Monday, 10 March 2025

Variance: A Measure of Data Spread

Variance is used to calculate how much individual data points differ from the mean (average) of the dataset. It is interpreted as the average squared distance from the mean.

Formula

 


where:

 

  1. x is a value in the data set
  2. μ is the mean of the data set
  3. n is the number of values in the data set

 

A larger variance indicates that the data points are more spread out from the mean, while a smaller variance indicates that the data points are closer to the mean.

 

For example, consider below house prices.

house_prices = [100000, 110000, 125000, 95000, 115000, 118000, 123000, 105000]

 

To determine whether the data is widely spread from the mean, we can compare the variance to the original data values by considering the magnitude of the standard deviation. Standard deviation is the square root of the variance.

 

Mean, variance and standard deviation values for the above data set is given below.

 

mean : 111375.0

variance : 102234375.0

standard_deviation : 10111.101572034573

 

 

Now when we compare the standard deviation (10111.101572034573)  to the mean (: 111375.0 dollars). In this case, the standard deviation is smaller than the mean, which indicates that the data is not extremely widely spread from the mean.

 

Find the below working application.

 

variance.py

import numpy as np
import matplotlib.pyplot as plt

house_prices = [100000, 110000, 125000, 95000, 115000, 118000, 123000, 105000]

# Calculate the standard deviation of the data
mean = np.mean(house_prices)
variance = np.var(house_prices)
standard_deviation = np.sqrt(variance)

print(f'mean : {mean}')
print(f'variance : {variance}')
print(f'standard_deviation : {standard_deviation}')

# Create a scatter plot of data points
plt.scatter(house_prices, range(len(house_prices)), label='House prices', color='green', marker='o')

plt.axvline(standard_deviation, color='r', linestyle='dashed', linewidth=2, label=f'Standard Deviation: {standard_deviation:.2f}')
plt.axvline(mean, color='blue', linestyle='dashed', linewidth=2, label=f'Mean: {mean:.2f}')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.title('Scatter plot  with Variance')
plt.show()

Output

mean : 111375.0
variance : 102234375.0
standard_deviation : 10111.101572034573



Note

  1. In practice, the primary measure used to assess the spread or variability of data is the standard deviation. There is no mandatory need to calculate the variance unless you specifically require it for a particular statistical analysis.


 

Previous                                                    Next                                                    Home

No comments:

Post a Comment