Variance is used to calculate how much individual data points differ from the mean (average) of the dataset. It is interpreted as the average squared distance from the mean.
Formula
where:
- x is a value in the data set
- μ is the mean of the data set
- n is the number of values in the data set
A larger variance indicates that the data points are more spread out from the mean, while a smaller variance indicates that the data points are closer to the mean.
For example, consider below house prices.
house_prices = [100000, 110000, 125000, 95000, 115000, 118000, 123000, 105000]
To determine whether the data is widely spread from the mean, we can compare the variance to the original data values by considering the magnitude of the standard deviation. Standard deviation is the square root of the variance.
Mean, variance and standard deviation values for the above data set is given below.
mean : 111375.0
variance : 102234375.0
standard_deviation : 10111.101572034573
Now when we compare the standard deviation (10111.101572034573) to the mean (: 111375.0 dollars). In this case, the standard deviation is smaller than the mean, which indicates that the data is not extremely widely spread from the mean.
Find the below working application.
variance.py
import numpy as np import matplotlib.pyplot as plt house_prices = [100000, 110000, 125000, 95000, 115000, 118000, 123000, 105000] # Calculate the standard deviation of the data mean = np.mean(house_prices) variance = np.var(house_prices) standard_deviation = np.sqrt(variance) print(f'mean : {mean}') print(f'variance : {variance}') print(f'standard_deviation : {standard_deviation}') # Create a scatter plot of data points plt.scatter(house_prices, range(len(house_prices)), label='House prices', color='green', marker='o') plt.axvline(standard_deviation, color='r', linestyle='dashed', linewidth=2, label=f'Standard Deviation: {standard_deviation:.2f}') plt.axvline(mean, color='blue', linestyle='dashed', linewidth=2, label=f'Mean: {mean:.2f}') plt.xlabel('Value') plt.ylabel('Frequency') plt.legend() plt.title('Scatter plot with Variance') plt.show()
Output
mean : 111375.0 variance : 102234375.0 standard_deviation : 10111.101572034573
Note
- In practice, the primary measure used to assess the spread or variability of data is the standard deviation. There is no mandatory need to calculate the variance unless you specifically require it for a particular statistical analysis.
No comments:
Post a Comment