Monday, 10 March 2025

Using the Interquartile Range (IQR) to Identify Outliers

 

IQR stands for interquartile range, it is a measure of variability that is used to describe the middle 50% of a data set.

 

How IQR is calculated?

IQR is calculated by taking the difference between the third quartile (Q3) and the first quartile (Q1).

 

Q1 = Element at position (n + 1)/4

Q3 = Element at position 3(n + 1)/4

 

‘n’ stands for number of elements in the data set.

 

For example, let's say we have the following data set (data set to be ordered)

 

10, 20, 30, 40, 50, 60, 70, 80, 90, 100

 

Let’s find Q1 for above dataset

The position of the first quartile is (n + 1)/4. Since, there are 10 data points in the data set, then the position of the first quartile is (10 + 1)/4 = 2.75. That means we can consider the element at the position 3 (taking 3 as nearest position to 2.75 here) as the value for Q1, which is 30.

 

Another simple way to find Q1 is, arrange your data in ascending order and then find the median (middle) of the lower half of the data.

 

In the above example, lower half of the data is [10, 20, 30, 40, 50] and the median is 30

 

Let’s find Q3 for above dataset

The position of the third quartile is 3(n + 1)/4. Since, there are 10 data points in the data set, then the position of the third quartile is 3(10 + 1)/4 = 33/4 = 8.2 This means that the third quartile is the eighth value in the data set. So 80 is the Q3 value.

 

Another simple way to find Q3, arrange your data in ascending order and then find the median (middle) of the upper half of the data.

 

In the above example, upper half of the data is [60, 70, 80, 90, 100] and the median is 80.

 

So, the IQR for this data set is 80 - 30 = 50.

 

Find the below working application.

 

iqr.py

import numpy as np

data_points = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# Calculate Q1
q1 = np.percentile(data_points, 25)

# Calculate Q3
q3 = np.percentile(data_points, 75)

# Calculate IQR
iqr = q3 - q1

print('Q1:', q1)
print('Q3:', q3)
print('IQR:', iqr)

 

Output

Q1: 32.5
Q3: 77.5
IQR: 45.0

As IQR focuses on the central portion of the data distribution, it is not affected by outliers.

 

iqr_box_plot.py

import numpy as np
import matplotlib.pyplot as plt

# Generate some random data with a mean of 0 and standard deviation of 1
house_prices = [50000, 300000, 950000, 500000, 750000, 100000, 325000, 650000, 25000, 200000, 35000]

# Calculate quartiles
q1 = np.percentile(house_prices, 25)  # Q1 (First Quartile)
q3 = np.percentile(house_prices, 75)  # Q3 (Third Quartile)

# Create the box plot
plt.boxplot(house_prices)

# Label the quartiles
plt.text(0.85, q1, f'Q1: {q1}', va='center', ha='left', bbox=dict(facecolor='white', edgecolor='black'))
plt.text(0.85, q3, f'Q3: {q3}', va='center', ha='left', bbox=dict(facecolor='white', edgecolor='black'))

# Set labels and title
plt.ylabel('Exam Scores')
plt.title('Box Plot with Q1 and Q3 Quartiles')

# Show the plot
plt.show()

Output

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment