Monday, 3 March 2025

Find and replace Outliers using IQR

The Interquartile Range (IQR) Approach is a statistical method used to detect and handle outliers in a dataset. It is based on quartiles, which divide data into four equal parts. The IQR approach is particularly useful because it is resistant to extreme values and works well for skewed distributions.

Step 1: Calculate IQR

Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1

   

Step 2: Calculate lower and upper bounds using IQR, and threshold. In general, threshold can be between 1.5 to 3.

 

lower_bound = Q1 - threshold * IQR
upper_bound = Q3 + threshold * IQR

 

Step 3: Replace the outliers.

If the column value is less than lower_bound, then replace it with the lower_bound.

 

df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])

If the column value is greater than upper_bound, then replace it with the upper_bound.

 

df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

Find the below working application.

 

replace_outliers.py

import pandas as pd
import numpy as np

def find_and_replace_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                print(f'\nOutliers exists in the column {column}')
                df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
                df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])


df = pd.DataFrame({
    'a': [1, 2, 3, 4, 5, 6, 7, 32, 64, 128, 256, 1024],
    'b': [1024, 2058, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

print(df.describe())

print('\nReplacing outliers with lower and upper bounds')
find_and_replace_outlier_columns(df, 3)
print(df.describe())

 

Output

                a            b
count    12.000000    12.000000
mean    127.666667   261.416667
std     292.352631   637.057220
min       1.000000     1.000000
25%       3.750000     3.750000
50%       6.500000     6.500000
75%      80.000000     9.250000
max    1024.000000  2058.000000

Replacing outliers with lower and upper bounds

Outliers exists in the column a

Outliers exists in the column b
                a        b
count   12.000000  12.0000
mean    68.062500   8.8750
std    107.414455   8.3445
min      1.000000   1.0000
25%      3.750000   3.7500
50%      6.500000   6.5000
75%     80.000000   9.2500
max    308.750000  25.7500

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment