The Interquartile Range (IQR) Approach is a statistical method used to detect and handle outliers in a dataset. It is based on quartiles, which divide data into four equal parts. The IQR approach is particularly useful because it is resistant to extreme values and works well for skewed distributions.
Step 1: Calculate IQR
Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1
Step 2: Calculate lower and upper bounds using IQR, and threshold. In general, threshold can be between 1.5 to 3.
lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR
Step 3: Replace the outliers. 
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
If
the column value is greater than upper_bound, then replace it with the
upper_bound.
 
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Find the below working application.
replace_outliers.py
import pandas as pd import numpy as np def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: print(f'\nOutliers exists in the column {column}') df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) df = pd.DataFrame({ 'a': [1, 2, 3, 4, 5, 6, 7, 32, 64, 128, 256, 1024], 'b': [1024, 2058, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] }) print(df.describe()) print('\nReplacing outliers with lower and upper bounds') find_and_replace_outlier_columns(df, 3) print(df.describe())
Output
                a            b
count    12.000000    12.000000
mean    127.666667   261.416667
std     292.352631   637.057220
min       1.000000     1.000000
25%       3.750000     3.750000
50%       6.500000     6.500000
75%      80.000000     9.250000
max    1024.000000  2058.000000
Replacing outliers with lower and upper bounds
Outliers exists in the column a
Outliers exists in the column b
                a        b
count   12.000000  12.0000
mean    68.062500   8.8750
std    107.414455   8.3445
min      1.000000   1.0000
25%      3.750000   3.7500
50%      6.500000   6.5000
75%     80.000000   9.2500
max    308.750000  25.7500
 
No comments:
Post a Comment