The Interquartile Range (IQR) Approach is a statistical method used to detect and handle outliers in a dataset. It is based on quartiles, which divide data into four equal parts. The IQR approach is particularly useful because it is resistant to extreme values and works well for skewed distributions.
Step 1: Calculate IQR
Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1
Step 2: Calculate lower and upper bounds using IQR, and threshold. In general, threshold can be between 1.5 to 3.
lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR
Step 3: Replace the outliers.
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
If
the column value is greater than upper_bound, then replace it with the
upper_bound.
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Find the below working application.
replace_outliers.py
import pandas as pd import numpy as np def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: print(f'\nOutliers exists in the column {column}') df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) df = pd.DataFrame({ 'a': [1, 2, 3, 4, 5, 6, 7, 32, 64, 128, 256, 1024], 'b': [1024, 2058, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] }) print(df.describe()) print('\nReplacing outliers with lower and upper bounds') find_and_replace_outlier_columns(df, 3) print(df.describe())
Output
a b
count 12.000000 12.000000
mean 127.666667 261.416667
std 292.352631 637.057220
min 1.000000 1.000000
25% 3.750000 3.750000
50% 6.500000 6.500000
75% 80.000000 9.250000
max 1024.000000 2058.000000
Replacing outliers with lower and upper bounds
Outliers exists in the column a
Outliers exists in the column b
a b
count 12.000000 12.0000
mean 68.062500 8.8750
std 107.414455 8.3445
min 1.000000 1.0000
25% 3.750000 3.7500
50% 6.500000 6.5000
75% 80.000000 9.2500
max 308.750000 25.7500
No comments:
Post a Comment