The Interquartile Range (IQR) Approach is a statistical method used to detect and handle outliers in a dataset. It is based on quartiles, which divide data into four equal parts. The IQR approach is particularly useful because it is resistant to extreme values and works well for skewed distributions.
Step 1: Calculate IQR
Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1
Step 2: Calculate lower and upper bounds using IQR, and threshold. In general, threshold can be between 1.5 to 3.
lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR
Step 3: Replace the outliers.
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
If
the column value is greater than upper_bound, then replace it with the
upper_bound.
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Find the below working application.
replace_outliers.py
import pandas as pd import numpy as np def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: print(f'\nOutliers exists in the column {column}') df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) df = pd.DataFrame({ 'a': [1, 2, 3, 4, 5, 6, 7, 32, 64, 128, 256, 1024], 'b': [1024, 2058, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] }) print(df.describe()) print('\nReplacing outliers with lower and upper bounds') find_and_replace_outlier_columns(df, 3) print(df.describe())
Output
a b count 12.000000 12.000000 mean 127.666667 261.416667 std 292.352631 637.057220 min 1.000000 1.000000 25% 3.750000 3.750000 50% 6.500000 6.500000 75% 80.000000 9.250000 max 1024.000000 2058.000000 Replacing outliers with lower and upper bounds Outliers exists in the column a Outliers exists in the column b a b count 12.000000 12.0000 mean 68.062500 8.8750 std 107.414455 8.3445 min 1.000000 1.0000 25% 3.750000 3.7500 50% 6.500000 6.5000 75% 80.000000 9.2500 max 308.750000 25.7500
No comments:
Post a Comment