Monday, 3 March 2025

How to handle outliers in Data?

Following are most commonly used methods to handle outliers.

a. Remove outliers

It is the simple way. It removes the outliers from dataset. But the problem with this method is, it can reduce the sample size. If there are few outliers that do not impact the dataset size significantly, then you can straightway remove them.

 

b. Replace the outliers with mean, mode or median

Replace the outliers with mean, mode or median. Go through the basic statistics of the data and choose the best one (mean, median or mode) and replace outliers with that.

 

c. Transform the data

Apply some mathematical transformations on the data to reduce outliers.

df['transformed_column'] = np.log(df['original_column'])

 

d. Winsorization

Winsorization replace the outliers with values that are less extreme.

 

Example

lower_limit = df['Column'].quantile(0.05)
upper_limit = df['Column'].quantile(0.95)

 

Any values below the lower_limit will be replaced with the lower_limit and any scores above the upper_limit will be replaced with the upper_limit

 

Replace all the values which are less than lower_limit with lower_limit

df['Column'] = np.where(df['Column'] < lower_limit, lower_limit, df['Column'])

Replace all the values which are hogher than upper_limit with upper_limit

df['Column'] = np.where(df['Column'] > upper_limit, upper_limit, df['Column'])

 

Unlike removing outliers entirely, Winsorization doesn't reduce the sample size.

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment