Following are most commonly used methods to handle outliers.
a. Remove outliers
It is the simple way. It removes the outliers from dataset. But the problem with this method is, it can reduce the sample size. If there are few outliers that do not impact the dataset size significantly, then you can straightway remove them.
b. Replace the outliers with mean, mode or median
Replace the outliers with mean, mode or median. Go through the basic statistics of the data and choose the best one (mean, median or mode) and replace outliers with that.
c. Transform the data
Apply some mathematical transformations on the data to reduce outliers.
df['transformed_column'] = np.log(df['original_column'])
d. Winsorization
Winsorization replace the outliers with values that are less extreme.
Example
lower_limit = df['Column'].quantile(0.05) upper_limit = df['Column'].quantile(0.95)
Any values below the lower_limit will be replaced with the lower_limit and any scores above the upper_limit will be replaced with the upper_limit
Replace all the values which are less than lower_limit with lower_limit
df['Column'] = np.where(df['Column'] < lower_limit, lower_limit, df['Column'])
Replace all the values which are hogher than upper_limit with upper_limit
df['Column'] = np.where(df['Column'] > upper_limit, upper_limit, df['Column'])
Unlike removing outliers entirely, Winsorization doesn't reduce the sample size.
No comments:
Post a Comment