Outliers are data points that are very different from most of the other values in a dataset. They can make analysis misleading because they don’t follow the usual pattern of the data. Since outliers can affect results and lead to incorrect conclusions, it’s important to find them and decide how to handle them properly.
Example of outliers
- In a employees salary dataset, an outlier can be the someone who is getting very high or less salary than the rest of the people in the data set.
- In a customer purchases dataset, assume a typical customer buy the items between 25$ to 150$. A outlier can be who purchase the items for 25000$.
- Outliers can occur because of data entry errors. For example, while calculating average weight of the people, an outlier can be 10000kg likely due to data entry error.
- Assume house prices in your locality are in the range between $150,000 to $300,000. A typical outlier can be $25000, 1 million dollars etc.,
- In general, your application API’s response times are in between 500ms to 4 seconds. A typical outlier can be 1 minute.
- In general tendency, most of the customers purchase items twice or thrice in a month. If you see hundreds of orders from a customer within a month, then it is an outlier.
How outliers occur?
Following are some of the causes of outliers.
- Data entry errors
- Outlier occur because of errors in the measurement taking process.
- If a very small sample is taken from a larger dataset could lead to outliers in the sample.
- Intentionally tampering the data
How to identify outliers?
There are various ways to identify outliers.
a. Visually inspecting the data
By visualizing the data in box plots, histograms, scatter plots we can identify the outliers.
For example, In a box plot, outliers are typically shown as individual points beyond the "whiskers" of the plot.
box_plot.py
import matplotlib.pyplot as plt import numpy as np # Create some sample data house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000]) # Highlight outliers with red color outlier_color = dict(markerfacecolor='r', marker='o', markersize=15, linestyle='none') # Create a box plot of the house prices plt.boxplot(house_prices, flierprops=outlier_color) plt.show()
Output
Let’s see with another example which demonstrate the problem in box plot.
Let’s have a look at the following house prices.
house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])
Ideally there are two outliers here.
- 700000
- 200
But box plot is unable to detect 200 here, this is because the identification of outliers in a box plot depends on the default outlier detection method used by Matplotlib.
By default, Matplotlib's boxplot function identifies outliers based on the interquartile range (IQR) method. The IQR is the difference between the third quartile (Q3) and the first quartile (Q1). An outlier is any data point that is more than 1.5 times the IQR below Q1 or above Q3.
Step 1: Sort the house prices.
[200, 100000, 120000, 150000, 170000, 200000, 700000]
Step 2: Calculate Q1 and Q3.
‘Q1’ stands for lower quartile corresponds with the 25th percentile and ‘Q3’ stands for upper quartile corresponds with the 75th percentile.
iqr = Q3 – Q1
iqr.py
import pandas as pd # Create a Pandas DataFrame df = pd.DataFrame({'data': [100000, 120000, 150000, 170000, 200000, 700000, 200]}) # Calculate the IQR Q3 = df['data'].quantile(0.75) Q1 = df['data'].quantile(0.25) iqr = Q3 - Q1 print(f'Q1 : {Q1}') print(f'Q3 : {Q3}') print(f'iqr : {iqr}')
Output
Q1 : 110000.0 Q3 : 185000.0 iqr : 75000.0 Lower Bound: Q1 - 1.5 * IQR = 110000 - 1.5 * 75000 = 110000 - 112500 = -2500 Upper Bound: Q3 + 1.5 * IQR = 185000 + 1.5 * 75000 = 200000 + 112500 = 312500
As lower bound is -2500, and 200 > -2500, so is considered as a regular data point, not an outlier.
You can confirm the same from below application.
box_plot_issues.py
import numpy as np import matplotlib.pyplot as plt # Sample data house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]) # Calculate Q1, Q3, and IQR Q1 = np.percentile(house_prices, 25) Q3 = np.percentile(house_prices, 75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Create a box plot plt.figure(figsize=(8, 6)) plt.boxplot(house_prices, vert=False) # Highlight outliers with red color outlier_color = dict(markerfacecolor='r', marker='o', markersize=8, linestyle='none') plt.boxplot(house_prices, vert=False, flierprops=outlier_color) # Annotate the plot with Q1, Q3, and IQR values # plt.text(Q1, 1.05, f'Q1 = {Q1}', fontsize=12, color='b', horizontalalignment='center') # plt.text(Q3, 1.05, f'Q3 = {Q3}', fontsize=12, color='b', horizontalalignment='center') # plt.text((Q1 + Q3) / 2, 1.15, f'IQR = {IQR}', fontsize=12, color='b', horizontalalignment='center') # plt.text(lower_bound, 1.25, f'Lower IQR Bound = {lower_bound}', fontsize=12, color='g', horizontalalignment='center') # plt.text(upper_bound, 1.25, f'Upper IQR Bound = {upper_bound}', fontsize=12, color='g', horizontalalignment='center') # Create a custom legend label custom_legend_label = [ plt.Line2D([0], [0], marker='o', color='r', label='Outliers', markersize=8, linestyle='none'), plt.Line2D([0], [0], marker='', color='b', label=f'Q1 = {Q1}', markersize=0, linestyle='-'), plt.Line2D([0], [0], marker='', color='b', label=f'Q3 = {Q3}', markersize=0, linestyle='-'), plt.Line2D([0], [0], marker='', color='b', label=f'IQR = {IQR}', markersize=0, linestyle='-'), plt.Line2D([0], [0], marker='', color='g', label=f'Lower IQR Bound = {lower_bound}', markersize=0, linestyle='-'), plt.Line2D([0], [0], marker='', color='g', label=f'Upper IQR Bound = {upper_bound}', markersize=0, linestyle='-') ] # Add the custom legend label to the plot plt.legend(handles=custom_legend_label, loc='upper right') plt.xlabel('House Price (in USD)') plt.title('House Prices with Quartiles and IQR Values') plt.grid(True) plt.show()
Output
Identify the outliers using scatter plot
scatter_plot.py
import numpy as np import matplotlib.pyplot as plt # Sample house prices house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]) # Create an array for x-axis values (e.g., index or house number) x_values = np.arange(1, len(house_prices) + 1) # Create a scatter plot plt.figure(figsize=(8, 6)) plt.scatter(x_values, house_prices, color='blue', label='House Prices') plt.xlabel('House Number') plt.ylabel('House Price (in USD)') plt.title('House Prices Scatter Plot') plt.grid(True) plt.legend() plt.show()
Output
By looking at the above image, you can clearly identify the prices 200 and 70000 are outliers here.
b. By calculating the summary statistics of the data points
By calculating summary statistics like min, max, mean, standard deviation, variance etc., we can identify outliers.
summary_stats.py
import numpy as np # Data points data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]) # Calculate summary statistics mean = np.mean(data) # Mean median = np.median(data) # Median std_dev = np.std(data) # Standard Deviation variance = np.var(data) # Variance min_value = np.min(data) # Minimum Value max_value = np.max(data) # Maximum Value range_value = np.ptp(data) # Range (Peak-to-Peak) # Print the summary statistics print("Mean:", mean) print("Median:", median) print("Standard Deviation:", std_dev) print("Variance:", variance) print("Minimum Value:", min_value) print("Maximum Value:", max_value) print("Range (Peak-to-Peak):", range_value)
Output
Mean: 205742.85714285713 Median: 150000.0 Standard Deviation: 210268.25626289082 Variance: 44212739591.83672 Minimum Value: 200 Maximum Value: 700000 Range (Peak-to-Peak): 699800
c. By calculating z-score of each data point
Z-score quantifies how far a particular data point is from the mean (average) of a dataset in terms of standard deviations.
Formula
Where
- Z is the z-score of the data point.
- X is the data point you want to standardize.
- μ is the mean (average) of the dataset.
- σ is the standard deviation of the dataset.
Data points with Z-scores beyond a certain threshold (e.g., ±2) are often considered outliers.
z_scores.py
import numpy as np # Data points data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]) # Calculate the mean and standard deviation of the data mean = np.mean(data) std_dev = np.std(data) # Calculate z-scores for each data point z_scores = (data - mean) / std_dev # Print the z-scores print("Z-Scores:", z_scores)
Output
Z-Scores: [-0.50289501 -0.40777842 -0.26510353 -0.16998694 -0.02731205 2.35060276 -0.97752681]
d. Using IQR method
This is already explained in box plot section.
e. Use your domain knowledge
Use your domain knowledge to identify what are the outliers.
f. Use machine learning models to predict outliers
Algorithms like isolation forests, one-class SVMs can be used to identify outliers that deviate significantly from the majority.
isolation_forest.py
import numpy as np from sklearn.ensemble import IsolationForest # Data points data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]).reshape(-1, 1) # Create an Isolation Forest model model = IsolationForest(contamination='auto', random_state=42) # Fit the model to the data model.fit(data) # Predict outliers/anomalies outliers = model.predict(data) # Find and print the outlier numbers outlier_indices = np.where(outliers == -1)[0] outlier_values = data[outlier_indices] print("Outlier Numbers:", outlier_values)
Output
Outlier Numbers: [[700000] [ 200]]
No comments:
Post a Comment