Sunday, 2 March 2025

What are outliers in Machine learning?

 

Outliers are data points that are very different from most of the other values in a dataset. They can make analysis misleading because they don’t follow the usual pattern of the data. Since outliers can affect results and lead to incorrect conclusions, it’s important to find them and decide how to handle them properly.

 

Example of outliers

  1. In a employees salary dataset, an outlier can be the someone who is getting very high or less salary than the rest of the people in the data set.
  2. In a customer purchases dataset, assume a typical customer buy the items between 25$ to 150$. A outlier can be who purchase the items for 25000$.
  3. Outliers can occur because of data entry errors. For example, while calculating average weight of the people, an outlier can be 10000kg likely due to data entry error.
  4. Assume house prices in your locality are in the range between $150,000 to $300,000. A typical outlier can be $25000, 1 million dollars etc.,
  5. In general, your application API’s response times are in between 500ms to 4 seconds. A typical outlier can be 1 minute.
  6. In general tendency, most of the customers purchase items twice or thrice in a month. If you see hundreds of orders from a customer within a month, then it is an outlier.

 

How outliers occur?

Following are some of the causes of outliers.

  1. Data entry errors
  2. Outlier occur because of errors in the measurement taking process.
  3. If a very small sample is taken from a larger dataset could lead to outliers in the sample.
  4. Intentionally tampering the data

 

How to identify outliers?

There are various ways to identify outliers.

 

a. Visually inspecting the data

By visualizing the data in box plots, histograms, scatter plots we can identify the outliers.

 

For example, In a box plot, outliers are typically shown as individual points beyond the "whiskers" of the plot.

 

box_plot.py

import matplotlib.pyplot as plt
import numpy as np

# Create some sample data
house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000])

# Highlight outliers with red color
outlier_color = dict(markerfacecolor='r', marker='o', markersize=15, linestyle='none')

# Create a box plot of the house prices
plt.boxplot(house_prices, flierprops=outlier_color)
plt.show()

Output



Let’s see with another example which demonstrate the problem in box plot.

 

Let’s have a look at the following house prices.

house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])

 

Ideally there are two outliers here.

  1. 700000
  2. 200

But box plot is unable to detect 200 here, this is because the identification of outliers in a box plot depends on the default outlier detection method used by Matplotlib.

 

By default, Matplotlib's boxplot function identifies outliers based on the interquartile range (IQR) method. The IQR is the difference between the third quartile (Q3) and the first quartile (Q1). An outlier is any data point that is more than 1.5 times the IQR below Q1 or above Q3.

 

Step 1: Sort the house prices.

[200, 100000, 120000, 150000, 170000, 200000, 700000]

 

Step 2: Calculate Q1 and Q3.

 

‘Q1’ stands for lower quartile corresponds with the 25th percentile and ‘Q3’ stands for upper quartile corresponds with the 75th percentile.

 

iqr = Q3 – Q1

 

iqr.py

import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame({'data': [100000, 120000, 150000, 170000, 200000, 700000, 200]})

# Calculate the IQR
Q3 = df['data'].quantile(0.75)
Q1 = df['data'].quantile(0.25)
iqr = Q3 - Q1

print(f'Q1 : {Q1}')
print(f'Q3 : {Q3}')
print(f'iqr : {iqr}')

Output

Q1 : 110000.0
Q3 : 185000.0
iqr : 75000.0


Lower Bound: Q1 - 1.5 * IQR = 110000 - 1.5 * 75000 = 110000 - 112500 = -2500
Upper Bound: Q3 + 1.5 * IQR = 185000 + 1.5 * 75000 = 200000 + 112500 = 312500

As lower bound is -2500, and 200 > -2500, so is considered as a regular data point, not an outlier.

 

You can confirm the same from below application.

 

box_plot_issues.py

import numpy as np
import matplotlib.pyplot as plt

# Sample data
house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])

# Calculate Q1, Q3, and IQR
Q1 = np.percentile(house_prices, 25)
Q3 = np.percentile(house_prices, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Create a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(house_prices, vert=False)

# Highlight outliers with red color
outlier_color = dict(markerfacecolor='r', marker='o', markersize=8, linestyle='none')
plt.boxplot(house_prices, vert=False, flierprops=outlier_color)

# Annotate the plot with Q1, Q3, and IQR values
# plt.text(Q1, 1.05, f'Q1 = {Q1}', fontsize=12, color='b', horizontalalignment='center')
# plt.text(Q3, 1.05, f'Q3 = {Q3}', fontsize=12, color='b', horizontalalignment='center')
# plt.text((Q1 + Q3) / 2, 1.15, f'IQR = {IQR}', fontsize=12, color='b', horizontalalignment='center')
# plt.text(lower_bound, 1.25, f'Lower IQR Bound = {lower_bound}', fontsize=12, color='g', horizontalalignment='center')
# plt.text(upper_bound, 1.25, f'Upper IQR Bound = {upper_bound}', fontsize=12, color='g', horizontalalignment='center')

# Create a custom legend label
custom_legend_label = [
    plt.Line2D([0], [0], marker='o', color='r', label='Outliers', markersize=8, linestyle='none'),
    plt.Line2D([0], [0], marker='', color='b', label=f'Q1 = {Q1}', markersize=0, linestyle='-'),
    plt.Line2D([0], [0], marker='', color='b', label=f'Q3 = {Q3}', markersize=0, linestyle='-'),
    plt.Line2D([0], [0], marker='', color='b', label=f'IQR = {IQR}', markersize=0, linestyle='-'),
    plt.Line2D([0], [0], marker='', color='g', label=f'Lower IQR Bound = {lower_bound}', markersize=0, linestyle='-'),
    plt.Line2D([0], [0], marker='', color='g', label=f'Upper IQR Bound = {upper_bound}', markersize=0, linestyle='-')
]

# Add the custom legend label to the plot
plt.legend(handles=custom_legend_label, loc='upper right')

plt.xlabel('House Price (in USD)')
plt.title('House Prices with Quartiles and IQR Values')
plt.grid(True)

plt.show()

Output



 

Identify the outliers using scatter plot

 

scatter_plot.py

 

import numpy as np
import matplotlib.pyplot as plt

# Sample house prices
house_prices = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])

# Create an array for x-axis values (e.g., index or house number)
x_values = np.arange(1, len(house_prices) + 1)

# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(x_values, house_prices, color='blue', label='House Prices')

plt.xlabel('House Number')
plt.ylabel('House Price (in USD)')
plt.title('House Prices Scatter Plot')
plt.grid(True)

plt.legend()
plt.show()

 

Output

 


 

By looking at the above image, you can clearly identify the prices 200 and 70000 are outliers here.

 

b. By calculating the summary statistics of the data points

By calculating summary statistics like min, max, mean, standard deviation, variance etc., we can identify outliers.

 

summary_stats.py

 

import numpy as np

# Data points
data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])

# Calculate summary statistics
mean = np.mean(data)             # Mean
median = np.median(data)         # Median
std_dev = np.std(data)           # Standard Deviation
variance = np.var(data)          # Variance
min_value = np.min(data)         # Minimum Value
max_value = np.max(data)         # Maximum Value
range_value = np.ptp(data)       # Range (Peak-to-Peak)

# Print the summary statistics
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)
print("Variance:", variance)
print("Minimum Value:", min_value)
print("Maximum Value:", max_value)
print("Range (Peak-to-Peak):", range_value)

 

Output

Mean: 205742.85714285713
Median: 150000.0
Standard Deviation: 210268.25626289082
Variance: 44212739591.83672
Minimum Value: 200
Maximum Value: 700000
Range (Peak-to-Peak): 699800

 

c. By calculating z-score of each data point

Z-score quantifies how far a particular data point is from the mean (average) of a dataset in terms of standard deviations.

 

Formula 


Where

  1. Z is the z-score of the data point.
  2. X is the data point you want to standardize.
  3. μ is the mean (average) of the dataset.
  4. σ is the standard deviation of the dataset.

 

Data points with Z-scores beyond a certain threshold (e.g., ±2) are often considered outliers.

 

z_scores.py

import numpy as np

# Data points
data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200])

# Calculate the mean and standard deviation of the data
mean = np.mean(data)
std_dev = np.std(data)

# Calculate z-scores for each data point
z_scores = (data - mean) / std_dev

# Print the z-scores
print("Z-Scores:", z_scores)

 

Output

Z-Scores: [-0.50289501 -0.40777842 -0.26510353 -0.16998694 -0.02731205  2.35060276 -0.97752681]

 

d. Using IQR method

This is already explained in box plot section.

 

e. Use your domain knowledge

Use your domain knowledge to identify what are the outliers.

 

f. Use machine learning models to predict outliers

Algorithms like isolation forests, one-class SVMs can be used to identify outliers that deviate significantly from the majority.

 

isolation_forest.py

 

import numpy as np
from sklearn.ensemble import IsolationForest

# Data points
data = np.array([100000, 120000, 150000, 170000, 200000, 700000, 200]).reshape(-1, 1)

# Create an Isolation Forest model
model = IsolationForest(contamination='auto', random_state=42)

# Fit the model to the data
model.fit(data)

# Predict outliers/anomalies
outliers = model.predict(data)

# Find and print the outlier numbers
outlier_indices = np.where(outliers == -1)[0]
outlier_values = data[outlier_indices]

print("Outlier Numbers:", outlier_values)

Output

Outlier Numbers: [[700000]
 [   200]]

 

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment