Programming for beginners: What is a residual in Machine Learning?

Residual is the difference between an actual and the value predicted by a model. Many evaluation algorithms use residuals to assess the fit of a regression model.

Formula

εᵢ = yᵢ - ŷᵢ

Where:

εᵢ represents the residual for the i-th data point.
yᵢ is the actual value of the dependent variable for the i-th data point.
ŷᵢ is the predicted value of the dependent variable for the i-th data point. This is generated by the regression model.

Example

actual_population = np.array([1425.776, 1417.173, 1407.564, 1396.387, 1384.332, 1371.818, 1359.003, 1346.021, 1332.993])

# Fit two different regression models to the data
model1 = np.polyfit(year, actual_population, 1)

# Predict the values for both models
m = model1[0]
c = model1[1]
predicted_population = m * year + c

# Calculate residuals
residuals = actual_population - predicted_population

Positive residual

A positive residual indicates that the model underpredicted the value (predicted value < observed value).

Negative residual

Negative residual indicates that the model overpredicted the value (predicted value > observed value).

We can visualize the distribution of residuals and identify any patterns or outliers using a scatter plot.

residuals.py

import numpy as np
import matplotlib.pyplot as plt

# Generate some data
year = np.array([2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015])
actual_population = np.array([1425.776, 1417.173, 1407.564, 1396.387, 1384.332, 1371.818, 1359.003, 1346.021, 1332.993])

# Fit two different regression models to the data
model1 = np.polyfit(year, actual_population, 1)

# Predict the values for both models
m = model1[0]
c = model1[1]
predicted_population = m * year + c

# Calculate residuals
residuals = actual_population - predicted_population

print("Actual Values:", actual_population)
print("Predicted Values:", predicted_population)
print("Residuals:", residuals)

# Create a scatterplot of residuals
plt.scatter(np.arange(len(residuals)), residuals, c='blue', marker='o', label='Residuals')

# Add a horizontal line at y=0 for reference
plt.axhline(y=0, color='red', linestyle='--', linewidth=1, label='Zero Residual Line')

# Add labels and a legend
plt.xlabel('Data Point')
plt.ylabel('Residual')
plt.title('Residual Plot')
plt.legend()

# Show the plot
plt.grid()
plt.show()

Output

Actual Values: [1425.776 1417.173 1407.564 1396.387 1384.332 1371.818 1359.003 1346.021
 1332.993]
Predicted Values: [1429.42604444 1417.65472778 1405.88341111 1394.11209444 1382.34077778
 1370.56946111 1358.79814444 1347.02682778 1335.25551111]
Residuals: [-3.65004444 -0.48172778  1.68058889  2.27490556  1.99122222  1.24853889
  0.20485556 -1.00582778 -2.26251111]

plt.axhline(y=0, color='red', linestyle='--', linewidth=1, label='Zero Residual Line')

I defined a horizontal line to visualize how residuals deviate from zero.

How residuals used?

An effective model have residuals close to zero and randomly distributed around zero. You can visualize the same using above kind of graph.
We can detect outliers using residuals, by identifying the residuals that are much larger or smaller than the others,

Previous Next Home

Programming for beginners

Tuesday, 18 March 2025

What is a residual in Machine Learning?

No comments:

Post a Comment