Monday, 24 February 2025

Simple Linear Regression for Machine Learning: A Step-by-Step Tutorial

Linear regression, models the relationship between one or more independent variables with a dependent variable by fitting a linear equation to the observed data.

Simple linear regression

Simple linear regression models the relationship between a dependent variable and an independent variable.

 

Formula

y = mx + c

 

In the above formula

 

  1. y is the dependent variable that we are trying to predict
  2. x is the independent variable or the input feature,
  3. m is the slope of the line, which represents the change in y for a unit change in x. A positive slope means that the line goes up as we move to the right, while a negative slope means that the line goes down as we move to the right. A slope of 0 means that the line is horizontal. The coefficient of x is m here.
  4. c is the intercept of the line, which represents the value of y when x is 0.

 

For example, the equation y = 5x + 10 has a slope of 5 and a y-intercept of 10. This means that the line goes up 5 for every 1 unit we move to the right, and it crosses the y-axis at the point (0, 10).

 

Example 1: y=5x+10 (positive slope 5)

 


Example 2: y=-5x+10 (negative slope 5)

 


Example 3: Slope is 0 (y=0x+10).


 

 

Let’s try to find coefficient (m) and intercept (c) for the below data.

 

Canada_per_capita_income.csv 

year,income
1970,3399.299037
1971,3768.297935
1972,4251.175484
1973,4804.463248
1974,5576.514583
1975,5998.144346
1976,7062.131392
1977,7100.12617
1978,7247.967035
1979,7602.912681
1980,8355.96812
1981,9434.390652
1982,9619.438377
1983,10416.53659
1984,10790.32872
1985,11018.95585
1986,11482.89153
1987,12974.80662
1988,15080.28345
1989,16426.72548
1990,16838.6732
1991,17266.09769
1992,16412.08309
1993,15875.58673
1994,15755.82027
1995,16369.31725
1996,16699.82668
1997,17310.75775
1998,16622.67187
1999,17581.02414
2000,18987.38241
2001,18601.39724
2002,19232.17556
2003,22739.42628
2004,25719.14715
2005,29198.05569
2006,32738.2629
2007,36144.48122
2008,37446.48609
2009,32755.17682
2010,38420.52289
2011,42334.71121
2012,42665.25597
2013,42676.46837
2014,41039.8936
2015,35175.18898
2016,34229.19363

As you see above data, there are two columns.

  1. Year
  2. Canadas per capita income upto the year 2016

 

We can use x as year and y as the income.

 

How to calculate coefficient (m) and intercept (c)?

Following formula calculates the coefficient or slope of the equation.



Where:

 

  1. n is the number of data points.
  2. ∑ represents the sum notation. ∑x is sum of all the x values
  3. x is the independent variable (input).
  4. y is the dependent variable (output).
  5. xy represents the product of the corresponding x and y values.

 

Following formula calculate the y intercept. 


 

Once you have the slope "m" and y-intercept "c," you can use them to predict y-values based on given x-values using the equation y=mx+c. This equation allows you to make predictions for new x-values that were not in the original dataset.

 

 

Predict using scikit library

Let’s predict income forecast from the input data ‘Canada_per_capita_income.csv’ using scikit library

 

Step 1: Read the csv file content to Pandas dataframe.

df = pd.read_csv(csv_file)

 

Step 2: Split the data into features, values.

X = df[['year']]

y = df[['income']]

 

df[['year']] selects the 'year' column as a DataFrame with a single column.

df[['income']] selects the 'income' column as a DataFrame with a single column.

 

Variable ‘X’ hold the features or inputs to the model. X hold all the independent variables. Variable ‘y’ holds dependent variables.

 

 

Step 3: Split the dataset into test and train data sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

 

using the train_test_split function from the sklearn.model_selection module, we can split the data set into train and test sets.

 

Following table summarizes the important variables/parameters in the above statement.

 

Variable/Parameter

Description

X_train

This contain the subset of feature data used to train the model

X_test

This contain the subset of feature data used to test the model

y_train

This contain the subset of output/target data used to train the model

y_test

These are the expected outcomes for the "X_test" data.

test_size

Specifies the proportion of the data that should be allocated to the testing set. In this example, 30% of the data will be used for testing, and 70% will be used for training.

random_state

This parameter is used to set the random seed for reproducibility. This ensures that when you run the same code multiple times, you'll get the same train-test split

 

Step 4: Initialize LinearRegression model and train the model on training dataset.

# Initialize the linear regression model
model = LinearRegression()

 

Above snippet create an instance of the LinearRegression class and assign it to the variable model.

# Train the model on the training data
model.fit(X_train, y_train)

 

‘fit’ method used to train the model by passing features and target/output values. Linear regression model find the best-fitting line that represents the relationship between the input features and the target values in the training data.

 

Step 5: Make the predictions on test features set.

y_pred = model.predict(X_test)

  Step 6: Calculates the Mean Squared Error (MSE) between the predicted values (y_pred) and the actual target values (y_test).

 

mse = mean_squared_error(y_test, y_pred)

 

Mean Squared Error is a common metric used to evaluate the performance of a regression model.

 

Step 7: Plot the scatter plot with the train, test dataset and prediction values.

 

plt.scatter(X_train, y_train, color='blue', label='Trained Data')
plt.scatter(X_test, y_test, color='red', label='Actual')
plt.scatter(X_test, y_pred, color='black', label='Predicted')

 

Step 8: Predict the income for future years.

# Years for prediction
years_to_predict = [2020, 2021, 2022, 2023]

# Create a DataFrame for the years to predict
future_years = pd.DataFrame({'year': years_to_predict})

# Predict the per capita income for the future years
future_years_per_capita_income = model.predict(future_years)

# Add the predictions to the DataFrame
future_years['income'] = future_years_per_capita_income

 

Find the below working application.

 

canada_per_capita_income.py

 

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import os

csv_file = os.getcwd() + '/data/csvs/Canada_per_capita_income.csv'

# Read the CSV file
df = pd.read_csv(csv_file)

# Split the data into features (X) and target (y)
X = df[['year']]
y = df[['income']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Print the coefficients and MSE
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)

# Plot the data and regression line
# plt.plot(df["year"], df["income"], marker='o')
plt.scatter(X_train, y_train, color='blue', label='Trained Data')
plt.scatter(X_test, y_test, color='red', label='Actual')
plt.scatter(X_test, y_pred, color='black', label='Predicted')

plt.legend()
plt.xlabel('Year')
plt.ylabel('Income')
plt.title('Linear Regression Example')
plt.show()

# Years for prediction
years_to_predict = [2020, 2021, 2022, 2023]

# Create a DataFrame for the years to predict
future_years = pd.DataFrame({'year': years_to_predict})

# Predict the per capita income for the future years
future_years_per_capita_income = model.predict(future_years)

# Add the predictions to the DataFrame
future_years['income'] = future_years_per_capita_income

print('\n', future_years)

Output


Coefficients: [[820.31304136]]
Intercept: [-1616038.77823245]
Mean Squared Error: 14000295.245757869

    year        income
0  2020  40993.565325
1  2021  41813.878366
2  2022  42634.191408
3  2023  43454.504449

 


References

https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-set

Previous                                                    Next                                                    Home

No comments:

Post a Comment