Linear regression, models the relationship between one or more independent variables with a dependent variable by fitting a linear equation to the observed data.
Simple linear regression
Simple linear regression models the relationship between a dependent variable and an independent variable.
Formula
y = mx + c
In the above formula
- y is the dependent variable that we are trying to predict
- x is the independent variable or the input feature,
- m is the slope of the line, which represents the change in y for a unit change in x. A positive slope means that the line goes up as we move to the right, while a negative slope means that the line goes down as we move to the right. A slope of 0 means that the line is horizontal. The coefficient of x is m here.
- c is the intercept of the line, which represents the value of y when x is 0.
For example, the equation y = 5x + 10 has a slope of 5 and a y-intercept of 10. This means that the line goes up 5 for every 1 unit we move to the right, and it crosses the y-axis at the point (0, 10).
Example 1: y=5x+10 (positive slope 5)
Example 2: y=-5x+10 (negative slope 5)
Example 3: Slope is 0 (y=0x+10).
Let’s try to find coefficient (m) and intercept (c) for the below data.
Canada_per_capita_income.csv
year,income 1970,3399.299037 1971,3768.297935 1972,4251.175484 1973,4804.463248 1974,5576.514583 1975,5998.144346 1976,7062.131392 1977,7100.12617 1978,7247.967035 1979,7602.912681 1980,8355.96812 1981,9434.390652 1982,9619.438377 1983,10416.53659 1984,10790.32872 1985,11018.95585 1986,11482.89153 1987,12974.80662 1988,15080.28345 1989,16426.72548 1990,16838.6732 1991,17266.09769 1992,16412.08309 1993,15875.58673 1994,15755.82027 1995,16369.31725 1996,16699.82668 1997,17310.75775 1998,16622.67187 1999,17581.02414 2000,18987.38241 2001,18601.39724 2002,19232.17556 2003,22739.42628 2004,25719.14715 2005,29198.05569 2006,32738.2629 2007,36144.48122 2008,37446.48609 2009,32755.17682 2010,38420.52289 2011,42334.71121 2012,42665.25597 2013,42676.46837 2014,41039.8936 2015,35175.18898 2016,34229.19363
As you see above data, there are two columns.
- Year
- Canadas per capita income upto the year 2016
We can use x as year and y as the income.
How to calculate coefficient (m) and intercept (c)?
Following formula calculates the coefficient or slope of the equation.
Where:
- n is the number of data points.
- ∑ represents the sum notation. ∑x is sum of all the x values
- x is the independent variable (input).
- y is the dependent variable (output).
- xy represents the product of the corresponding x and y values.
Following formula calculate the y intercept.
Once you have the slope "m" and y-intercept "c," you can use them to predict y-values based on given x-values using the equation y=mx+c. This equation allows you to make predictions for new x-values that were not in the original dataset.
Predict using scikit library
Let’s predict income forecast from the input data ‘Canada_per_capita_income.csv’ using scikit library
Step 1: Read the csv file content to Pandas dataframe.
df = pd.read_csv(csv_file)
Step 2: Split the data into features, values.
X = df[['year']]
y = df[['income']]
df[['year']] selects the 'year' column as a DataFrame with a single column.
df[['income']] selects the 'income' column as a DataFrame with a single column.
Variable ‘X’ hold the features or inputs to the model. X hold all the independent variables. Variable ‘y’ holds dependent variables.
Step 3: Split the dataset into test and train data sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
using the train_test_split function from the sklearn.model_selection module, we can split the data set into train and test sets.
Following table summarizes the important variables/parameters in the above statement.
Variable/Parameter |
Description |
X_train |
This contain the subset of feature data used to train the model |
X_test |
This contain the subset of feature data used to test the model |
y_train |
This contain the subset of output/target data used to train the model |
y_test |
These are the expected outcomes for the "X_test" data. |
test_size |
Specifies the proportion of the data that should be allocated to the testing set. In this example, 30% of the data will be used for testing, and 70% will be used for training. |
random_state |
This parameter is used to set the random seed for reproducibility. This ensures that when you run the same code multiple times, you'll get the same train-test split |
Step 4: Initialize LinearRegression model and train the model on training dataset.
# Initialize the linear regression model
model = LinearRegression()
Above snippet create an instance of the LinearRegression class and assign it to the variable model.
# Train the model on the training data
model.fit(X_train, y_train)
‘fit’ method used to train the model by passing features and target/output values. Linear regression model find the best-fitting line that represents the relationship between the input features and the target values in the training data.
Step 5: Make the predictions on test features set.
y_pred = model.predict(X_test)
Step 6: Calculates the Mean Squared Error (MSE) between the predicted values (y_pred) and the actual target values (y_test).
mse = mean_squared_error(y_test, y_pred)
Mean Squared Error is a common metric used to evaluate the performance of a regression model.
Step 7: Plot the scatter plot with the train, test dataset and prediction values.
plt.scatter(X_train, y_train, color='blue', label='Trained Data')
plt.scatter(X_test, y_test, color='red', label='Actual')
plt.scatter(X_test, y_pred, color='black', label='Predicted')
Step 8: Predict the income for future years.
# Years for prediction
years_to_predict = [2020, 2021, 2022, 2023]
# Create a DataFrame for the years to predict
future_years = pd.DataFrame({'year': years_to_predict})
# Predict the per capita income for the future years
future_years_per_capita_income = model.predict(future_years)
# Add the predictions to the DataFrame
future_years['income'] = future_years_per_capita_income
Find the below working application.
canada_per_capita_income.py
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import os
csv_file = os.getcwd() + '/data/csvs/Canada_per_capita_income.csv'
# Read the CSV file
df = pd.read_csv(csv_file)
# Split the data into features (X) and target (y)
X = df[['year']]
y = df[['income']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the linear regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
# Print the coefficients and MSE
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
# Plot the data and regression line
# plt.plot(df["year"], df["income"], marker='o')
plt.scatter(X_train, y_train, color='blue', label='Trained Data')
plt.scatter(X_test, y_test, color='red', label='Actual')
plt.scatter(X_test, y_pred, color='black', label='Predicted')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Income')
plt.title('Linear Regression Example')
plt.show()
# Years for prediction
years_to_predict = [2020, 2021, 2022, 2023]
# Create a DataFrame for the years to predict
future_years = pd.DataFrame({'year': years_to_predict})
# Predict the per capita income for the future years
future_years_per_capita_income = model.predict(future_years)
# Add the predictions to the DataFrame
future_years['income'] = future_years_per_capita_income
print('\n', future_years)
Output
Coefficients: [[820.31304136]] Intercept: [-1616038.77823245] Mean Squared Error: 14000295.245757869 year income 0 2020 40993.565325 1 2021 41813.878366 2 2022 42634.191408 3 2023 43454.504449
References
https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-set Previous Next Home
No comments:
Post a Comment