Sunday, 16 March 2025

Multiple Linear Regression for Machine Learning: A Step-by-Step Tutorial

Multiple linear regression is used to model the relationship between multiple independent variables (features) and the dependent variable (target or outcome). 

 

Here

  1. y is the dependent variable (target).
  2. x1, x2, x3 are independent variables (features).
  3. β0 is the intercept (the value of y when all features are zero).
  4. β1,β2,…,βp are the coefficients for each feature.
  5. ε represents the error term

 

 

Where to use Multiple regression?

Multiple linear regression can be used to solve a variety of problems, such as:

 

  1. Predicting the price of a house based on its square footage, number of bedrooms, and location.
  2. Predicting the prices of a second hand mobile phone
  3. We can use a multiple linear regression model to predict how the variables like company earnings, market capitalization, interest rates, and industry performance collectively affect the stock price.

 

Working example

I am using Mobile prices dataset from this link (https://www.kaggle.com/datasets/howisusmanali/mobile-prices-2023).

 

Let’s try to understand the dataset before actual implementation.

 

Get total rows and columns

print(df.shape)

 

Above snippet prints below line.

(1836, 11)

 

From the above statement, I can see that there are 1836 rows and 11 columns in the data.

 

Get basic information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1836 entries, 0 to 1835
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Phone Name         1836 non-null   object 
 1   Rating ?/5         1836 non-null   float64
 2   Number of Ratings  1836 non-null   object 
 3   RAM                1836 non-null   object 
 4   ROM/Storage        1662 non-null   object 
 5   Back/Rare Camera   1827 non-null   object 
 6   Front Camera       1435 non-null   object 
 7   Battery            1826 non-null   object 
 8   Processor          1781 non-null   object 
 9   Price in INR       1836 non-null   object 
 10  Date of Scraping   1836 non-null   object 
dtypes: float64(1), object(10)
memory usage: 157.9+ KB

 From the above output, we can understand, there are 10 columns or feature in the dataset. Some features (like ROM/Storage, Front Camera)  has missing values.

 

pd.set_option('display.max_columns', None)

print(df.head())

Phone Name,Rating ?/5,Number of Ratings,RAM,ROM/Storage,Back/Rare Camera,Front Camera,Battery,Processor,Price in INR,Date of Scraping
"POCO C50 (Royal Blue, 32 GB)",4.2,"33,561",2 GB RAM,32 GB ROM,8MP Dual Camera,5MP Front Camera,5000 mAh,"Mediatek Helio A22 Processor, Upto 2.0 GHz Processor","₹5,649",2023-06-17
"POCO M4 5G (Cool Blue, 64 GB)",4.2,"77,128",4 GB RAM,64 GB ROM,50MP + 2MP,8MP Front Camera,5000 mAh,Mediatek Dimensity 700 Processor,"₹11,999",2023-06-17
"POCO C51 (Royal Blue, 64 GB)",4.3,"15,175",4 GB RAM,64 GB ROM,8MP Dual Rear Camera,5MP Front Camera,5000 mAh,Helio G36 Processor,"₹6,999",2023-06-17
"POCO C55 (Cool Blue, 64 GB)",4.2,"22,621",4 GB RAM,64 GB ROM,50MP Dual Rear Camera,5MP Front Camera,5000 mAh,Mediatek Helio G85 Processor,"₹7,749",2023-06-17
"POCO C51 (Power Black, 64 GB)",4.3,"15,175",4 GB RAM,64 GB ROM,8MP Dual Rear Camera,5MP Front Camera,5000 mAh,Helio G36 Processor,"₹6,999",2023-06-17

Data cleaning

Handle missing values

Let’s see how many missing values are there in each column.

print(df.isnull().sum())

 

Above snippet print below information.

Phone Name             0
Rating ?/5             0
Number of Ratings      0
RAM                    0
ROM/Storage          174
Back/Rare Camera       9
Front Camera         401
Battery               10
Processor             55
Price in INR           0
Date of Scraping       0
dtype: int64

From the above statement, I can clearly see that the columns

  1. ROM/Storage has 174 missing values.
  2. Back/Rare Camera has 9 missing values
  3. Front Camera has 401 missing values
  4. Battery has 10 missing values and
  5. Processor  has 10 missing value.

 

To make the things simple, let’s drop all the rows with missing values.

 

df=df.dropna()
print(df.isnull().sum())
print(df.shape)

 

Above snippet prints below information.

Phone Name           0
Rating ?/5           0
Number of Ratings    0
RAM                  0
ROM/Storage          0
Back/Rare Camera     0
Front Camera         0
Battery              0
Processor            0
Price in INR         0
Date of Scraping     0
dtype: int64
(1291, 11)

 

After dropping all the rows with null values, I had 1291 rows and 11 columns.

 

Drop duplicates

 

df = df.drop_duplicates()
print(df.shape)

 

Above snippet prints below information.

(1252, 11)

 

After dropping duplicates, I had 1252 rows and 11 columns.

 

Extract the brand from the column ‘Phone Name’ and encode it using label encoder

Phone name looks like ‘POCO C50 (Royal Blue, 32 GB)’, I just want to use brand as a feature to calculate the price of phone.

 

# Extract brand names and convert the names to numerical labels
df['Brand Label'] = df['Phone Name'].apply(lambda x: x.split()[0])

# Use LabelEncoder to convert categories to numerical labels
label_encoder = LabelEncoder()
df['Brand Label'] = label_encoder.fit_transform(df['Brand Label'])

 

LabelEncoder class is a utility provided by Scikit-learn to encode categorical labels (class labels) into numerical values.

 

Normalize the ‘Number of Ratings’

df['Number of Ratings'] = df['Number of Ratings'].str.replace(',', '').astype(float)
scaler = MinMaxScaler()
df['Number of Ratings'] = scaler.fit_transform(df[['Number of Ratings']])

 

Convert the column 'Price in INR' to float

Right now price is represented in String format like "5,649", we need to convert it to the real value. Following snippet do this.

 

    df['Price in INR'] = df['Price in INR'] \
        .str.replace(',', '') \
        .str.replace('₹', '') \
        .astype(float)

Convert the columns 'RAM’ and 'ROM/Storage' to float type

df['RAM'] = df['RAM'].str.extract('(\d+)').astype(float)
df['ROM/Storage'] = df['ROM/Storage'].str.extract('(\d+)').astype(float)

 

Convert the data of column 'Date of Scraping' to timestamp

df['Date of Scraping'] = df['Date of Scraping'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timestamp())

 

Extract the battery capacity from battery column

df['Battery Capacity'] = df['Battery'].str.extract(r'(\d+) mAh')

 

Convert front and back camera details to number

# Extract back camera details
df['Back Camera MP'] = df['Back/Rare Camera'].str.extract(r'(\d+)MP')
df['Back Camera MP'] = pd.to_numeric(df['Back Camera MP'], errors='coerce')

# Extract front camera detail
df['Front Camera MP'] = df['Front Camera'].str.extract(r'(\d+)MP')
df['Front Camera MP'] = pd.to_numeric(df['Front Camera MP'], errors='coerce')

 

Encode "Brand Label", "Processor" using label encoding

columns_to_encode = ["Brand Label", "Processor"]
label_encoder = LabelEncoder()
for i in columns_to_encode:
    df[i] = label_encoder.fit_transform(df[i])

 

As we have all the data converted into numbers format, it is time to drop all the textual columns.

df = df.drop(['Phone Name', 'Battery', 'Back/Rare Camera', 'Front Camera'], axis=1)

Divide the data into training and test samples

Using ‘train_test_split’ method, we can split the data into training and test sets.

Above line split the dataset into two subsets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  1. training set (X_train, y_train) and 
  2. testing set (y_test, y_test)

 

 

X_train: This represents the feature data that will be used for training a machine learning model.

 

y_train: This represents the target data corresponding to the X_train data set.

 

X_test: This represents another set of feature data, separate from the training data. This set is used for evaluating the model's performance after it has been trained.

 

y_test: Similar to y_train, this represents the target data corresponding to the X_test data. It is used for evaluating the model's predictions.

 

test_size: It specifies the percentage of the dataset that should be allocated to the testing set. In our example, I set it to the value 0.3, which means 30% of the data will be used for testing, and the remaining 70% will be used for training.

 

random_state: Explicitly setting the random seed ensures reproducibility. By setting it to a fixed value (e.g., 42), you ensure that the data splitting process will be the same every time you run your code. Setting the random seed is helpful for debugging and comparing different model runs.

 

Initialize linear regression model

model = LinearRegression()

 

Above statement creates an instance of LinearRegression model. Once the model object is created, you can use it to perform various tasks such as

 

  1. training the model on your data,
  2. making predictions, and
  1. evaluating its performance etc.,

 

Train the model on the training data

model.fit(X_train, y_train)

 

Above statement trains the model on given dataset. ‘fit’ method takes two arguments X_train, y_train, where Machine learning model uses the training data X_train and the corresponding target values y_train to learn the underlying patterns or relationships in the data.

 

 

For multi linear regression, this means finding the coefficients (slope and intercept) of the linear equation that best fits the data. In our example, model will predict below values.

  1. β0 is the intercept
  2. β1,β2,…,βp are the coefficients for each feature

 

Make predictions using a trained machine learning model.

Now we have a trained model, it is time to make predicitons.

 

y_pred = model.predict(X_test)

 

y_pred variable stores the predicted values generated from the machine learning model.

 

Find the below working application.

 

mobile_prices_2023.py

 

import os
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np

csv_file = 'mobile_prices_2023.csv'

input_data_set = pd.read_csv(csv_file)
df = input_data_set

def clean_and_transform_data(df):
    # Drop missing values
    df = df.dropna()

    # Drop duplicates
    df = df.drop_duplicates()

    # Use LabelEncoder to convert categories to numerical labels
    label_encoder = LabelEncoder()

    # Extract brand names and convert the names to numerical labels
    df['Brand Label'] = df['Phone Name'].apply(lambda x: x.split()[0])
    df['Brand Label'] = label_encoder.fit_transform(df['Brand Label'])

    # Normalize ratings
    df['Number of Ratings'] = df['Number of Ratings'].str.replace(',', '').astype(float)
    scaler = MinMaxScaler()
    df['Number of Ratings'] = scaler.fit_transform(df[['Number of Ratings']])

    # Convert the column 'Price in INR' to float
    df['Price in INR'] = df['Price in INR'] \
        .str.replace(',', '') \
        .str.replace('₹', '') \
        .astype(float)

    df['RAM'] = df['RAM'].str.extract('(\d+)').astype(float)
    df['ROM/Storage'] = df['ROM/Storage'].str.extract('(\d+)').astype(float)

    # Convert 'Date' to Unix timestamps
    df['Date of Scraping'] = df['Date of Scraping'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timestamp())

    # Extract battery capacity
    df['Battery Capacity'] = df['Battery'].str.extract(r'(\d+) mAh')

    # Convert 'Battery Capacity' to numerical
    df['Battery Capacity'] = pd.to_numeric(df['Battery Capacity'], errors='coerce')

    # Extract back camera details
    df['Back Camera MP'] = df['Back/Rare Camera'].str.extract(r'(\d+)MP')
    df['Back Camera MP'] = pd.to_numeric(df['Back Camera MP'], errors='coerce')

    # Extract front camera detail
    df['Front Camera MP'] = df['Front Camera'].str.extract(r'(\d+)MP')
    df['Front Camera MP'] = pd.to_numeric(df['Front Camera MP'], errors='coerce')

    df = df.dropna()

    columns_to_encode = ["Brand Label", "Processor"]
    label_encoder = LabelEncoder()
    for i in columns_to_encode:
        df[i] = label_encoder.fit_transform(df[i])

    df = df.drop(['Phone Name', 'Battery', 'Back/Rare Camera', 'Front Camera'], axis=1)

    return df


df = clean_and_transform_data(df)

# Split the data into features (X) and target (y)
X = df.drop(['Price in INR'], axis=1)
y = df[['Price in INR']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Print the coefficients and MSE
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)

# Test
test_samples = input_data_set.loc[[20, 35, 320, 345]]
new_rows_to_test = test_samples
new_rows_to_test = clean_and_transform_data(new_rows_to_test)
new_rows_to_test = new_rows_to_test.drop(['Price in INR'], axis=1)
predictions = model.predict(new_rows_to_test)

count = 0
print('\nActual Price \t predicted price')
for index, row in test_samples.iterrows():
    print(row['Price in INR'], '\t', predictions[count])
    count = count+1

 

Output

Coefficients: [[-1.47717811e+03 -6.10695250e+03 -1.26356266e+01  1.34486639e+02
   1.37388450e+00  2.49258392e-11 -9.34928574e+01 -4.79621280e+00
   7.71726101e+01  7.46384232e+01]]
Intercept: [31556.82111766]
Mean Squared Error: 91932502.53967698

Actual Price     predicted price
₹6,499   [6775.60749958]
₹17,499      [18537.81036251]
₹22,999      [24688.73868959]
₹18,999      [18175.84459827]

 

References

https://www.kaggle.com/datasets/howisusmanali/mobile-prices-2023

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment