Multiple linear regression is used to model the relationship between multiple independent variables (features) and the dependent variable (target or outcome).
Here
- y is the dependent variable (target).
- x1, x2, x3 are independent variables (features).
- β0 is the intercept (the value of y when all features are zero).
- β1,β2,…,βp are the coefficients for each feature.
- ε represents the error term
Where to use Multiple regression?
Multiple linear regression can be used to solve a variety of problems, such as:
- Predicting the price of a house based on its square footage, number of bedrooms, and location.
- Predicting the prices of a second hand mobile phone
- We can use a multiple linear regression model to predict how the variables like company earnings, market capitalization, interest rates, and industry performance collectively affect the stock price.
Working example
I am using Mobile prices dataset from this link (https://www.kaggle.com/datasets/howisusmanali/mobile-prices-2023).
Let’s try to understand the dataset before actual implementation.
Get total rows and columns
print(df.shape)
Above snippet prints below line.
(1836, 11)
From the above statement, I can see that there are 1836 rows and 11 columns in the data.
Get basic information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1836 entries, 0 to 1835 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Phone Name 1836 non-null object 1 Rating ?/5 1836 non-null float64 2 Number of Ratings 1836 non-null object 3 RAM 1836 non-null object 4 ROM/Storage 1662 non-null object 5 Back/Rare Camera 1827 non-null object 6 Front Camera 1435 non-null object 7 Battery 1826 non-null object 8 Processor 1781 non-null object 9 Price in INR 1836 non-null object 10 Date of Scraping 1836 non-null object dtypes: float64(1), object(10) memory usage: 157.9+ KB
From
the above output, we can understand, there are 10 columns or feature in the
dataset. Some features (like ROM/Storage, Front Camera) has missing values.
pd.set_option('display.max_columns', None)
print(df.head())
Phone Name,Rating ?/5,Number of Ratings,RAM,ROM/Storage,Back/Rare Camera,Front Camera,Battery,Processor,Price in INR,Date of Scraping "POCO C50 (Royal Blue, 32 GB)",4.2,"33,561",2 GB RAM,32 GB ROM,8MP Dual Camera,5MP Front Camera,5000 mAh,"Mediatek Helio A22 Processor, Upto 2.0 GHz Processor","₹5,649",2023-06-17 "POCO M4 5G (Cool Blue, 64 GB)",4.2,"77,128",4 GB RAM,64 GB ROM,50MP + 2MP,8MP Front Camera,5000 mAh,Mediatek Dimensity 700 Processor,"₹11,999",2023-06-17 "POCO C51 (Royal Blue, 64 GB)",4.3,"15,175",4 GB RAM,64 GB ROM,8MP Dual Rear Camera,5MP Front Camera,5000 mAh,Helio G36 Processor,"₹6,999",2023-06-17 "POCO C55 (Cool Blue, 64 GB)",4.2,"22,621",4 GB RAM,64 GB ROM,50MP Dual Rear Camera,5MP Front Camera,5000 mAh,Mediatek Helio G85 Processor,"₹7,749",2023-06-17 "POCO C51 (Power Black, 64 GB)",4.3,"15,175",4 GB RAM,64 GB ROM,8MP Dual Rear Camera,5MP Front Camera,5000 mAh,Helio G36 Processor,"₹6,999",2023-06-17
Data cleaning
Handle missing values
Let’s see how many missing values are there in each column.
print(df.isnull().sum())
Above snippet print below information.
Phone Name 0 Rating ?/5 0 Number of Ratings 0 RAM 0 ROM/Storage 174 Back/Rare Camera 9 Front Camera 401 Battery 10 Processor 55 Price in INR 0 Date of Scraping 0 dtype: int64
From the above statement, I can clearly see that the columns
- ROM/Storage has 174 missing values.
- Back/Rare Camera has 9 missing values
- Front Camera has 401 missing values
- Battery has 10 missing values and
- Processor has 10 missing value.
To make the things simple, let’s drop all the rows with missing values.
df=df.dropna() print(df.isnull().sum()) print(df.shape)
Above snippet prints below information.
Phone Name 0 Rating ?/5 0 Number of Ratings 0 RAM 0 ROM/Storage 0 Back/Rare Camera 0 Front Camera 0 Battery 0 Processor 0 Price in INR 0 Date of Scraping 0 dtype: int64 (1291, 11)
After dropping all the rows with null values, I had 1291 rows and 11 columns.
Drop duplicates
df = df.drop_duplicates() print(df.shape)
Above snippet prints below information.
(1252, 11)
After dropping duplicates, I had 1252 rows and 11 columns.
Extract the brand from the column ‘Phone Name’ and encode it using label encoder
Phone name looks like ‘POCO C50 (Royal Blue, 32 GB)’, I just want to use brand as a feature to calculate the price of phone.
# Extract brand names and convert the names to numerical labels df['Brand Label'] = df['Phone Name'].apply(lambda x: x.split()[0]) # Use LabelEncoder to convert categories to numerical labels label_encoder = LabelEncoder() df['Brand Label'] = label_encoder.fit_transform(df['Brand Label'])
LabelEncoder class is a utility provided by Scikit-learn to encode categorical labels (class labels) into numerical values.
Normalize the ‘Number of Ratings’
df['Number of Ratings'] = df['Number of Ratings'].str.replace(',', '').astype(float) scaler = MinMaxScaler() df['Number of Ratings'] = scaler.fit_transform(df[['Number of Ratings']])
Convert the column 'Price in INR' to float
Right now price is represented in String format like "₹5,649", we need to convert it to the real value. Following snippet do this.
df['Price in INR'] = df['Price in INR'] \ .str.replace(',', '') \ .str.replace('₹', '') \ .astype(float)
Convert the columns 'RAM’ and 'ROM/Storage' to float type
df['RAM'] = df['RAM'].str.extract('(\d+)').astype(float) df['ROM/Storage'] = df['ROM/Storage'].str.extract('(\d+)').astype(float)
Convert the data of column 'Date of Scraping' to timestamp
df['Date of Scraping'] = df['Date of Scraping'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timestamp())
Extract the battery capacity from battery column
df['Battery Capacity'] = df['Battery'].str.extract(r'(\d+) mAh')
Convert front and back camera details to number
# Extract back camera details df['Back Camera MP'] = df['Back/Rare Camera'].str.extract(r'(\d+)MP') df['Back Camera MP'] = pd.to_numeric(df['Back Camera MP'], errors='coerce') # Extract front camera detail df['Front Camera MP'] = df['Front Camera'].str.extract(r'(\d+)MP') df['Front Camera MP'] = pd.to_numeric(df['Front Camera MP'], errors='coerce')
Encode "Brand Label", "Processor" using label encoding
columns_to_encode = ["Brand Label", "Processor"] label_encoder = LabelEncoder() for i in columns_to_encode: df[i] = label_encoder.fit_transform(df[i])
As we have all the data converted into numbers format, it is time to drop all the textual columns.
df = df.drop(['Phone Name', 'Battery', 'Back/Rare Camera', 'Front Camera'], axis=1)
Divide the data into training and test samples
Using ‘train_test_split’ method, we can split the data into training and test sets.
Above line split the dataset into two subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
- training set (X_train, y_train) and
- testing set (y_test, y_test)
X_train: This represents the feature data that will be used for training a machine learning model.
y_train: This represents the target data corresponding to the X_train data set.
X_test: This represents another set of feature data, separate from the training data. This set is used for evaluating the model's performance after it has been trained.
y_test: Similar to y_train, this represents the target data corresponding to the X_test data. It is used for evaluating the model's predictions.
test_size: It specifies the percentage of the dataset that should be allocated to the testing set. In our example, I set it to the value 0.3, which means 30% of the data will be used for testing, and the remaining 70% will be used for training.
random_state: Explicitly setting the random seed ensures reproducibility. By setting it to a fixed value (e.g., 42), you ensure that the data splitting process will be the same every time you run your code. Setting the random seed is helpful for debugging and comparing different model runs.
Initialize linear regression model
model = LinearRegression()
Above statement creates an instance of LinearRegression model. Once the model object is created, you can use it to perform various tasks such as
- training the model on your data,
- making predictions, and
- evaluating its performance etc.,
Train the model on the training data
model.fit(X_train, y_train)
Above statement trains the model on given dataset. ‘fit’ method takes two arguments X_train, y_train, where Machine learning model uses the training data X_train and the corresponding target values y_train to learn the underlying patterns or relationships in the data.
For multi linear regression, this means finding the coefficients (slope and intercept) of the linear equation that best fits the data. In our example, model will predict below values.
- β0 is the intercept
- β1,β2,…,βp are the coefficients for each feature
Make predictions using a trained machine learning model.
Now we have a trained model, it is time to make predicitons.
y_pred = model.predict(X_test)
y_pred variable stores the predicted values generated from the machine learning model.
Find the below working application.
mobile_prices_2023.py
import os import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler, MinMaxScaler from datetime import datetime import matplotlib.pyplot as plt import numpy as np csv_file = 'mobile_prices_2023.csv' input_data_set = pd.read_csv(csv_file) df = input_data_set def clean_and_transform_data(df): # Drop missing values df = df.dropna() # Drop duplicates df = df.drop_duplicates() # Use LabelEncoder to convert categories to numerical labels label_encoder = LabelEncoder() # Extract brand names and convert the names to numerical labels df['Brand Label'] = df['Phone Name'].apply(lambda x: x.split()[0]) df['Brand Label'] = label_encoder.fit_transform(df['Brand Label']) # Normalize ratings df['Number of Ratings'] = df['Number of Ratings'].str.replace(',', '').astype(float) scaler = MinMaxScaler() df['Number of Ratings'] = scaler.fit_transform(df[['Number of Ratings']]) # Convert the column 'Price in INR' to float df['Price in INR'] = df['Price in INR'] \ .str.replace(',', '') \ .str.replace('₹', '') \ .astype(float) df['RAM'] = df['RAM'].str.extract('(\d+)').astype(float) df['ROM/Storage'] = df['ROM/Storage'].str.extract('(\d+)').astype(float) # Convert 'Date' to Unix timestamps df['Date of Scraping'] = df['Date of Scraping'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').timestamp()) # Extract battery capacity df['Battery Capacity'] = df['Battery'].str.extract(r'(\d+) mAh') # Convert 'Battery Capacity' to numerical df['Battery Capacity'] = pd.to_numeric(df['Battery Capacity'], errors='coerce') # Extract back camera details df['Back Camera MP'] = df['Back/Rare Camera'].str.extract(r'(\d+)MP') df['Back Camera MP'] = pd.to_numeric(df['Back Camera MP'], errors='coerce') # Extract front camera detail df['Front Camera MP'] = df['Front Camera'].str.extract(r'(\d+)MP') df['Front Camera MP'] = pd.to_numeric(df['Front Camera MP'], errors='coerce') df = df.dropna() columns_to_encode = ["Brand Label", "Processor"] label_encoder = LabelEncoder() for i in columns_to_encode: df[i] = label_encoder.fit_transform(df[i]) df = df.drop(['Phone Name', 'Battery', 'Back/Rare Camera', 'Front Camera'], axis=1) return df df = clean_and_transform_data(df) # Split the data into features (X) and target (y) X = df.drop(['Price in INR'], axis=1) y = df[['Price in INR']] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Initialize the linear regression model model = LinearRegression() # Train the model on the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) # Print the coefficients and MSE print("Coefficients:", model.coef_) print("Intercept:", model.intercept_) print("Mean Squared Error:", mse) # Test test_samples = input_data_set.loc[[20, 35, 320, 345]] new_rows_to_test = test_samples new_rows_to_test = clean_and_transform_data(new_rows_to_test) new_rows_to_test = new_rows_to_test.drop(['Price in INR'], axis=1) predictions = model.predict(new_rows_to_test) count = 0 print('\nActual Price \t predicted price') for index, row in test_samples.iterrows(): print(row['Price in INR'], '\t', predictions[count]) count = count+1
Output
Coefficients: [[-1.47717811e+03 -6.10695250e+03 -1.26356266e+01 1.34486639e+02 1.37388450e+00 2.49258392e-11 -9.34928574e+01 -4.79621280e+00 7.71726101e+01 7.46384232e+01]] Intercept: [31556.82111766] Mean Squared Error: 91932502.53967698 Actual Price predicted price ₹6,499 [6775.60749958] ₹17,499 [18537.81036251] ₹22,999 [24688.73868959] ₹18,999 [18175.84459827]
References
https://www.kaggle.com/datasets/howisusmanali/mobile-prices-2023
No comments:
Post a Comment