Logistic regression is a classification model, used to predict a categorical outcome. It is come under supervised learning.
Logistic regression is used to model the probability of a binary outcome (0 or 1, true or false, yes or no etc.,). You can go with Multinomial logical regression, when you have more than two possible discrete outcomes.
Examples
- Predict whether user will buy given product or not.
- Predict whether a given email is spam or not.
- Predict whether a patient has disease or not
- Predict whether an applicant is likely to default on their loan.
- Predict whether a customer leave the subscription or not
- Predict whether a given customer review is +ve or -ve
- Predict whether an employee is leave/stay with the organization
- Predict whether a given credit card transaction is fraud or not
- Predict whether a stock price will go up or not
- Predict whether given item has a defect or not
The probability of event occurring is calculated by a sigmoid function.
Sigmoid function
Sigmoid function maps the output of a model into a range of 0 to 1 or to transform the model’s output into a probability.
Frequently used sigmoid function is a logistic function, which is defined like below.
Where, x is the input to the logistic function.
Other sigmoid functions are given below.
Hyperbolic tangent function
Softplus function
Rectified linear unit (ReLU) function
Exponential linear unit (ELU) function
How logistic regression works internally?
It is a three step process.
- Map the training data to linear equation.
- sigmoid function, to map the outcome of linear equation between 0 and 1.
- Define decision boundary
- Make predictions
Map the training data to a linear equation
Simple linear equation models the relationship between a dependent variable and an independent variable.
Formula
y = mx + c
In the above formula
- y is the dependent variable that we are trying to predict
- x is the independent variable or the input feature,
- m is the slope of the line, which represents the change in y for a unit change in x. A positive slope means that the line goes up as we move to the right, while a negative slope means that the line goes down as we move to the right. A slope of 0 means that the line is horizontal. The coefficient of x is m here.
- c is the intercept of the line, which represents the value of y when x is 0.
For example, the equation y = 5x + 10 has a slope of 5 and a y-intercept of 10. This means that the line goes up 5 for every 1 unit we move to the right, and it crosses the y-axis at the point (0, 10).
If we have more than one dependent variable (x1, x2, …xp) and one response variable (y), then the linear equation would be given mathematically with the following equation.
β1,β2,…,βp are the independent variables or coefficients for each feature.
Sigmoid function, to map the outcome of linear equation between 0 and 1.
The predicted response from the linear equation is mapped between 0 and 1 using a sigmoid function.
The sigmoid function is visualized as S shaped curve, you can confirm the same from below application.
sigmoid.py
import numpy as np import matplotlib.pyplot as plt # Define the sigmoid function def sigmoid(y): return 1 / (1 + np.exp(-y)) # Create a range of values for 'y' y = np.linspace(-6, 6, 300) # Calculate sigmoid values for the range of 'z' sigmoid_values = sigmoid(y) # Create a plot plt.figure(figsize=(8, 5)) plt.plot(y, sigmoid_values, label='Sigmoid Function', color='green') plt.xlabel('y') plt.ylabel('sigmoid(y)') plt.title('Sigmoid Function') plt.axhline(y=1, color='black', linestyle='--', linewidth=0.9) plt.axhline(y=0, color='black', linestyle='--', linewidth=0.9) plt.axvline(x=0, color='black', linestyle='--', linewidth=0.9) plt.grid(True, linestyle='--', alpha=0.6) plt.legend() plt.show()
Output
Define decision boundary
Sigmoid function maps the projected value of from linear equation between the values 0 and 1. Now we need to define the boundary or threshold value to map the sigmoid function output to a discrete value like yes/no, pass/fail, buy/not_buy etc.,
For example, suppose if I selected the threshold values as 0.4, then mathematically, it can be expressed as follows:-
p ≥ 0.4 => class = 1
p < 0.4 => class = 0
Make predictions
As the linear equation, sigmoid function, decision boundary are in place, we can use these together to predict the outcome category.
I will be using the weather dataset (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) and use logistic regression to predict, whether is it going to rain today or not?
To get a quick glance on Logistic regression.
logreg = LogisticRegression(solver='liblinear', random_state=47)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Model accuracy score : ',accuracy_score(y_test, y_pred))
Let’s analyze the data
Get some rows to understand the data.
five_rows = df.head()
Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow 2008-12-01,Albury,13.4,22.9,0.6,NA,NA,W,44,W,WNW,20,24,71,22,1007.7,1007.1,8,NA,16.9,21.8,No,No 2008-12-02,Albury,7.4,25.1,0,NA,NA,WNW,44,NNW,WSW,4,22,44,25,1010.6,1007.8,NA,NA,17.2,24.3,No,No 2008-12-03,Albury,12.9,25.7,0,NA,NA,WSW,46,W,WSW,19,26,38,30,1007.6,1008.7,NA,2,21,23.2,No,No 2008-12-04,Albury,9.2,28,0,NA,NA,NE,24,SE,E,11,9,45,16,1017.6,1012.8,NA,NA,18.1,26.5,No,No 2008-12-05,Albury,17.5,32.3,1,NA,NA,W,41,ENE,NW,7,20,82,33,1010.8,1006,7,8,17.8,29.7,No,No 2008-12-06,Albury,14.6,29.7,0.2,NA,NA,WNW,56,W,W,19,24,55,23,1009.2,1005.4,NA,NA,20.6,28.9,No,No
Get total rows and columns
shape = df.shape total_rows = shape[0] total_columns = shape[1]
Get column names
columns = df.columns
columns : Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'RainTomorrow'], dtype='object')
Get detailed summary of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 145460 entries, 0 to 145459 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 145460 non-null object 1 Location 145460 non-null object 2 MinTemp 143975 non-null float64 3 MaxTemp 144199 non-null float64 4 Rainfall 142199 non-null float64 5 Evaporation 82670 non-null float64 6 Sunshine 75625 non-null float64 7 WindGustDir 135134 non-null object 8 WindGustSpeed 135197 non-null float64 9 WindDir9am 134894 non-null object 10 WindDir3pm 141232 non-null object 11 WindSpeed9am 143693 non-null float64 12 WindSpeed3pm 142398 non-null float64 13 Humidity9am 142806 non-null float64 14 Humidity3pm 140953 non-null float64 15 Pressure9am 130395 non-null float64 16 Pressure3pm 130432 non-null float64 17 Cloud9am 89572 non-null float64 18 Cloud3pm 86102 non-null float64 19 Temp9am 143693 non-null float64 20 Temp3pm 141851 non-null float64 21 RainToday 142199 non-null object 22 RainTomorrow 142193 non-null object dtypes: float64(16), object(7) memory usage: 25.5+ MB
Get column wise missing values
column_wise_missing_values = df.isnull().sum()
Date 0 Location 0 MinTemp 1485 MaxTemp 1261 Rainfall 3261 Evaporation 62790 Sunshine 69835 WindGustDir 10326 WindGustSpeed 10263 WindDir9am 10566 WindDir3pm 4228 WindSpeed9am 1767 WindSpeed3pm 3062 Humidity9am 2654 Humidity3pm 4507 Pressure9am 15065 Pressure3pm 15028 Cloud9am 55888 Cloud3pm 59358 Temp9am 1767 Temp3pm 3609 RainToday 3261 RainTomorrow 3267
Check duplicate rows based on all columns
duplicate_count = df.duplicated().sum() print(f'duplicate_count : {duplicate_count}')
If there are any duplicates, we can use the method ‘drop_duplicates’ to drop the duplicates.
df = df.drop_duplicates()
Print all the unique values, missing values count, column datatype
unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values}) print(result_df)
You can use below snippet to get basic information about the dataframe that we discussed above.
def basic_analysis(df): five_rows = df.head() print(f'five_rows : {five_rows}') shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'duplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values}) print(result_df)
Above snippet prints below output.
five_rows : Date Location MinTemp ... Temp3pm RainToday RainTomorrow 0 2008-12-01 Albury 13.4 ... 21.8 No No 1 2008-12-02 Albury 7.4 ... 24.3 No No 2 2008-12-03 Albury 12.9 ... 23.2 No No 3 2008-12-04 Albury 9.2 ... 26.5 No No 4 2008-12-05 Albury 17.5 ... 29.7 No No [5 rows x 23 columns] total_rows : 145460 total_columns : 23 columns : Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'RainTomorrow'], dtype='object') Detailed information <class 'pandas.core.frame.DataFrame'> RangeIndex: 145460 entries, 0 to 145459 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 145460 non-null object 1 Location 145460 non-null object 2 MinTemp 143975 non-null float64 3 MaxTemp 144199 non-null float64 4 Rainfall 142199 non-null float64 5 Evaporation 82670 non-null float64 6 Sunshine 75625 non-null float64 7 WindGustDir 135134 non-null object 8 WindGustSpeed 135197 non-null float64 9 WindDir9am 134894 non-null object 10 WindDir3pm 141232 non-null object 11 WindSpeed9am 143693 non-null float64 12 WindSpeed3pm 142398 non-null float64 13 Humidity9am 142806 non-null float64 14 Humidity3pm 140953 non-null float64 15 Pressure9am 130395 non-null float64 16 Pressure3pm 130432 non-null float64 17 Cloud9am 89572 non-null float64 18 Cloud3pm 86102 non-null float64 19 Temp9am 143693 non-null float64 20 Temp3pm 141851 non-null float64 21 RainToday 142199 non-null object 22 RainTomorrow 142193 non-null object dtypes: float64(16), object(7) memory usage: 25.5+ MB Column wise missing values column_wise_missing_values : Date 0 Location 0 MinTemp 1485 MaxTemp 1261 Rainfall 3261 Evaporation 62790 Sunshine 69835 WindGustDir 10326 WindGustSpeed 10263 WindDir9am 10566 WindDir3pm 4228 WindSpeed9am 1767 WindSpeed3pm 3062 Humidity9am 2654 Humidity3pm 4507 Pressure9am 15065 Pressure3pm 15028 Cloud9am 55888 Cloud3pm 59358 Temp9am 1767 Temp3pm 3609 RainToday 3261 RainTomorrow 3267 dtype: int64 duplicate_count : 0 Unique counts, missing values count, data types column wise Data Type Unique Count missing_values Date object 3436 0 Location object 49 0 MinTemp float64 389 1485 MaxTemp float64 505 1261 Rainfall float64 681 3261 Evaporation float64 358 62790 Sunshine float64 145 69835 WindGustDir object 16 10326 WindGustSpeed float64 67 10263 WindDir9am object 16 10566 WindDir3pm object 16 4228 WindSpeed9am float64 43 1767 WindSpeed3pm float64 44 3062 Humidity9am float64 101 2654 Humidity3pm float64 101 4507 Pressure9am float64 546 15065 Pressure3pm float64 549 15028 Cloud9am float64 10 55888 Cloud3pm float64 10 59358 Temp9am float64 441 1767 Temp3pm float64 502 3609 RainToday object 2 3261 RainTomorrow object 2 3267
From the output, I identified below insights.
- There are 145460 and 23 columns.
- There are 16 columns of type float64 and 4 columns of type object.
- There are no duplicates in the input dataset
- From the ‘Unique counts column wise’ section, I identified below columns as categorical columns based in their type and unique value count. As unique value count for RainToday and RainTomorrow columns is 2, these two are considered as binary categorical variables.
1. Location (0 missing values)
2. WindGustDir (10326 missing values)
3. WindDir9am (10566 missing values)
4. WindDir3pm (4228 missing values)
5. RainToday (3261 missing values)
6. RainTomorrow (3267 missing values)
- There is a date variable denoted by Date column
- We need to predict the target ‘RainTomorrow’.
Let’s find the frequency count of categorical columns to get some more insights
categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow'] for col in categorical_columns: print(f'\nFrequency count of values for the column {col}\n') print(df[col].value_counts())
Above snippet prints below information.
Frequency count of values for the column Location Location Canberra 3436 Sydney 3344 Darwin 3193 Melbourne 3193 Brisbane 3193 Adelaide 3193 Perth 3193 Hobart 3193 Albany 3040 MountGambier 3040 Ballarat 3040 Townsville 3040 GoldCoast 3040 Cairns 3040 Launceston 3040 AliceSprings 3040 Bendigo 3040 Albury 3040 MountGinini 3040 Wollongong 3040 Newcastle 3039 Tuggeranong 3039 Penrith 3039 Woomera 3009 Nuriootpa 3009 Cobar 3009 CoffsHarbour 3009 Moree 3009 Sale 3009 PerthAirport 3009 PearceRAAF 3009 Witchcliffe 3009 BadgerysCreek 3009 Mildura 3009 NorfolkIsland 3009 MelbourneAirport 3009 Richmond 3009 SydneyAirport 3009 WaggaWagga 3009 Williamtown 3009 Dartmoor 3009 Watsonia 3009 Portland 3009 Walpole 3006 NorahHead 3004 SalmonGums 3001 Katherine 1578 Nhil 1578 Uluru 1578 Name: count, dtype: int64 Frequency count of values for the column WindGustDir WindGustDir W 9915 SE 9418 N 9313 SSE 9216 E 9181 S 9168 WSW 9069 SW 8967 SSW 8736 WNW 8252 NW 8122 ENE 8104 ESE 7372 NE 7133 NNW 6620 NNE 6548 Name: count, dtype: int64 Frequency count of values for the column WindDir9am WindDir9am N 11758 SE 9287 E 9176 SSE 9112 NW 8749 S 8659 W 8459 SW 8423 NNE 8129 NNW 7980 ENE 7836 NE 7671 ESE 7630 SSW 7587 WNW 7414 WSW 7024 Name: count, dtype: int64 Frequency count of values for the column WindDir3pm WindDir3pm SE 10838 W 10110 S 9926 WSW 9518 SSE 9399 SW 9354 N 8890 WNW 8874 NW 8610 ESE 8505 E 8472 NE 8263 SSW 8156 NNW 7870 ENE 7857 NNE 6590 Name: count, dtype: int64 Frequency count of values for the column RainToday RainToday No 110319 Yes 31880 Name: count, dtype: int64 Frequency count of values for the column RainTomorrow RainTomorrow No 110316 Yes 31877 Name: count, dtype: int64
Let’s start pre-processing the data
Let’s take a deep copy of the data and apply all the transformations on it.
df_copy = df.copy(deep=True)
Deep copy impacts the performance when the data set size is huge.
Convert the Date column data type to datetime
df_copy['Date'] = pd.to_datetime(df_copy['Date'])
Above statement converts the values in a 'Date' column in a Pandas DataFrame (df_copy) into datetime objects. After executing this line, the 'Date' column in the df_copy DataFrame will contain datetime objects instead of the original date values.
Extract year, month and day values from ‘Date’ column and store them in separate features year, month and day
df_copy['Year'] = df_copy['Date'].dt.year
Above statement creates a new column called 'Year' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .year attribute to extract the year component from each date in the 'Date' column and assigns it to the 'Year' column.
df_copy['Month'] = df_copy['Date'].dt.month
Above statement creates a new column called 'Month' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .month attribute to extract the month component from each date in the 'Date' column and assigns it to the 'Month' column.
df_copy['Day'] = df_copy['Date'].dt.day
Above statement creates a new column called 'Day' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .day attribute to extract the day component from each date in the 'Date' column and assigns it to the 'Day' column.
Drop Date column data
As we already extracted the Year, Month and Day information from Date column, we can remove this from the data frame to optimize the memory.
df_copy.drop('Date', axis=1, inplace=True)
First argument specifies the column ‘Date’ that we want to remove.
Second argument axis=1 specifies that we want to delete the column data, axis=0 specifies the row.
inplace=True specifies that I want to modify the DataFrame df_copy in place. Internally the 'Date' column will be removed from df_copy directly, and the modified DataFrame will be stored in df_copy without the need to assign it explicitly to a new variable.
Let’s cleanup Categorical variables
Let’s identify the null value count in categorical variables.
# print null value count and unique values count of categorical columns # Create an empty DataFrame to store the counts null_and_unique_value_count = pd.DataFrame(index=categorical_columns) # Count of null values for each column null_and_unique_value_count['Null Count'] = df[categorical_columns].isnull().sum() # Count of unique values for each column null_and_unique_value_count['Unique Count'] = df[categorical_columns].nunique()
Above snippet prints below statistics.
Null Count Unique Count Location 0 49 WindGustDir 10326 16 WindDir9am 10566 16 WindDir3pm 4228 16 RainToday 3261 2 RainTomorrow 3267 2
From the above output, you can confirm the columns WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTomorrow variables contain missing values.
Let’s do following steps on categorical column data.
- Fill the missing values with mode.
- Encode the column data with label encoder.
label_encoder = LabelEncoder() for col in categorical_columns: # Fill missing values with mode mode_value = df_copy[col].mode().iloc[0] df[col].fillna(mode_value, inplace=True) # Encode the data df_copy[col] = label_encoder.fit_transform(df_copy[col])
To confirm print the null values count in categorical columns.
print(df_copy[categorical_columns].isnull().sum())
Location 0 WindGustDir 0 WindDir9am 0 WindDir3pm 0 RainToday 0 RainTomorrow 0
Print the first five rows of the data to visualize the transformed data.
df_copy[categorical_columns].head()
Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow 0 2 13 13 14 0 0 1 2 14 6 15 0 0 2 2 15 13 15 0 0 3 2 4 9 0 0 0 4 2 13 1 7 0 0
Let’s cleanup Numerical variables
Following snippet gets you all the numeric columns.
numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print('The numerical variables are :', numerical_columns) print('Total numeric columns are : ', len(numerical_columns))
Above statements prints below output.
The numerical variables are : ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'] Total numeric columns are : 16
Let’s get count of missing values in the numeric columns.
null_values_count = df[numerical_columns].isnull().sum() print(f'null_values_count : \n{null_values_count}')
Above snippet prints below statistics.
null_values_count : MinTemp 1485 MaxTemp 1261 Rainfall 3261 Evaporation 62790 Sunshine 69835 WindGustSpeed 10263 WindSpeed9am 1767 WindSpeed3pm 3062 Humidity9am 2654 Humidity3pm 4507 Pressure9am 15065 Pressure3pm 15028 Cloud9am 55888 Cloud3pm 59358 Temp9am 1767 Temp3pm 3609
From the above snippet, you can confirm that all the columns has missing values.
Let’s get basic statistics on the numeric columns.
First adjust display options to show all rows and columns
pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None)
Execute below line to get the statistics.
print(round(df[numerical_columns].describe()), 2)
Above statement print below statistics.
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed count 143975.0 144199.0 142199.0 82670.0 75625.0 135197.0 \ mean 12.0 23.0 2.0 5.0 8.0 40.0 std 6.0 7.0 8.0 4.0 4.0 14.0 min -8.0 -5.0 0.0 0.0 0.0 6.0 25% 8.0 18.0 0.0 3.0 5.0 31.0 50% 12.0 23.0 0.0 5.0 8.0 39.0 75% 17.0 28.0 1.0 7.0 11.0 48.0 max 34.0 48.0 371.0 145.0 14.0 135.0 WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am count 143693.0 142398.0 142806.0 140953.0 130395.0 \ mean 14.0 19.0 69.0 52.0 1018.0 std 9.0 9.0 19.0 21.0 7.0 min 0.0 0.0 0.0 0.0 980.0 25% 7.0 13.0 57.0 37.0 1013.0 50% 13.0 19.0 70.0 52.0 1018.0 75% 19.0 24.0 83.0 66.0 1022.0 max 130.0 87.0 100.0 100.0 1041.0 Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm count 130432.0 89572.0 86102.0 143693.0 141851.0 mean 1015.0 4.0 5.0 17.0 22.0 std 7.0 3.0 3.0 6.0 7.0 min 977.0 0.0 0.0 -7.0 -5.0 25% 1010.0 1.0 2.0 12.0 17.0 50% 1015.0 5.0 5.0 17.0 21.0 75% 1020.0 7.0 7.0 22.0 26.0 max 1040.0 9.0 9.0 40.0 47.0
By closely inspecting the data, we can, we can see that following columns has outliers.
- Rainfall (75% has the value 1.0, but the max value is 371.0)
- Evaporation (75% has the value 7.0, but the max value is 145.0)
- WindSpeed9am (75% has the value 19.0, but the max value is 130.0)
- WindSpeed3pm (75% has the value 24.0, but the max value is 87.0)
Let’s draw a box plot and confirm our observation for the numerical columns.
box_plot_for_outliers.py
import pandas as pd import matplotlib.pyplot as plt # Create a DataFrame df = pd.read_csv('weatherAUS.csv') plt.figure(figsize=(15,10)) plt.subplot(2, 2, 1) fig = df.boxplot(column='Rainfall') fig.set_title('') fig.set_ylabel('Rainfall') plt.subplot(2, 2, 2) fig = df.boxplot(column='Evaporation') fig.set_title('') fig.set_ylabel('Evaporation') plt.subplot(2, 2, 3) fig = df.boxplot(column='WindSpeed9am') fig.set_title('') fig.set_ylabel('WindSpeed9am') plt.subplot(2, 2, 4) fig = df.boxplot(column='WindSpeed3pm') fig.set_title('') fig.set_ylabel('WindSpeed3pm') plt.show()
Output
From the above diagram, we can reconfirm there are outliers.
Following snippet create a box plot for all the numeric columns.
box_plot_for_all_numeric_columns.py
import pandas as pd import matplotlib.pyplot as plt # Create a DataFrame df = pd.read_csv('weatherAUS.csv') numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] # print(len(numerical_columns)) plt.figure(figsize=(14, 14)) rows = 4 cols = 4 for i in range(len(numerical_columns)): col_name = numerical_columns[i] plt.subplot(rows, cols, i+1) fig = df.boxplot(column=col_name) fig.set_ylabel(col_name) plt.show()
Output
Let’s use Winsorization approach to replace extreme values with less extreme.
outlier_columns = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed9am'] for col_name in outlier_columns: lower_limit = df_copy[col_name].quantile(0.1) upper_limit = df_copy[col_name].quantile(0.9) df_copy[col_name] = np.where(df_copy [col_name] < lower_limit, lower_limit, df_copy[col_name]) df_copy[col_name] = np.where(df_copy[col_name] > upper_limit, upper_limit, df_copy[col_name])
After above snippet, statistics looks like below.
Rainfall Evaporation WindSpeed9am WindSpeed9am count 142199.0 82670.0 143693.0 143693.0 mean 1.0 5.0 14.0 14.0 std 2.0 3.0 7.0 7.0 min 0.0 1.0 4.0 4.0 25% 0.0 3.0 7.0 7.0 50% 0.0 5.0 13.0 13.0 75% 1.0 7.0 19.0 19.0 max 6.0 10.0 26.0 26.0
Let’s replace all the null values of numeric columns with median.
for col_name in numerical_columns: col_median = df_copy[col_name].median() df_copy[col_name].fillna(col_median, inplace=True)
You can print for null values using below statement.
print(df_copy[numerical_columns].isnull().sum())
MinTemp 0 MaxTemp 0 Rainfall 0 Evaporation 0 Sunshine 0 WindGustSpeed 0 WindSpeed9am 0 WindSpeed3pm 0 Humidity9am 0 Humidity3pm 0 Pressure9am 0 Pressure3pm 0 Cloud9am 0 Cloud3pm 0 Temp9am 0 Temp3pm 0
Preprocessing step is done now…😊
Divide the dataset into feature set (X) and target variable (y)
target_column_name = 'RainTomorrow' X = df.drop([target_column_name], axis=1) y = df[target_column_name] Split the data into training and test samples. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47) Get LogisticRegression object. logreg = LogisticRegression(solver='liblinear', random_state=47) Train the model. logreg.fit(X_train, y_train) Print the prediction. y_pred = logreg.predict(X_test) Print the accuracy score print('Model accuracy score : ',accuracy_score(y_test, y_pred))
Find the below working application.
weather_prediction.py
import pandas as pd from sklearn.preprocessing import LabelEncoder import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.preprocessing import MinMaxScaler # Adjust display options to show all rows and columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def basic_analysis(df): five_rows = df.head() print(f'five_rows : {five_rows}') shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'duplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values}) print(result_df) for col in categorical_columns: print(f'\nFrequency count of values for the column {col}\n') print(df[col].value_counts()) # print null value count and unique values count of categorical columns # Create an empty DataFrame to store the counts null_and_unique_value_count = pd.DataFrame(index=categorical_columns) # Count of null values for each column null_and_unique_value_count['Null Count'] = df[categorical_columns].isnull().sum() # Count of unique values for each column null_and_unique_value_count['Unique Count'] = df[categorical_columns].nunique() # Display the combined DataFrame print(null_and_unique_value_count) def preprocess(df, categorical_columns): df_copy = df.copy(deep=True) # Convert the type of Date from Object to date typee df_copy['Date'] = pd.to_datetime(df_copy['Date']) df_copy['Year'] = df_copy['Date'].dt.year df_copy['Month'] = df_copy['Date'].dt.month df_copy['Day'] = df_copy['Date'].dt.day df_copy.drop('Date', axis=1, inplace=True) # Encode the categorical columns label_encoder = LabelEncoder() for col in categorical_columns: # Fill missing values with mode mode_value = df_copy[col].mode().iloc[0] df[col].fillna(mode_value, inplace=True) # Encode the data df_copy[col] = label_encoder.fit_transform(df_copy[col]) #print('\nNull values count in categorical columns') #print(df_copy[categorical_columns].isnull().sum()) #print('\nfirst five rows of categorical columns') #print(df_copy[categorical_columns].head()) # Work with numerical columns numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] #print('The numerical variables are :', numerical_columns) #print('Total numeric columns are : ', len(numerical_columns)) #null_values_count = df[numerical_columns].isnull().sum() # print(f'null_values_count : \n{null_values_count}') # print('Basic statistics on numerical columns\n') # print(round(df[numerical_columns].describe())) # Replace all the outliers using Winsorization approach # Winsorization involves replacing extreme values with less extreme outlier_columns = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed9am'] for col_name in outlier_columns: lower_limit = df_copy[col_name].quantile(0.1) upper_limit = df_copy[col_name].quantile(0.9) df_copy[col_name] = np.where(df[col_name] < lower_limit, lower_limit, df_copy[col_name]) df_copy[col_name] = np.where(df_copy[col_name] > upper_limit, upper_limit, df_copy[col_name]) #print('Basic statistics on numerical columns\n') # print(round(df_copy[outlier_columns].describe())) # let's replace all the missing values with median for col_name in numerical_columns: median = df_copy[col_name].median() df_copy[col_name].fillna(median, inplace=True) #print(df_copy) #print(df_copy[numerical_columns].isnull().sum()) return df_copy def train_and_test_model(df): # Divide the data into features and target variable target_column_name = 'RainTomorrow' X = df.drop([target_column_name], axis=1) #scaler = MinMaxScaler() #X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns) y = df[target_column_name] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47) #print(X_train.describe()) # instantiate the model logreg = LogisticRegression(solver='liblinear', random_state=47) # fit the model logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) print('Model accuracy score : ',accuracy_score(y_test, y_pred)) df = pd.read_csv('weatherAUS.csv') categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow'] # basic_analysis(df, categorical_columns) preprocessed_data = preprocess(df, categorical_columns) # preprocessed_data.info() train_and_test_model(preprocessed_data)
Output
Model accuracy score : 0.8351436821119208
Note
- When there are outliers in the dataset, use median to replace missing values.
You can download the dataset from below link.
https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
No comments:
Post a Comment