Saturday, 22 March 2025

Logistic regression In Machine Learning

Logistic regression is a classification model, used to predict a categorical outcome. It is come under supervised learning.

 

Logistic regression is used to model the probability of a binary outcome (0 or 1, true or false, yes or no etc.,). You can go with Multinomial logical regression, when you have more than two possible discrete outcomes.

 

Examples

  1. Predict whether user will buy given product or not.
  2. Predict whether a given email is spam or not.
  3. Predict whether a patient has disease or not
  4. Predict whether an applicant is likely to default on their loan.
  5. Predict whether a customer leave the subscription or not
  6. Predict whether a given customer review is +ve or -ve
  7. Predict whether an employee is leave/stay with the organization
  8. Predict whether a given credit card transaction is fraud or not
  9. Predict whether a stock price will go up or not
  10. Predict whether given item has a defect or not

 

The probability of event occurring is calculated by a sigmoid function.

 

Sigmoid function

Sigmoid function maps the output of a model into a range of 0 to 1 or to transform the model’s output into a probability.

 

Frequently used sigmoid function is a logistic function, which is defined like below.

 


Where, x is the input to the logistic function.

 

Other sigmoid functions are given below.

 

Hyperbolic tangent function

 


Softplus function

 


Rectified linear unit (ReLU) function

 


Exponential linear unit (ELU) function

 


How logistic regression works internally?

It is a three step process.

  1. Map the training data to linear equation.
  2. sigmoid function, to map the outcome of linear equation between 0 and 1.
  3. Define decision boundary
  4. Make predictions

 

Map the training data to a linear equation

Simple linear equation models the relationship between a dependent variable and an independent variable.

 

Formula

y = mx + c

 

In the above formula

 

  1. y is the dependent variable that we are trying to predict
  2. x is the independent variable or the input feature,
  3. m is the slope of the line, which represents the change in y for a unit change in x. A positive slope means that the line goes up as we move to the right, while a negative slope means that the line goes down as we move to the right. A slope of 0 means that the line is horizontal. The coefficient of x is m here.
  4. c is the intercept of the line, which represents the value of y when x is 0.

 

For example, the equation y = 5x + 10 has a slope of 5 and a y-intercept of 10. This means that the line goes up 5 for every 1 unit we move to the right, and it crosses the y-axis at the point (0, 10).

 

If we have more than one dependent variable (x1, x2, …xp) and one response variable (y), then the linear equation would be given mathematically with the following equation.

 


β1,β2,…,βp are the independent variables or coefficients for each feature.

 

Sigmoid function, to map the outcome of linear equation between 0 and 1.

The predicted response from the linear equation is mapped between 0 and 1 using a sigmoid function.

 

The sigmoid function is visualized as S shaped curve, you can confirm the same from below application.

 

sigmoid.py

import numpy as np
import matplotlib.pyplot as plt

# Define the sigmoid function
def sigmoid(y):
    return 1 / (1 + np.exp(-y))

# Create a range of values for 'y'
y = np.linspace(-6, 6, 300)

# Calculate sigmoid values for the range of 'z'
sigmoid_values = sigmoid(y)

# Create a plot
plt.figure(figsize=(8, 5))
plt.plot(y, sigmoid_values, label='Sigmoid Function', color='green')
plt.xlabel('y')
plt.ylabel('sigmoid(y)')
plt.title('Sigmoid Function')

plt.axhline(y=1, color='black', linestyle='--', linewidth=0.9)
plt.axhline(y=0, color='black', linestyle='--', linewidth=0.9)
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.9)

plt.grid(True, linestyle='--', alpha=0.6)

plt.legend()
plt.show()

 

Output

 


Define decision boundary

Sigmoid function maps the projected value of from linear equation between the values 0 and 1. Now we need to define the boundary or threshold value to map the sigmoid function output to a discrete value like yes/no, pass/fail, buy/not_buy etc.,

 

For example, suppose if I selected the threshold values as 0.4, then mathematically, it can be expressed as follows:-

 

p ≥ 0.4 => class = 1

p < 0.4 => class = 0



Make predictions

As the linear equation, sigmoid function, decision boundary are in place, we can use these together to predict the outcome category.

 

I will be using the weather dataset (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) and use logistic regression to predict, whether is it going to rain today or not?

 

To get a quick glance on Logistic regression.

logreg = LogisticRegression(solver='liblinear', random_state=47)

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Model accuracy score : ',accuracy_score(y_test, y_pred))

 

Let’s analyze the data

Get some rows to understand the data.

five_rows = df.head()

 

Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
2008-12-01,Albury,13.4,22.9,0.6,NA,NA,W,44,W,WNW,20,24,71,22,1007.7,1007.1,8,NA,16.9,21.8,No,No
2008-12-02,Albury,7.4,25.1,0,NA,NA,WNW,44,NNW,WSW,4,22,44,25,1010.6,1007.8,NA,NA,17.2,24.3,No,No
2008-12-03,Albury,12.9,25.7,0,NA,NA,WSW,46,W,WSW,19,26,38,30,1007.6,1008.7,NA,2,21,23.2,No,No
2008-12-04,Albury,9.2,28,0,NA,NA,NE,24,SE,E,11,9,45,16,1017.6,1012.8,NA,NA,18.1,26.5,No,No
2008-12-05,Albury,17.5,32.3,1,NA,NA,W,41,ENE,NW,7,20,82,33,1010.8,1006,7,8,17.8,29.7,No,No
2008-12-06,Albury,14.6,29.7,0.2,NA,NA,WNW,56,W,W,19,24,55,23,1009.2,1005.4,NA,NA,20.6,28.9,No,No

 

Get total rows and columns

shape = df.shape
total_rows = shape[0]
total_columns = shape[1]

 

Get column names

columns = df.columns

columns : Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

 

Get detailed summary of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object 
 22  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(7)
memory usage: 25.5+ MB

 

Get column wise missing values

column_wise_missing_values = df.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267

 

Check duplicate rows based on all columns

duplicate_count = df.duplicated().sum()

print(f'duplicate_count : {duplicate_count}')

 If there are any duplicates, we can use the method ‘drop_duplicates’ to drop the duplicates. 

df = df.drop_duplicates()

 

Print all the unique values, missing values count,  column datatype

unique_counts = df.nunique()
missing_values = df.isnull().sum()
result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values})
print(result_df)

  You can use below snippet to get basic information about the dataframe that we discussed above.

 

def basic_analysis(df):

    five_rows = df.head()
    print(f'five_rows : {five_rows}')

    shape = df.shape
    total_rows = shape[0]
    total_columns = shape[1]

    print(f'\ntotal_rows : {total_rows}')
    print(f'total_columns : {total_columns}')

    columns = df.columns
    print(f'\ncolumns : {columns}')

    print('\nDetailed information')
    df.info()

    print('\nColumn wise missing values')
    column_wise_missing_values = df.isnull().sum()
    print(f'column_wise_missing_values : {column_wise_missing_values}')

    # Check for duplicate rows based on all columns
    duplicate_count = df.duplicated().sum()
    print(f'duplicate_count : {duplicate_count}')

    # Print unique counts
    print(f'\nUnique counts, missing values count, data types column wise')
    unique_counts = df.nunique()
    missing_values = df.isnull().sum()
    result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values})
    print(result_df)

Above snippet prints below output.

five_rows :          Date Location  MinTemp  ...  Temp3pm  RainToday  RainTomorrow
0  2008-12-01   Albury     13.4  ...     21.8         No            No
1  2008-12-02   Albury      7.4  ...     24.3         No            No
2  2008-12-03   Albury     12.9  ...     23.2         No            No
3  2008-12-04   Albury      9.2  ...     26.5         No            No
4  2008-12-05   Albury     17.5  ...     29.7         No            No

[5 rows x 23 columns]

total_rows : 145460
total_columns : 23

columns : Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

Detailed information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null   float64
 18  Cloud3pm       86102 non-null   float64
 19  Temp9am        143693 non-null  float64
 20  Temp3pm        141851 non-null  float64
 21  RainToday      142199 non-null  object 
 22  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(7)
memory usage: 25.5+ MB

Column wise missing values
column_wise_missing_values : Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64
duplicate_count : 0

Unique counts, missing values count, data types column wise
              Data Type  Unique Count  missing_values
Date             object          3436               0
Location         object            49               0
MinTemp         float64           389            1485
MaxTemp         float64           505            1261
Rainfall        float64           681            3261
Evaporation     float64           358           62790
Sunshine        float64           145           69835
WindGustDir      object            16           10326
WindGustSpeed   float64            67           10263
WindDir9am       object            16           10566
WindDir3pm       object            16            4228
WindSpeed9am    float64            43            1767
WindSpeed3pm    float64            44            3062
Humidity9am     float64           101            2654
Humidity3pm     float64           101            4507
Pressure9am     float64           546           15065
Pressure3pm     float64           549           15028
Cloud9am        float64            10           55888
Cloud3pm        float64            10           59358
Temp9am         float64           441            1767
Temp3pm         float64           502            3609
RainToday        object             2            3261
RainTomorrow     object             2            3267

 

From the output, I identified below insights.

  1. There are 145460 and 23 columns.
  2. There are 16 columns of type float64 and 4 columns of type object.
  3. There are no duplicates in the input dataset
  4. From the ‘Unique counts column wise’ section, I identified below columns as categorical columns based in their type and unique value count. As unique value count for RainToday and RainTomorrow columns is 2, these two are considered as binary categorical variables.

1.   Location  (0 missing values)  

2.   WindGustDir (10326 missing values)

3.   WindDir9am (10566 missing values)

4.   WindDir3pm (4228 missing values)

5.   RainToday (3261 missing values)

6.   RainTomorrow (3267 missing values)

       

  1. There is a date variable denoted by Date column
  2. We need to predict the target ‘RainTomorrow’.

 

Let’s find the frequency count of categorical columns to get some more insights

categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
for col in categorical_columns:
    print(f'\nFrequency count of values for the column {col}\n')
    print(df[col].value_counts())

 

Above snippet prints below information.

Frequency count of values for the column Location

Location
Canberra            3436
Sydney              3344
Darwin              3193
Melbourne           3193
Brisbane            3193
Adelaide            3193
Perth               3193
Hobart              3193
Albany              3040
MountGambier        3040
Ballarat            3040
Townsville          3040
GoldCoast           3040
Cairns              3040
Launceston          3040
AliceSprings        3040
Bendigo             3040
Albury              3040
MountGinini         3040
Wollongong          3040
Newcastle           3039
Tuggeranong         3039
Penrith             3039
Woomera             3009
Nuriootpa           3009
Cobar               3009
CoffsHarbour        3009
Moree               3009
Sale                3009
PerthAirport        3009
PearceRAAF          3009
Witchcliffe         3009
BadgerysCreek       3009
Mildura             3009
NorfolkIsland       3009
MelbourneAirport    3009
Richmond            3009
SydneyAirport       3009
WaggaWagga          3009
Williamtown         3009
Dartmoor            3009
Watsonia            3009
Portland            3009
Walpole             3006
NorahHead           3004
SalmonGums          3001
Katherine           1578
Nhil                1578
Uluru               1578
Name: count, dtype: int64

Frequency count of values for the column WindGustDir

WindGustDir
W      9915
SE     9418
N      9313
SSE    9216
E      9181
S      9168
WSW    9069
SW     8967
SSW    8736
WNW    8252
NW     8122
ENE    8104
ESE    7372
NE     7133
NNW    6620
NNE    6548
Name: count, dtype: int64

Frequency count of values for the column WindDir9am

WindDir9am
N      11758
SE      9287
E       9176
SSE     9112
NW      8749
S       8659
W       8459
SW      8423
NNE     8129
NNW     7980
ENE     7836
NE      7671
ESE     7630
SSW     7587
WNW     7414
WSW     7024
Name: count, dtype: int64

Frequency count of values for the column WindDir3pm

WindDir3pm
SE     10838
W      10110
S       9926
WSW     9518
SSE     9399
SW      9354
N       8890
WNW     8874
NW      8610
ESE     8505
E       8472
NE      8263
SSW     8156
NNW     7870
ENE     7857
NNE     6590
Name: count, dtype: int64

Frequency count of values for the column RainToday

RainToday
No     110319
Yes     31880
Name: count, dtype: int64

Frequency count of values for the column RainTomorrow

RainTomorrow
No     110316
Yes     31877
Name: count, dtype: int64

 

Let’s start pre-processing the data

Let’s take a deep copy of the data and apply all the transformations on it.

df_copy = df.copy(deep=True)

 

Deep copy impacts the performance when the data set size is huge.

 

Convert the Date column data type to datetime

df_copy['Date'] = pd.to_datetime(df_copy['Date'])

 

Above statement converts the values in a 'Date' column in a Pandas DataFrame (df_copy) into datetime objects. After executing this line, the 'Date' column in the df_copy DataFrame will contain datetime objects instead of the original date values.

 

Extract year, month and day values from ‘Date’ column and store them in separate features year, month and day

df_copy['Year'] = df_copy['Date'].dt.year

 

Above statement creates a new column called 'Year' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .year attribute to extract the year component from each date in the 'Date' column and assigns it to the 'Year' column.

df_copy['Month'] = df_copy['Date'].dt.month

  Above statement creates a new column called 'Month' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .month attribute to extract the month component from each date in the 'Date' column and assigns it to the 'Month' column.

 

df_copy['Day'] = df_copy['Date'].dt.day

 

Above statement creates a new column called 'Day' in the DataFrame df_copy. I used the ‘dt’ accessor to access the datetime properties of the 'Date' column and used the .day attribute to extract the day component from each date in the 'Date' column and assigns it to the 'Day' column.

 

Drop Date column data

As we already extracted the Year, Month and Day information from Date column, we can remove this from the data frame to optimize the memory.

df_copy.drop('Date', axis=1, inplace=True)

 

First argument specifies the column ‘Date’ that we want to remove.

 

Second argument axis=1 specifies that we want to delete the column data, axis=0 specifies the row.

 

inplace=True specifies that I want to modify the DataFrame df_copy in place. Internally the 'Date' column will be removed from df_copy directly, and the modified DataFrame will be stored in df_copy without the need to assign it explicitly to a new variable.

 

Let’s cleanup Categorical variables

Let’s identify the null value count in categorical variables.

# print null value count and unique values count of categorical columns
# Create an empty DataFrame to store the counts
null_and_unique_value_count = pd.DataFrame(index=categorical_columns)

# Count of null values for each column
null_and_unique_value_count['Null Count'] = df[categorical_columns].isnull().sum()

# Count of unique values for each column
null_and_unique_value_count['Unique Count'] = df[categorical_columns].nunique()

 

Above snippet prints below statistics.

              Null Count  Unique Count
Location               0            49
WindGustDir        10326            16
WindDir9am         10566            16
WindDir3pm          4228            16
RainToday           3261             2
RainTomorrow        3267             2

 

From the above output, you can confirm the columns WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTomorrow variables contain missing values.

 

Let’s do following steps on categorical column data.

  1. Fill the missing values with mode.
  2. Encode the column data with label encoder.

 

label_encoder = LabelEncoder()
for col in categorical_columns:
	# Fill missing values with mode
	mode_value = df_copy[col].mode().iloc[0]
	df[col].fillna(mode_value, inplace=True)

	# Encode the data
	df_copy[col] = label_encoder.fit_transform(df_copy[col])

To confirm print the null values count in categorical columns.

print(df_copy[categorical_columns].isnull().sum())

Location        0
WindGustDir     0
WindDir9am      0
WindDir3pm      0
RainToday       0
RainTomorrow    0

 

Print the first five rows of the data to visualize the transformed data.

df_copy[categorical_columns].head()

   Location  WindGustDir  WindDir9am  WindDir3pm  RainToday  RainTomorrow
0         2           13          13          14          0             0
1         2           14           6          15          0             0
2         2           15          13          15          0             0
3         2            4           9           0          0             0
4         2           13           1           7          0             0

 

Let’s cleanup Numerical variables

Following snippet gets you all the numeric columns.

numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
print('The numerical variables are :', numerical_columns)
print('Total numeric columns are : ', len(numerical_columns))

Above statements prints below output.

The numerical variables are : ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']
Total numeric columns are :  16

 

Let’s get count of missing values in the numeric columns.


null_values_count = df[numerical_columns].isnull().sum()
print(f'null_values_count : \n{null_values_count}')

Above snippet prints below statistics.

null_values_count : 
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustSpeed    10263
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609

 

From the above snippet, you can confirm that all the columns has missing values.

 

Let’s get basic statistics on the numeric columns.

 

First adjust display options to show all rows and columns

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

 

Execute below line to get the statistics.

print(round(df[numerical_columns].describe()), 2)

Above statement print below statistics.

        MinTemp   MaxTemp  Rainfall  Evaporation  Sunshine  WindGustSpeed   
count  143975.0  144199.0  142199.0      82670.0   75625.0       135197.0  \
mean       12.0      23.0       2.0          5.0       8.0           40.0   
std         6.0       7.0       8.0          4.0       4.0           14.0   
min        -8.0      -5.0       0.0          0.0       0.0            6.0   
25%         8.0      18.0       0.0          3.0       5.0           31.0   
50%        12.0      23.0       0.0          5.0       8.0           39.0   
75%        17.0      28.0       1.0          7.0      11.0           48.0   
max        34.0      48.0     371.0        145.0      14.0          135.0   

       WindSpeed9am  WindSpeed3pm  Humidity9am  Humidity3pm  Pressure9am   
count      143693.0      142398.0     142806.0     140953.0     130395.0  \
mean           14.0          19.0         69.0         52.0       1018.0   
std             9.0           9.0         19.0         21.0          7.0   
min             0.0           0.0          0.0          0.0        980.0   
25%             7.0          13.0         57.0         37.0       1013.0   
50%            13.0          19.0         70.0         52.0       1018.0   
75%            19.0          24.0         83.0         66.0       1022.0   
max           130.0          87.0        100.0        100.0       1041.0   

       Pressure3pm  Cloud9am  Cloud3pm   Temp9am   Temp3pm  
count     130432.0   89572.0   86102.0  143693.0  141851.0  
mean        1015.0       4.0       5.0      17.0      22.0  
std            7.0       3.0       3.0       6.0       7.0  
min          977.0       0.0       0.0      -7.0      -5.0  
25%         1010.0       1.0       2.0      12.0      17.0  
50%         1015.0       5.0       5.0      17.0      21.0  
75%         1020.0       7.0       7.0      22.0      26.0  
max         1040.0       9.0       9.0      40.0      47.0  

 

By closely inspecting the data, we can, we can see that following columns has outliers.

  1. Rainfall (75% has the value 1.0, but the max value is 371.0)
  2. Evaporation (75% has the value 7.0, but the max value is 145.0)
  3. WindSpeed9am (75% has the value 19.0, but the max value is 130.0)
  4. WindSpeed3pm (75% has the value 24.0, but the max value is 87.0)

 

Let’s draw a box plot and confirm our observation for the numerical columns.

 

box_plot_for_outliers.py

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.read_csv('weatherAUS.csv')

plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
fig = df.boxplot(column='Rainfall')
fig.set_title('')
fig.set_ylabel('Rainfall')

plt.subplot(2, 2, 2)
fig = df.boxplot(column='Evaporation')
fig.set_title('')
fig.set_ylabel('Evaporation')

plt.subplot(2, 2, 3)
fig = df.boxplot(column='WindSpeed9am')
fig.set_title('')
fig.set_ylabel('WindSpeed9am')

plt.subplot(2, 2, 4)
fig = df.boxplot(column='WindSpeed3pm')
fig.set_title('')
fig.set_ylabel('WindSpeed3pm')

plt.show()

 

Output


 

 

From the above diagram, we can reconfirm there are outliers.

 

Following snippet create a box plot for all the numeric columns.

 

box_plot_for_all_numeric_columns.py


 

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.read_csv('weatherAUS.csv')

numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
# print(len(numerical_columns))

plt.figure(figsize=(14, 14))

rows = 4
cols = 4

for i in range(len(numerical_columns)):
    col_name = numerical_columns[i]
    plt.subplot(rows, cols, i+1)
    fig = df.boxplot(column=col_name)
    fig.set_ylabel(col_name)

plt.show()

 

Output

 


 

Let’s use Winsorization approach to replace extreme values with less extreme.


 

outlier_columns = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed9am']
for col_name in outlier_columns:
	lower_limit = df_copy[col_name].quantile(0.1)
	upper_limit = df_copy[col_name].quantile(0.9)
	df_copy[col_name] = np.where(df_copy [col_name] < lower_limit, lower_limit, df_copy[col_name])
	df_copy[col_name] = np.where(df_copy[col_name] > upper_limit, upper_limit, df_copy[col_name])

 

After above snippet, statistics looks like below.

       Rainfall  Evaporation  WindSpeed9am  WindSpeed9am
count  142199.0      82670.0      143693.0      143693.0
mean        1.0          5.0          14.0          14.0
std         2.0          3.0           7.0           7.0
min         0.0          1.0           4.0           4.0
25%         0.0          3.0           7.0           7.0
50%         0.0          5.0          13.0          13.0
75%         1.0          7.0          19.0          19.0
max         6.0         10.0          26.0          26.0

Let’s replace all the null values of numeric columns with median.

for col_name in numerical_columns:
	col_median = df_copy[col_name].median()
	df_copy[col_name].fillna(col_median, inplace=True)

 

You can print for null values using below statement.

print(df_copy[numerical_columns].isnull().sum())

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0

 

Preprocessing step is done now…😊

 

Divide the dataset into feature set (X) and target variable (y)

target_column_name = 'RainTomorrow'
X = df.drop([target_column_name], axis=1)
y = df[target_column_name]

Split the data into training and test samples.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47)

Get LogisticRegression object.
logreg = LogisticRegression(solver='liblinear', random_state=47)

Train the model.
logreg.fit(X_train, y_train)

Print the prediction.
y_pred = logreg.predict(X_test)

Print the accuracy score
print('Model accuracy score : ',accuracy_score(y_test, y_pred))

 

Find the below working application.

 

weather_prediction.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

# Adjust display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

def basic_analysis(df):

    five_rows = df.head()
    print(f'five_rows : {five_rows}')

    shape = df.shape
    total_rows = shape[0]
    total_columns = shape[1]

    print(f'\ntotal_rows : {total_rows}')
    print(f'total_columns : {total_columns}')

    columns = df.columns
    print(f'\ncolumns : {columns}')

    print('\nDetailed information')
    df.info()

    print('\nColumn wise missing values')
    column_wise_missing_values = df.isnull().sum()
    print(f'column_wise_missing_values : {column_wise_missing_values}')

    # Check for duplicate rows based on all columns
    duplicate_count = df.duplicated().sum()
    print(f'duplicate_count : {duplicate_count}')

    # Print unique counts
    print(f'\nUnique counts, missing values count, data types column wise')
    unique_counts = df.nunique()
    missing_values = df.isnull().sum()
    result_df = pd.DataFrame({'Data Type': df.dtypes, 'Unique Count': unique_counts, 'missing_values' : missing_values})
    print(result_df)

    for col in categorical_columns:
        print(f'\nFrequency count of values for the column {col}\n')
        print(df[col].value_counts())

    # print null value count and unique values count of categorical columns
    # Create an empty DataFrame to store the counts
    null_and_unique_value_count = pd.DataFrame(index=categorical_columns)

    # Count of null values for each column
    null_and_unique_value_count['Null Count'] = df[categorical_columns].isnull().sum()

    # Count of unique values for each column
    null_and_unique_value_count['Unique Count'] = df[categorical_columns].nunique()

    # Display the combined DataFrame
    print(null_and_unique_value_count)


def preprocess(df, categorical_columns):
    df_copy = df.copy(deep=True)

    # Convert the type of Date from Object to date typee
    df_copy['Date'] = pd.to_datetime(df_copy['Date'])

    df_copy['Year'] = df_copy['Date'].dt.year
    df_copy['Month'] = df_copy['Date'].dt.month
    df_copy['Day'] = df_copy['Date'].dt.day

    df_copy.drop('Date', axis=1, inplace=True)

    # Encode the categorical columns
    label_encoder = LabelEncoder()
    for col in categorical_columns:
        # Fill missing values with mode
        mode_value = df_copy[col].mode().iloc[0]
        df[col].fillna(mode_value, inplace=True)

        # Encode the data
        df_copy[col] = label_encoder.fit_transform(df_copy[col])

    #print('\nNull values count in categorical columns')
    #print(df_copy[categorical_columns].isnull().sum())

    #print('\nfirst five rows of categorical columns')
    #print(df_copy[categorical_columns].head())

    # Work with numerical columns
    numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
    #print('The numerical variables are :', numerical_columns)
    #print('Total numeric columns are : ', len(numerical_columns))

    #null_values_count = df[numerical_columns].isnull().sum()
    # print(f'null_values_count : \n{null_values_count}')

    # print('Basic statistics on numerical columns\n')
    # print(round(df[numerical_columns].describe()))

    # Replace all the outliers using Winsorization approach
    # Winsorization involves replacing extreme values with less extreme
    outlier_columns = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed9am']
    for col_name in outlier_columns:
        lower_limit = df_copy[col_name].quantile(0.1)
        upper_limit = df_copy[col_name].quantile(0.9)
        df_copy[col_name] = np.where(df[col_name] < lower_limit, lower_limit, df_copy[col_name])
        df_copy[col_name] = np.where(df_copy[col_name] > upper_limit, upper_limit, df_copy[col_name])

    #print('Basic statistics on numerical columns\n')
    # print(round(df_copy[outlier_columns].describe()))

    # let's replace all the missing values with median
    for col_name in numerical_columns:
        median = df_copy[col_name].median()
        df_copy[col_name].fillna(median, inplace=True)

    #print(df_copy)
    #print(df_copy[numerical_columns].isnull().sum())
    return df_copy

def train_and_test_model(df):
    # Divide the data into features and target variable
    target_column_name = 'RainTomorrow'
    X = df.drop([target_column_name], axis=1)

    #scaler = MinMaxScaler()
    #X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    y = df[target_column_name]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47)
    #print(X_train.describe())

    # instantiate the model
    logreg = LogisticRegression(solver='liblinear', random_state=47)

    # fit the model
    logreg.fit(X_train, y_train)

    y_pred = logreg.predict(X_test)
    print('Model accuracy score : ',accuracy_score(y_test, y_pred))

df = pd.read_csv('weatherAUS.csv')
categorical_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

# basic_analysis(df, categorical_columns)
preprocessed_data = preprocess(df, categorical_columns)
# preprocessed_data.info()

train_and_test_model(preprocessed_data)

 

Output

Model accuracy score :  0.8351436821119208

 

Note

  1. When there are outliers in the dataset, use median to replace missing values.

 

You can download the dataset from below link.

https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package


Previous                                                    Next                                                    Home

No comments:

Post a Comment