Programming for beginners: How to handle missing values?

A missing value is the one which is undefined or absent or a null value in the dataset. These are typically represented as NaN for numerical data and null for textual or object data.

Since missing value affect the accuracy of prediction of model, we need to handle them with care. Depending on the size, context of data different strategies can be applied to address missing values like remove missing values, replace missing values with some default value etc.,

Why data can be missing?

Data can be missed due to several reasons, following are some.

Data entry errors
Data might not be collected fully, for example, while taking the survey, some might be mandatory questions and some might be optional. User might skip the optional questions while submitting the survey, which leads to missing value here.
Data is lost due to some file corruption.

How to handle missing values?

Following are some commonly used strategies to handle missing values.

a. Remove rows with missing values

it is very simple approach to implement, just remove all the rows with missing values. You can go with this approach, when the number of missing values are small and do not significantly impact the dataset size.

df_exclude_missing_values = df.dropna()

Above snippet drop the rows that contain atleast one missing value, this operation should not impact the original dataframe 'df', the variable 'df_exclude_missing_values' holds the final result after dropping missing values.

df_exclude_missing_values_of_col_b = df.dropna(subset=['b'])

Above statement exclude rows from the original DataFrame df where the column 'b' has missing values (NaN). The variable 'df_exclude_missing_values_of_col_b' holds the final result after dropping missing values.

df.dropna(inplace=True)

Above snippet drop the rows that contain atleast one missing value and the operation should be performed on the original DataFrame, that means Changes are reflected to the original dataframe.

Find the below working application.

drop_missing_values.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [True, None, True, False, True, None, False]
})

df_exclude_missing_values = df.dropna()
df_exclude_missing_values_of_col_b = df.dropna(subset=['b'])

print(f'df : \n',df)
print(f'\ndf_exclude_missing_values : \n',df_exclude_missing_values)
print(f'\ndf_exclude_missing_values_of_col_b : \n',df_exclude_missing_values_of_col_b)

# Incase if you want to reflect the changes in actual dataframe, set inplace argument to True
df.dropna(inplace=True)
print(f'\ndf : \n',df)

Output

df : 
    a       b      c
0  1   12.30   True
1  2     NaN   None
2  3  143.45   True
3  4     NaN  False
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False

df_exclude_missing_values : 
    a       b      c
0  1   12.30   True
2  3  143.45   True
4  5   34.00   True
6  7  109.00  False

df_exclude_missing_values_of_col_b : 
    a       b      c
0  1   12.30   True
2  3  143.45   True
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False
df : 
    a       b      c
0  1   12.30   True
2  3  143.45   True
4  5   34.00   True
6  7  109.00  False

b. Replace missing values with some default value

Replace missing values with some default value. In case of numerical values, you can replace the missing values with 0, 1 or -1 depends on the context. In case of boolean values, you can replace missing values with True or False etc.,

Example

df['b'].fillna(0, inplace=True)
Above snippet replaces the missing values in a numeric column ‘b’ with 0.

df['c'].fillna(True, inplace=True)
Above snippet replaces the missing values in a boolean column ‘c’ with True

Find the below working application.

replace_missing_values.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [True, None, True, False, True, None, False]
})

print(f'df : \n',df)

df['b'].fillna(0, inplace=True)
df['c'].fillna(True, inplace=True)

print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')

print(f'df : \n',df)

Output

df : 
    a       b      c
0  1   12.30   True
1  2     NaN   None
2  3  143.45   True
3  4     NaN  False
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False

Dataframe after replacing numerical missing values with 0 and boolean values with True
df : 
    a       b      c
0  1   12.30   True
1  2    0.00   True
2  3  143.45   True
3  4    0.00  False
4  5   34.00   True
5  6   56.00   True
6  7  109.00  False

c. Replace with mode

This approach will work both categorical, non-categorical columns. Find the most frequent value in the column and replace all the missing values with that.

Example

mode_value = df['c'].mode().iloc[0]

Above statement get the most frequent value in the column ‘c’.

df['c'].fillna(mode_value, inplace=True)

Above snippet replace the missing values in the column ‘c’ with most frequent value.

replace_missing_values_with_mode.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [True, None, True, False, True, None, False]
})

print(f'df : \n',df)

mode_value = df['b'].mode().iloc[0]
df['b'].fillna(mode_value, inplace=True)
print(f'\nmode_value for the column b is : {mode_value}')

mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is : {mode_value}')

print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')

print(f'df : \n',df)

Output

df : 
    a       b      c
0  1   12.30   True
1  2     NaN   None
2  3  143.45   True
3  4     NaN  False
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False

mode_value for the column b is : 12.3
mode_value for the column c is : True

Dataframe after replacing numerical missing values with 0 and boolean values with True
df : 
    a       b      c
0  1   12.30   True
1  2   12.30   True
2  3  143.45   True
3  4   12.30  False
4  5   34.00   True
5  6   56.00   True
6  7  109.00  False

d. Replace missing values with median

Median represent the middle value of the data points, when the data points are arranged in either ascending or descending order. This approach is not applicable for categorical columns.

Example

median_value = df['b'].median()
df['b'].fillna(median_value, inplace=True)

Find the below working application.

replace_missing_values_with_median.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [True, None, True, False, True, None, False]
})

print(f'df : \n',df)

median_value = df['b'].median()
df['b'].fillna(median_value, inplace=True)
print(f'\nmedian_value for the column b is : {median_value}')

mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is : {mode_value}')

print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')

print(f'df : \n',df)

Output

df : 
    a       b      c
0  1   12.30   True
1  2     NaN   None
2  3  143.45   True
3  4     NaN  False
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False

median_value for the column b is : 56.0
mode_value for the column c is : 0    True
Name: c, dtype: object

Dataframe after replacing numerical missing values with 0 and boolean values with True
df : 
    a       b      c
0  1   12.30   True
1  2   56.00    NaN
2  3  143.45   True
3  4   56.00  False
4  5   34.00   True
5  6   56.00    NaN
6  7  109.00  False

e. Replace missing values with mean

This approach will work on numerical columns. We can take the average of all the values and replace missing values with the average.

mean_value = df['b'].mean()
df['b'].fillna(mean_value, inplace=True)

Find the below working application.

replace_missing_values_with_mean.py

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [True, None, True, False, True, None, False]
})

print(f'df : \n',df)

mean_value = df['b'].mean()
df['b'].fillna(mean_value, inplace=True)
print(f'\nmean_value for the column b is : {mean_value}')

mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is :\n {mode_value}')

print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')

print(f'df : \n',df)

Output

df : 
    a       b      c
0  1   12.30   True
1  2     NaN   None
2  3  143.45   True
3  4     NaN  False
4  5   34.00   True
5  6   56.00   None
6  7  109.00  False

mean_value for the column b is : 70.95
mode_value for the column c is :
 True

Dataframe after replacing numerical missing values with 0 and boolean values with True
df : 
    a       b      c
0  1   12.30   True
1  2   70.95   True
2  3  143.45   True
3  4   70.95  False
4  5   34.00   True
5  6   56.00   True
6  7  109.00  False

f. Use predictive models to predict the missing values

We can use predictive models like linear regression, to predict the missing numerical values based on other features.

Procedure to achieve above one is given below.

Step 1: Train the model on rows without missing values

Step 2: Use the trained model to predict the missing value.

predict_missing_values.py

from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3, 4, 5, 6, 7],
    'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
    'c' : [7, 6, 5, 4, 3, 2, 1]
})

print(f'df : \n', df)

# Separate data into two sets: rows with missing values and rows without missing values
rows_with_missing_values = df[df['b'].isnull()]
rows_without_missing_values = df.dropna(subset=['b'])

# Train a model to predict the missing values
model = LinearRegression()
model.fit(rows_without_missing_values.drop('b', axis=1), rows_without_missing_values['b'])

# Predict missing values
predicted_values = model.predict(rows_with_missing_values.drop('b', axis=1))
rows_with_missing_values['b'] = predicted_values

# Concatenate the data back together and reset the index
df = pd.concat([rows_with_missing_values, rows_without_missing_values], ignore_index=True)
print(f'df after imputation: \n', df)

Output

df : 
    a       b  c
0  1   12.30  7
1  2     NaN  6
2  3  143.45  5
3  4     NaN  4
4  5   34.00  3
5  6   56.00  2
6  7  109.00  1

df after imputation: 
    a           b  c
0  2   55.355172  6
1  4   68.350862  4
2  1   12.300000  7
3  3  143.450000  5
4  5   34.000000  3
5  6   56.000000  2
6  7  109.000000  1

Process finished with exit code 0

g. Use domain knowledge to fill missing data

Sometimes, based on your domain knowledge, you can fill the missing data.

When to use mean, mode and median to replace missing values

When your data is skewed or contain outliers, then use median to replace missing values. Using median is a better choice, when the distribution is not symmetric.

Use mode, when you have categorical data.

Use mean, when the data follows a roughly symmetric distribution without significant outliers. For example, we can use mean to replace the values for the columns age, temperature, salary etc.,

Previous Next Home

Programming for beginners

Thursday, 20 February 2025

How to handle missing values?

No comments:

Post a Comment