A missing value is the one which is undefined or absent or a null value in the dataset. These are typically represented as NaN for numerical data and null for textual or object data.
Since missing value affect the accuracy of prediction of model, we need to handle them with care. Depending on the size, context of data different strategies can be applied to address missing values like remove missing values, replace missing values with some default value etc.,
Why data can be missing?
Data can be missed due to several reasons, following are some.
- Data entry errors
- Data might not be collected fully, for example, while taking the survey, some might be mandatory questions and some might be optional. User might skip the optional questions while submitting the survey, which leads to missing value here.
- Data is lost due to some file corruption.
How to handle missing values?
Following are some commonly used strategies to handle missing values.
a. Remove rows with missing values
it is very simple approach to implement, just remove all the rows with missing values. You can go with this approach, when the number of missing values are small and do not significantly impact the dataset size.
df_exclude_missing_values = df.dropna()
Above snippet drop the rows that contain atleast one missing value, this operation should not impact the original dataframe 'df', the variable 'df_exclude_missing_values' holds the final result after dropping missing values.
df_exclude_missing_values_of_col_b = df.dropna(subset=['b'])
Above statement exclude rows from the original DataFrame df where the column 'b' has missing values (NaN). The variable 'df_exclude_missing_values_of_col_b' holds the final result after dropping missing values.
df.dropna(inplace=True)
Above snippet drop the rows that contain atleast one missing value and the operation should be performed on the original DataFrame, that means Changes are reflected to the original dataframe.
Find the below working application.
drop_missing_values.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [True, None, True, False, True, None, False]
})
df_exclude_missing_values = df.dropna()
df_exclude_missing_values_of_col_b = df.dropna(subset=['b'])
print(f'df : \n',df)
print(f'\ndf_exclude_missing_values : \n',df_exclude_missing_values)
print(f'\ndf_exclude_missing_values_of_col_b : \n',df_exclude_missing_values_of_col_b)
# Incase if you want to reflect the changes in actual dataframe, set inplace argument to True
df.dropna(inplace=True)
print(f'\ndf : \n',df)
Output
df : a b c 0 1 12.30 True 1 2 NaN None 2 3 143.45 True 3 4 NaN False 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False df_exclude_missing_values : a b c 0 1 12.30 True 2 3 143.45 True 4 5 34.00 True 6 7 109.00 False df_exclude_missing_values_of_col_b : a b c 0 1 12.30 True 2 3 143.45 True 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False df : a b c 0 1 12.30 True 2 3 143.45 True 4 5 34.00 True 6 7 109.00 False
b. Replace missing values with some default value
Replace missing values with some default value. In case of numerical values, you can replace the missing values with 0, 1 or -1 depends on the context. In case of boolean values, you can replace missing values with True or False etc.,
Example
df['b'].fillna(0, inplace=True)
Above snippet replaces the missing values in a numeric column ‘b’ with 0.
df['c'].fillna(True, inplace=True)
Above snippet replaces the missing values in a boolean column ‘c’ with True
Find the below working application.
replace_missing_values.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [True, None, True, False, True, None, False]
})
print(f'df : \n',df)
df['b'].fillna(0, inplace=True)
df['c'].fillna(True, inplace=True)
print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')
print(f'df : \n',df)
Output
df : a b c 0 1 12.30 True 1 2 NaN None 2 3 143.45 True 3 4 NaN False 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False Dataframe after replacing numerical missing values with 0 and boolean values with True df : a b c 0 1 12.30 True 1 2 0.00 True 2 3 143.45 True 3 4 0.00 False 4 5 34.00 True 5 6 56.00 True 6 7 109.00 False
c. Replace with mode
This approach will work both categorical, non-categorical columns. Find the most frequent value in the column and replace all the missing values with that.
Example
mode_value = df['c'].mode().iloc[0]
Above statement get the most frequent value in the column ‘c’.
df['c'].fillna(mode_value, inplace=True)
Above snippet replace the missing values in the column ‘c’ with most frequent value.
replace_missing_values_with_mode.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [True, None, True, False, True, None, False]
})
print(f'df : \n',df)
mode_value = df['b'].mode().iloc[0]
df['b'].fillna(mode_value, inplace=True)
print(f'\nmode_value for the column b is : {mode_value}')
mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is : {mode_value}')
print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')
print(f'df : \n',df)
Output
df : a b c 0 1 12.30 True 1 2 NaN None 2 3 143.45 True 3 4 NaN False 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False mode_value for the column b is : 12.3 mode_value for the column c is : True Dataframe after replacing numerical missing values with 0 and boolean values with True df : a b c 0 1 12.30 True 1 2 12.30 True 2 3 143.45 True 3 4 12.30 False 4 5 34.00 True 5 6 56.00 True 6 7 109.00 False
d. Replace missing values with median
Median represent the middle value of the data points, when the data points are arranged in either ascending or descending order. This approach is not applicable for categorical columns.
Example
median_value = df['b'].median()
df['b'].fillna(median_value, inplace=True)
Find the below working application.
replace_missing_values_with_median.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [True, None, True, False, True, None, False]
})
print(f'df : \n',df)
median_value = df['b'].median()
df['b'].fillna(median_value, inplace=True)
print(f'\nmedian_value for the column b is : {median_value}')
mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is : {mode_value}')
print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')
print(f'df : \n',df)
Output
df : a b c 0 1 12.30 True 1 2 NaN None 2 3 143.45 True 3 4 NaN False 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False median_value for the column b is : 56.0 mode_value for the column c is : 0 True Name: c, dtype: object Dataframe after replacing numerical missing values with 0 and boolean values with True df : a b c 0 1 12.30 True 1 2 56.00 NaN 2 3 143.45 True 3 4 56.00 False 4 5 34.00 True 5 6 56.00 NaN 6 7 109.00 False
e. Replace missing values with mean
This approach will work on numerical columns. We can take the average of all the values and replace missing values with the average.
mean_value = df['b'].mean()
df['b'].fillna(mean_value, inplace=True)
Find the below working application.
replace_missing_values_with_mean.py
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [True, None, True, False, True, None, False]
})
print(f'df : \n',df)
mean_value = df['b'].mean()
df['b'].fillna(mean_value, inplace=True)
print(f'\nmean_value for the column b is : {mean_value}')
mode_value = df['c'].mode().iloc[0]
df['c'].fillna(mode_value, inplace=True)
print(f'mode_value for the column c is :\n {mode_value}')
print('\nDataframe after replacing numerical missing values with 0 and boolean values with True')
print(f'df : \n',df)
Output
df : a b c 0 1 12.30 True 1 2 NaN None 2 3 143.45 True 3 4 NaN False 4 5 34.00 True 5 6 56.00 None 6 7 109.00 False mean_value for the column b is : 70.95 mode_value for the column c is : True Dataframe after replacing numerical missing values with 0 and boolean values with True df : a b c 0 1 12.30 True 1 2 70.95 True 2 3 143.45 True 3 4 70.95 False 4 5 34.00 True 5 6 56.00 True 6 7 109.00 False
f. Use predictive models to predict the missing values
We can use predictive models like linear regression, to predict the missing numerical values based on other features.
Procedure to achieve above one is given below.
Step 1: Train the model on rows without missing values
Step 2: Use the trained model to predict the missing value.
predict_missing_values.py
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a' : [1, 2, 3, 4, 5, 6, 7],
'b' : [12.3, np.NAN, 143.45, np.NAN, 34, 56, 109],
'c' : [7, 6, 5, 4, 3, 2, 1]
})
print(f'df : \n', df)
# Separate data into two sets: rows with missing values and rows without missing values
rows_with_missing_values = df[df['b'].isnull()]
rows_without_missing_values = df.dropna(subset=['b'])
# Train a model to predict the missing values
model = LinearRegression()
model.fit(rows_without_missing_values.drop('b', axis=1), rows_without_missing_values['b'])
# Predict missing values
predicted_values = model.predict(rows_with_missing_values.drop('b', axis=1))
rows_with_missing_values['b'] = predicted_values
# Concatenate the data back together and reset the index
df = pd.concat([rows_with_missing_values, rows_without_missing_values], ignore_index=True)
print(f'df after imputation: \n', df)
Output
df : a b c 0 1 12.30 7 1 2 NaN 6 2 3 143.45 5 3 4 NaN 4 4 5 34.00 3 5 6 56.00 2 6 7 109.00 1 df after imputation: a b c 0 2 55.355172 6 1 4 68.350862 4 2 1 12.300000 7 3 3 143.450000 5 4 5 34.000000 3 5 6 56.000000 2 6 7 109.000000 1 Process finished with exit code 0
g. Use domain knowledge to fill missing data
Sometimes, based on your domain knowledge, you can fill the missing data.
When to use mean, mode and median to replace missing values
When your data is skewed or contain outliers, then use median to replace missing values. Using median is a better choice, when the distribution is not symmetric.
Use mode, when you have categorical data.
Use mean, when the data follows a roughly symmetric distribution without significant outliers. For example, we can use mean to replace the values for the columns age, temperature, salary etc.,
Previous Next Home
No comments:
Post a Comment