Friday 3 November 2023

Pandas: sort_values vs missing values in a DataFrame

‘sort_values’ values method keep the missing values at last by default. To confirm the same let’s experiment with the below dataset.

        Name   Age       City  Gender
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
2      Joel  29.0       None    Male
3     Chamu   NaN    Chennai  Female
4  Jitendra  52.0       None    Male
5       Raj   NaN    Chennai    Male

 

Sort the DataFrame by Age column

sort_by_age_ascending_1 = df.sort_values('Age')

Above snippet sort the DataFrame by ‘Age’ column and assign the transformed DataFrame to the variable sort_by_age_ascending_1. Content of the DataFrame ‘sort_by_age_ascending_1’ is given below.

        Name   Age       City  Gender
2      Joel  29.0       None    Male
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
4  Jitendra  52.0       None    Male
3     Chamu   NaN    Chennai  Female
5       Raj   NaN    Chennai    Male

 

As you see above snippet, all NaN values in Age column are shifted to the last and remaining values are sorted in ascending order of Age.

 

You can achieve the same result by passing the argument na_position to 'last'.

sort_by_age_ascending_2 = df.sort_values('Age', na_position='last')

 

Sort the DataFrame by Age column and move all the missing values to the top

By passing the argument na_position='first', we can move all the missing value to the top.

 

sort_by_age_ascending_none_to_first_1 = df.sort_values('Age', na_position='first')

 

Above snippet generate below DataFrame.

        Name   Age       City  Gender
3     Chamu   NaN    Chennai  Female
5       Raj   NaN    Chennai    Male
2      Joel  29.0       None    Male
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
4  Jitendra  52.0       None    Male

 

Find the below working application.

 

missing_values_handling.py

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
        'Age': [34, 35, 29, np.nan, 52, np.nan],
        'City': ['Bangalore', 'Hyderabad', None, 'Chennai', None, 'Chennai'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male']}
df = pd.DataFrame(data)

sort_by_age_ascending_1 = df.sort_values('Age')
sort_by_age_ascending_2 = df.sort_values('Age', na_position='last')

sort_by_age_ascending_none_to_first_1 = df.sort_values('Age', na_position='first')

print('df : \n', df)
print('\nsort_by_age_ascending_1 : \n', sort_by_age_ascending_1)
print('\nsort_by_age_ascending_2 : \n', sort_by_age_ascending_2)
print('\nsort_by_age_ascending_none_to_first_1 : \n', sort_by_age_ascending_none_to_first_1)

Output

df : 
        Name   Age       City  Gender
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
2      Joel  29.0       None    Male
3     Chamu   NaN    Chennai  Female
4  Jitendra  52.0       None    Male
5       Raj   NaN    Chennai    Male

sort_by_age_ascending_1 : 
        Name   Age       City  Gender
2      Joel  29.0       None    Male
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
4  Jitendra  52.0       None    Male
3     Chamu   NaN    Chennai  Female
5       Raj   NaN    Chennai    Male

sort_by_age_ascending_2 : 
        Name   Age       City  Gender
2      Joel  29.0       None    Male
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
4  Jitendra  52.0       None    Male
3     Chamu   NaN    Chennai  Female
5       Raj   NaN    Chennai    Male

sort_by_age_ascending_none_to_first_1 : 
        Name   Age       City  Gender
3     Chamu   NaN    Chennai  Female
5       Raj   NaN    Chennai    Male
2      Joel  29.0       None    Male
0   Krishna  34.0  Bangalore    Male
1     Sailu  35.0  Hyderabad  Female
4  Jitendra  52.0       None    Male


 

Previous                                                 Next                                                 Home

No comments:

Post a Comment