Monday, 13 November 2023

Pandas: assign rank to the DataFrame by a specific column

Using rank method, we can assign ranks to the values in a DataFrame or a specific column.

For example, I have following student dataset and I would like to add ranks to the student based on their percentage.

       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male          98
1     Sailu   35  Hyderabad  Female          76
2      Joel   29  Hyderabad    Male          99
3     Chamu   35    Chennai  Female         100
4  Jitendra   52  Bangalore    Male          87
5       Raj   34    Chennai    Male          96

 

Example 1: Give 1st rank to the student who scores highest and 2nd rank to the next and so on.

df['Rank'] = df['Percentage'].rank(ascending=False)

 

Let’s sort the DataFrame by its Rank.

df = df.sort_values('Rank')

 

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
2      Joel   29  Hyderabad    Male          99   2.0
0   Krishna   34  Bangalore    Male          98   3.0
5       Raj   34    Chennai    Male          96   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

 

As you see the above output,  as student ‘Chamu’ scored 100 percentage, her rank is 1.

 

Find the below working application.

 

rank_method_demo.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
        'Age': [34, 35, 29, 35, 52, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
        'Percentage': [98, 76, 99, 100, 87, 96]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nAssign ranks based on student scores')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')

print(df)

Output

Original DataFrame
       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male          98
1     Sailu   35  Hyderabad  Female          76
2      Joel   29  Hyderabad    Male          99
3     Chamu   35    Chennai  Female         100
4  Jitendra   52  Bangalore    Male          87
5       Raj   34    Chennai    Male          96

Assign ranks based on student scores
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
2      Joel   29  Hyderabad    Male          99   2.0
0   Krishna   34  Bangalore    Male          98   3.0
5       Raj   34    Chennai    Male          96   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

How Pandas calculate the rank ?

‘rank()’ method assign the rank to each value based on its relative positon within the data set. Following are the steps in the rank calculation process.

 

a. sorting: Data is sorted in Ascending order by default while giving the ranking. You can customize the sorting order by passing the parameter ascending to False.

 

b. Ties handling: When there is a tie in the data (like two student got same percentage of marks), then ‘method’ parameter is used to assign the ranks. Following table summarizes the possible values of ‘method’ parameter.

Parameter Value

Description

min

Assign the minimum rank

max

Assign the maximum ranl

first

Assign the rank based on order of appearance

dense

Assign consecutive ranks without any gaps

avergae

This is the default one and the average value rank is assigned to the tied values.

 

Let’s experiment by adding a tie to two students and experiment.

       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male          98
1     Sailu   35  Hyderabad  Female          76
2      Joel   29  Hyderabad    Male          99
3     Chamu   35    Chennai  Female         100
4  Jitendra   52  Bangalore    Male          87
5       Raj   34    Chennai    Male         100

 

As you observe above data set, two students Chamu and Raj got 100 percentage of marks, let’s see how can we handle this tie.

 

Default ranking without method argument

df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')

 

As we do not pass any value to the method argument here, it use ‘average’ to handle the ties.

 

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.5
5       Raj   34    Chennai    Male         100   1.5
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

 

Since there is a tie between student (Chamu and Raj), 1 and 2 ranks will be allotted to them, so system takes the average (1 + 2) /2 and assign the them.

 

Ranking with ‘min’ method

df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')

 

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   1.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

 

Ranking with ‘max’ method

df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   2.0
5       Raj   34    Chennai    Male         100   2.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

Ranking with ‘first’ method

df['Rank'] = df['Percentage'].rank(ascending=False, method='first')
df = df.sort_values('Rank')

Above snippet generates below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   2.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

As ‘Chamu’ appear in the data set before ‘Raj’, ‘Chamu is given rank 1 and Raj is given 2.

 

Ranking with ‘dense’ method

df['Rank'] = df['Percentage'].rank(ascending=False, method='dense')
df = df.sort_values('Rank')

Above snippet generates below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   1.0
2      Joel   29  Hyderabad    Male          99   2.0
0   Krishna   34  Bangalore    Male          98   3.0
4  Jitendra   52  Bangalore    Male          87   4.0
1     Sailu   35  Hyderabad  Female          76   5.0

As you see the ‘dense’ method output, Pandas assign consecutive ranks without any gaps, you can confirm the same with ‘min’ method output, where the rank 2 is missing.

 

It would be great of the ranks are integers than real values, you can convert the float to int using astype method.

df['Rank'] = df['Percentage'].rank(ascending=False, method='dense').astype('int')
df = df.sort_values('Rank')

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100     1
5       Raj   34    Chennai    Male         100     1
2      Joel   29  Hyderabad    Male          99     2
0   Krishna   34  Bangalore    Male          98     3
4  Jitendra   52  Bangalore    Male          87     4
1     Sailu   35  Hyderabad  Female          76     5

 

Find the below working application.

 

rank_ties_handling.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
        'Age': [34, 35, 29, 35, 52, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
        'Percentage': [98, 76, 99, 100, 87, 100]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nAssign ranks based on student percentages')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and method="min"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and method="max"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='max')
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and method="first"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='first')
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and method="dense"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense')
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and method="dense and convert the ranks to integers"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense').astype('int')
df = df.sort_values('Rank')
print(df)

 

Output

Original DataFrame
       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male          98
1     Sailu   35  Hyderabad  Female          76
2      Joel   29  Hyderabad    Male          99
3     Chamu   35    Chennai  Female         100
4  Jitendra   52  Bangalore    Male          87
5       Raj   34    Chennai    Male         100

Assign ranks based on student percentages
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.5
5       Raj   34    Chennai    Male         100   1.5
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

Assign ranks based on student percentages and method="min"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   1.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

Assign ranks based on student percentages and method="max"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   2.0
5       Raj   34    Chennai    Male         100   2.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

Assign ranks based on student percentages and method="first"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   2.0
2      Joel   29  Hyderabad    Male          99   3.0
0   Krishna   34  Bangalore    Male          98   4.0
4  Jitendra   52  Bangalore    Male          87   5.0
1     Sailu   35  Hyderabad  Female          76   6.0

Assign ranks based on student percentages and method="dense"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100   1.0
5       Raj   34    Chennai    Male         100   1.0
2      Joel   29  Hyderabad    Male          99   2.0
0   Krishna   34  Bangalore    Male          98   3.0
4  Jitendra   52  Bangalore    Male          87   4.0
1     Sailu   35  Hyderabad  Female          76   5.0

Assign ranks based on student percentages and method="dense and convert the ranks to integers"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female         100     1
5       Raj   34    Chennai    Male         100     1
2      Joel   29  Hyderabad    Male          99     2
0   Krishna   34  Bangalore    Male          98     3
4  Jitendra   52  Bangalore    Male          87     4
1     Sailu   35  Hyderabad  Female          76     5

 

c. Assign ranks

Ranks are assigned post sorting and ties handling steps. Lowest value is assigned with rank 1 by default.

 

d. Missing values handling

Missing values are handled by na_option parameter. Following table summarize the possible values of na_option parameter.

 

Value

Description

keep

Assign NaN rank to NaN values. It is the default value.

top

Assign lowest rank to NaN values

bottom

Assign highest rank to NaN values

 

Let’s experiment with ‘value’ parameter with below dataset.

       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male         NaN
1     Sailu   35  Hyderabad  Female        76.0
2      Joel   29  Hyderabad    Male         NaN
3     Chamu   35    Chennai  Female       100.0
4  Jitendra   52  Bangalore    Male        87.0
5       Raj   34    Chennai    Male        89.0

 

With na_option=keep

df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')

Above snippet generate below data set.

       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female       100.0   1.0
5       Raj   34    Chennai    Male        89.0   2.0
4  Jitendra   52  Bangalore    Male        87.0   3.0
1     Sailu   35  Hyderabad  Female        76.0   4.0
0   Krishna   34  Bangalore    Male         NaN   NaN
2      Joel   29  Hyderabad    Male         NaN   NaN

With na_option=top

       Name  Age       City  Gender  Percentage  Rank
0   Krishna   34  Bangalore    Male         NaN   1.5
2      Joel   29  Hyderabad    Male         NaN   1.5
3     Chamu   35    Chennai  Female       100.0   3.0
5       Raj   34    Chennai    Male        89.0   4.0
4  Jitendra   52  Bangalore    Male        87.0   5.0
1     Sailu   35  Hyderabad  Female        76.0   6.0

As you see the output, missing values are assigned with highest ranks.

 

With na_option=bottom

3     Chamu   35    Chennai  Female       100.0   1.0
5       Raj   34    Chennai    Male        89.0   2.0
4  Jitendra   52  Bangalore    Male        87.0   3.0
1     Sailu   35  Hyderabad  Female        76.0   4.0
0   Krishna   34  Bangalore    Male         NaN   5.5
2      Joel   29  Hyderabad    Male         NaN   5.5

As you see the output, missing values are assigned with lowest ranks.

 

rank_missing_values_handling.py

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
        'Age': [34, 35, 29, 35, 52, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
        'Percentage': [np.nan, 76, np.nan, 100, 87, 89]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nAssign ranks based on student percentages')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and na_option="top"')
df['Rank'] = df['Percentage'].rank(ascending=False, na_option='top')
df = df.sort_values('Rank')
print(df)

print('\nAssign ranks based on student percentages and na_option="bottom"')
df['Rank'] = df['Percentage'].rank(ascending=False, na_option='bottom')
df = df.sort_values('Rank')
print(df)

Output

Original DataFrame
       Name  Age       City  Gender  Percentage
0   Krishna   34  Bangalore    Male         NaN
1     Sailu   35  Hyderabad  Female        76.0
2      Joel   29  Hyderabad    Male         NaN
3     Chamu   35    Chennai  Female       100.0
4  Jitendra   52  Bangalore    Male        87.0
5       Raj   34    Chennai    Male        89.0

Assign ranks based on student percentages
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female       100.0   1.0
5       Raj   34    Chennai    Male        89.0   2.0
4  Jitendra   52  Bangalore    Male        87.0   3.0
1     Sailu   35  Hyderabad  Female        76.0   4.0
0   Krishna   34  Bangalore    Male         NaN   NaN
2      Joel   29  Hyderabad    Male         NaN   NaN

Assign ranks based on student percentages and na_option="top"
       Name  Age       City  Gender  Percentage  Rank
0   Krishna   34  Bangalore    Male         NaN   1.5
2      Joel   29  Hyderabad    Male         NaN   1.5
3     Chamu   35    Chennai  Female       100.0   3.0
5       Raj   34    Chennai    Male        89.0   4.0
4  Jitendra   52  Bangalore    Male        87.0   5.0
1     Sailu   35  Hyderabad  Female        76.0   6.0

Assign ranks based on student percentages and na_option="bottom"
       Name  Age       City  Gender  Percentage  Rank
3     Chamu   35    Chennai  Female       100.0   1.0
5       Raj   34    Chennai    Male        89.0   2.0
4  Jitendra   52  Bangalore    Male        87.0   3.0
1     Sailu   35  Hyderabad  Female        76.0   4.0
0   Krishna   34  Bangalore    Male         NaN   5.5
2      Joel   29  Hyderabad    Male         NaN   5.5

 

Previous                                                 Next                                                 Home

No comments:

Post a Comment