Using rank method, we can assign ranks to the values in a DataFrame or a specific column.
For example, I have following student dataset and I would like to add ranks to the student based on their percentage.
Name Age City Gender Percentage 0 Krishna 34 Bangalore Male 98 1 Sailu 35 Hyderabad Female 76 2 Joel 29 Hyderabad Male 99 3 Chamu 35 Chennai Female 100 4 Jitendra 52 Bangalore Male 87 5 Raj 34 Chennai Male 96
Example 1: Give 1st rank to the student who scores highest and 2nd rank to the next and so on.
df['Rank'] = df['Percentage'].rank(ascending=False)
Let’s sort the DataFrame by its Rank.
df = df.sort_values('Rank')
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 2 Joel 29 Hyderabad Male 99 2.0 0 Krishna 34 Bangalore Male 98 3.0 5 Raj 34 Chennai Male 96 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
As you see the above output, as student ‘Chamu’ scored 100 percentage, her rank is 1.
Find the below working application.
rank_method_demo.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
'Age': [34, 35, 29, 35, 52, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
'Percentage': [98, 76, 99, 100, 87, 96]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nAssign ranks based on student scores')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
print(df)
Output
Original DataFrame Name Age City Gender Percentage 0 Krishna 34 Bangalore Male 98 1 Sailu 35 Hyderabad Female 76 2 Joel 29 Hyderabad Male 99 3 Chamu 35 Chennai Female 100 4 Jitendra 52 Bangalore Male 87 5 Raj 34 Chennai Male 96 Assign ranks based on student scores Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 2 Joel 29 Hyderabad Male 99 2.0 0 Krishna 34 Bangalore Male 98 3.0 5 Raj 34 Chennai Male 96 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
How Pandas calculate the rank ?
‘rank()’ method assign the rank to each value based on its relative positon within the data set. Following are the steps in the rank calculation process.
a. sorting: Data is sorted in Ascending order by default while giving the ranking. You can customize the sorting order by passing the parameter ascending to False.
b. Ties handling: When there is a tie in the data (like two student got same percentage of marks), then ‘method’ parameter is used to assign the ranks. Following table summarizes the possible values of ‘method’ parameter.
Parameter Value |
Description |
min |
Assign the minimum rank |
max |
Assign the maximum ranl |
first |
Assign the rank based on order of appearance |
dense |
Assign consecutive ranks without any gaps |
avergae |
This is the default one and the average value rank is assigned to the tied values. |
Let’s experiment by adding a tie to two students and experiment.
Name Age City Gender Percentage 0 Krishna 34 Bangalore Male 98 1 Sailu 35 Hyderabad Female 76 2 Joel 29 Hyderabad Male 99 3 Chamu 35 Chennai Female 100 4 Jitendra 52 Bangalore Male 87 5 Raj 34 Chennai Male 100
As you observe above data set, two students Chamu and Raj got 100 percentage of marks, let’s see how can we handle this tie.
Default ranking without method argument
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
As we do not pass any value to the method argument here, it use ‘average’ to handle the ties.
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.5 5 Raj 34 Chennai Male 100 1.5 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
Since there is a tie between student (Chamu and Raj), 1 and 2 ranks will be allotted to them, so system takes the average (1 + 2) /2 and assign the them.
Ranking with ‘min’ method
df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 1.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
Ranking with ‘max’ method
df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 2.0 5 Raj 34 Chennai Male 100 2.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
Ranking with ‘first’ method
df['Rank'] = df['Percentage'].rank(ascending=False, method='first')
df = df.sort_values('Rank')
Above snippet generates below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 2.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0
As ‘Chamu’ appear in the data set before ‘Raj’, ‘Chamu is given rank 1 and Raj is given 2.
Ranking with ‘dense’ method
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense')
df = df.sort_values('Rank')
Above snippet generates below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 1.0 2 Joel 29 Hyderabad Male 99 2.0 0 Krishna 34 Bangalore Male 98 3.0 4 Jitendra 52 Bangalore Male 87 4.0 1 Sailu 35 Hyderabad Female 76 5.0
As you see the ‘dense’ method output, Pandas assign consecutive ranks without any gaps, you can confirm the same with ‘min’ method output, where the rank 2 is missing.
It would be great of the ranks are integers than real values, you can convert the float to int using astype method.
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense').astype('int')
df = df.sort_values('Rank')
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1 5 Raj 34 Chennai Male 100 1 2 Joel 29 Hyderabad Male 99 2 0 Krishna 34 Bangalore Male 98 3 4 Jitendra 52 Bangalore Male 87 4 1 Sailu 35 Hyderabad Female 76 5
Find the below working application.
rank_ties_handling.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
'Age': [34, 35, 29, 35, 52, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
'Percentage': [98, 76, 99, 100, 87, 100]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nAssign ranks based on student percentages')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and method="min"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='min')
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and method="max"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='max')
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and method="first"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='first')
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and method="dense"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense')
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and method="dense and convert the ranks to integers"')
df['Rank'] = df['Percentage'].rank(ascending=False, method='dense').astype('int')
df = df.sort_values('Rank')
print(df)
Output
Original DataFrame Name Age City Gender Percentage 0 Krishna 34 Bangalore Male 98 1 Sailu 35 Hyderabad Female 76 2 Joel 29 Hyderabad Male 99 3 Chamu 35 Chennai Female 100 4 Jitendra 52 Bangalore Male 87 5 Raj 34 Chennai Male 100 Assign ranks based on student percentages Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.5 5 Raj 34 Chennai Male 100 1.5 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0 Assign ranks based on student percentages and method="min" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 1.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0 Assign ranks based on student percentages and method="max" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 2.0 5 Raj 34 Chennai Male 100 2.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0 Assign ranks based on student percentages and method="first" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 2.0 2 Joel 29 Hyderabad Male 99 3.0 0 Krishna 34 Bangalore Male 98 4.0 4 Jitendra 52 Bangalore Male 87 5.0 1 Sailu 35 Hyderabad Female 76 6.0 Assign ranks based on student percentages and method="dense" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1.0 5 Raj 34 Chennai Male 100 1.0 2 Joel 29 Hyderabad Male 99 2.0 0 Krishna 34 Bangalore Male 98 3.0 4 Jitendra 52 Bangalore Male 87 4.0 1 Sailu 35 Hyderabad Female 76 5.0 Assign ranks based on student percentages and method="dense and convert the ranks to integers" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100 1 5 Raj 34 Chennai Male 100 1 2 Joel 29 Hyderabad Male 99 2 0 Krishna 34 Bangalore Male 98 3 4 Jitendra 52 Bangalore Male 87 4 1 Sailu 35 Hyderabad Female 76 5
c. Assign ranks
Ranks are assigned post sorting and ties handling steps. Lowest value is assigned with rank 1 by default.
d. Missing values handling
Missing values are handled by na_option parameter. Following table summarize the possible values of na_option parameter.
Value |
Description |
keep |
Assign NaN rank to NaN values. It is the default value. |
top |
Assign lowest rank to NaN values |
bottom |
Assign highest rank to NaN values |
Let’s experiment with ‘value’ parameter with below dataset.
Name Age City Gender Percentage 0 Krishna 34 Bangalore Male NaN 1 Sailu 35 Hyderabad Female 76.0 2 Joel 29 Hyderabad Male NaN 3 Chamu 35 Chennai Female 100.0 4 Jitendra 52 Bangalore Male 87.0 5 Raj 34 Chennai Male 89.0
With na_option=keep
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
Above snippet generate below data set.
Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100.0 1.0 5 Raj 34 Chennai Male 89.0 2.0 4 Jitendra 52 Bangalore Male 87.0 3.0 1 Sailu 35 Hyderabad Female 76.0 4.0 0 Krishna 34 Bangalore Male NaN NaN 2 Joel 29 Hyderabad Male NaN NaN
With na_option=top
Name Age City Gender Percentage Rank 0 Krishna 34 Bangalore Male NaN 1.5 2 Joel 29 Hyderabad Male NaN 1.5 3 Chamu 35 Chennai Female 100.0 3.0 5 Raj 34 Chennai Male 89.0 4.0 4 Jitendra 52 Bangalore Male 87.0 5.0 1 Sailu 35 Hyderabad Female 76.0 6.0
As you see the output, missing values are assigned with highest ranks.
With na_option=bottom
3 Chamu 35 Chennai Female 100.0 1.0 5 Raj 34 Chennai Male 89.0 2.0 4 Jitendra 52 Bangalore Male 87.0 3.0 1 Sailu 35 Hyderabad Female 76.0 4.0 0 Krishna 34 Bangalore Male NaN 5.5 2 Joel 29 Hyderabad Male NaN 5.5
As you see the output, missing values are assigned with lowest ranks.
rank_missing_values_handling.py
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
'Age': [34, 35, 29, 35, 52, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male'],
'Percentage': [np.nan, 76, np.nan, 100, 87, 89]}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nAssign ranks based on student percentages')
df['Rank'] = df['Percentage'].rank(ascending=False)
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and na_option="top"')
df['Rank'] = df['Percentage'].rank(ascending=False, na_option='top')
df = df.sort_values('Rank')
print(df)
print('\nAssign ranks based on student percentages and na_option="bottom"')
df['Rank'] = df['Percentage'].rank(ascending=False, na_option='bottom')
df = df.sort_values('Rank')
print(df)
Output
Original DataFrame Name Age City Gender Percentage 0 Krishna 34 Bangalore Male NaN 1 Sailu 35 Hyderabad Female 76.0 2 Joel 29 Hyderabad Male NaN 3 Chamu 35 Chennai Female 100.0 4 Jitendra 52 Bangalore Male 87.0 5 Raj 34 Chennai Male 89.0 Assign ranks based on student percentages Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100.0 1.0 5 Raj 34 Chennai Male 89.0 2.0 4 Jitendra 52 Bangalore Male 87.0 3.0 1 Sailu 35 Hyderabad Female 76.0 4.0 0 Krishna 34 Bangalore Male NaN NaN 2 Joel 29 Hyderabad Male NaN NaN Assign ranks based on student percentages and na_option="top" Name Age City Gender Percentage Rank 0 Krishna 34 Bangalore Male NaN 1.5 2 Joel 29 Hyderabad Male NaN 1.5 3 Chamu 35 Chennai Female 100.0 3.0 5 Raj 34 Chennai Male 89.0 4.0 4 Jitendra 52 Bangalore Male 87.0 5.0 1 Sailu 35 Hyderabad Female 76.0 6.0 Assign ranks based on student percentages and na_option="bottom" Name Age City Gender Percentage Rank 3 Chamu 35 Chennai Female 100.0 1.0 5 Raj 34 Chennai Male 89.0 2.0 4 Jitendra 52 Bangalore Male 87.0 3.0 1 Sailu 35 Hyderabad Female 76.0 4.0 0 Krishna 34 Bangalore Male NaN 5.5 2 Joel 29 Hyderabad Male NaN 5.5
No comments:
Post a Comment