Monday 27 November 2023

Pandas: Delete duplicated rows using drop_duplicates method

We can use 'drop_duplicates()' method to remove the duplicate rows in a dataframe. 'drop_duplicates()' method keeps the first occurrence of each row and drop the subsequent rows.

Let’s use below data set to experiment with duplicated method.

      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

df_unique = df.drop_duplicates()

 

By applying drop_duplicates(), we obtain a new DataFrame (df_unique) where the duplicates have been removed, and only the first occurrence of each unique row is retained. ‘df_unique’ contain below data set.

      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
6  Krishna   34  Hyderabad

By default, drop_duplicates() method consider all the columns data while identifying duplicates.

 

Specify subset of columns while identifying duplicates

you can specify a subset of columns by passing a list of column names to the subset parameter.

 

Example

df_unique_by_name_and_age= df.drop_duplicates(subset=['Name', 'Age'])

Above snipept will remove duplicates based on the 'Name' and 'Age' columns, and keep the first occurrence of each unique combination. ‘df_unique_by_name_and_age’ contain below data set.

      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad

We can specify which duplicates to keep by specifying ‘keep’ argument. It can take the values 'first', 'last', and False.

 

a.   'first': It is the default option and drop all duplicates except the first occurrence of the row.

b.   'last': Drop duplicates except for the last occurrence.

c.    False: Drop all duplicates.

 

By setting the argument ‘keep’ to False, we can drop all the duplicated rows.

 

df_unique_by_name_and_age_keep_none= df.drop_duplicates(subset=['Name', 'Age'], keep=False)

df_unique_by_name_and_age_keep_none point to below data set.

    Name  Age       City
1  Sailu   35  Hyderabad

df_unique_by_name_and_age_keep_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')

df_unique_by_name_and_age_keep_last point to below data set.

      Name  Age       City
1    Sailu   35  Hyderabad
4     Joel   29  Hyderabad
6  Krishna   34  Hyderabad

Find the below working application.

 

drop_duplicates.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
        'Age': [34, 35, 29, 34, 29, 34, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
df_unique = df.drop_duplicates()

print('Original DataFrame')
print(df)

print('\nData after dropping duplicates')
print(df_unique)

df_unique_by_name_and_age = df.drop_duplicates(subset=['Name', 'Age'])
print('\nData set unique by Name and Age')
print(df_unique_by_name_and_age)

df_unique_by_name_and_age_keep_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')
print('\nData set unique by Name and Age and keep last values')
print(df_unique_by_name_and_age_keep_last)

df_unique_by_name_and_age_keep_none= df.drop_duplicates(subset=['Name', 'Age'], keep=False)
print('\nData set unique by Name and Age and do not keep duplicate rows')
print(df_unique_by_name_and_age_keep_none)

Output

Original DataFrame
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Data after dropping duplicates
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
6  Krishna   34  Hyderabad

Data set unique by Name and Age
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad

Data set unique by Name and Age and keep last values
      Name  Age       City
1    Sailu   35  Hyderabad
4     Joel   29  Hyderabad
6  Krishna   34  Hyderabad

Data set unique by Name and Age and do not keep duplicate rows
    Name  Age       City
1  Sailu   35  Hyderabad



Previous                                                 Next                                                 Home

No comments:

Post a Comment