Programming for beginners: Effectively Detecting Duplicate Rows with Pandas duplicated()

‘duplicated()’ method identifies and mark duplicate rows in a DataFrame. It returns a boolean series that indicates whether a given row is duplicate or not.

Let’s use below data set to experiment with duplicated method.

      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Let’s identify duplicate rows using duplicated method.

duplicated_rows_condition = df.duplicated()

In the above example, duplicated() method checks all columns in a row by default and returns a boolean series indicating the duplicate rows. ‘duplicated_rows_condition’

0    False
1    False
2    False
3     True
4     True
5     True
6    False

The row with

a. index 3 and 5 is marked as True because it's a duplicate of the row with index 0.

b. Index 4 is marked as True because it’s a duplicate of the row with index 2.

Find the below working application.

duplicated_method_demo.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
        'Age': [34, 35, 29, 34, 29, 34, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nDuplicated rows')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

Output

Original DataFrame
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Duplicated rows
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

Following table summarizes the arguments of duplicated method.

Argument

Descrption

subset

We can specify the columns to be considers while identifying the duplicates.

keep

Specifies which rows are marked as duplicates. It can take the values 'first', 'last', and False.

a. 'first': It is the default option and marks all duplicates as True except the first occurrence of the row.

b. 'last': Marks all duplicates as True except the last occurrence of the row.

c. False: Marks all duplicates as True.

How to specify particular columns to identify duplicates?

Example 1: Duplicated rows by considering "Name" column alone.

duplicated_rows_condition = df.duplicated(subset=['Name'])
duplicated_rows = df[duplicated_rows_condition]

Example 2: Duplicated rows by considering "Name" and "City" columns.

duplicated_rows_condition = df.duplicated(subset=['Name', 'City'])
duplicated_rows = df[duplicated_rows_condition]

Find the below working application.

specify_columns_to_duplicated.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
        'Age': [34, 35, 29, 34, 29, 34, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

print('\nDuplicated rows by considering "Name" column alone')
duplicated_rows_condition = df.duplicated(subset=['Name'])
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

print('\nDuplicated rows by considering "Name" and "City" columns')
duplicated_rows_condition = df.duplicated(subset=['Name', 'City'])
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

Output

Original DataFrame
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Duplicated rows by considering all the columns
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

Duplicated rows by considering "Name" column alone
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Duplicated rows by considering "Name" and "City" columns
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

Keep all the duplicated rows

‘keep’ argument specifies which rows are marked as duplicates. It can take the values 'first', 'last', and False.

a. 'first': It is the default option and marks all duplicates as True except the first occurrence of the row.

b. 'last': Marks all duplicates as True except the last occurrence of the row.

c. False: Marks all duplicates as True.

By setting the argument ‘keep’ to False, we can keep all the duplicated rows.

keep_all_duplicates.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
        'Age': [34, 35, 29, 34, 29, 34, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

print('\nDuplicated rows by considering all the columns by keeping all the duplicates')
duplicated_rows_condition = df.duplicated(keep=False)
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

Output

Original DataFrame
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Duplicated rows by considering all the columns
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

Duplicated rows by considering all the columns by keeping all the duplicates
      Name  Age       City
0  Krishna   34  Bangalore
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

By applying ~ operator, we can get the non duplicated rows in a DataFrame

Example

duplicated_rows_condition = ~(df.duplicated(keep=False))
duplicated_rows = df[duplicated_rows_condition]

non_duplicated_rows.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
        'Age': [34, 35, 29, 34, 29, 34, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

print('\nNon-Duplicated rows by considering all the columns')
duplicated_rows_condition = ~(df.duplicated(keep=False))
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)

Output

Original DataFrame
      Name  Age       City
0  Krishna   34  Bangalore
1    Sailu   35  Hyderabad
2     Joel   29  Hyderabad
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore
6  Krishna   34  Hyderabad

Duplicated rows by considering all the columns
      Name  Age       City
3  Krishna   34  Bangalore
4     Joel   29  Hyderabad
5  Krishna   34  Bangalore

Non-Duplicated rows by considering all the columns
      Name  Age       City
1    Sailu   35  Hyderabad
6  Krishna   34  Hyderabad

Previous Next Home

Programming for beginners

Sunday, 19 November 2023

Effectively Detecting Duplicate Rows with Pandas duplicated()

No comments:

Post a Comment