‘duplicated()’ method identifies and mark duplicate rows in a DataFrame. It returns a boolean series that indicates whether a given row is duplicate or not.
Let’s use below data set to experiment with duplicated method.
Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad
Let’s identify duplicate rows using duplicated method.
duplicated_rows_condition = df.duplicated()
In the above example, duplicated() method checks all columns in a row by default and returns a boolean series indicating the duplicate rows. ‘duplicated_rows_condition’
0 False 1 False 2 False 3 True 4 True 5 True 6 False
The row with
a. index 3 and 5 is marked as True because it's a duplicate of the row with index 0.
b. Index 4 is marked as True because it’s a duplicate of the row with index 2.
Find the below working application.
duplicated_method_demo.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
'Age': [34, 35, 29, 34, 29, 34, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nDuplicated rows')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
Output
Original DataFrame Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Duplicated rows Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore
Following table summarizes the arguments of duplicated method.
Argument |
Descrption |
subset |
We can specify the columns to be considers while identifying the duplicates. |
keep |
Specifies which rows are marked as duplicates. It can take the values 'first', 'last', and False.
a. 'first': It is the default option and marks all duplicates as True except the first occurrence of the row. b. 'last': Marks all duplicates as True except the last occurrence of the row. c. False: Marks all duplicates as True. |
How to specify particular columns to identify duplicates?
Example 1: Duplicated rows by considering "Name" column alone.
duplicated_rows_condition = df.duplicated(subset=['Name'])
duplicated_rows = df[duplicated_rows_condition]
Example 2: Duplicated rows by considering "Name" and "City" columns.
duplicated_rows_condition = df.duplicated(subset=['Name', 'City'])
duplicated_rows = df[duplicated_rows_condition]
Find the below working application.
specify_columns_to_duplicated.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
'Age': [34, 35, 29, 34, 29, 34, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
print('\nDuplicated rows by considering "Name" column alone')
duplicated_rows_condition = df.duplicated(subset=['Name'])
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
print('\nDuplicated rows by considering "Name" and "City" columns')
duplicated_rows_condition = df.duplicated(subset=['Name', 'City'])
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
Output
Original DataFrame Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Duplicated rows by considering all the columns Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore Duplicated rows by considering "Name" column alone Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Duplicated rows by considering "Name" and "City" columns Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore
Keep all the duplicated rows
‘keep’ argument specifies which rows are marked as duplicates. It can take the values 'first', 'last', and False.
a. 'first': It is the default option and marks all duplicates as True except the first occurrence of the row.
b. 'last': Marks all duplicates as True except the last occurrence of the row.
c. False: Marks all duplicates as True.
By setting the argument ‘keep’ to False, we can keep all the duplicated rows.
keep_all_duplicates.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
'Age': [34, 35, 29, 34, 29, 34, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
print('\nDuplicated rows by considering all the columns by keeping all the duplicates')
duplicated_rows_condition = df.duplicated(keep=False)
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
Output
Original DataFrame Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Duplicated rows by considering all the columns Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore Duplicated rows by considering all the columns by keeping all the duplicates Name Age City 0 Krishna 34 Bangalore 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore
By applying ~ operator, we can get the non duplicated rows in a DataFrame
Example
duplicated_rows_condition = ~(df.duplicated(keep=False))
duplicated_rows = df[duplicated_rows_condition]
non_duplicated_rows.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
'Age': [34, 35, 29, 34, 29, 34, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
print('\nDuplicated rows by considering all the columns')
duplicated_rows_condition = df.duplicated()
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
print('\nNon-Duplicated rows by considering all the columns')
duplicated_rows_condition = ~(df.duplicated(keep=False))
duplicated_rows = df[duplicated_rows_condition]
print(duplicated_rows)
Output
Original DataFrame Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Duplicated rows by considering all the columns Name Age City 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore Non-Duplicated rows by considering all the columns Name Age City 1 Sailu 35 Hyderabad 6 Krishna 34 Hyderabad
Previous Next Home
No comments:
Post a Comment