We can use
'drop_duplicates()' method to remove the duplicate rows in a dataframe.
'drop_duplicates()' method keeps the first occurrence of each row and drop the
subsequent rows.
Let’s use below data set to experiment with duplicated method.
Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad
df_unique = df.drop_duplicates()
By applying drop_duplicates(), we obtain a new DataFrame (df_unique) where the duplicates have been removed, and only the first occurrence of each unique row is retained. ‘df_unique’ contain below data set.
Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 6 Krishna 34 Hyderabad
By default, drop_duplicates() method consider all the columns data while identifying duplicates.
Specify subset of columns while identifying duplicates
you can specify a subset of columns by passing a list of column names to the subset parameter.
Example
df_unique_by_name_and_age= df.drop_duplicates(subset=['Name', 'Age'])
Above snipept will remove duplicates based on the 'Name' and 'Age' columns, and keep the first occurrence of each unique combination. ‘df_unique_by_name_and_age’ contain below data set.
Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad
We can specify which duplicates to keep by specifying ‘keep’ argument. It can take the values 'first', 'last', and False.
a. 'first': It is the default option and drop all duplicates except the first occurrence of the row.
b. 'last': Drop duplicates except for the last occurrence.
c. False: Drop all duplicates.
By setting the argument ‘keep’ to False, we can drop all the duplicated rows.
df_unique_by_name_and_age_keep_none= df.drop_duplicates(subset=['Name', 'Age'], keep=False)
df_unique_by_name_and_age_keep_none point to below data set.
Name Age City 1 Sailu 35 Hyderabad
df_unique_by_name_and_age_keep_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')
df_unique_by_name_and_age_keep_last point to below data set.
Name Age City 1 Sailu 35 Hyderabad 4 Joel 29 Hyderabad 6 Krishna 34 Hyderabad
Find the below working application.
drop_duplicates.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Krishna', 'Joel', "Krishna", "Krishna"],
'Age': [34, 35, 29, 34, 29, 34, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad']}
df = pd.DataFrame(data)
df_unique = df.drop_duplicates()
print('Original DataFrame')
print(df)
print('\nData after dropping duplicates')
print(df_unique)
df_unique_by_name_and_age = df.drop_duplicates(subset=['Name', 'Age'])
print('\nData set unique by Name and Age')
print(df_unique_by_name_and_age)
df_unique_by_name_and_age_keep_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')
print('\nData set unique by Name and Age and keep last values')
print(df_unique_by_name_and_age_keep_last)
df_unique_by_name_and_age_keep_none= df.drop_duplicates(subset=['Name', 'Age'], keep=False)
print('\nData set unique by Name and Age and do not keep duplicate rows')
print(df_unique_by_name_and_age_keep_none)
Output
Original DataFrame Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 3 Krishna 34 Bangalore 4 Joel 29 Hyderabad 5 Krishna 34 Bangalore 6 Krishna 34 Hyderabad Data after dropping duplicates Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad 6 Krishna 34 Hyderabad Data set unique by Name and Age Name Age City 0 Krishna 34 Bangalore 1 Sailu 35 Hyderabad 2 Joel 29 Hyderabad Data set unique by Name and Age and keep last values Name Age City 1 Sailu 35 Hyderabad 4 Joel 29 Hyderabad 6 Krishna 34 Hyderabad Data set unique by Name and Age and do not keep duplicate rows Name Age City 1 Sailu 35 Hyderabad
No comments:
Post a Comment