Using ‘str.split()’ method, we can split the text content of a DataFrame column.
I am using following data set to demonstrate the examples.
Name Age City Hobbies 0 Krishna,Gurram 34 Bangalore Football,Cricket,Tennis 1 Sailu,Dokku 35 Hyderabad Tennis, cricket,Trekking 2 Joel,Chelli 234 Hyderabad Trekking, reading books 3 Chamu,Maj 35 Chennai Chess 4 Gopi,Battu 52 Bangalore Read Books 5 Siva,Ponnam 34 Chennai Cricket
Add two new columns (FirstName, LastName) to the dataset
To do this, we need to split the Name column using the separator (,)
Let’s split the Name column data using split method.
name_split_series = df['Name'].str.split(',')
‘name_split_series’ points to a series that contain following data.
0 [Krishna, Gurram] 1 [Sailu, Dokku] 2 [Joel, Chelli] 3 [Chamu, Maj] 4 [Gopi, Battu] 5 [Siva, Ponnam]
Following statements extract the first name and last name values from the ‘name_split_series’.
first_names_series = name_split_series.str.get(0)
last_names_series = name_split_series.str.get(1)
first_names_series contain below data
0 Krishna
1 Sailu
2 Joel
3 Chamu
4 Gopi
5 Siva
last_names_series contain below data
0 Gurram
1 Dokku
2 Chelli
3 Maj
4 Battu
5 Ponnam
Let’s assign first_names_series, last_names_series data to the columns FirstName and LastName of the original data frame.
df['FirstName'] = first_names_series
df['LastName'] = last_names_series
Find the below working application.
split_text_content.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
'Age': [34, 35, 234, 35, 52, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
name_split_series = df['Name'].str.split(',')
first_names_series = name_split_series.str.get(0)
last_names_series = name_split_series.str.get(1)
df['FirstName'] = first_names_series
df['LastName'] = last_names_series
print('\nname_split_series\n',name_split_series)
print('\nfirst_names_series\n',first_names_series)
print('\nlast_names_series\n',last_names_series)
print('\nDataFrame after adding FirstName and LastName columns')
print(df)
Output
Original DataFrame Name Age City Hobbies 0 Krishna,Gurram 34 Bangalore Football,Cricket,Tennis 1 Sailu,Dokku 35 Hyderabad Tennis, cricket,Trekking 2 Joel,Chelli 234 Hyderabad Trekking, reading books 3 Chamu,Maj 35 Chennai Chess 4 Gopi,Battu 52 Bangalore Read Books 5 Siva,Ponnam 34 Chennai Cricket name_split_series 0 [Krishna, Gurram] 1 [Sailu, Dokku] 2 [Joel, Chelli] 3 [Chamu, Maj] 4 [Gopi, Battu] 5 [Siva, Ponnam] Name: Name, dtype: object first_names_series 0 Krishna 1 Sailu 2 Joel 3 Chamu 4 Gopi 5 Siva Name: Name, dtype: object last_names_series 0 Gurram 1 Dokku 2 Chelli 3 Maj 4 Battu 5 Ponnam Name: Name, dtype: object DataFrame after adding FirstName and LastName columns Name Age City Hobbies FirstName LastName 0 Krishna,Gurram 34 Bangalore Football,Cricket,Tennis Krishna Gurram 1 Sailu,Dokku 35 Hyderabad Tennis, cricket,Trekking Sailu Dokku 2 Joel,Chelli 234 Hyderabad Trekking, reading books Joel Chelli 3 Chamu,Maj 35 Chennai Chess Chamu Maj 4 Gopi,Battu 52 Bangalore Read Books Gopi Battu 5 Siva,Ponnam 34 Chennai Cricket Siva Ponnam
Expand the resulting splits into separate columns
‘str.split’ method split the text into list of values by default. By setting the argument expand to True, we can get the resulting splits as a new DataFrame.
name_split_df = df['Name'].str.split(',', expand=True)
In the above example, name_split_df points to below data set or dataframe.
0 1 0 Krishna Gurram 1 Sailu Dokku 2 Joel Chelli 3 Chamu Maj 4 Gopi Battu 5 Siva Ponnam
Following statement assign the splits to FirstName and LastName columns
df[['FirstName', 'LastName']] = name_split_df
Above statement is equivalent to following two statements.
df['FirstName'] = name_split_df[0]
df['LastName'] = name_split_df[1]
Find the below working application.
split_text_into_separate_columns .py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
'Age': [34, 35, 234, 35, 52, 34],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
name_split_df = df['Name'].str.split(',', expand=True)
print('\nname_split_df\n',name_split_df)
# Assign the splits to FirstName and LastName columns
df[['FirstName', 'LastName']] = name_split_df
# You can use below statements also to achieve the same result
# df['FirstName'] = name_split_df[0]
# df['LastName'] = name_split_df[1]
print('\nDataFrame after adding FirstName and LastName columns')
print(df)
Output
Original DataFrame Name Age City Hobbies 0 Krishna,Gurram 34 Bangalore Football,Cricket,Tennis 1 Sailu,Dokku 35 Hyderabad Tennis, cricket,Trekking 2 Joel,Chelli 234 Hyderabad Trekking, reading books 3 Chamu,Maj 35 Chennai Chess 4 Gopi,Battu 52 Bangalore Read Books 5 Siva,Ponnam 34 Chennai Cricket name_split_df 0 1 0 Krishna Gurram 1 Sailu Dokku 2 Joel Chelli 3 Chamu Maj 4 Gopi Battu 5 Siva Ponnam DataFrame after adding FirstName and LastName columns Name Age City Hobbies FirstName LastName 0 Krishna,Gurram 34 Bangalore Football,Cricket,Tennis Krishna Gurram 1 Sailu,Dokku 35 Hyderabad Tennis, cricket,Trekking Sailu Dokku 2 Joel,Chelli 234 Hyderabad Trekking, reading books Joel Chelli 3 Chamu,Maj 35 Chennai Chess Chamu Maj 4 Gopi,Battu 52 Bangalore Read Books Gopi Battu 5 Siva,Ponnam 34 Chennai Cricket Siva Ponnam
Limit number of splits
By setting the argument ‘n’ to an integer we can specify the number of splits that we are interested in.
Example
hobbies_split_df = df['Hobbies'].str.split(',', expand=True, n=1)
For the hobby ‘Football,Cricket,Tennis’ split 1 contains Football and split 2 contain Cricket,Tennis.
Find the below working application.
specify_split_count.py
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
'Age': [34, 35, 234, 35, 52, 34],
'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}
df = pd.DataFrame(data)
print('Original DataFrame')
print(df)
hobbies_split_df = df['Hobbies'].str.split(',', expand=True, n=1)
print('\nhobbies_split_df\n',hobbies_split_df)
# Assign the splits to FirstName and LastName columns
df[['FirstHobby', 'RestOfHobbies']] = hobbies_split_df
print('\nDataFrame after adding FirstHobby and RestOfHobbies columns')
print(df)
Output
Original DataFrame Name Age Hobbies 0 Krishna,Gurram 34 Football,Cricket,Tennis 1 Sailu,Dokku 35 Tennis, cricket,Trekking 2 Joel,Chelli 234 Trekking, reading books 3 Chamu,Maj 35 Chess 4 Gopi,Battu 52 Read Books 5 Siva,Ponnam 34 Cricket hobbies_split_df 0 1 0 Football Cricket,Tennis 1 Tennis cricket,Trekking 2 Trekking reading books 3 Chess None 4 Read Books None 5 Cricket None DataFrame after adding FirstHobby and RestOfHobbies columns Name Age Hobbies FirstHobby RestOfHobbies 0 Krishna,Gurram 34 Football,Cricket,Tennis Football Cricket,Tennis 1 Sailu,Dokku 35 Tennis, cricket,Trekking Tennis cricket,Trekking 2 Joel,Chelli 234 Trekking, reading books Trekking reading books 3 Chamu,Maj 35 Chess Chess None 4 Gopi,Battu 52 Read Books Read Books None 5 Siva,Ponnam 34 Cricket Cricket None
Previous Next Home
No comments:
Post a Comment