Monday 26 February 2024

Pandas: Split the text content using split method

Using ‘str.split()’ method, we can split the text content of a DataFrame column.

I am using following data set to demonstrate the examples.

             Name  Age       City                   Hobbies
0  Krishna,Gurram   34  Bangalore   Football,Cricket,Tennis
1     Sailu,Dokku   35  Hyderabad  Tennis, cricket,Trekking
2     Joel,Chelli  234  Hyderabad   Trekking, reading books
3       Chamu,Maj   35    Chennai                     Chess
4      Gopi,Battu   52  Bangalore                Read Books
5     Siva,Ponnam   34    Chennai                   Cricket

Add two new columns (FirstName, LastName) to the dataset

To do this, we need to split the Name column using the separator (,)

 

Let’s split the Name column data using split method.

name_split_series = df['Name'].str.split(',')

‘name_split_series’ points to a series that contain following data.

0    [Krishna, Gurram]
1       [Sailu, Dokku]
2       [Joel, Chelli]
3         [Chamu, Maj]
4        [Gopi, Battu]
5       [Siva, Ponnam]

Following statements extract the first name and last name values from the ‘name_split_series’.

first_names_series = name_split_series.str.get(0)
last_names_series = name_split_series.str.get(1)

first_names_series contain below data

0    Krishna

1      Sailu

2       Joel

3      Chamu

4       Gopi

5       Siva

 

last_names_series contain below data

0    Gurram

1     Dokku

2    Chelli

3       Maj

4     Battu

5    Ponnam

 

Let’s assign first_names_series, last_names_series data to the columns FirstName and LastName of the original data frame.

 


df['FirstName'] = first_names_series
df['LastName'] = last_names_series

 

Find the below working application.

 

split_text_content.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
        'Age': [34, 35, 234, 35, 52, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

name_split_series = df['Name'].str.split(',')
first_names_series = name_split_series.str.get(0)
last_names_series = name_split_series.str.get(1)

df['FirstName'] = first_names_series
df['LastName'] = last_names_series

print('\nname_split_series\n',name_split_series)
print('\nfirst_names_series\n',first_names_series)
print('\nlast_names_series\n',last_names_series)

print('\nDataFrame after adding FirstName and LastName columns')
print(df) 

 

Output

Original DataFrame
             Name  Age       City                   Hobbies
0  Krishna,Gurram   34  Bangalore   Football,Cricket,Tennis
1     Sailu,Dokku   35  Hyderabad  Tennis, cricket,Trekking
2     Joel,Chelli  234  Hyderabad   Trekking, reading books
3       Chamu,Maj   35    Chennai                     Chess
4      Gopi,Battu   52  Bangalore                Read Books
5     Siva,Ponnam   34    Chennai                   Cricket

name_split_series
 0    [Krishna, Gurram]
1       [Sailu, Dokku]
2       [Joel, Chelli]
3         [Chamu, Maj]
4        [Gopi, Battu]
5       [Siva, Ponnam]
Name: Name, dtype: object

first_names_series
 0    Krishna
1      Sailu
2       Joel
3      Chamu
4       Gopi
5       Siva
Name: Name, dtype: object

last_names_series
 0    Gurram
1     Dokku
2    Chelli
3       Maj
4     Battu
5    Ponnam
Name: Name, dtype: object

DataFrame after adding FirstName and LastName columns
             Name  Age       City                   Hobbies FirstName LastName
0  Krishna,Gurram   34  Bangalore   Football,Cricket,Tennis   Krishna   Gurram
1     Sailu,Dokku   35  Hyderabad  Tennis, cricket,Trekking     Sailu    Dokku
2     Joel,Chelli  234  Hyderabad   Trekking, reading books      Joel   Chelli
3       Chamu,Maj   35    Chennai                     Chess     Chamu      Maj
4      Gopi,Battu   52  Bangalore                Read Books      Gopi    Battu
5     Siva,Ponnam   34    Chennai                   Cricket      Siva   Ponnam 

 

Expand the resulting splits into separate columns

‘str.split’ method split the text into list of values by default. By setting the argument expand to True, we can get the resulting splits as a new DataFrame.

name_split_df = df['Name'].str.split(',', expand=True)

 

In the above example, name_split_df points to below data set or dataframe.

          0       1
0  Krishna  Gurram
1    Sailu   Dokku
2     Joel  Chelli
3    Chamu     Maj
4     Gopi   Battu
5     Siva  Ponnam

 

Following statement assign the splits to FirstName and LastName columns

df[['FirstName', 'LastName']] = name_split_df

 

Above statement is equivalent to following two statements.

 

df['FirstName'] = name_split_df[0]

df['LastName'] = name_split_df[1]

 

Find the below working application.

 

split_text_into_separate_columns .py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
        'Age': [34, 35, 234, 35, 52, 34],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

name_split_df = df['Name'].str.split(',', expand=True)
print('\nname_split_df\n',name_split_df)

# Assign the splits to FirstName and LastName columns
df[['FirstName', 'LastName']] = name_split_df

# You can use below statements also to achieve the same result
# df['FirstName'] = name_split_df[0]
# df['LastName'] = name_split_df[1]


print('\nDataFrame after adding FirstName and LastName columns')
print(df) 

 

Output

Original DataFrame
             Name  Age       City                   Hobbies
0  Krishna,Gurram   34  Bangalore   Football,Cricket,Tennis
1     Sailu,Dokku   35  Hyderabad  Tennis, cricket,Trekking
2     Joel,Chelli  234  Hyderabad   Trekking, reading books
3       Chamu,Maj   35    Chennai                     Chess
4      Gopi,Battu   52  Bangalore                Read Books
5     Siva,Ponnam   34    Chennai                   Cricket

name_split_df
          0       1
0  Krishna  Gurram
1    Sailu   Dokku
2     Joel  Chelli
3    Chamu     Maj
4     Gopi   Battu
5     Siva  Ponnam

DataFrame after adding FirstName and LastName columns
             Name  Age       City                   Hobbies FirstName LastName
0  Krishna,Gurram   34  Bangalore   Football,Cricket,Tennis   Krishna   Gurram
1     Sailu,Dokku   35  Hyderabad  Tennis, cricket,Trekking     Sailu    Dokku
2     Joel,Chelli  234  Hyderabad   Trekking, reading books      Joel   Chelli
3       Chamu,Maj   35    Chennai                     Chess     Chamu      Maj
4      Gopi,Battu   52  Bangalore                Read Books      Gopi    Battu
5     Siva,Ponnam   34    Chennai                   Cricket      Siva   Ponnam 

Limit number of splits

By setting the argument ‘n’ to an integer we can specify the number of splits that we are interested in.

 

Example

hobbies_split_df = df['Hobbies'].str.split(',', expand=True, n=1)

For the hobby ‘Football,Cricket,Tennis’ split 1 contains Football and split 2 contain Cricket,Tennis.

 

Find the below working application.

 

specify_split_count.py

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Krishna,Gurram', 'Sailu,Dokku', 'Joel,Chelli', 'Chamu,Maj', 'Gopi,Battu', "Siva,Ponnam"],
        'Age': [34, 35, 234, 35, 52, 34],
        'Hobbies': ['Football,Cricket,Tennis', 'Tennis, cricket,Trekking', 'Trekking, reading books', 'Chess', 'Read Books', 'Cricket']}

df = pd.DataFrame(data)
print('Original DataFrame')
print(df)

hobbies_split_df = df['Hobbies'].str.split(',', expand=True, n=1)
print('\nhobbies_split_df\n',hobbies_split_df)

# Assign the splits to FirstName and LastName columns
df[['FirstHobby', 'RestOfHobbies']] = hobbies_split_df

print('\nDataFrame after adding FirstHobby and RestOfHobbies columns')
print(df)

Output

Original DataFrame
             Name  Age                   Hobbies
0  Krishna,Gurram   34   Football,Cricket,Tennis
1     Sailu,Dokku   35  Tennis, cricket,Trekking
2     Joel,Chelli  234   Trekking, reading books
3       Chamu,Maj   35                     Chess
4      Gopi,Battu   52                Read Books
5     Siva,Ponnam   34                   Cricket

hobbies_split_df
             0                  1
0    Football     Cricket,Tennis
1      Tennis   cricket,Trekking
2    Trekking      reading books
3       Chess               None
4  Read Books               None
5     Cricket               None

DataFrame after adding FirstHobby and RestOfHobbies columns
             Name  Age                   Hobbies  FirstHobby      RestOfHobbies
0  Krishna,Gurram   34   Football,Cricket,Tennis    Football     Cricket,Tennis
1     Sailu,Dokku   35  Tennis, cricket,Trekking      Tennis   cricket,Trekking
2     Joel,Chelli  234   Trekking, reading books    Trekking      reading books
3       Chamu,Maj   35                     Chess       Chess               None
4      Gopi,Battu   52                Read Books  Read Books               None
5     Siva,Ponnam   34                   Cricket     Cricket               None

 

 

Previous                                                 Next                                                 Home

No comments:

Post a Comment