Programming for beginners: Pandas: categorical columns in a DataFrame

Category columns are used to define categorical data like gender, month etc., It is mainly used for efficient storage and manipulation of categorical data.

Example

df['Gender'] = df['Gender'].astype("category")

categorical_column.py

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
        'Age': [34, 35, 29, 41, 52, 31],
        'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male']}

df = pd.DataFrame(data)

# Add 1000000 rows
rows = []
i = 0
while i < 10000000:
    new_row = {}
    if i % 2 == 0:
        new_row = {'Name': 'Name ' + str(i), 'Age': 28, 'City': 'Bangalore', 'Gender': 'Female'}
    else:
        new_row = {'Name': 'Name ' + str(i), 'Age': 28, 'City': 'Hyderabad', 'Gender': 'Male'}

    rows.append(new_row)
    i = i + 1

df = pd.concat([df, pd.DataFrame(rows)])
df.info()

print('\nChange the Gender to categorical column\n')
df['Gender'] = df['Gender'].astype("category")
df.info()

In this example, the ‘Gender’ column is initially created as a regular object column. We then converted it to the categorical data type using the astype() method.

Output

<class 'pandas.core.frame.DataFrame'>
Index: 10000006 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Name    object
 1   Age     int64 
 2   City    object
 3   Gender  object
dtypes: int64(1), object(3)
memory usage: 381.5+ MB

Change the Gender to categorical column

<class 'pandas.core.frame.DataFrame'>
Index: 10000006 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype   
---  ------  -----   
 0   Name    object  
 1   Age     int64   
 2   City    object  
 3   Gender  category
dtypes: category(1), int64(1), object(2)
memory usage: 314.7+ MB

As you see the above output, DataFrame with categorical type took 314.7+ MB whereas with non-categorical type took 381.5+ MB.

Benefits of categorical column

a. Efficient memory usage

b. Faster the computations

Previous Next Home

Programming for beginners

Sunday, 29 October 2023

Pandas: categorical columns in a DataFrame

No comments:

Post a Comment