Category
columns are used to define categorical data like gender, month etc., It is
mainly used for efficient storage and manipulation of categorical data.
Example
df['Gender'] = df['Gender'].astype("category")
categorical_column.py
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {'Name': ['Krishna', 'Sailu', 'Joel', 'Chamu', 'Jitendra', "Raj"],
'Age': [34, 35, 29, 41, 52, 31],
'City': ['Bangalore', 'Hyderabad', 'Hyderabad', 'Chennai', 'Bangalore', 'Chennai'],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male']}
df = pd.DataFrame(data)
# Add 1000000 rows
rows = []
i = 0
while i < 10000000:
new_row = {}
if i % 2 == 0:
new_row = {'Name': 'Name ' + str(i), 'Age': 28, 'City': 'Bangalore', 'Gender': 'Female'}
else:
new_row = {'Name': 'Name ' + str(i), 'Age': 28, 'City': 'Hyderabad', 'Gender': 'Male'}
rows.append(new_row)
i = i + 1
df = pd.concat([df, pd.DataFrame(rows)])
df.info()
print('\nChange the Gender to categorical column\n')
df['Gender'] = df['Gender'].astype("category")
df.info()
In this example, the ‘Gender’ column is initially created as a regular object column. We then converted it to the categorical data type using the astype() method.
Output
<class 'pandas.core.frame.DataFrame'> Index: 10000006 entries, 0 to 9999999 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 Name object 1 Age int64 2 City object 3 Gender object dtypes: int64(1), object(3) memory usage: 381.5+ MB Change the Gender to categorical column <class 'pandas.core.frame.DataFrame'> Index: 10000006 entries, 0 to 9999999 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 Name object 1 Age int64 2 City object 3 Gender category dtypes: category(1), int64(1), object(2) memory usage: 314.7+ MB
As you see the above output, DataFrame with categorical type took 314.7+ MB whereas with non-categorical type took 381.5+ MB.
Benefits of categorical column
a. Efficient memory usage
b. Faster the computations
Previous Next Home
No comments:
Post a Comment