Thursday, 27 February 2025

Quick guide to Categorical variables in Machine learning

 

A categorical variable represent categories or labels, in general it take a fixed or limited number of values.

 

Examples

  1. gender is a categorical variable, as it has two values Male, Female
  2. country_of_origin : India, China, Japan etc.,
  3. education_qualifiaciton : High school, college, graduate, post graduate, doctorate etc.,
  4. occupation :  Doctor, Engineer, Teacher etc.,
  5. marital_status : can be single, married, divorced
  6. product_categories :  "Electronics" "Clothing" "Books", "Audio files", "Groceries" etc.,
  7. color : red, green, yellow etc.,
  8. survey_rating: "Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied" and "Very Satisfied" 
  9. approval_status: approved, not_approved
  10. subscription_status: subscribed, not_subscribed
  11. smoking_status: smoker, non_smoker

 

Unlike numerical variables which represents quantity, a categorical variable represent the quality or characteristics of the datapoint.

 

categorical variable is also known as a qualitative variable.

 

Types of categorical variables

Categorical variables are divided into several types.

 

a. Nominal variable: It represent categories with no inherent order. Examples: gender, country_of_origin, color etc.,

 

b. Ordinal variable: It represent categories with some natural order. Example: education_qualifiaciton, survey_rating.

 

c. Binary variable: It is a special category of ‘categorical variables’ which has only two states 0/1, yes/no, true/false etc., Example: approval_status, subscription_status, smoking_status

 

d. Multi-class variable: These are special types of variables with more than two categories or levels. These are used when you need more than two categories or levels, are used to represent situations.

 

Example:

Car type: SUV, Sedan

Customer segment: High-Value, Medium-Value, Low-Value etc,

 

How to identify whether a column is categorical column or not?

By examining the data and its characteristics, we can identify the categorical columns. Following are some of the common ways to identify a categorical column.

 

a. By checking the data type: Categorical columns in general of type object or category. Following snippet can be used in Pandas to get the categorical columns.

 

categorical_columns = df.select_dtypes(include=['object', 'category']).columns

 

b. By checking the number of unique values: In general, a categorical column has less number of unique values. In Pandas, you can use the nunique() method to count unique values. But there no single universal formula to choose the threshold to determine whether a column is categorical based solely on the number of unique values.

 

For example, if a column has fewer than 15 unique values in a dataset with 1000 rows, it may be considered categorical. Here 15 is not a fixed number, you can use 20, 25 etc., use your domain expertise here.

 

Following snippet find the categorical columns in a dataframe using approaches a and b.

def categorical_columns(df, count_threashold):
    # Approach 1: Data Type Inspection
    categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns

    categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threashold]

    # Approach 2: Unique Value Count
    categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threashold]

    # Add elements to the list
    all_categorical_columns = []
    all_categorical_columns.extend(categorical_columns_dtype_by_threshold)

    for item in categorical_columns_unique_by_threshold:
        if item not in all_categorical_columns:
            all_categorical_columns.append(item)

    return all_categorical_columns

Find the below working application.

categorical_column_identifications.py

import pandas as pd

def categorical_columns(df, count_threashold):
    # Approach 1: Data Type Inspection
    categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns

    categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threashold]

    # Approach 2: Unique Value Count
    categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threashold]

    # Add elements to the list
    all_categorical_columns = []
    all_categorical_columns.extend(categorical_columns_dtype_by_threshold)

    for item in categorical_columns_unique_by_threshold:
        if item not in all_categorical_columns:
            all_categorical_columns.append(item)

    return all_categorical_columns

# Sample DataFrame
data = {
    'Name': ['Harika', 'Krishna', 'Joel', 'Gopi', 'Sailu', 'Raj', 'Ravi', 'Narasimha'],
    'Age': [25, 30, 22, 35, 36, 42, 23, 37],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male'],
    'City': ['Chennai', 'Bangalore', 'Chennai', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore']
}

df = pd.DataFrame(data)
categorical_columns = categorical_columns(df, 5)
print(f'categorical_columns : {categorical_columns}')

Output

categorical_columns : ['Gender', 'City']

c. Use your domain knowledge: Look into the data and use your domain knowledge to identify the categorical column.

 

d. By visualizing the distribution of values in a column: Use techniques like frequency counts, histograms, or bar plots to visualize the distribution of values in a column.

 

I am using below dataset to visually find the categorical values.

 

https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package

 

 

bar_chart_with_column_unique_values.py

import pandas as pd
import matplotlib.pyplot as plt

def categorical_columns(df, count_threshold):
    # Approach 1: Data Type Inspection
    categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns

    categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if
                                              df[col].nunique() < count_threshold]

    # Approach 2: Unique Value Count
    categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threshold]

    # Add elements to the list
    all_categorical_columns = []
    all_categorical_columns.extend(categorical_columns_dtype_by_threshold)

    for item in categorical_columns_unique_by_threshold:
        if item not in all_categorical_columns:
            all_categorical_columns.append(item)

    return all_categorical_columns


df = pd.read_csv('weatherAUS.csv')

threshold = 50
columns = categorical_columns(df, threshold)
print(columns)
unique_values = []

for column in columns:
    unique_values.append(df[column].nunique())

total_columns = len(df.columns)
categorical_columns = len(columns)

plt.bar(columns, unique_values, label=f'total rows : {len(df)}\n '
                                      f'total_columns : {total_columns}\n '
                                      f'categorical_columns : {categorical_columns}\n'
                                      f'threshold_used = {threshold}')
plt.legend()

plt.show()

Output



 

From the above visual bar chart, we can see following columns are categorical.

 

  1. 'Location',
  2. 'WindGustDir',
  3. 'WindDir9am',
  4. 'WindDir3pm',
  5. 'RainToday',
  6. 'RainTomorrow',
  7. 'WindSpeed9am',
  8. 'WindSpeed3pm',
  9. 'Cloud9am',

10. 'Cloud3pm'

 

Let’s check the categorical columns data to tweak further.

 

As you see above image, we can remove following columns from the categorical list, as these are not fit into categorical columns.

  1. WindSpeed9am
  2. WindSpeed3pm
  3. Cloud9am
  4. Cloud3pm

 

bar_chart_with_column_unique_values.py

import pandas as pd
import matplotlib.pyplot as plt

def categorical_columns(df, count_threshold):
    # Approach 1: Data Type Inspection
    categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns

    categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if
                                              df[col].nunique() < count_threshold]

    # Approach 2: Unique Value Count
    categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threshold]

    # Add elements to the list
    all_categorical_columns = []
    all_categorical_columns.extend(categorical_columns_dtype_by_threshold)

    for item in categorical_columns_unique_by_threshold:
        if item not in all_categorical_columns:
            all_categorical_columns.append(item)

    return all_categorical_columns


df = pd.read_csv('weatherAUS.csv')
df.info()

threshold = 50
columns = categorical_columns(df, threshold)
columns_to_remove = ['WindSpeed9am', 'WindSpeed3pm', 'Cloud9am', 'Cloud3pm']
columns = [item for item in columns if item not in columns_to_remove]

unique_values = []

# Set display options to prevent column trimming
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping

print(df[columns].head())

for column in columns:
    unique_values.append(df[column].nunique())

total_columns = len(df.columns)
categorical_columns = len(columns)

plt.bar(columns, unique_values, label=f'total rows : {len(df)}\n '
                                      f'total_columns : {total_columns}\n '
                                      f'categorical_columns : {categorical_columns}\n'
                                      f'threshold_used = {threshold}')
plt.legend()

plt.show()

 

Output

 


 

Previous                                                    Next                                                    Home

No comments:

Post a Comment