A categorical variable represent categories or labels, in general it take a fixed or limited number of values.
Examples
- gender is a categorical variable, as it has two values Male, Female
- country_of_origin : India, China, Japan etc.,
- education_qualifiaciton : High school, college, graduate, post graduate, doctorate etc.,
- occupation : Doctor, Engineer, Teacher etc.,
- marital_status : can be single, married, divorced
- product_categories : "Electronics" "Clothing" "Books", "Audio files", "Groceries" etc.,
- color : red, green, yellow etc.,
- survey_rating: "Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied" and "Very Satisfied"
- approval_status: approved, not_approved
- subscription_status: subscribed, not_subscribed
- smoking_status: smoker, non_smoker
Unlike numerical variables which represents quantity, a categorical variable represent the quality or characteristics of the datapoint.
categorical variable is also known as a qualitative variable.
Types of categorical variables
Categorical variables are divided into several types.
a. Nominal variable: It represent categories with no inherent order. Examples: gender, country_of_origin, color etc.,
b. Ordinal variable: It represent categories with some natural order. Example: education_qualifiaciton, survey_rating.
c. Binary variable: It is a special category of ‘categorical variables’ which has only two states 0/1, yes/no, true/false etc., Example: approval_status, subscription_status, smoking_status
d. Multi-class variable: These are special types of variables with more than two categories or levels. These are used when you need more than two categories or levels, are used to represent situations.
Example:
Car type: SUV, Sedan
Customer segment: High-Value, Medium-Value, Low-Value etc,
How to identify whether a column is categorical column or not?
By examining the data and its characteristics, we can identify the categorical columns. Following are some of the common ways to identify a categorical column.
a. By checking the data type: Categorical columns in general of type object or category. Following snippet can be used in Pandas to get the categorical columns.
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
b. By checking the number of unique values: In general, a categorical column has less number of unique values. In Pandas, you can use the nunique() method to count unique values. But there no single universal formula to choose the threshold to determine whether a column is categorical based solely on the number of unique values.
For example, if a column has fewer than 15 unique values in a dataset with 1000 rows, it may be considered categorical. Here 15 is not a fixed number, you can use 20, 25 etc., use your domain expertise here.
Following snippet find the categorical columns in a dataframe using approaches a and b.
def categorical_columns(df, count_threashold): # Approach 1: Data Type Inspection categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threashold] # Approach 2: Unique Value Count categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threashold] # Add elements to the list all_categorical_columns = [] all_categorical_columns.extend(categorical_columns_dtype_by_threshold) for item in categorical_columns_unique_by_threshold: if item not in all_categorical_columns: all_categorical_columns.append(item) return all_categorical_columns
Find the below working application.
categorical_column_identifications.py
import pandas as pd def categorical_columns(df, count_threashold): # Approach 1: Data Type Inspection categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threashold] # Approach 2: Unique Value Count categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threashold] # Add elements to the list all_categorical_columns = [] all_categorical_columns.extend(categorical_columns_dtype_by_threshold) for item in categorical_columns_unique_by_threshold: if item not in all_categorical_columns: all_categorical_columns.append(item) return all_categorical_columns # Sample DataFrame data = { 'Name': ['Harika', 'Krishna', 'Joel', 'Gopi', 'Sailu', 'Raj', 'Ravi', 'Narasimha'], 'Age': [25, 30, 22, 35, 36, 42, 23, 37], 'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male'], 'City': ['Chennai', 'Bangalore', 'Chennai', 'Bangalore', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Bangalore'] } df = pd.DataFrame(data) categorical_columns = categorical_columns(df, 5) print(f'categorical_columns : {categorical_columns}')
Output
categorical_columns : ['Gender', 'City']
c. Use your domain knowledge: Look into the data and use your domain knowledge to identify the categorical column.
d. By visualizing the distribution of values in a column: Use techniques like frequency counts, histograms, or bar plots to visualize the distribution of values in a column.
I am using below dataset to visually find the categorical values.
https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
bar_chart_with_column_unique_values.py
import pandas as pd import matplotlib.pyplot as plt def categorical_columns(df, count_threshold): # Approach 1: Data Type Inspection categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threshold] # Approach 2: Unique Value Count categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threshold] # Add elements to the list all_categorical_columns = [] all_categorical_columns.extend(categorical_columns_dtype_by_threshold) for item in categorical_columns_unique_by_threshold: if item not in all_categorical_columns: all_categorical_columns.append(item) return all_categorical_columns df = pd.read_csv('weatherAUS.csv') threshold = 50 columns = categorical_columns(df, threshold) print(columns) unique_values = [] for column in columns: unique_values.append(df[column].nunique()) total_columns = len(df.columns) categorical_columns = len(columns) plt.bar(columns, unique_values, label=f'total rows : {len(df)}\n ' f'total_columns : {total_columns}\n ' f'categorical_columns : {categorical_columns}\n' f'threshold_used = {threshold}') plt.legend() plt.show()
Output
From the above visual bar chart, we can see following columns are categorical.
- 'Location',
- 'WindGustDir',
- 'WindDir9am',
- 'WindDir3pm',
- 'RainToday',
- 'RainTomorrow',
- 'WindSpeed9am',
- 'WindSpeed3pm',
- 'Cloud9am',
10. 'Cloud3pm'
Let’s check the categorical columns data to tweak further.
As you see above image, we can remove following columns from the categorical list, as these are not fit into categorical columns.
- WindSpeed9am
- WindSpeed3pm
- Cloud9am
- Cloud3pm
bar_chart_with_column_unique_values.py
import pandas as pd import matplotlib.pyplot as plt def categorical_columns(df, count_threshold): # Approach 1: Data Type Inspection categorical_columns_dtype = df.select_dtypes(include=['object', 'category']).columns categorical_columns_dtype_by_threshold = [col for col in categorical_columns_dtype if df[col].nunique() < count_threshold] # Approach 2: Unique Value Count categorical_columns_unique_by_threshold = [col for col in df.columns if df[col].nunique() < count_threshold] # Add elements to the list all_categorical_columns = [] all_categorical_columns.extend(categorical_columns_dtype_by_threshold) for item in categorical_columns_unique_by_threshold: if item not in all_categorical_columns: all_categorical_columns.append(item) return all_categorical_columns df = pd.read_csv('weatherAUS.csv') df.info() threshold = 50 columns = categorical_columns(df, threshold) columns_to_remove = ['WindSpeed9am', 'WindSpeed3pm', 'Cloud9am', 'Cloud3pm'] columns = [item for item in columns if item not in columns_to_remove] unique_values = [] # Set display options to prevent column trimming pd.set_option('display.max_columns', None) # Display all columns pd.set_option('display.expand_frame_repr', False) # Prevent line wrapping print(df[columns].head()) for column in columns: unique_values.append(df[column].nunique()) total_columns = len(df.columns) categorical_columns = len(columns) plt.bar(columns, unique_values, label=f'total rows : {len(df)}\n ' f'total_columns : {total_columns}\n ' f'categorical_columns : {categorical_columns}\n' f'threshold_used = {threshold}') plt.legend() plt.show()
Output
Previous Next Home
No comments:
Post a Comment