Programming for beginners: Label Encoding: A Simple Way to Convert Categorical Data for Machine Learning

Label encoding is a process to convert the categorical data into numbers. Each categorical data is assigned with an unique number. This is required in multiple occasions, where machine learning algorithms require numerical data to perform Mathematical computations, for example Linear regression, decision trees, support vector machines work with numerical data.

Examples of Categorical data

favorite_color: red, green, yellow etc.,
gender : male, female
Marital status: single, married, divorced
vehical_type: car, bike, bus
rating: poor, average, good, excellent
occupation: doctor, teacher, software engineer
Religion: Christian, Muslim, Hindu, Buddhist
education: high school, degree, masters, PhD

Let’s write an example using Scikit learn library.

label_encoding.py

from sklearn.preprocessing import LabelEncoder

# Sample categorical labels
categorical_labels = ['poor', 'average', 'good', 'average', 'good', 'excellent']

# Creating an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fitting the encoder on the categorical labels and transforming them
encoded_labels = label_encoder.fit_transform(categorical_labels)

print(encoded_labels)

Output

[3 0 2 0 2 1]

In this example, the LabelEncoder maps each unique label to a corresponding integer. In this case,

'poor' is mapped to 3,
'average' is mapped to 0,
'good' is mapped to 2, and
'excellent' is mapped to 1

Advantages of Label encoding

Easy to implement
It is a lossless transformation, original data is converted back from encoded data.

Limitation on Label encoding

Label encoding technique is ineffective in machine learning algorithms, when you have many unique categories. In these cases, use other encoding techniques like one-hot encoding.

Previous Next Home

Programming for beginners

Sunday, 2 March 2025

Label Encoding: A Simple Way to Convert Categorical Data for Machine Learning

No comments:

Post a Comment