Label encoding is a process to convert the categorical data into numbers. Each categorical data is assigned with an unique number. This is required in multiple occasions, where machine learning algorithms require numerical data to perform Mathematical computations, for example Linear regression, decision trees, support vector machines work with numerical data.
Examples of Categorical data
- favorite_color: red, green, yellow etc.,
- gender : male, female
- Marital status: single, married, divorced
- vehical_type: car, bike, bus
- rating: poor, average, good, excellent
- occupation: doctor, teacher, software engineer
- Religion: Christian, Muslim, Hindu, Buddhist
- education: high school, degree, masters, PhD
Let’s write an example using Scikit learn library.
label_encoding.py
from sklearn.preprocessing import LabelEncoder # Sample categorical labels categorical_labels = ['poor', 'average', 'good', 'average', 'good', 'excellent'] # Creating an instance of LabelEncoder label_encoder = LabelEncoder() # Fitting the encoder on the categorical labels and transforming them encoded_labels = label_encoder.fit_transform(categorical_labels) print(encoded_labels)
Output
[3 0 2 0 2 1]
In this example, the LabelEncoder maps each unique label to a corresponding integer. In this case,
- 'poor' is mapped to 3,
- 'average' is mapped to 0,
- 'good' is mapped to 2, and
- 'excellent' is mapped to 1
Advantages of Label encoding
- Easy to implement
- It is a lossless transformation, original data is converted back from encoded data.
Limitation on Label encoding
Label encoding technique is ineffective in machine learning algorithms, when you have many unique categories. In these cases, use other encoding techniques like one-hot encoding.
No comments:
Post a Comment