Sunday, 2 March 2025

How to use one-hot encoding in machine learning?

One hot encoding is a technique, that converts categorical data into a binary matrix representation. This is one of the popular data pre-processing technique.

Example 1:  if you have a categorical variable called "gender" with the values "male", "female", then one-hot encoding will create two binary features "gender_male", "gender_female". Each feature will be 0 or 1 depending on whether the corresponding category is present in the data.

 

gender

gender_male

gender_female

male

1

0

female

0

1

female

0

1

male

1

0

 

Example 2:  if you have a categorical variable called "color" with the values "red", "green", and "blue", then one-hot encoding will create three binary features: color_red, color_green, and color_blue. Each feature will be 0 or 1, depending on whether the corresponding category is present in the data.

 

color

color_red

color_green

color_blue

red

1

0

0

green

0

1

0

blue

0

0

1

green

0

1

0

 

one_hot_encoding.py 

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a DataFrame with a categorical column
df = pd.DataFrame(
    {'color': ['red', 'green', 'blue', 'red']}
)

# Create a OneHotEncoder object
encoder = OneHotEncoder()

# Transform the DataFrame
encoded_sparse_matrix = encoder.fit_transform(df)

# The result is a sparse matrix, you can convert it to a dense array
one_hot_encoded_array = encoded_sparse_matrix.toarray()

print("Original categorical data:")
print(df)

print("\nencoded sparse matrix:")
print(type(encoded_sparse_matrix))

print("\nOne-hot encoded data:")
print(one_hot_encoded_array)

Output

Original categorical data:
   color
0    red
1  green
2   blue
3    red

encoded sparse matrix:
<class 'scipy.sparse._csr.csr_matrix'>

One-hot encoded data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

df = pd.DataFrame(

    {'color': ['red', 'green', 'blue', 'red']}

)

This code will first create a DataFrame with a categorical column called color. The color column has three unique values red, green, and blue.

 

encoder = OneHotEncoder()

it will create a OneHotEncoder object. This object is used to encode the categorical data in the color column.

 

encoded_sparse_matrix = encoder.fit_transform(df)

This will convert the categorical data in the color column to one-hot encoded data.

 

one_hot_encoded_array = encoded_sparse_matrix.toarray()

The output of the code is a DataFrame with three new columns: color_red, color_green, and color_blue. The color_red column will be 1 if the color column is equal to red and 0 otherwise. The color_green column will be 1 if the color column is equal to green and 0 otherwise. The color_blue column will be 1 if the color column is equal to blue and 0 otherwise.

 

In the above output, encoding sequence is in the order of blue, green and red.

 

But what if I want the order as red, greed, and blue.

 

Customize the encoding sequence

By passing the custom order as an argument to ‘OneHotEncoder’ constructor we can customize the encoding sequence.

# Define the desired order of categories
desired_order = ['red', 'green', 'blue']

# Create a OneHotEncoder object
encoder = OneHotEncoder(categories=[desired_order])

 

Find the below working application.

 

one_hot_encoding_maintain_order.py

 

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a DataFrame with a categorical column
df = pd.DataFrame(
    {'color': ['red', 'green', 'blue', 'red']}
)

# Define the desired order of categories
desired_order = ['red', 'green', 'blue']

# Create a OneHotEncoder object
encoder = OneHotEncoder(categories=[desired_order])

# Transform the DataFrame
encoded_sparse_matrix = encoder.fit_transform(df)

# The result is a sparse matrix, you can convert it to a dense array
one_hot_encoded_array = encoded_sparse_matrix.toarray()

print("Original categorical data:")
print(df)

print("\nencoded sparse matrix:")
print(type(encoded_sparse_matrix))

print("\nOne-hot encoded data:")
print(one_hot_encoded_array)

 Output

Original categorical data:
   color
0    red
1  green
2   blue
3    red

encoded sparse matrix:
<class 'scipy.sparse._csr.csr_matrix'>

One-hot encoded data:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]

 


Previous                                                    Next                                                    Home

No comments:

Post a Comment