One hot encoding is a technique, that converts categorical data into a binary matrix representation. This is one of the popular data pre-processing technique.
Example 1: if you have a categorical variable called "gender" with the values "male", "female", then one-hot encoding will create two binary features "gender_male", "gender_female". Each feature will be 0 or 1 depending on whether the corresponding category is present in the data.
gender |
gender_male |
gender_female |
male |
1 |
0 |
female |
0 |
1 |
female |
0 |
1 |
male |
1 |
0 |
Example 2: if you have a categorical variable called "color" with the values "red", "green", and "blue", then one-hot encoding will create three binary features: color_red, color_green, and color_blue. Each feature will be 0 or 1, depending on whether the corresponding category is present in the data.
color |
color_red |
color_green |
color_blue |
red |
1 |
0 |
0 |
green |
0 |
1 |
0 |
blue |
0 |
0 |
1 |
green |
0 |
1 |
0 |
one_hot_encoding.py
import pandas as pd from sklearn.preprocessing import OneHotEncoder # Create a DataFrame with a categorical column df = pd.DataFrame( {'color': ['red', 'green', 'blue', 'red']} ) # Create a OneHotEncoder object encoder = OneHotEncoder() # Transform the DataFrame encoded_sparse_matrix = encoder.fit_transform(df) # The result is a sparse matrix, you can convert it to a dense array one_hot_encoded_array = encoded_sparse_matrix.toarray() print("Original categorical data:") print(df) print("\nencoded sparse matrix:") print(type(encoded_sparse_matrix)) print("\nOne-hot encoded data:") print(one_hot_encoded_array)
Output
Original categorical data: color 0 red 1 green 2 blue 3 red encoded sparse matrix: <class 'scipy.sparse._csr.csr_matrix'> One-hot encoded data: [[0. 0. 1.] [0. 1. 0.] [1. 0. 0.] [0. 0. 1.]]
df = pd.DataFrame(
{'color': ['red', 'green', 'blue', 'red']}
)
This code will first create a DataFrame with a categorical column called color. The color column has three unique values red, green, and blue.
encoder = OneHotEncoder()
it will create a OneHotEncoder object. This object is used to encode the categorical data in the color column.
encoded_sparse_matrix = encoder.fit_transform(df)
This will convert the categorical data in the color column to one-hot encoded data.
one_hot_encoded_array = encoded_sparse_matrix.toarray()
The output of the code is a DataFrame with three new columns: color_red, color_green, and color_blue. The color_red column will be 1 if the color column is equal to red and 0 otherwise. The color_green column will be 1 if the color column is equal to green and 0 otherwise. The color_blue column will be 1 if the color column is equal to blue and 0 otherwise.
In the above output, encoding sequence is in the order of blue, green and red.
But what if I want the order as red, greed, and blue.
Customize the encoding sequence
By passing the custom order as an argument to ‘OneHotEncoder’ constructor we can customize the encoding sequence.
# Define the desired order of categories desired_order = ['red', 'green', 'blue'] # Create a OneHotEncoder object encoder = OneHotEncoder(categories=[desired_order])
Find the below working application.
one_hot_encoding_maintain_order.py
import pandas as pd from sklearn.preprocessing import OneHotEncoder # Create a DataFrame with a categorical column df = pd.DataFrame( {'color': ['red', 'green', 'blue', 'red']} ) # Define the desired order of categories desired_order = ['red', 'green', 'blue'] # Create a OneHotEncoder object encoder = OneHotEncoder(categories=[desired_order]) # Transform the DataFrame encoded_sparse_matrix = encoder.fit_transform(df) # The result is a sparse matrix, you can convert it to a dense array one_hot_encoded_array = encoded_sparse_matrix.toarray() print("Original categorical data:") print(df) print("\nencoded sparse matrix:") print(type(encoded_sparse_matrix)) print("\nOne-hot encoded data:") print(one_hot_encoded_array)
Output
Original categorical data: color 0 red 1 green 2 blue 3 red encoded sparse matrix: <class 'scipy.sparse._csr.csr_matrix'> One-hot encoded data: [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.] [1. 0. 0.]]
No comments:
Post a Comment