Programming for beginners: Quick guide to feature engineering

Feature engineering is an important step in the machine learning process to transforming raw data into features. Feature Engineering significantly improve the performance of the model and make the predictions more accurate and reliable.

There are many techniques in feature engineering process, the techniques that you can use are depend on the dataset, machine learning algorithm that you want to use.

Following are the key steps in feature engineering process.

a. Feature selection: Most of the times, we might not need all the features in the dataset. This step select the most relevant features, and ignore irrelevant from the dataset. You can use your domain knowledge, or some statistical approach like finding correlation coefficients to find the features that are relevant to address the given problem.

Suppose we have a dataset with the following features:

· height (cm)

· weight (kg)

· age (years)

· blood_sugar_level (mg/dL)

· eye_color

· favorite_sport

Here, eye color and favorite sport are unlikely to be relevant for predicting diabetes risk. So, we remove these irrelevant features using correlation analysis or domain knowledge.

b. Feature creation: Create new features from the existing features. For example you can create body mass index (bmi) column based on person height and weight details.

c. Feature Transformation: Covert the existing features such that they are consumed by Machine learning models. Techniques like Normalization, Scaling, logarithmic transformation are used in this phase.

Suppose blood_sugar_level ranges from 70 to 300 mg/dL, but some machine learning models work better with normally distributed data. We can apply log transformation to reduce skewness.

Example

log(180)=2.25

d. Feature encoding: Categorical data will be converted to numbers based on encoding techniques like label encoding, one-hot encoding, binary encoding etc.,

If we have a categorical feature like gender, which has values Male, Female, Other, machine learning models don’t understand text, so we convert it into numerical values.

One-Hot Encoding: Convert into binary columns

Male Female Other

1 0 0

0 1 0

0 0 1

Label Encoding: Assign numbers

Male → 0

Female → 1

Other → 2

Assign 0 to Male, 1 to Female and 2 to Other.

e. Missing values handling: Missing values create lot of accuracy issues in the model prediction. We can handle them using various ways like remove missing values, replace missing values with mean, median or mode, replace missing value with some default value etc.,

Suppose age has some missing values. We can handle this in different ways:

· Remove rows with missing values (if very few are missing).

· Replace with the mean: If most ages are around 40, replace missing values with 40.

· Replace with median or mode (if the data is skewed).

f. Feature scaling: Most of the machine learning algorithms are sensitive to scale of input features. You can use techniques like Z-score scaling, min-max scaling to ensure that features have similar values.

References

https://www.kaggle.com/datasets/codebreaker619/salary-data-with-age-and-experience

Previous Next Home

Programming for beginners

Monday, 24 February 2025

Quick guide to feature engineering

No comments:

Post a Comment