Programming for beginners: The Importance of Data Normalization in Machine Learning

Data normalization is a pre-processing technique in machine learning that is used to transform the numerical data/features of a dataset to a common scale. This is done to improve the performance of machine learning algorithms. If we do not normalize the data, certain features dominate others due to their larger magnitude/size.

Min-Max Normalization
Z-score normalization
Robust Scaling
Max Absolute Scaling
Logarithmic normalization

Min-Max Normalization

It scales the data to a specific range, usually between 0 and 1.

Formula

X_normalized = (X - X_min) / (X_max - X_min)

Find the below working application.

min_max.py

data = [27, 98, 123, 4, 6, 897, 643]
min_val = min(data)
max_val = max(data)

normalized_values = []

for i in data:
    normalized_val = (i - min_val) / (max_val - min_val)
    normalized_values.append(normalized_val)

print('actual_value,normalized_value')
for i in range(len(data)):
    print(data[i],',',normalized_values[i])

Output

actual_value,normalized_value
27 , 0.025755879059350503
98 , 0.10526315789473684
123 , 0.13325867861142218
4 , 0.0
6 , 0.0022396416573348264
897 , 1.0
643 , 0.7155655095184771

Min max normalization using Scikit learn MinMaxScaler

min_max_scikit.py

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame
data = pd.DataFrame({
    'ratings': [27, 98, 123, 4, 6, 897, 643],
    'product_code': [123, 43, 91, 8, 98, 45, 3]
})

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data using the scaler
normalized_data = scaler.fit_transform(data[['ratings']])
normalized_df = pd.DataFrame(normalized_data, columns=['ratings'])

print('actual_rating,normalized_rating')
for index, row in normalized_df.iterrows():
    print(data.iloc[index]['ratings'],',',row['ratings'])

print()

Output

actual_rating,normalized_rating
27 , 0.025755879059350506
98 , 0.10526315789473685
123 , 0.13325867861142218
4 , 0.0
6 , 0.002239641657334827
897 , 1.0
643 , 0.7155655095184771

In this example, the MinMaxScaler is used to normalize the values in ‘ratings’ column of the Pandas DataFrame. The resulting normalized DataFrame has the same column name and the values are scaled within the range [0, 1].

Z-score normalization

It normalizes the data such that the mean is 0 and the standard deviation is 1. This is achieved by subtracting the mean from all the values and then dividing by the standard deviation.

X_normalized = (X - mean(X)) / std(X)

z_score_normalization.py

import statistics

# Create a list of numbers
data = [27, 98, 123, 4, 6, 897, 643]

# Calculate the mean
mean = statistics.mean(data)

# Calculate the standard deviation
std_dev = statistics.stdev(data)

normalized_data = []
for i in data:
    normalized_value = (i-mean)/std_dev
    normalized_data.append(normalized_value)

# Print the results
print("mean:", mean)
print("standard deviation:", std_dev)

print('actual_rating,normalized_rating')
for i in range(len(data)):
    print(data[i], ',', normalized_data[i])

Output

mean: 256.85714285714283
standard deviation: 360.9577207797004
actual_rating,normalized_rating
27 , -0.6367979672539797
98 , -0.4400990301966597
123 , -0.3708388410919695
4 , -0.7005173412302946
6 , -0.6949765261019195
897 , 1.7734566135892376
643 , 1.0697730922855857

Z score normalization example in scikit learn.

z_score_normalization_scikit.py

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Create a sample pandas DataFrame
data = pd.DataFrame({
    'ratings': [27, 98, 123, 4, 6, 897, 643],
    'product_code': [123, 43, 91, 8, 98, 45, 3]
})
df = pd.DataFrame(data)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data using the scaler
normalized_data = scaler.fit_transform(data[['ratings']])
normalized_df = pd.DataFrame(normalized_data, columns=['ratings'])

print('actual_rating,normalized_rating')
for index, row in normalized_df.iterrows():
    print(data.iloc[index]['ratings'],',',row['ratings'])

Output

actual_rating,normalized_rating
27 , -0.687820417174377
98 , -0.47536128272088707
123 , -0.40055172833585534
4 , -0.7566452072086062
6 , -0.7506604428578036
897 , 1.9155520754247257
643 , 1.155487002872804

Robust scaling

Robust scaling is similar to Z-Score scaling, but it uses the median and interquartile range (IQR) instead of the mean and standard deviation. This approach is less sensitive to outliers.

Formula

X_robust_scaled = (X - median(X)) / IQR(X)

robust_scaling.py

import statistics
import numpy as np

# Create a list of numbers
data = [27, 98, 123, 4, 6, 897, 643]

sorted_data =sorted(data)

# Calculate the median
median = statistics.median(sorted_data)

# Calculate the interquartile range
q1 = np.percentile(sorted_data, 25)
q3 = np.percentile(sorted_data, 75)
iqr = q3 - q1

normalized_data = []
for i in data:
    normalized_value = (i-median)/iqr
    normalized_data.append(normalized_value)

# Print the results
print("median:", median)
print("iqr:", iqr)

print('actual_rating,normalized_rating')
for i in range(len(data)):
    print(data[i], ',', normalized_data[i])

Output

median: 98
iqr: 366.5
actual_rating,normalized_rating
27 , -0.1937244201909959
98 , 0.0
123 , 0.06821282401091405
4 , -0.25648021828103684
6 , -0.25102319236016374
897 , 2.180081855388813
643 , 1.4870395634379263

Following example demonstrates with the same data using numpy arrays.

robust_scaling_numpy_arrays.py

import numpy as np

# Sample data with outliers
data = np.array( [27, 98, 123, 4, 6, 897, 643])

# Calculate the median and IQR
median = np.median(data)
iqr = np.percentile(data, 75) - np.percentile(data, 25)

# Apply robust scaling
robust_scaled_data = (data - median) / iqr

# Print the robust-scaled data
print("Robust Scaled Data:")
print(robust_scaled_data)

Output

[-0.19372442  0.          0.06821282 -0.25648022 -0.25102319  2.18008186
  1.48703956]

Max Absolute Scaling

Max absolute scaling is implemented by MaxAbsScaler class in Python.

Formula

scaled_value = value / max_abs_value

Find the below working application.

max_abs_scaling.py

import numpy as np
from sklearn.preprocessing import MaxAbsScaler

# Create the training data
X_train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create the scaler
scaler = MaxAbsScaler()

# Fit the scaler to the training data
scaler.fit(X_train)

# Transform the training data
X_train_scaled = scaler.transform(X_train)

print(X_train_scaled)

Output

[[0.14285714 0.25       0.33333333]
 [0.57142857 0.625      0.66666667]
 [1.         1.         1.        ]]

Logarithmic normalization

In this technique, it takes the logarithm value of a feature.

Natural logarithm formula

Log Transformation(x)=ln(x)

For base 10 logarithm

Log Transformation(x)=log10(x)

logarithmic_transformation.py

import numpy as np

data = np.array([1, 10, 100, 1000, 10000, 100000, 1000000])

# Perform natural logarithm (base e) transformation
log_transformed_data = np.log(data)

# Perform base 10 logarithm transformation
log10_transformed_data = np.log10(data)

print("Original Data:")
print(data)

print("\nNatural Logarithm Transformation:")
print(log_transformed_data)

print("\nBase 10 Logarithm Transformation:")
print(log10_transformed_data)

Output

Original Data:
[      1      10     100    1000   10000  100000 1000000]

Natural Logarithm Transformation:
[ 0.          2.30258509  4.60517019  6.90775528  9.21034037 11.51292546
 13.81551056]

Base 10 Logarithm Transformation:
[0. 1. 2. 3. 4. 5. 6.]

Previous Next Home

Programming for beginners

Monday, 10 March 2025

The Importance of Data Normalization in Machine Learning

No comments:

Post a Comment