Data
normalization is a pre-processing technique in machine learning that is used to
transform the numerical data/features of a dataset to a common scale. This is
done to improve the performance of machine learning algorithms. If we do not
normalize the data, certain features dominate others due to their larger
magnitude/size.
- Min-Max Normalization
- Z-score normalization
- Robust Scaling
- Max Absolute Scaling
- Logarithmic normalization
Min-Max Normalization
It scales the data to a specific range, usually between 0 and 1.
Formula
X_normalized = (X - X_min) / (X_max - X_min)
Find the below working application.
min_max.py
data = [27, 98, 123, 4, 6, 897, 643] min_val = min(data) max_val = max(data) normalized_values = [] for i in data: normalized_val = (i - min_val) / (max_val - min_val) normalized_values.append(normalized_val) print('actual_value,normalized_value') for i in range(len(data)): print(data[i],',',normalized_values[i])
Output
actual_value,normalized_value 27 , 0.025755879059350503 98 , 0.10526315789473684 123 , 0.13325867861142218 4 , 0.0 6 , 0.0022396416573348264 897 , 1.0 643 , 0.7155655095184771
Min max normalization using Scikit learn MinMaxScaler
min_max_scikit.py
import pandas as pd from sklearn.preprocessing import MinMaxScaler # Sample DataFrame data = pd.DataFrame({ 'ratings': [27, 98, 123, 4, 6, 897, 643], 'product_code': [123, 43, 91, 8, 98, 45, 3] }) # Initialize the MinMaxScaler scaler = MinMaxScaler() # Fit and transform the data using the scaler normalized_data = scaler.fit_transform(data[['ratings']]) normalized_df = pd.DataFrame(normalized_data, columns=['ratings']) print('actual_rating,normalized_rating') for index, row in normalized_df.iterrows(): print(data.iloc[index]['ratings'],',',row['ratings']) print()
Output
actual_rating,normalized_rating 27 , 0.025755879059350506 98 , 0.10526315789473685 123 , 0.13325867861142218 4 , 0.0 6 , 0.002239641657334827 897 , 1.0 643 , 0.7155655095184771
In this example, the MinMaxScaler is used to normalize the values in ‘ratings’ column of the Pandas DataFrame. The resulting normalized DataFrame has the same column name and the values are scaled within the range [0, 1].
Z-score normalization
It normalizes the data such that the mean is 0 and the standard deviation is 1. This is achieved by subtracting the mean from all the values and then dividing by the standard deviation.
X_normalized = (X - mean(X)) / std(X)
z_score_normalization.py
import statistics # Create a list of numbers data = [27, 98, 123, 4, 6, 897, 643] # Calculate the mean mean = statistics.mean(data) # Calculate the standard deviation std_dev = statistics.stdev(data) normalized_data = [] for i in data: normalized_value = (i-mean)/std_dev normalized_data.append(normalized_value) # Print the results print("mean:", mean) print("standard deviation:", std_dev) print('actual_rating,normalized_rating') for i in range(len(data)): print(data[i], ',', normalized_data[i])
Output
mean: 256.85714285714283 standard deviation: 360.9577207797004 actual_rating,normalized_rating 27 , -0.6367979672539797 98 , -0.4400990301966597 123 , -0.3708388410919695 4 , -0.7005173412302946 6 , -0.6949765261019195 897 , 1.7734566135892376 643 , 1.0697730922855857
Z score normalization example in scikit learn.
z_score_normalization_scikit.py
import pandas as pd from sklearn.preprocessing import StandardScaler # Create a sample pandas DataFrame data = pd.DataFrame({ 'ratings': [27, 98, 123, 4, 6, 897, 643], 'product_code': [123, 43, 91, 8, 98, 45, 3] }) df = pd.DataFrame(data) # Initialize the StandardScaler scaler = StandardScaler() # Fit and transform the data using the scaler normalized_data = scaler.fit_transform(data[['ratings']]) normalized_df = pd.DataFrame(normalized_data, columns=['ratings']) print('actual_rating,normalized_rating') for index, row in normalized_df.iterrows(): print(data.iloc[index]['ratings'],',',row['ratings'])
Output
actual_rating,normalized_rating 27 , -0.687820417174377 98 , -0.47536128272088707 123 , -0.40055172833585534 4 , -0.7566452072086062 6 , -0.7506604428578036 897 , 1.9155520754247257 643 , 1.155487002872804
Robust scaling
Robust scaling is similar to Z-Score scaling, but it uses the median and interquartile range (IQR) instead of the mean and standard deviation. This approach is less sensitive to outliers.
Formula
X_robust_scaled = (X - median(X)) / IQR(X)
robust_scaling.py
import statistics import numpy as np # Create a list of numbers data = [27, 98, 123, 4, 6, 897, 643] sorted_data =sorted(data) # Calculate the median median = statistics.median(sorted_data) # Calculate the interquartile range q1 = np.percentile(sorted_data, 25) q3 = np.percentile(sorted_data, 75) iqr = q3 - q1 normalized_data = [] for i in data: normalized_value = (i-median)/iqr normalized_data.append(normalized_value) # Print the results print("median:", median) print("iqr:", iqr) print('actual_rating,normalized_rating') for i in range(len(data)): print(data[i], ',', normalized_data[i])
Output
median: 98 iqr: 366.5 actual_rating,normalized_rating 27 , -0.1937244201909959 98 , 0.0 123 , 0.06821282401091405 4 , -0.25648021828103684 6 , -0.25102319236016374 897 , 2.180081855388813 643 , 1.4870395634379263
Following example demonstrates with the same data using numpy arrays.
robust_scaling_numpy_arrays.py
import numpy as np # Sample data with outliers data = np.array( [27, 98, 123, 4, 6, 897, 643]) # Calculate the median and IQR median = np.median(data) iqr = np.percentile(data, 75) - np.percentile(data, 25) # Apply robust scaling robust_scaled_data = (data - median) / iqr # Print the robust-scaled data print("Robust Scaled Data:") print(robust_scaled_data)
Output
[-0.19372442 0. 0.06821282 -0.25648022 -0.25102319 2.18008186 1.48703956]
Max Absolute Scaling
Max absolute scaling is implemented by MaxAbsScaler class in Python.
Formula
scaled_value = value / max_abs_value
Find the below working application.
max_abs_scaling.py
import numpy as np from sklearn.preprocessing import MaxAbsScaler # Create the training data X_train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create the scaler scaler = MaxAbsScaler() # Fit the scaler to the training data scaler.fit(X_train) # Transform the training data X_train_scaled = scaler.transform(X_train) print(X_train_scaled)
Output
[[0.14285714 0.25 0.33333333] [0.57142857 0.625 0.66666667] [1. 1. 1. ]]
Logarithmic normalization
In this technique, it takes the logarithm value of a feature.
Natural logarithm formula
Log Transformation(x)=ln(x)
For base 10 logarithm
Log Transformation(x)=log10(x)
logarithmic_transformation.py
import numpy as np data = np.array([1, 10, 100, 1000, 10000, 100000, 1000000]) # Perform natural logarithm (base e) transformation log_transformed_data = np.log(data) # Perform base 10 logarithm transformation log10_transformed_data = np.log10(data) print("Original Data:") print(data) print("\nNatural Logarithm Transformation:") print(log_transformed_data) print("\nBase 10 Logarithm Transformation:") print(log10_transformed_data)
Output
Original Data: [ 1 10 100 1000 10000 100000 1000000] Natural Logarithm Transformation: [ 0. 2.30258509 4.60517019 6.90775528 9.21034037 11.51292546 13.81551056] Base 10 Logarithm Transformation: [0. 1. 2. 3. 4. 5. 6.]
No comments:
Post a Comment