K-means cluster algorithm, group the data points into clusters based on their similarity. Here k represents the number of clusters.
Example 1: data points are divided into two clusters.
Example 2: data points are divided into three clusters.
Real world examples of K-means clustering
- Group customers based on their purchase behaviour
- Group text based on their content. For example, all the business related articles are segregated in business section and all the sports related articles are segregated in Sports section etc.,
- c. Used in detecting fraudulent transactions, by clustering regular and fraud transactions in separate groups.
- d. Used to classify mutual fund assets into their respective RISK category (like moderate, risk, high risk etc.,)
How k-means algorithm works?
K-means is an iterative algorithm, that means it repeat the same steps until it finds an optimal solution.
Step 1: Choose k centroids randomly in the data set as the initial cluster centroids.
Step 2: Calculate the distance between each datapoint and the k centroids. Assign each data point to the cluster with the nearest centroid.
Step 3: Recalculate the centroids of each cluster. The centroid of a cluster is the mean of all the data points in that cluster
Step 4: Repeat steps 2 and 3 until the centroids do not change significantly.
How the centroid is calculated for two-dimensional data points?
Suppose you have data points like below.
[ (2, 3), (4, 5), (1, 12), (6, 9), (5, 11), (3, 14), (8, 25) ]
By calculating the mean (average) of the x-coordinates and the mean of the y-coordinates separately, we can get the coordinates of the centroid.
Mean of x-coordinates: (2 + 4 + 1 + 6 + 5 + 3 + 8) / 7 = 29 / 7 ≈ 4.14
Mean of y-coordinates: (3 + 5 + 12 + 9 + 11 + 14 + 25) / 7 = 79 / 7 ≈ 11.28
So the centroid is (4.14, 11.28).
centroid_in_2d_space.py
import numpy as np import matplotlib.pyplot as plt arr = np.array([ [2, 3], [4, 5], [1, 12], [6, 9], [5, 11], [3, 14], [8, 25] ] ) centroid_x = np.mean(arr[:, 0]) centroid_y = np.mean(arr[:, 1]) print(f'centroid_x : {centroid_x}') print(f'centroid_y : {centroid_y}') plt.scatter(arr[:, 0], arr[:, 1], label='Data Points', c='red') # Plot centroid plt.scatter(centroid_x, centroid_y, c='blue', marker='x', s=100, label='Centroid') plt.show()
Output
Points to consider while using k-means algorithm
- Number of clusters should be prespecified
- Algorithm is sensitive to the initial choice of centroids.
Let’s try to apply k-means clustering technique on house prices data (https://www.kaggle.com/datasets/camnugent/california-housing-prices).
Basic Analysis on the data
Get total rows and columns in the dataset.
shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}')
total_rows : 20640 total_columns : 10
Let’s get an idea about the dataset, by looking at the column names.
columns = df.columns print(f'\ncolumns : {columns}')
columns : Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity'], dtype='object')
Let’s get detailed information on dataset by executing below statement.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
As you see the output, we can find below points.
- total_bedrooms column has some missing values.
- There are 8 columns of type float64 and one column ocean_proximity is of type object.
Let’s get an overview of the data by looking into sample data points.
five_rows = df.head()
longitude latitude housing_median_age total_rooms total_bedrooms 0 -122.23 37.88 41.0 880.0 129.0 \ 1 -122.22 37.86 21.0 7099.0 1106.0 2 -122.24 37.85 52.0 1467.0 190.0 3 -122.25 37.85 52.0 1274.0 235.0 4 -122.25 37.85 52.0 1627.0 280.0 population households median_income median_house_value ocean_proximity 0 322.0 126.0 8.3252 452600.0 NEAR BAY 1 2401.0 1138.0 8.3014 358500.0 NEAR BAY 2 496.0 177.0 7.2574 352100.0 NEAR BAY 3 558.0 219.0 5.6431 341300.0 NEAR BAY 4 565.0 259.0 3.8462 342200.0 NEAR BAY
Let's find the count of missing values column wise.
column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}')
Above snippet prints below information.
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 207 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0
We can get there are 207 entries missing for the column total_bedrooms.
Check for duplicated rows
duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : \n{duplicate_count}')
Above snippet prints below information.
duplicate_count : 0
Let’s find unique values count column wise
unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df)
Above snippet prints below data.
Unique counts, missing values count, data types column wise Data Type unique_count missing_values longitude float64 844 0 latitude float64 862 0 housing_median_age float64 52 0 total_rooms float64 5926 0 total_bedrooms float64 1923 207 population float64 3888 0 households float64 1815 0 median_income float64 12928 0 median_house_value float64 3842 0 ocean_proximity object 5 0
As you see the column ‘ocean_proximity’ has 5 unique values. We can encode it before processing.
Preprocess the data
Encode the 'ocean_proximity' column data using label enoder.
label_encoder = LabelEncoder() df['ocean_proximity'] = label_encoder.fit_transform(df['ocean_proximity'])
Replace missing values in 'total_bedrooms' column with mode
mode = df['total_bedrooms'].dropna().mode() df['total_bedrooms'].fillna(mode.iloc[0], inplace=True)
Form the clusters
Get an instance of KMeans algorithm.
no_of_clusters = 5 kmeans = KMeans(n_clusters=no_of_clusters, random_state=47)
Fit the KMeans model to the data
kmeans.fit(df)
Get the centroids and labels.
centroids = kmeans.cluster_centers_ labels = kmeans.labels_
Plot the diagram to visualize. Since visualizing high-dimensional data directly is challenging, we can only display a scatter plot of the first two dimensions.
Find the below working application.
cluster_houses_info.py
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import make_blobs pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def basic_analysis(df): shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() five_rows = df.head() print(f'five_rows : \n{five_rows}') print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : \n{column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : \n{duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df) numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe())) def preprocess_data(df): # Encode ocean_proximity column label_encoder = LabelEncoder() df['ocean_proximity'] = label_encoder.fit_transform(df['ocean_proximity']) # Replace missing values in 'total_bedrooms' column with mode mode = df['total_bedrooms'].dropna().mode() print(f'total_bedrooms mode : {mode.iloc[0]}\n') df['total_bedrooms'].fillna(mode.iloc[0], inplace=True) def cluster(df): no_of_clusters = 5 kmeans = KMeans(n_clusters=no_of_clusters, random_state=47) # Fit the KMeans model to the data kmeans.fit(df) # Get cluster centroids and labels centroids = kmeans.cluster_centers_ #print(f'centroids : {centroids}') labels = kmeans.labels_ # Since visualizing high-dimensional data directly is challenging, we can only display a scatter plot of the first two dimensions plt.scatter(df['housing_median_age'], df['median_house_value'], c=labels, cmap='viridis', s=50) plt.title('K-Means Clustering (First Two Dimensions)') plt.xlabel('housing_median_age') plt.ylabel('median_house_value') plt.show() df = pd.read_csv('housing.csv') basic_analysis(df) preprocess_data(df) cluster(df)
Output
No comments:
Post a Comment