Programming for beginners: K means cluster algorithm

K-means cluster algorithm, group the data points into clusters based on their similarity. Here k represents the number of clusters.

Example 1: data points are divided into two clusters.

Example 2: data points are divided into three clusters.

Real world examples of K-means clustering

Group customers based on their purchase behaviour
Group text based on their content. For example, all the business related articles are segregated in business section and all the sports related articles are segregated in Sports section etc.,
c. Used in detecting fraudulent transactions, by clustering regular and fraud transactions in separate groups.
d. Used to classify mutual fund assets into their respective RISK category (like moderate, risk, high risk etc.,)

How k-means algorithm works?

K-means is an iterative algorithm, that means it repeat the same steps until it finds an optimal solution.

Step 1: Choose k centroids randomly in the data set as the initial cluster centroids.

Step 2: Calculate the distance between each datapoint and the k centroids. Assign each data point to the cluster with the nearest centroid.

Step 3: Recalculate the centroids of each cluster. The centroid of a cluster is the mean of all the data points in that cluster

Step 4: Repeat steps 2 and 3 until the centroids do not change significantly.

How the centroid is calculated for two-dimensional data points?

Suppose you have data points like below.

[
(2, 3), 
(4, 5), 
(1, 12), 
(6, 9), 
(5, 11),
(3, 14),
(8, 25)
]

By calculating the mean (average) of the x-coordinates and the mean of the y-coordinates separately, we can get the coordinates of the centroid.

Mean of x-coordinates: (2 + 4 + 1 + 6 + 5 + 3 + 8) / 7 = 29 / 7 ≈ 4.14

Mean of y-coordinates: (3 + 5 + 12 + 9 + 11 + 14 + 25) / 7 = 79 / 7 ≈ 11.28

So the centroid is (4.14, 11.28).

centroid_in_2d_space.py

import numpy as np
import matplotlib.pyplot as plt

arr = np.array([
    [2, 3],
    [4, 5],
    [1, 12],
    [6, 9],
    [5, 11],
    [3, 14],
    [8, 25]
    ]
)

centroid_x = np.mean(arr[:, 0])
centroid_y = np.mean(arr[:, 1])

print(f'centroid_x : {centroid_x}')
print(f'centroid_y : {centroid_y}')

plt.scatter(arr[:, 0], arr[:, 1], label='Data Points', c='red')

# Plot centroid
plt.scatter(centroid_x, centroid_y, c='blue', marker='x', s=100, label='Centroid')

plt.show()

Output

Points to consider while using k-means algorithm

Number of clusters should be prespecified
Algorithm is sensitive to the initial choice of centroids.

Let’s try to apply k-means clustering technique on house prices data (https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Basic Analysis on the data

Get total rows and columns in the dataset.

shape = df.shape
total_rows = shape[0]
total_columns = shape[1]

print(f'\ntotal_rows : {total_rows}')
print(f'total_columns : {total_columns}')

total_rows : 20640
total_columns : 10

Let’s get an idea about the dataset, by looking at the column names.

columns = df.columns
print(f'\ncolumns : {columns}')

columns : Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

Let’s get detailed information on dataset by executing below statement.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

As you see the output, we can find below points.

total_bedrooms column has some missing values.
There are 8 columns of type float64 and one column ocean_proximity is of type object.

Let’s get an overview of the data by looking into sample data points.

five_rows = df.head()

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms   
0    -122.23     37.88                41.0        880.0           129.0  \
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY

Let's find the count of missing values column wise.

column_wise_missing_values = df.isnull().sum()
print(f'column_wise_missing_values : {column_wise_missing_values}')

Above snippet prints below information.

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0

We can get there are 207 entries missing for the column total_bedrooms.

Check for duplicated rows

duplicate_count = df.duplicated().sum()
print(f'\nduplicate_count : \n{duplicate_count}')

Above snippet prints below information.

duplicate_count : 
0

Let’s find unique values count column wise

unique_counts = df.nunique()
missing_values = df.isnull().sum()
result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values})
print(result_df)

Above snippet prints below data.

Unique counts, missing values count, data types column wise
                   Data Type  unique_count  missing_values
longitude            float64           844               0
latitude             float64           862               0
housing_median_age   float64            52               0
total_rooms          float64          5926               0
total_bedrooms       float64          1923             207
population           float64          3888               0
households           float64          1815               0
median_income        float64         12928               0
median_house_value   float64          3842               0
ocean_proximity       object             5               0

As you see the column ‘ocean_proximity’ has 5 unique values. We can encode it before processing.

Preprocess the data

Encode the 'ocean_proximity' column data using label enoder.

label_encoder = LabelEncoder()
df['ocean_proximity'] = label_encoder.fit_transform(df['ocean_proximity'])

Replace missing values in 'total_bedrooms' column with mode

mode = df['total_bedrooms'].dropna().mode()
df['total_bedrooms'].fillna(mode.iloc[0], inplace=True)

Form the clusters

Get an instance of KMeans algorithm.

no_of_clusters = 5
kmeans = KMeans(n_clusters=no_of_clusters, random_state=47)

Fit the KMeans model to the data

kmeans.fit(df)

Get the centroids and labels.

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

Plot the diagram to visualize. Since visualizing high-dimensional data directly is challenging, we can only display a scatter plot of the first two dimensions.

Find the below working application.

cluster_houses_info.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

def basic_analysis(df):
    shape = df.shape
    total_rows = shape[0]
    total_columns = shape[1]

    print(f'\ntotal_rows : {total_rows}')
    print(f'total_columns : {total_columns}')

    columns = df.columns
    print(f'\ncolumns : {columns}')

    print('\nDetailed information')
    df.info()

    five_rows = df.head()
    print(f'five_rows : \n{five_rows}')

    print('\nColumn wise missing values')
    column_wise_missing_values = df.isnull().sum()
    print(f'column_wise_missing_values : \n{column_wise_missing_values}')

    # Check for duplicate rows based on all columns
    duplicate_count = df.duplicated().sum()
    print(f'\nduplicate_count : \n{duplicate_count}')

    # Print unique counts
    print(f'\nUnique counts, missing values count, data types column wise')
    unique_counts = df.nunique()
    missing_values = df.isnull().sum()
    result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values})
    print(result_df)

    numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
    print(round(df[numerical_columns].describe()))


def preprocess_data(df):
    # Encode ocean_proximity column
    label_encoder = LabelEncoder()
    df['ocean_proximity'] = label_encoder.fit_transform(df['ocean_proximity'])

    # Replace missing values in 'total_bedrooms' column with mode
    mode = df['total_bedrooms'].dropna().mode()
    print(f'total_bedrooms mode : {mode.iloc[0]}\n')
    df['total_bedrooms'].fillna(mode.iloc[0], inplace=True)


def cluster(df):
    no_of_clusters = 5
    kmeans = KMeans(n_clusters=no_of_clusters, random_state=47)

    # Fit the KMeans model to the data
    kmeans.fit(df)

    # Get cluster centroids and labels
    centroids = kmeans.cluster_centers_
    #print(f'centroids : {centroids}')
    labels = kmeans.labels_

    # Since visualizing high-dimensional data directly is challenging, we can only display a scatter plot of the first two dimensions
    plt.scatter(df['housing_median_age'], df['median_house_value'], c=labels, cmap='viridis', s=50)
    plt.title('K-Means Clustering (First Two Dimensions)')
    plt.xlabel('housing_median_age')
    plt.ylabel('median_house_value')
    plt.show()

df = pd.read_csv('housing.csv')
basic_analysis(df)
preprocess_data(df)
cluster(df)

Output

Previous Next Home

Programming for beginners

Monday, 24 March 2025

K means cluster algorithm

No comments:

Post a Comment