Sunday, 23 March 2025

K nearest neighbors algorithm In Machine Learning

k-nearest neighbors (KNN) is a supervised machine learning algorithm, and it can be used to address both classification, regression problems.


How it works?

It tries to find the k-nearest trained data points to the new data point and predict the category (or label) or value of the new data point based on the labels or values of its k nearest neighbors.

 

Pros

  1. Easy to implement
  2. Easy to understand
  3. It can be used for both regression and classification tasks

 

Cons

  1. Computationally expensive for largest datasets.
  2. It is sensitive to the hyperparameter k.

 

If k is too low, then it will increase bias and lead to misclassification. It k is too high, it make the process too expensive.

 

Have a look at below image.

 


As you see above image, Rectangle, star shapes are the trained data points. When we need to predict the outcome for the new data point which is represented in Triangle shape, it depends on value of k.

 

  1. Suppose if k=5, then Algorithm selects 3 star data points and 2 rectangle shape data points.
  2. If k=9, then Algorithm selects 4 star data points and 5 rectangle shape data points.

 

From the above statements, you can confirm that the outcome of model is changed by altering the k's value from 5 to 9.

 

How k-nearest neighbors of a given data point are identified?

By calculating the distance between the give point and all other points in the dataset, we can find the k-nearest neighbors of a given data point.

 

Following algorithms are used to calculate the distance between two points.

  1. Euclidean Distance
  2. Manhattan Distance
  3. Minkowski distance

 

Euclidean Distance

Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Euclidean distance is calculated using below formula.

 


Where

  1. x1 and x2 are the x-coordinates of the two points
  2. y1 and y2 are the y-coordinates of the two points

 

Suppose if you have two points in three dimensional space, the forumal looks like below.

 

dist = sqrt((x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2)

 

where:

  1. x1 and x2 are the x-coordinates of the two points
  2. y1 and y2 are the y-coordinates of the two points
  3. z1 and z2 are the z-coordinates of the two points

 

Euclidean distance is a good choice for many applications, but be cautious as it is sensitive to outliers.

 

Manhattan Distance

Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Manhattan distance is calculated using below formula.



Suppose if you have two points (x1, y1, z1) and (x2, y2, z2) in a three dimensional space, Manhattan distance is calculated using below formula.

 


Manhattan Distance is easy to understand and implement.

 

Minkowski distance

It's a generalization of both the Euclidean distance and the Manhattan distance.

 

Formula



In simple terms, it is given below.

 

dist = (|x1 - x2|^p + |y1 - y2|^p + |z1 - z2|^p + ...)^(1/p)

 

Where

  1. x1 and x2 are the x-coordinates of the two points
  2. y1 and y2 are the y-coordinates of the two points
  3. z1 and z2 are the z-coordinates of the two points
  4. p is a real number, When p=1, Minkowski distance is equal to the Manhattan distance, and when p=2, it is equal to the Euclidean distance. We can experiment with different p values to find the best fit for your distance metric. A small value of p will make the Minkowski distance more sensitive to outliers, while a large value of p will make the Minkowski distance less sensitive to outliers.

 

Real world examples of KNN algorithm

a. KNN algorithm is used in image classification tasks. When we have given an unknown image, it is compared against the set of k-nearest labelled images based on the training set and take appropriate decision.

 

b. It is used to detect fraudulent transactions by comparing new transactions to a set of known fraudulent transactions.

 

c. It is used in recommender systems. For example, in a news recommender system, we can recommend news to the user based on the preferences of K-nearest users who have similar news reading history.

 

d. You can use KNN algorithm to predict the house prices in a locality.

 

e. We can group the customers by their interests using KNN algorithm.

 

Example in scikit learn

 

# Create a KNN classifier with k=7 neighbors
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels for the test data
y_pred = knn.predict(X_test)

 

Dataset used to demonstrate the application.

https://www.kaggle.com/datasets/gkalpolukcu/knn-algorithm-dataset/download?datasetVersionNumber=1

 

Let’s try to understand above dataset.

 

Get the shape (number of rows and columns) of the dataset

shape = df.shape
total_rows = shape[0]
total_columns = shape[1]

print(f'\ntotal_rows : {total_rows}')
print(f'total_columns : {total_columns}')

Above snippet prints below information.

total_rows : 569
total_columns : 33

 

Print the column names to get quick glance of data definition

columns = df.columns
print(f'\ncolumns : {columns}')

columns : Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

 

Let’s get basic information about the dataframe like column name, data type, non_null count etc.,

df.info()

Above snippet prints below information.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)

From the above snippet, we can confirm that

  1. There are 32 numeric columns (31 is of type float64 and 1 is of type int64).
  2. One column (diagnosis) is of type object
  3. All the columns except ‘Unnamed: 32’ has no null values.
  4. The column ‘Unnamed: 32’ has all the null values

 

Print first five rows to get basic details of the content 

five_rows = df.head()
print(f'five_rows : \n{five_rows}')

 

Above snippet prints below data.

five_rows : 
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean   
0    842302         M        17.99         10.38          122.80     1001.0  \
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean   
0          0.11840           0.27760          0.3001              0.14710  \
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   symmetry_mean  fractal_dimension_mean  radius_se  texture_se  perimeter_se   
0         0.2419                 0.07871     1.0950      0.9053         8.589  \
1         0.1812                 0.05667     0.5435      0.7339         3.398   
2         0.2069                 0.05999     0.7456      0.7869         4.585   
3         0.2597                 0.09744     0.4956      1.1560         3.445   
4         0.1809                 0.05883     0.7572      0.7813         5.438   

   area_se  smoothness_se  compactness_se  concavity_se  concave points_se   
0   153.40       0.006399         0.04904       0.05373            0.01587  \
1    74.08       0.005225         0.01308       0.01860            0.01340   
2    94.03       0.006150         0.04006       0.03832            0.02058   
3    27.23       0.009110         0.07458       0.05661            0.01867   
4    94.44       0.011490         0.02461       0.05688            0.01885   

   symmetry_se  fractal_dimension_se  radius_worst  texture_worst   
0      0.03003              0.006193         25.38          17.33  \
1      0.01389              0.003532         24.99          23.41   
2      0.02250              0.004571         23.57          25.53   
3      0.05963              0.009208         14.91          26.50   
4      0.01756              0.005115         22.54          16.67   

   perimeter_worst  area_worst  smoothness_worst  compactness_worst   
0           184.60      2019.0            0.1622             0.6656  \
1           158.80      1956.0            0.1238             0.1866   
2           152.50      1709.0            0.1444             0.4245   
3            98.87       567.7            0.2098             0.8663   
4           152.20      1575.0            0.1374             0.2050   

   concavity_worst  concave points_worst  symmetry_worst   
0           0.7119                0.2654          0.4601  \
1           0.2416                0.1860          0.2750   
2           0.4504                0.2430          0.3613   
3           0.6869                0.2575          0.6638   
4           0.4000                0.1625          0.2364   

   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
3                  0.17300          NaN  
4                  0.07678          NaN  

 Find duplicate records

duplicate_count = df.duplicated().sum()
print(f'duplicate_count : {duplicate_count}')

 

Above snippet prints below output.

duplicate_count : 0

 

Print unique count and missing values

unique_counts = df.nunique()
missing_values = df.isnull().sum()
result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values})
print(result_df)

Above snippet prints below output.

                        Data Type  unique_count  missing_values
id                          int64           569               0
diagnosis                  object             2               0
radius_mean               float64           456               0
texture_mean              float64           479               0
perimeter_mean            float64           522               0
area_mean                 float64           539               0
smoothness_mean           float64           474               0
compactness_mean          float64           537               0
concavity_mean            float64           537               0
concave points_mean       float64           542               0
symmetry_mean             float64           432               0
fractal_dimension_mean    float64           499               0
radius_se                 float64           540               0
texture_se                float64           519               0
perimeter_se              float64           533               0
area_se                   float64           528               0
smoothness_se             float64           547               0
compactness_se            float64           541               0
concavity_se              float64           533               0
concave points_se         float64           507               0
symmetry_se               float64           498               0
fractal_dimension_se      float64           545               0
radius_worst              float64           457               0
texture_worst             float64           511               0
perimeter_worst           float64           514               0
area_worst                float64           544               0
smoothness_worst          float64           411               0
compactness_worst         float64           529               0
concavity_worst           float64           539               0
concave points_worst      float64           492               0
symmetry_worst            float64           500               0
fractal_dimension_worst   float64           535               0
Unnamed: 32               float64             0             569

 

Get basic statistics of numeric columns data to find the outliers

numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
print(round(df[numerical_columns].describe()))

                id  radius_mean  texture_mean  perimeter_mean  area_mean   
count        569.0        569.0         569.0           569.0      569.0  \
mean    30371831.0         14.0          19.0            92.0      655.0   
std    125020586.0          4.0           4.0            24.0      352.0   
min         8670.0          7.0          10.0            44.0      144.0   
25%       869218.0         12.0          16.0            75.0      420.0   
50%       906024.0         13.0          19.0            86.0      551.0   
75%      8813129.0         16.0          22.0           104.0      783.0   
max    911320502.0         28.0          39.0           188.0     2501.0   

       smoothness_mean  compactness_mean  concavity_mean  concave points_mean   
count            569.0             569.0           569.0                569.0  \
mean               0.0               0.0             0.0                  0.0   
std                0.0               0.0             0.0                  0.0   
min                0.0               0.0             0.0                  0.0   
25%                0.0               0.0             0.0                  0.0   
50%                0.0               0.0             0.0                  0.0   
75%                0.0               0.0             0.0                  0.0   
max                0.0               0.0             0.0                  0.0   

       symmetry_mean  fractal_dimension_mean  radius_se  texture_se   
count          569.0                   569.0      569.0       569.0  \
mean             0.0                     0.0        0.0         1.0   
std              0.0                     0.0        0.0         1.0   
min              0.0                     0.0        0.0         0.0   
25%              0.0                     0.0        0.0         1.0   
50%              0.0                     0.0        0.0         1.0   
75%              0.0                     0.0        0.0         1.0   
max              0.0                     0.0        3.0         5.0   

       perimeter_se  area_se  smoothness_se  compactness_se  concavity_se   
count         569.0    569.0          569.0           569.0         569.0  \
mean            3.0     40.0            0.0             0.0           0.0   
std             2.0     45.0            0.0             0.0           0.0   
min             1.0      7.0            0.0             0.0           0.0   
25%             2.0     18.0            0.0             0.0           0.0   
50%             2.0     25.0            0.0             0.0           0.0   
75%             3.0     45.0            0.0             0.0           0.0   
max            22.0    542.0            0.0             0.0           0.0   

       concave points_se  symmetry_se  fractal_dimension_se  radius_worst   
count              569.0        569.0                 569.0         569.0  \
mean                 0.0          0.0                   0.0          16.0   
std                  0.0          0.0                   0.0           5.0   
min                  0.0          0.0                   0.0           8.0   
25%                  0.0          0.0                   0.0          13.0   
50%                  0.0          0.0                   0.0          15.0   
75%                  0.0          0.0                   0.0          19.0   
max                  0.0          0.0                   0.0          36.0   

       texture_worst  perimeter_worst  area_worst  smoothness_worst   
count          569.0            569.0       569.0             569.0  \
mean            26.0            107.0       881.0               0.0   
std              6.0             34.0       569.0               0.0   
min             12.0             50.0       185.0               0.0   
25%             21.0             84.0       515.0               0.0   
50%             25.0             98.0       686.0               0.0   
75%             30.0            125.0      1084.0               0.0   
max             50.0            251.0      4254.0               0.0   

       compactness_worst  concavity_worst  concave points_worst   
count              569.0            569.0                 569.0  \
mean                 0.0              0.0                   0.0   
std                  0.0              0.0                   0.0   
min                  0.0              0.0                   0.0   
25%                  0.0              0.0                   0.0   
50%                  0.0              0.0                   0.0   
75%                  0.0              0.0                   0.0   
max                  1.0              1.0                   0.0   

       symmetry_worst  fractal_dimension_worst  Unnamed: 32  
count           569.0                    569.0          0.0  
mean              0.0                      0.0          NaN  
std               0.0                      0.0          NaN  
min               0.0                      0.0          NaN  
25%               0.0                      0.0          NaN  
50%               0.0                      0.0          NaN  
75%               0.0                      0.0          NaN  
max               1.0                      0.0          NaN  

I can sense some columns has outliers in the dataset by looking at above statistics. To make the things simple, I am going with IQR to find the outlier columns.

 

def find_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                outlier_columns.append(column)
    return outlier_columns

 

For each numeric column,

  1. Above function calculates the first quartile (Q1), third quartile (Q3), and the IQR (IQR = Q3 - Q1).
  2. Defines lower and upper bounds for each column using the IQR and a specified threshold (default is 1.5 times the IQR).
  3. If a column has value which is either less than the lower_bound, or greater than the upper bound, is cosidered as outlier column.

outlier_columns = find_outlier_columns(df, 3)
print(f'Total outlier_columns : {len(outlier_columns)}')
print(f'outlier_columns : \n{outlier_columns}')

 

Above snippet prints below insights.

Total outlier_columns : 24
outlier_columns : 
['id', 'radius_mean', 'texture_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst']

 

Preprocess the data

Drop the column ‘Unnamed: 32’ as it do not have any non_null value in it.

 

df.drop('Unnamed: 32', axis=1, inplace=True)

 

Encode the ‘diagnosis’ column data

We have only one non_numeric column ‘diagnosis’ and it has only two unique values in it, we can go with binary encoding.

 

label_encoder = LabelEncoder()
df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

 

Replace outliers with lower, upper bounds

 

Following program will do that.

def find_and_replace_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
                df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

 

Let’s drop id and ‘Unnamed: 32’ columns

df.drop('Unnamed: 32', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)

 

Encode 'diagnosis' column using label encoder.

label_encoder = LabelEncoder()
df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

 

Find and replace outliers with certain threshold.

def find_and_replace_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
                df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

 

Train and test the model

Split the data into features (X) and target (y).

X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

Standardize the training set. As we are going to use KNN algorithm, contiguous data make the calculations more faster. We can get that using StandardScaler.

 

sc = StandardScaler()
X = sc.fit_transform(X)

 

Divide the data into training, testing samples.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

 

Get an instance of KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=3)

 

Train the model.

knn_classifier.fit(X_train, y_train)

 

Validate the model against test sample.

y_pred = knn_classifier.predict(X_test)

Get the accuracy score.

accuracy = accuracy_score(y_test, y_pred)

knn.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Adjust display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

def find_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                outlier_columns.append(column)
    return outlier_columns

def find_and_replace_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
                df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

def basic_analysis(df):
    shape = df.shape
    total_rows = shape[0]
    total_columns = shape[1]

    print(f'\ntotal_rows : {total_rows}')
    print(f'total_columns : {total_columns}')

    columns = df.columns
    print(f'\ncolumns : {columns}')

    print('\nDetailed information')
    df.info()

    five_rows = df.head()
    print(f'five_rows : \n{five_rows}')

    print('\nColumn wise missing values')
    column_wise_missing_values = df.isnull().sum()
    print(f'column_wise_missing_values : {column_wise_missing_values}')

    # Check for duplicate rows based on all columns
    duplicate_count = df.duplicated().sum()
    print(f'\nduplicate_count : {duplicate_count}')

    # Print unique counts
    print(f'\nUnique counts, missing values count, data types column wise')
    unique_counts = df.nunique()
    missing_values = df.isnull().sum()
    result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values})
    print(result_df)

    numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
    print(round(df[numerical_columns].describe()))

    outlier_columns = find_outlier_columns(df, 3)
    print(f'Total outlier_columns : {len(outlier_columns)}')
    print(f'outlier_columns : \n{outlier_columns}')

def preprocess_data(df):
    df.drop('Unnamed: 32', axis=1, inplace=True)
    df.drop('id', axis=1, inplace=True)

    label_encoder = LabelEncoder()
    df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

    find_and_replace_outlier_columns(df, 3)

    #print(round(df.describe()))

def train_test_model(df):
    #df.info()

    # Split the data into features (X) and target (y)
    X = df.drop(['diagnosis'], axis=1)
    y = df['diagnosis']

    sc = StandardScaler()
    X = sc.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

    # Initialize the K-NN classifier (Let's choose k=3)
    knn_classifier = KNeighborsClassifier(n_neighbors=3)

    # Fit the classifier to the training data
    knn_classifier.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = knn_classifier.predict(X_test)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)


df = pd.read_csv('KNNAlgorithmDataset.csv')
# basic_analysis(df)
preprocess_data(df)
train_test_model(df)

 Output

Accuracy: 0.9824561403508771

We can enhance above program to validate against different neighbors count.

training_accuracy = []
testing_accuracy = []

for neighbor_count in range(1, 20):
    # Initialize the K-NN classifier (Let's choose k=neighbor_count)
    knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count)

    # Fit the classifier to the training data
    knn_classifier.fit(X_train, y_train)

    training_accuracy.append(knn_classifier.score(X_train, y_train))
    testing_accuracy.append(knn_classifier.score(X_test, y_test))

 Find the below working application.

knn_diff_neighbors.py 

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Adjust display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

def find_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                outlier_columns.append(column)
    return outlier_columns

def find_and_replace_outlier_columns(df, threshold=1.5):
    outlier_columns = []
    for column in df.columns:
        if df[column].dtype in [np.int64, np.float64]:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
            if not outliers.empty:
                df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
                df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

def basic_analysis(df):
    shape = df.shape
    total_rows = shape[0]
    total_columns = shape[1]

    print(f'\ntotal_rows : {total_rows}')
    print(f'total_columns : {total_columns}')

    columns = df.columns
    print(f'\ncolumns : {columns}')

    print('\nDetailed information')
    df.info()

    five_rows = df.head()
    print(f'five_rows : \n{five_rows}')

    print('\nColumn wise missing values')
    column_wise_missing_values = df.isnull().sum()
    print(f'column_wise_missing_values : {column_wise_missing_values}')

    # Check for duplicate rows based on all columns
    duplicate_count = df.duplicated().sum()
    print(f'\nduplicate_count : {duplicate_count}')

    # Print unique counts
    print(f'\nUnique counts, missing values count, data types column wise')
    unique_counts = df.nunique()
    missing_values = df.isnull().sum()
    result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values})
    print(result_df)

    numerical_columns = [var for var in df.columns if df[var].dtype != 'O']
    print(round(df[numerical_columns].describe()))

    outlier_columns = find_outlier_columns(df, 3)
    print(f'Total outlier_columns : {len(outlier_columns)}')
    print(f'outlier_columns : \n{outlier_columns}')

def preprocess_data(df):
    df.drop('Unnamed: 32', axis=1, inplace=True)
    df.drop('id', axis=1, inplace=True)

    label_encoder = LabelEncoder()
    df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])

    find_and_replace_outlier_columns(df, 3)

    #print(round(df.describe()))

def train_test_model(df):
    #df.info()

    # Split the data into features (X) and target (y)
    X = df.drop(['diagnosis'], axis=1)
    y = df['diagnosis']

    sc = StandardScaler()
    X = sc.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

    training_accuracy = []
    testing_accuracy = []

    for neighbor_count in range(1, 20):
        # Initialize the K-NN classifier (Let's choose k=neighbor_count)
        knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count)

        # Fit the classifier to the training data
        knn_classifier.fit(X_train, y_train)

        training_accuracy.append(knn_classifier.score(X_train, y_train))
        testing_accuracy.append(knn_classifier.score(X_test, y_test))

    num = [i for i in range(1, 20)]
    score = pd.DataFrame({'neighbor_count': num, 'training_score': training_accuracy, 'testing_score': testing_accuracy})
    print(score)
    plt.plot(num, training_accuracy, label='Training accuracy')
    plt.plot(num, testing_accuracy, label='Testing accuracy')

    plt.legend()
    plt.show()


df = pd.read_csv('KNNAlgorithmDataset.csv')
# basic_analysis(df)
preprocess_data(df)
train_test_model(df)

 

Output

    neighbor_count  training_score  testing_score
0                1        1.000000       0.970760
1                2        0.972362       0.976608
2                3        0.982412       0.982456
3                4        0.969849       0.982456
4                5        0.974874       0.982456
5                6        0.967337       0.976608
6                7        0.972362       0.976608
7                8        0.969849       0.976608
8                9        0.972362       0.976608
9               10        0.962312       0.970760
10              11        0.967337       0.964912
11              12        0.962312       0.959064
12              13        0.962312       0.964912
13              14        0.957286       0.947368
14              15        0.964824       0.947368
15              16        0.957286       0.953216
16              17        0.959799       0.947368
17              18        0.954774       0.941520
18              19        0.957286       0.941520

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment