k-nearest neighbors (KNN) is a supervised machine learning algorithm, and it can be used to address both classification, regression problems.
How it works?
It tries to find the k-nearest trained data points to the new data point and predict the category (or label) or value of the new data point based on the labels or values of its k nearest neighbors.
Pros
- Easy to implement
 - Easy to understand
 - It can be used for both regression and classification tasks
 
Cons
- Computationally expensive for largest datasets.
 - It is sensitive to the hyperparameter k.
 
If k is too low, then it will increase bias and lead to misclassification. It k is too high, it make the process too expensive.
Have a look at below image.
As you see above image, Rectangle, star shapes are the trained data points. When we need to predict the outcome for the new data point which is represented in Triangle shape, it depends on value of k.
- Suppose if k=5, then Algorithm selects 3 star data points and 2 rectangle shape data points.
 - If k=9, then Algorithm selects 4 star data points and 5 rectangle shape data points.
 
From the above statements, you can confirm that the outcome of model is changed by altering the k's value from 5 to 9.
How k-nearest neighbors of a given data point are identified?
By calculating the distance between the give point and all other points in the dataset, we can find the k-nearest neighbors of a given data point.
Following algorithms are used to calculate the distance between two points.
- Euclidean Distance
 - Manhattan Distance
 - Minkowski distance
 
Euclidean Distance
Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Euclidean distance is calculated using below formula.
Where
- x1 and x2 are the x-coordinates of the two points
 - y1 and y2 are the y-coordinates of the two points
 
Suppose if you have two points in three dimensional space, the forumal looks like below.
dist = sqrt((x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2)
where:
- x1 and x2 are the x-coordinates of the two points
 - y1 and y2 are the y-coordinates of the two points
 - z1 and z2 are the z-coordinates of the two points
 
Euclidean distance is a good choice for many applications, but be cautious as it is sensitive to outliers.
Manhattan Distance
Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Manhattan distance is calculated using below formula.
 
Suppose if you have two points (x1, y1, z1) and (x2, y2, z2) in a three dimensional space, Manhattan distance is calculated using below formula.
Manhattan Distance is easy to understand and implement.
Minkowski distance
It's a generalization of both the Euclidean distance and the Manhattan distance.
Formula
 
In simple terms, it is given below.
dist = (|x1 - x2|^p + |y1 - y2|^p + |z1 - z2|^p + ...)^(1/p)
Where
- x1 and x2 are the x-coordinates of the two points
 - y1 and y2 are the y-coordinates of the two points
 - z1 and z2 are the z-coordinates of the two points
 - p is a real number, When p=1, Minkowski distance is equal to the Manhattan distance, and when p=2, it is equal to the Euclidean distance. We can experiment with different p values to find the best fit for your distance metric. A small value of p will make the Minkowski distance more sensitive to outliers, while a large value of p will make the Minkowski distance less sensitive to outliers.
 
Real world examples of KNN algorithm
a. KNN algorithm is used in image classification tasks. When we have given an unknown image, it is compared against the set of k-nearest labelled images based on the training set and take appropriate decision.
b. It is used to detect fraudulent transactions by comparing new transactions to a set of known fraudulent transactions.
c. It is used in recommender systems. For example, in a news recommender system, we can recommend news to the user based on the preferences of K-nearest users who have similar news reading history.
d. You can use KNN algorithm to predict the house prices in a locality.
e. We can group the customers by their interests using KNN algorithm.
Example in scikit learn
# Create a KNN classifier with k=7 neighbors knn = KNeighborsClassifier(n_neighbors=7) # Fit the classifier to the training data knn.fit(X_train, y_train) # Predict the labels for the test data y_pred = knn.predict(X_test)
Dataset used to demonstrate the application.
https://www.kaggle.com/datasets/gkalpolukcu/knn-algorithm-dataset/download?datasetVersionNumber=1
Let’s try to understand above dataset.
Get the shape (number of rows and columns) of the dataset
shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}')
Above snippet prints below information.
total_rows : 569 total_columns : 33
Print the column names to get quick glance of data definition
columns = df.columns print(f'\ncolumns : {columns}')
columns : Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')
Let’s get basic information about the dataframe like column name, data type, non_null count etc.,
df.info()
Above snippet prints below information.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1)
From the above snippet, we can confirm that
- There are 32 numeric columns (31 is of type float64 and 1 is of type int64).
 - One column (diagnosis) is of type object
 - All the columns except ‘Unnamed: 32’ has no null values.
 - The column ‘Unnamed: 32’ has all the null values
 
Print first five rows to get basic details of the content
five_rows = df.head() print(f'five_rows : \n{five_rows}')
Above snippet prints below data.
five_rows : 
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean   
0    842302         M        17.99         10.38          122.80     1001.0  \
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   
   smoothness_mean  compactness_mean  concavity_mean  concave points_mean   
0          0.11840           0.27760          0.3001              0.14710  \
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   
   symmetry_mean  fractal_dimension_mean  radius_se  texture_se  perimeter_se   
0         0.2419                 0.07871     1.0950      0.9053         8.589  \
1         0.1812                 0.05667     0.5435      0.7339         3.398   
2         0.2069                 0.05999     0.7456      0.7869         4.585   
3         0.2597                 0.09744     0.4956      1.1560         3.445   
4         0.1809                 0.05883     0.7572      0.7813         5.438   
   area_se  smoothness_se  compactness_se  concavity_se  concave points_se   
0   153.40       0.006399         0.04904       0.05373            0.01587  \
1    74.08       0.005225         0.01308       0.01860            0.01340   
2    94.03       0.006150         0.04006       0.03832            0.02058   
3    27.23       0.009110         0.07458       0.05661            0.01867   
4    94.44       0.011490         0.02461       0.05688            0.01885   
   symmetry_se  fractal_dimension_se  radius_worst  texture_worst   
0      0.03003              0.006193         25.38          17.33  \
1      0.01389              0.003532         24.99          23.41   
2      0.02250              0.004571         23.57          25.53   
3      0.05963              0.009208         14.91          26.50   
4      0.01756              0.005115         22.54          16.67   
   perimeter_worst  area_worst  smoothness_worst  compactness_worst   
0           184.60      2019.0            0.1622             0.6656  \
1           158.80      1956.0            0.1238             0.1866   
2           152.50      1709.0            0.1444             0.4245   
3            98.87       567.7            0.2098             0.8663   
4           152.20      1575.0            0.1374             0.2050   
   concavity_worst  concave points_worst  symmetry_worst   
0           0.7119                0.2654          0.4601  \
1           0.2416                0.1860          0.2750   
2           0.4504                0.2430          0.3613   
3           0.6869                0.2575          0.6638   
4           0.4000                0.1625          0.2364   
   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
3                  0.17300          NaN  
4                  0.07678          NaN  
Find duplicate records
duplicate_count = df.duplicated().sum() print(f'duplicate_count : {duplicate_count}')
Above snippet prints below output.
duplicate_count : 0
Print unique count and missing values
unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df)
Above snippet prints below output.
Data Type unique_count missing_values id int64 569 0 diagnosis object 2 0 radius_mean float64 456 0 texture_mean float64 479 0 perimeter_mean float64 522 0 area_mean float64 539 0 smoothness_mean float64 474 0 compactness_mean float64 537 0 concavity_mean float64 537 0 concave points_mean float64 542 0 symmetry_mean float64 432 0 fractal_dimension_mean float64 499 0 radius_se float64 540 0 texture_se float64 519 0 perimeter_se float64 533 0 area_se float64 528 0 smoothness_se float64 547 0 compactness_se float64 541 0 concavity_se float64 533 0 concave points_se float64 507 0 symmetry_se float64 498 0 fractal_dimension_se float64 545 0 radius_worst float64 457 0 texture_worst float64 511 0 perimeter_worst float64 514 0 area_worst float64 544 0 smoothness_worst float64 411 0 compactness_worst float64 529 0 concavity_worst float64 539 0 concave points_worst float64 492 0 symmetry_worst float64 500 0 fractal_dimension_worst float64 535 0 Unnamed: 32 float64 0 569
Get basic statistics of numeric columns data to find the outliers
numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe()))
                id  radius_mean  texture_mean  perimeter_mean  area_mean   
count        569.0        569.0         569.0           569.0      569.0  \
mean    30371831.0         14.0          19.0            92.0      655.0   
std    125020586.0          4.0           4.0            24.0      352.0   
min         8670.0          7.0          10.0            44.0      144.0   
25%       869218.0         12.0          16.0            75.0      420.0   
50%       906024.0         13.0          19.0            86.0      551.0   
75%      8813129.0         16.0          22.0           104.0      783.0   
max    911320502.0         28.0          39.0           188.0     2501.0   
       smoothness_mean  compactness_mean  concavity_mean  concave points_mean   
count            569.0             569.0           569.0                569.0  \
mean               0.0               0.0             0.0                  0.0   
std                0.0               0.0             0.0                  0.0   
min                0.0               0.0             0.0                  0.0   
25%                0.0               0.0             0.0                  0.0   
50%                0.0               0.0             0.0                  0.0   
75%                0.0               0.0             0.0                  0.0   
max                0.0               0.0             0.0                  0.0   
       symmetry_mean  fractal_dimension_mean  radius_se  texture_se   
count          569.0                   569.0      569.0       569.0  \
mean             0.0                     0.0        0.0         1.0   
std              0.0                     0.0        0.0         1.0   
min              0.0                     0.0        0.0         0.0   
25%              0.0                     0.0        0.0         1.0   
50%              0.0                     0.0        0.0         1.0   
75%              0.0                     0.0        0.0         1.0   
max              0.0                     0.0        3.0         5.0   
       perimeter_se  area_se  smoothness_se  compactness_se  concavity_se   
count         569.0    569.0          569.0           569.0         569.0  \
mean            3.0     40.0            0.0             0.0           0.0   
std             2.0     45.0            0.0             0.0           0.0   
min             1.0      7.0            0.0             0.0           0.0   
25%             2.0     18.0            0.0             0.0           0.0   
50%             2.0     25.0            0.0             0.0           0.0   
75%             3.0     45.0            0.0             0.0           0.0   
max            22.0    542.0            0.0             0.0           0.0   
       concave points_se  symmetry_se  fractal_dimension_se  radius_worst   
count              569.0        569.0                 569.0         569.0  \
mean                 0.0          0.0                   0.0          16.0   
std                  0.0          0.0                   0.0           5.0   
min                  0.0          0.0                   0.0           8.0   
25%                  0.0          0.0                   0.0          13.0   
50%                  0.0          0.0                   0.0          15.0   
75%                  0.0          0.0                   0.0          19.0   
max                  0.0          0.0                   0.0          36.0   
       texture_worst  perimeter_worst  area_worst  smoothness_worst   
count          569.0            569.0       569.0             569.0  \
mean            26.0            107.0       881.0               0.0   
std              6.0             34.0       569.0               0.0   
min             12.0             50.0       185.0               0.0   
25%             21.0             84.0       515.0               0.0   
50%             25.0             98.0       686.0               0.0   
75%             30.0            125.0      1084.0               0.0   
max             50.0            251.0      4254.0               0.0   
       compactness_worst  concavity_worst  concave points_worst   
count              569.0            569.0                 569.0  \
mean                 0.0              0.0                   0.0   
std                  0.0              0.0                   0.0   
min                  0.0              0.0                   0.0   
25%                  0.0              0.0                   0.0   
50%                  0.0              0.0                   0.0   
75%                  0.0              0.0                   0.0   
max                  1.0              1.0                   0.0   
       symmetry_worst  fractal_dimension_worst  Unnamed: 32  
count           569.0                    569.0          0.0  
mean              0.0                      0.0          NaN  
std               0.0                      0.0          NaN  
min               0.0                      0.0          NaN  
25%               0.0                      0.0          NaN  
50%               0.0                      0.0          NaN  
75%               0.0                      0.0          NaN  
max               1.0                      0.0          NaN  
I can sense some columns has outliers in the dataset by looking at above statistics. To make the things simple, I am going with IQR to find the outlier columns.
def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns
For each numeric column,
- Above function calculates the first quartile (Q1), third quartile (Q3), and the IQR (IQR = Q3 - Q1).
 - Defines lower and upper bounds for each column using the IQR and a specified threshold (default is 1.5 times the IQR).
 - If a column has value which is either less than the lower_bound, or greater than the upper bound, is cosidered as outlier column.
 
outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}')
Above snippet prints below insights.
Total outlier_columns : 24 outlier_columns : ['id', 'radius_mean', 'texture_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst']
Preprocess the data
Drop the column ‘Unnamed: 32’ as it do not have any non_null value in it.
df.drop('Unnamed: 32', axis=1, inplace=True)
Encode the ‘diagnosis’ column data
We have only one non_numeric column ‘diagnosis’ and it has only two unique values in it, we can go with binary encoding.
label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])
Replace outliers with lower, upper bounds
Following program will do that.
def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Let’s drop id and ‘Unnamed: 32’ columns
df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True)
Encode 'diagnosis' column using label encoder.
label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])
Find and replace outliers with certain threshold.
def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Train and test the model
Split the data into features (X) and target (y).
X = df.drop(['diagnosis'], axis=1) y = df['diagnosis']
Standardize
the training set. As we are going to use KNN algorithm, contiguous data make
the calculations more faster. We can get that using StandardScaler.
 
sc = StandardScaler() X = sc.fit_transform(X)
Divide the data into training, testing samples.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)
Get an instance of KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
Train the model.
knn_classifier.fit(X_train, y_train)
Validate the model against test sample.
y_pred = knn_classifier.predict(X_test)
Get the accuracy score.
accuracy = accuracy_score(y_test, y_pred)
knn.py
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt # Adjust display options to show all rows and columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) def basic_analysis(df): shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() five_rows = df.head() print(f'five_rows : \n{five_rows}') print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df) numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe())) outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}') def preprocess_data(df): df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True) label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis']) find_and_replace_outlier_columns(df, 3) #print(round(df.describe())) def train_test_model(df): #df.info() # Split the data into features (X) and target (y) X = df.drop(['diagnosis'], axis=1) y = df['diagnosis'] sc = StandardScaler() X = sc.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47) # Initialize the K-NN classifier (Let's choose k=3) knn_classifier = KNeighborsClassifier(n_neighbors=3) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) # Make predictions on the test data y_pred = knn_classifier.predict(X_test) # Calculate the accuracy of the model accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) df = pd.read_csv('KNNAlgorithmDataset.csv') # basic_analysis(df) preprocess_data(df) train_test_model(df)
Output
Accuracy: 0.9824561403508771
We can enhance above program to validate against different neighbors count.
training_accuracy = [] testing_accuracy = [] for neighbor_count in range(1, 20): # Initialize the K-NN classifier (Let's choose k=neighbor_count) knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) training_accuracy.append(knn_classifier.score(X_train, y_train)) testing_accuracy.append(knn_classifier.score(X_test, y_test))
Find the below working application.
knn_diff_neighbors.py
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt # Adjust display options to show all rows and columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) def basic_analysis(df): shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() five_rows = df.head() print(f'five_rows : \n{five_rows}') print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df) numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe())) outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}') def preprocess_data(df): df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True) label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis']) find_and_replace_outlier_columns(df, 3) #print(round(df.describe())) def train_test_model(df): #df.info() # Split the data into features (X) and target (y) X = df.drop(['diagnosis'], axis=1) y = df['diagnosis'] sc = StandardScaler() X = sc.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47) training_accuracy = [] testing_accuracy = [] for neighbor_count in range(1, 20): # Initialize the K-NN classifier (Let's choose k=neighbor_count) knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) training_accuracy.append(knn_classifier.score(X_train, y_train)) testing_accuracy.append(knn_classifier.score(X_test, y_test)) num = [i for i in range(1, 20)] score = pd.DataFrame({'neighbor_count': num, 'training_score': training_accuracy, 'testing_score': testing_accuracy}) print(score) plt.plot(num, training_accuracy, label='Training accuracy') plt.plot(num, testing_accuracy, label='Testing accuracy') plt.legend() plt.show() df = pd.read_csv('KNNAlgorithmDataset.csv') # basic_analysis(df) preprocess_data(df) train_test_model(df)
Output
neighbor_count training_score testing_score 0 1 1.000000 0.970760 1 2 0.972362 0.976608 2 3 0.982412 0.982456 3 4 0.969849 0.982456 4 5 0.974874 0.982456 5 6 0.967337 0.976608 6 7 0.972362 0.976608 7 8 0.969849 0.976608 8 9 0.972362 0.976608 9 10 0.962312 0.970760 10 11 0.967337 0.964912 11 12 0.962312 0.959064 12 13 0.962312 0.964912 13 14 0.957286 0.947368 14 15 0.964824 0.947368 15 16 0.957286 0.953216 16 17 0.959799 0.947368 17 18 0.954774 0.941520 18 19 0.957286 0.941520
Previous Next Home


No comments:
Post a Comment