k-nearest neighbors (KNN) is a supervised machine learning algorithm, and it can be used to address both classification, regression problems.
How it works?
It tries to find the k-nearest trained data points to the new data point and predict the category (or label) or value of the new data point based on the labels or values of its k nearest neighbors.
Pros
- Easy to implement
- Easy to understand
- It can be used for both regression and classification tasks
Cons
- Computationally expensive for largest datasets.
- It is sensitive to the hyperparameter k.
If k is too low, then it will increase bias and lead to misclassification. It k is too high, it make the process too expensive.
Have a look at below image.
As you see above image, Rectangle, star shapes are the trained data points. When we need to predict the outcome for the new data point which is represented in Triangle shape, it depends on value of k.
- Suppose if k=5, then Algorithm selects 3 star data points and 2 rectangle shape data points.
- If k=9, then Algorithm selects 4 star data points and 5 rectangle shape data points.
From the above statements, you can confirm that the outcome of model is changed by altering the k's value from 5 to 9.
How k-nearest neighbors of a given data point are identified?
By calculating the distance between the give point and all other points in the dataset, we can find the k-nearest neighbors of a given data point.
Following algorithms are used to calculate the distance between two points.
- Euclidean Distance
- Manhattan Distance
- Minkowski distance
Euclidean Distance
Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Euclidean distance is calculated using below formula.
Where
- x1 and x2 are the x-coordinates of the two points
- y1 and y2 are the y-coordinates of the two points
Suppose if you have two points in three dimensional space, the forumal looks like below.
dist = sqrt((x1 - x2)^2 + (y1 - y2)^2 + (z1 - z2)^2)
where:
- x1 and x2 are the x-coordinates of the two points
- y1 and y2 are the y-coordinates of the two points
- z1 and z2 are the z-coordinates of the two points
Euclidean distance is a good choice for many applications, but be cautious as it is sensitive to outliers.
Manhattan Distance
Suppose if you have two points (x1, y1) and (x2, y2) in a two dimensional space, Manhattan distance is calculated using below formula.
Suppose if you have two points (x1, y1, z1) and (x2, y2, z2) in a three dimensional space, Manhattan distance is calculated using below formula.
Manhattan Distance is easy to understand and implement.
Minkowski distance
It's a generalization of both the Euclidean distance and the Manhattan distance.
Formula
In simple terms, it is given below.
dist = (|x1 - x2|^p + |y1 - y2|^p + |z1 - z2|^p + ...)^(1/p)
Where
- x1 and x2 are the x-coordinates of the two points
- y1 and y2 are the y-coordinates of the two points
- z1 and z2 are the z-coordinates of the two points
- p is a real number, When p=1, Minkowski distance is equal to the Manhattan distance, and when p=2, it is equal to the Euclidean distance. We can experiment with different p values to find the best fit for your distance metric. A small value of p will make the Minkowski distance more sensitive to outliers, while a large value of p will make the Minkowski distance less sensitive to outliers.
Real world examples of KNN algorithm
a. KNN algorithm is used in image classification tasks. When we have given an unknown image, it is compared against the set of k-nearest labelled images based on the training set and take appropriate decision.
b. It is used to detect fraudulent transactions by comparing new transactions to a set of known fraudulent transactions.
c. It is used in recommender systems. For example, in a news recommender system, we can recommend news to the user based on the preferences of K-nearest users who have similar news reading history.
d. You can use KNN algorithm to predict the house prices in a locality.
e. We can group the customers by their interests using KNN algorithm.
Example in scikit learn
# Create a KNN classifier with k=7 neighbors knn = KNeighborsClassifier(n_neighbors=7) # Fit the classifier to the training data knn.fit(X_train, y_train) # Predict the labels for the test data y_pred = knn.predict(X_test)
Dataset used to demonstrate the application.
https://www.kaggle.com/datasets/gkalpolukcu/knn-algorithm-dataset/download?datasetVersionNumber=1
Let’s try to understand above dataset.
Get the shape (number of rows and columns) of the dataset
shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}')
Above snippet prints below information.
total_rows : 569 total_columns : 33
Print the column names to get quick glance of data definition
columns = df.columns print(f'\ncolumns : {columns}')
columns : Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'], dtype='object')
Let’s get basic information about the dataframe like column name, data type, non_null count etc.,
df.info()
Above snippet prints below information.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1)
From the above snippet, we can confirm that
- There are 32 numeric columns (31 is of type float64 and 1 is of type int64).
- One column (diagnosis) is of type object
- All the columns except ‘Unnamed: 32’ has no null values.
- The column ‘Unnamed: 32’ has all the null values
Print first five rows to get basic details of the content
five_rows = df.head() print(f'five_rows : \n{five_rows}')
Above snippet prints below data.
five_rows : id diagnosis radius_mean texture_mean perimeter_mean area_mean 0 842302 M 17.99 10.38 122.80 1001.0 \ 1 842517 M 20.57 17.77 132.90 1326.0 2 84300903 M 19.69 21.25 130.00 1203.0 3 84348301 M 11.42 20.38 77.58 386.1 4 84358402 M 20.29 14.34 135.10 1297.0 smoothness_mean compactness_mean concavity_mean concave points_mean 0 0.11840 0.27760 0.3001 0.14710 \ 1 0.08474 0.07864 0.0869 0.07017 2 0.10960 0.15990 0.1974 0.12790 3 0.14250 0.28390 0.2414 0.10520 4 0.10030 0.13280 0.1980 0.10430 symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se 0 0.2419 0.07871 1.0950 0.9053 8.589 \ 1 0.1812 0.05667 0.5435 0.7339 3.398 2 0.2069 0.05999 0.7456 0.7869 4.585 3 0.2597 0.09744 0.4956 1.1560 3.445 4 0.1809 0.05883 0.7572 0.7813 5.438 area_se smoothness_se compactness_se concavity_se concave points_se 0 153.40 0.006399 0.04904 0.05373 0.01587 \ 1 74.08 0.005225 0.01308 0.01860 0.01340 2 94.03 0.006150 0.04006 0.03832 0.02058 3 27.23 0.009110 0.07458 0.05661 0.01867 4 94.44 0.011490 0.02461 0.05688 0.01885 symmetry_se fractal_dimension_se radius_worst texture_worst 0 0.03003 0.006193 25.38 17.33 \ 1 0.01389 0.003532 24.99 23.41 2 0.02250 0.004571 23.57 25.53 3 0.05963 0.009208 14.91 26.50 4 0.01756 0.005115 22.54 16.67 perimeter_worst area_worst smoothness_worst compactness_worst 0 184.60 2019.0 0.1622 0.6656 \ 1 158.80 1956.0 0.1238 0.1866 2 152.50 1709.0 0.1444 0.4245 3 98.87 567.7 0.2098 0.8663 4 152.20 1575.0 0.1374 0.2050 concavity_worst concave points_worst symmetry_worst 0 0.7119 0.2654 0.4601 \ 1 0.2416 0.1860 0.2750 2 0.4504 0.2430 0.3613 3 0.6869 0.2575 0.6638 4 0.4000 0.1625 0.2364 fractal_dimension_worst Unnamed: 32 0 0.11890 NaN 1 0.08902 NaN 2 0.08758 NaN 3 0.17300 NaN 4 0.07678 NaN
Find duplicate records
duplicate_count = df.duplicated().sum() print(f'duplicate_count : {duplicate_count}')
Above snippet prints below output.
duplicate_count : 0
Print unique count and missing values
unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df)
Above snippet prints below output.
Data Type unique_count missing_values id int64 569 0 diagnosis object 2 0 radius_mean float64 456 0 texture_mean float64 479 0 perimeter_mean float64 522 0 area_mean float64 539 0 smoothness_mean float64 474 0 compactness_mean float64 537 0 concavity_mean float64 537 0 concave points_mean float64 542 0 symmetry_mean float64 432 0 fractal_dimension_mean float64 499 0 radius_se float64 540 0 texture_se float64 519 0 perimeter_se float64 533 0 area_se float64 528 0 smoothness_se float64 547 0 compactness_se float64 541 0 concavity_se float64 533 0 concave points_se float64 507 0 symmetry_se float64 498 0 fractal_dimension_se float64 545 0 radius_worst float64 457 0 texture_worst float64 511 0 perimeter_worst float64 514 0 area_worst float64 544 0 smoothness_worst float64 411 0 compactness_worst float64 529 0 concavity_worst float64 539 0 concave points_worst float64 492 0 symmetry_worst float64 500 0 fractal_dimension_worst float64 535 0 Unnamed: 32 float64 0 569
Get basic statistics of numeric columns data to find the outliers
numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe()))
id radius_mean texture_mean perimeter_mean area_mean count 569.0 569.0 569.0 569.0 569.0 \ mean 30371831.0 14.0 19.0 92.0 655.0 std 125020586.0 4.0 4.0 24.0 352.0 min 8670.0 7.0 10.0 44.0 144.0 25% 869218.0 12.0 16.0 75.0 420.0 50% 906024.0 13.0 19.0 86.0 551.0 75% 8813129.0 16.0 22.0 104.0 783.0 max 911320502.0 28.0 39.0 188.0 2501.0 smoothness_mean compactness_mean concavity_mean concave points_mean count 569.0 569.0 569.0 569.0 \ mean 0.0 0.0 0.0 0.0 std 0.0 0.0 0.0 0.0 min 0.0 0.0 0.0 0.0 25% 0.0 0.0 0.0 0.0 50% 0.0 0.0 0.0 0.0 75% 0.0 0.0 0.0 0.0 max 0.0 0.0 0.0 0.0 symmetry_mean fractal_dimension_mean radius_se texture_se count 569.0 569.0 569.0 569.0 \ mean 0.0 0.0 0.0 1.0 std 0.0 0.0 0.0 1.0 min 0.0 0.0 0.0 0.0 25% 0.0 0.0 0.0 1.0 50% 0.0 0.0 0.0 1.0 75% 0.0 0.0 0.0 1.0 max 0.0 0.0 3.0 5.0 perimeter_se area_se smoothness_se compactness_se concavity_se count 569.0 569.0 569.0 569.0 569.0 \ mean 3.0 40.0 0.0 0.0 0.0 std 2.0 45.0 0.0 0.0 0.0 min 1.0 7.0 0.0 0.0 0.0 25% 2.0 18.0 0.0 0.0 0.0 50% 2.0 25.0 0.0 0.0 0.0 75% 3.0 45.0 0.0 0.0 0.0 max 22.0 542.0 0.0 0.0 0.0 concave points_se symmetry_se fractal_dimension_se radius_worst count 569.0 569.0 569.0 569.0 \ mean 0.0 0.0 0.0 16.0 std 0.0 0.0 0.0 5.0 min 0.0 0.0 0.0 8.0 25% 0.0 0.0 0.0 13.0 50% 0.0 0.0 0.0 15.0 75% 0.0 0.0 0.0 19.0 max 0.0 0.0 0.0 36.0 texture_worst perimeter_worst area_worst smoothness_worst count 569.0 569.0 569.0 569.0 \ mean 26.0 107.0 881.0 0.0 std 6.0 34.0 569.0 0.0 min 12.0 50.0 185.0 0.0 25% 21.0 84.0 515.0 0.0 50% 25.0 98.0 686.0 0.0 75% 30.0 125.0 1084.0 0.0 max 50.0 251.0 4254.0 0.0 compactness_worst concavity_worst concave points_worst count 569.0 569.0 569.0 \ mean 0.0 0.0 0.0 std 0.0 0.0 0.0 min 0.0 0.0 0.0 25% 0.0 0.0 0.0 50% 0.0 0.0 0.0 75% 0.0 0.0 0.0 max 1.0 1.0 0.0 symmetry_worst fractal_dimension_worst Unnamed: 32 count 569.0 569.0 0.0 mean 0.0 0.0 NaN std 0.0 0.0 NaN min 0.0 0.0 NaN 25% 0.0 0.0 NaN 50% 0.0 0.0 NaN 75% 0.0 0.0 NaN max 1.0 0.0 NaN
I can sense some columns has outliers in the dataset by looking at above statistics. To make the things simple, I am going with IQR to find the outlier columns.
def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns
For each numeric column,
- Above function calculates the first quartile (Q1), third quartile (Q3), and the IQR (IQR = Q3 - Q1).
- Defines lower and upper bounds for each column using the IQR and a specified threshold (default is 1.5 times the IQR).
- If a column has value which is either less than the lower_bound, or greater than the upper bound, is cosidered as outlier column.
outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}')
Above snippet prints below insights.
Total outlier_columns : 24 outlier_columns : ['id', 'radius_mean', 'texture_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst']
Preprocess the data
Drop the column ‘Unnamed: 32’ as it do not have any non_null value in it.
df.drop('Unnamed: 32', axis=1, inplace=True)
Encode the ‘diagnosis’ column data
We have only one non_numeric column ‘diagnosis’ and it has only two unique values in it, we can go with binary encoding.
label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])
Replace outliers with lower, upper bounds
Following program will do that.
def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Let’s drop id and ‘Unnamed: 32’ columns
df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True)
Encode 'diagnosis' column using label encoder.
label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis'])
Find and replace outliers with certain threshold.
def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
Train and test the model
Split the data into features (X) and target (y).
X = df.drop(['diagnosis'], axis=1) y = df['diagnosis']
Standardize
the training set. As we are going to use KNN algorithm, contiguous data make
the calculations more faster. We can get that using StandardScaler.
sc = StandardScaler() X = sc.fit_transform(X)
Divide the data into training, testing samples.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)
Get an instance of KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
Train the model.
knn_classifier.fit(X_train, y_train)
Validate the model against test sample.
y_pred = knn_classifier.predict(X_test)
Get the accuracy score.
accuracy = accuracy_score(y_test, y_pred)
knn.py
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt # Adjust display options to show all rows and columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) def basic_analysis(df): shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() five_rows = df.head() print(f'five_rows : \n{five_rows}') print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df) numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe())) outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}') def preprocess_data(df): df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True) label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis']) find_and_replace_outlier_columns(df, 3) #print(round(df.describe())) def train_test_model(df): #df.info() # Split the data into features (X) and target (y) X = df.drop(['diagnosis'], axis=1) y = df['diagnosis'] sc = StandardScaler() X = sc.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47) # Initialize the K-NN classifier (Let's choose k=3) knn_classifier = KNeighborsClassifier(n_neighbors=3) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) # Make predictions on the test data y_pred = knn_classifier.predict(X_test) # Calculate the accuracy of the model accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) df = pd.read_csv('KNNAlgorithmDataset.csv') # basic_analysis(df) preprocess_data(df) train_test_model(df)
Output
Accuracy: 0.9824561403508771
We can enhance above program to validate against different neighbors count.
training_accuracy = [] testing_accuracy = [] for neighbor_count in range(1, 20): # Initialize the K-NN classifier (Let's choose k=neighbor_count) knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) training_accuracy.append(knn_classifier.score(X_train, y_train)) testing_accuracy.append(knn_classifier.score(X_test, y_test))
Find the below working application.
knn_diff_neighbors.py
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt # Adjust display options to show all rows and columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) def find_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: outlier_columns.append(column) return outlier_columns def find_and_replace_outlier_columns(df, threshold=1.5): outlier_columns = [] for column in df.columns: if df[column].dtype in [np.int64, np.float64]: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - threshold * IQR upper_bound = Q3 + threshold * IQR outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] if not outliers.empty: df[column] = np.where(df[column] < lower_bound, lower_bound, df[column]) df[column] = np.where(df[column] > upper_bound, upper_bound, df[column]) def basic_analysis(df): shape = df.shape total_rows = shape[0] total_columns = shape[1] print(f'\ntotal_rows : {total_rows}') print(f'total_columns : {total_columns}') columns = df.columns print(f'\ncolumns : {columns}') print('\nDetailed information') df.info() five_rows = df.head() print(f'five_rows : \n{five_rows}') print('\nColumn wise missing values') column_wise_missing_values = df.isnull().sum() print(f'column_wise_missing_values : {column_wise_missing_values}') # Check for duplicate rows based on all columns duplicate_count = df.duplicated().sum() print(f'\nduplicate_count : {duplicate_count}') # Print unique counts print(f'\nUnique counts, missing values count, data types column wise') unique_counts = df.nunique() missing_values = df.isnull().sum() result_df = pd.DataFrame({'Data Type': df.dtypes, 'unique_count': unique_counts, 'missing_values' : missing_values}) print(result_df) numerical_columns = [var for var in df.columns if df[var].dtype != 'O'] print(round(df[numerical_columns].describe())) outlier_columns = find_outlier_columns(df, 3) print(f'Total outlier_columns : {len(outlier_columns)}') print(f'outlier_columns : \n{outlier_columns}') def preprocess_data(df): df.drop('Unnamed: 32', axis=1, inplace=True) df.drop('id', axis=1, inplace=True) label_encoder = LabelEncoder() df['diagnosis'] = label_encoder.fit_transform(df['diagnosis']) find_and_replace_outlier_columns(df, 3) #print(round(df.describe())) def train_test_model(df): #df.info() # Split the data into features (X) and target (y) X = df.drop(['diagnosis'], axis=1) y = df['diagnosis'] sc = StandardScaler() X = sc.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47) training_accuracy = [] testing_accuracy = [] for neighbor_count in range(1, 20): # Initialize the K-NN classifier (Let's choose k=neighbor_count) knn_classifier = KNeighborsClassifier(n_neighbors=neighbor_count) # Fit the classifier to the training data knn_classifier.fit(X_train, y_train) training_accuracy.append(knn_classifier.score(X_train, y_train)) testing_accuracy.append(knn_classifier.score(X_test, y_test)) num = [i for i in range(1, 20)] score = pd.DataFrame({'neighbor_count': num, 'training_score': training_accuracy, 'testing_score': testing_accuracy}) print(score) plt.plot(num, training_accuracy, label='Training accuracy') plt.plot(num, testing_accuracy, label='Testing accuracy') plt.legend() plt.show() df = pd.read_csv('KNNAlgorithmDataset.csv') # basic_analysis(df) preprocess_data(df) train_test_model(df)
Output
neighbor_count training_score testing_score 0 1 1.000000 0.970760 1 2 0.972362 0.976608 2 3 0.982412 0.982456 3 4 0.969849 0.982456 4 5 0.974874 0.982456 5 6 0.967337 0.976608 6 7 0.972362 0.976608 7 8 0.969849 0.976608 8 9 0.972362 0.976608 9 10 0.962312 0.970760 10 11 0.967337 0.964912 11 12 0.962312 0.959064 12 13 0.962312 0.964912 13 14 0.957286 0.947368 14 15 0.964824 0.947368 15 16 0.957286 0.953216 16 17 0.959799 0.947368 17 18 0.954774 0.941520 18 19 0.957286 0.941520
Previous Next Home
No comments:
Post a Comment