K-Nearest Neighbor (KNN) Project

KNN Project in Machine Learning #

Is a basic yet powerful technique used for classification and prediction tasks. The K-Nearest Neighbor (KNN) algorithm determines a given input’s class by finding the K most similar (closest) data points and uses the majority class (for classification) or average value (for regression) to make the final prediction.

KNN can be used to classify based on the similarity of nearby data points.
KNN employs distance metrics (like Euclidean Distance) to calculate how far away data points are from one another and therefore determine who their nearest neighbors are.
KNN is an instance-based learning method and therefore provides no assumptions about the underlying distribution of the input data (versus other machine learning algorithms).

The K-Nearest Neighbors(KNN) algorithm has been classified as a “lazy learner” because it does not make immediate use of the training dataset to create an understanding about the training data. Instead, KNN simply stores the entire dataset and then, upon classification, will perform all computations necessary to produce a solution.

For example, consider two features: Category 1 and Category 2.

KNN assigns a “category” based on the nearest points to another point. As shown in the following images, KNN makes predictions about whether or not a new point belongs in a category based on the location of its closest neighbors.

In the image, the green represents points belonging to Category 1 and the red represents points belonging to Category 2. In this example, the new point is located at the point where the circles appear. If we look at the number of neighbors that are classified as red (Category 2), we see that most of the neighbors for the new point are red – KNN has determined that the new point belongs to Category 2.

The Implementation Pipeline #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Load Dataset

# 1. Load the dataset
file_path = '/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv'
df = pd.read_csv(file_path)

# 2. Clean column names
df.columns = df.columns.str.strip()

# 3. Handle missing values
df.fillna(df.median(), inplace=True)

# 4. Define Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']

# 5. Feature Scaling (Essential for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 6. Split into Training (80%) and Testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print(f"Preprocessing Complete. Training samples: {X_train.shape[0]}")

Preprocessing Complete. Training samples: 940177

3D Scatter Plot

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Sample data
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)

# Create figure
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot
ax.scatter(x, y, z)

# Labels
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')

plt.title('3D Scatter Plot')

plt.show()

Total bill Plot

import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset
df = sns.load_dataset('tips')

# LM plot
sns.lmplot(x='total_bill', y='tip', data=df)

plt.show()

Total bill Plot

import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset
df = sns.load_dataset('tips')

# Joint plot
sns.jointplot(x='total_bill', y='tip', data=df)

plt.show()

Violin Plot of Total Bill by Day

import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset (you can replace with your own df)
# Example: using built-in dataset
df = sns.load_dataset('tips')

plt.figure(figsize=(8, 5))

# Violin plot
sns.violinplot(x='day', y='total_bill', data=df)

plt.title('Violin Plot of Total Bill by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')

plt.show()

Swarm Plot of Total Bill by Day

import seaborn as sns
import matplotlib.pyplot as plt

# Sample dataset
df = sns.load_dataset('tips')

plt.figure(figsize=(8, 5))

# Swarm plot
sns.swarmplot(x='day', y='total_bill', data=df)

plt.title('Swarm Plot of Total Bill by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')

plt.show()

Train the K-Nearest Neighbor (KNN)

# Initialize the model
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model to the training data
knn.fit(X_train, y_train)

Evaluate the Model

# Make predictions
y_pred = knn.predict(X_test)

# Print Performance Metrics
print("--- KNN Evaluation Metrics ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Create Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(7,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.title('Confusion Matrix: IoT Intrusion Detection')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

--- KNN Evaluation Metrics ---
Accuracy: 0.9969

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99     61522
           1       1.00      1.00      1.00    173523

    accuracy                           1.00    235045
   macro avg       1.00      1.00      1.00    235045
weighted avg       1.00      1.00      1.00    235045

Error Rate vs. K Value

error_rate = []

# This might take a few minutes given the large dataset size
for i in range(1, 11):
    knn_i = KNeighborsClassifier(n_neighbors=i)
    knn_i.fit(X_train, y_train)
    pred_i = knn_i.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1, 11), error_rate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red')
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()

K-Nearest neighbor (KNN)Random Forest