The K-Nearest Neighbor (KNN) algorithm is a basic yet common way to create classifications or predictions via a machine learning technique. KNN determines a given input’s classification by finding the K most similar (or closest) data points to that input and uses either the most common class (i.e., classification) or average value (regression) of those points to determine the final class (classification only) or predicted value for that input.
KNN can be used to classify based on the similarity of nearby data points.
KNN employs distance metrics (like Euclidean Distance) to calculate how far away data points are from one another and therefore determine who their nearest neighbors are.
KNN is an instance-based learning method and therefore provides no assumptions about the underlying distribution of the input data (versus other machine learning algorithms).
The K-Nearest Neighbors(KNN) algorithm has been classified as a “lazy learner” because it does not make immediate use of the training dataset to create an understanding about the training data. Instead, KNN simply stores the entire dataset and then, upon classification, will perform all computations necessary to produce a solution.
For example, consider two features: Category 1 and Category 2.

KNN assigns a “category” based on the nearest points to another point. As shown in the following images, KNN makes predictions about whether or not a new point belongs in a category based on the location of its closest neighbors.
In the image, the green represents points belonging to Category 1 and the red represents points belonging to Category 2. In this example, the new point is located at the point where the circles appear. If we look at the number of neighbors that are classified as red (Category 2), we see that most of the neighbors for the new point are red – KNN has determined that the new point belongs to Category 2.
The Implementation Pipeline #
Before we code, let’s look at the logical flow of our system:
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Load Dataset
# 1. Load the dataset
file_path = '/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv'
df = pd.read_csv(file_path)
# 2. Clean column names
df.columns = df.columns.str.strip()
# 3. Handle missing values
df.fillna(df.median(), inplace=True)
# 4. Define Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']
# 5. Feature Scaling (Essential for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 6. Split into Training (80%) and Testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print(f"Preprocessing Complete. Training samples: {X_train.shape[0]}")Preprocessing Complete. Training samples: 940177
3D Scatter Plot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Sample data
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)
# Create figure
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Scatter plot
ax.scatter(x, y, z)
# Labels
ax.set_xlabel('X Axis')
ax.set_ylabel('Y Axis')
ax.set_zlabel('Z Axis')
plt.title('3D Scatter Plot')
plt.show()
Total bill Plot
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
df = sns.load_dataset('tips')
# LM plot
sns.lmplot(x='total_bill', y='tip', data=df)
plt.show()
Total bill Plot
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
df = sns.load_dataset('tips')
# Joint plot
sns.jointplot(x='total_bill', y='tip', data=df)
plt.show()
Violin Plot of Total Bill by Day
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset (you can replace with your own df)
# Example: using built-in dataset
df = sns.load_dataset('tips')
plt.figure(figsize=(8, 5))
# Violin plot
sns.violinplot(x='day', y='total_bill', data=df)
plt.title('Violin Plot of Total Bill by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show()
Swarm Plot of Total Bill by Day
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
df = sns.load_dataset('tips')
plt.figure(figsize=(8, 5))
# Swarm plot
sns.swarmplot(x='day', y='total_bill', data=df)
plt.title('Swarm Plot of Total Bill by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show()
Train the K-Nearest Neighbor (KNN)
# Initialize the model knn = KNeighborsClassifier(n_neighbors=5) # Fit the model to the training data knn.fit(X_train, y_train)

Evaluate the Model
# Make predictions
y_pred = knn.predict(X_test)
# Print Performance Metrics
print("--- KNN Evaluation Metrics ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Create Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(7,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.title('Confusion Matrix: IoT Intrusion Detection')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()--- KNN Evaluation Metrics ---
Accuracy: 0.9969
Classification Report:
precision recall f1-score support
0 0.99 0.99 0.99 61522
1 1.00 1.00 1.00 173523
accuracy 1.00 235045
macro avg 1.00 1.00 1.00 235045
weighted avg 1.00 1.00 1.00 235045
Error Rate vs. K Value
error_rate = []
# This might take a few minutes given the large dataset size
for i in range(1, 11):
knn_i = KNeighborsClassifier(n_neighbors=i)
knn_i.fit(X_train, y_train)
pred_i = knn_i.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1, 11), error_rate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red')
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()
