Naive bayes Project

What is Naive Bayes? #

Naive Bayes is a probabilistic classifier based on Bayes’ Theorem. It calculates the probability of a label (Attack or Normal) given a set of input features (like Port Number or Protocol).

The “Naive” part comes from a big assumption: the model assumes that every feature is independent of every other feature. In network traffic, we know that Protocol and Port are often related, but by “ignoring” this connection, the model becomes incredibly fast and requires very little memory—making it perfect for IoT devices.

The Implementation Pipeline #

Before we code, let’s look at the logical flow of our system:

Importing the Essentials #

We start by importing the libraries needed for data handling (pandas), the mathematical model (sklearn), and visualization (seaborn).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB  # Correct
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay

Data Loading and Feature Engineering #

Our dataset contains 24 features, but we will focus on the most critical network attributes. Because Naive Bayes handles continuous numbers (like Flow Duration), we use the Gaussian version of the algorithm, which assumes the data follows a “Bell Curve.”

# Load the IoT dataset
df = pd.read_csv('ACI-IoT-2023.csv')

# Define our 10 key features
features = [
    'Src Port', 'Dst Port', 'Protocol', 'Flow Duration', 
    'Total Fwd Packet', 'Total Bwd packets', 'Total Length of Fwd Packet', 
    'Fwd Packet Length Min', 'Bwd Packet Length Max', 'Flow IAT Min'
]

X = df[features]
y = df['Label'] # 0: Normal, 1: Attack

# Split into 80% Training and 20% Testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model #

This is where the math happens. The model looks at the 80% training data and learns the “likelihood” of an attack for every feature value. For example, it learns: “If the Dst Port is 80, what are the chances this is a normal request versus a DDoS attack?”

# Initialize Gaussian Naive Bayes
model = GaussianNB()

# Train the model
model.fit(X_train, y_train)

Making Predictions #

Now we give the model the 20% “Test” data it has never seen before. It calculates the probability for both classes and picks the one with the highest score.

# Predict the labels for the test set
y_pred = model.predict(X_test)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Detailed Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Output:

Model Accuracy: 80.81%
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.28      0.43     61522
           1       0.80      1.00      0.88    173523

    accuracy                           0.81    235045
   macro avg       0.88      0.64      0.66    235045
weighted avg       0.84      0.81      0.77    235045

The Evaluation Matrix Diagram #

In cybersecurity, a 99% accuracy isn’t enough if the 1% you missed was a critical system hack. We use a Confusion Matrix to visualize:

True Positives: Attacks correctly caught.
False Positives: Normal traffic wrongly blocked (Annoying).
False Negatives: Attacks that slipped through (Dangerous).

# Create the Diagram
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds')
plt.title('Naive Bayes Evaluation: IoT Attack Detection')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print the final score
print(f"Final Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Plot ka size set karein
plt.figure(figsize=(12, 7))

# 2. Scatter plot banayen (Humne sample use kiya hai taakay graph jaldi banay kyunki data 1.2M rows hai)
# 'hue' parameter Label (0 ya 1) ke mutabiq colors assign karega
sns.scatterplot(data=df.sample(5000), x='Flow Duration', y='Total Fwd Packet', hue='Label', palette='viridis', alpha=0.6)

# 3. Titles aur Labels
plt.title('Scatter Plot: Flow Duration vs Total Fwd Packets (IoT Traffic)')
plt.xlabel('Flow Duration (Microseconds)')
plt.ylabel('Total Forward Packets')
plt.legend(title='Traffic Type', labels=['Normal (0)', 'Attack (1)'])

# 4. Grid dikhayen taakay values samajh saken
plt.grid(True, linestyle='--', alpha=0.5)

# 5. Show plot
plt.show()

Conclusion #

Naive Bayes proves that you don’t always need a complex “Black Box” Neural Network to secure an IoT environment. Because of its speed and simplicity, it remains a top choice for real-time intrusion detection where every millisecond counts.

Naive bayes Logistic Regression