Random Forest Project

Transitioning from a Single Tree to an Endless Forest: Learning the Decision Logic of Random Forests
You have been told previously you are likely familiar with Decision Trees: They are quite understandable; they function in a linear fashion similar to the way you would make decisions, one at-a-time; Decision Trees are extremely easy to read; and therefore, can be easily communicated.

However, as mentioned above, there are limitations to using Decision Trees alone—they are very sensitive to data input. What does this mean? If you change just a few rows of data, the entire decision tree can grow significantly differently. In data analysis vernacular, this is known as being extremely variable. An example of this is: An individual obtains all the answers correct when taking a prior to the final examination, however, cannot recall the answer when the words are different.

Random Forest Algorithm: How it Works #

Each Tree is Built from its own Random Sample of Data

Random Forest builds many trees, and each tree is built using a random sample of the total amount of training data. This means that each tree may learn from a completely different subset of the data.

Each Tree Uses Randomly Sampled Features

While building individual trees, the random forest algorithm does not take into account all of the features/columns of the training dataset at once; for each potential split, the algorithm only considers a subset of the available features. This ensures that the trees are built using different features and are therefore less likely to produce similar predictions.

All Trees Predict Values

All of the trees in the random forest build their own predictions based on what they learned from the data they trained on. Therefore, all of the trees produce their own outputs.

Prediction Combination

For classification problems, the predicted output for the random forest is given as the class that most of the trees predicted, i.e., voting. For regression problems, the predicted output for the random forest is given as the mean of all of the trees predictions.

Why Does it Work?

The randomization of the training data and training features for each individual tree of the random forest minimizes the amount of overfitting that occurs and subsequently increases the accuracy and reliability of the overall prediction.

The Implementation Pipeline #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model & Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_recall_curve, roc_curve, auc
from sklearn.preprocessing import StandardScaler

# Settings
%matplotlib inline
sns.set(style="whitegrid")

Load Dataset

# Load the dataset
df = pd.read_csv('/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv')

# Handling potential Infinite values or NaNs from flow calculations
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

Show First 10 Columns Entries

df.head(10)

Output

Show First 10 Row Entries

df.describe()

Output

Check Class

# Check class balance
print(f"Dataset Shape: {df.shape}")
print(df['Label'].value_counts())

Dataset Shape: (1175222, 24)
Label
1    866152
0    309070
Name: count, dtype: int64

Scatter Plot

plt.title('Scatter Plot: Flow Duration vs Total Forward Packets')
sns.scatterplot(data=df, x='Flow Duration', y='Total Fwd Packet', hue='Label', palette='viridis', s=100)
plt.xlabel(r'Flow Duration ($\mu s$)')
plt.ylabel('Total Forward Packets')
plt.legend(title='Label (0: Normal, 1: Attack)')
plt.grid(True)
plt.savefig('scatter_plot.png')
plt.show()

Step Plot

# 2. Create the Step Plot for Flow Duration
plt.figure(figsize=(10, 5))
plt.step(df.index, df["Flow Duration"], where='post', label='Flow Duration', marker='o', linewidth=2)

plt.title("Step Plot: Flow Duration Transition")
plt.xlabel("Sample Index")
plt.ylabel("Duration (microseconds)")
plt.xticks([0, 1])  # Only two samples
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

Step Plot

plt.figure(figsize=(10, 5))
features = ["Total Fwd Packet", "Total Bwd packets", "Total Length of Fwd Packet"]

for feature in features:
    plt.step(df.index, df[feature], where='post', label=feature, marker='s')

plt.title("Step Plot: Packet Feature Comparison")
plt.xlabel("Sample Index")
plt.ylabel("Count / Length")
plt.xticks([0, 1])
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

Training Model

# Define Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']

# Split the data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Confusion Matrix

plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.title('Confusion Matrix: IoT Intrusion Detection')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

Top 10 Important Network Features

importances = rf_model.feature_importances_
indices = np.argsort(importances)[-10:] # Top 10 features

plt.figure(figsize=(10, 6))
plt.title('Top 10 Important Network Features')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [X.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Receiver Operating Characteristic (ROC)

fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

Evaluate the Model

print("--- Final Evaluation Report ---")
print(classification_report(y_test, y_pred))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.4%}")

--- Final Evaluation Report ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     61814
           1       1.00      1.00      1.00    173231

    accuracy                           1.00    235045
   macro avg       1.00      1.00      1.00    235045
weighted avg       1.00      1.00      1.00    235045

Overall Accuracy: 99.9545%

Random Forest Support Vector Machine (SVM)