Transitioning from a Single Tree to an Endless Forest: Learning the Decision Logic of Random Forests
You have been told previously you are likely familiar with Decision Trees: They are quite understandable; they function in a linear fashion similar to the way you would make decisions, one at-a-time; Decision Trees are extremely easy to read; and therefore, can be easily communicated.
However, as mentioned above, there are limitations to using Decision Trees alone—they are very sensitive to data input. What does this mean? If you change just a few rows of data, the entire decision tree can grow significantly differently. In data analysis vernacular, this is known as being extremely variable. An example of this is: An individual obtains all the answers correct when taking a prior to the final examination, however, cannot recall the answer when the words are different.



Random Forest Algorithm: How it Works #
Each Tree is Built from its own Random Sample of Data
Random Forest builds many trees, and each tree is built using a random sample of the total amount of training data. This means that each tree may learn from a completely different subset of the data.
Each Tree Uses Randomly Sampled Features
While building individual trees, the random forest algorithm does not take into account all of the features/columns of the training dataset at once; for each potential split, the algorithm only considers a subset of the available features. This ensures that the trees are built using different features and are therefore less likely to produce similar predictions.
All Trees Predict Values
All of the trees in the random forest build their own predictions based on what they learned from the data they trained on. Therefore, all of the trees produce their own outputs.
Prediction Combination
For classification problems, the predicted output for the random forest is given as the class that most of the trees predicted, i.e., voting. For regression problems, the predicted output for the random forest is given as the mean of all of the trees predictions.
Why Does it Work?
The randomization of the training data and training features for each individual tree of the random forest minimizes the amount of overfitting that occurs and subsequently increases the accuracy and reliability of the overall prediction.
The Implementation Pipeline #
Before we code, let’s look at the logical flow of our system:
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Model & Preprocessing from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_recall_curve, roc_curve, auc from sklearn.preprocessing import StandardScaler # Settings %matplotlib inline sns.set(style="whitegrid")
Load Dataset
# Load the dataset
df = pd.read_csv('/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv')
# Handling potential Infinite values or NaNs from flow calculations
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)Show First 10 Columns Entries
df.head(10)
Output

Show First 10 Row Entries
df.describe()
Output

Check Class
# Check class balance
print(f"Dataset Shape: {df.shape}")
print(df['Label'].value_counts())Dataset Shape: (1175222, 24) Label 1 866152 0 309070 Name: count, dtype: int64
Scatter Plot
plt.title('Scatter Plot: Flow Duration vs Total Forward Packets')
sns.scatterplot(data=df, x='Flow Duration', y='Total Fwd Packet', hue='Label', palette='viridis', s=100)
plt.xlabel(r'Flow Duration ($\mu s$)')
plt.ylabel('Total Forward Packets')
plt.legend(title='Label (0: Normal, 1: Attack)')
plt.grid(True)
plt.savefig('scatter_plot.png')
plt.show()
Step Plot
# 2. Create the Step Plot for Flow Duration
plt.figure(figsize=(10, 5))
plt.step(df.index, df["Flow Duration"], where='post', label='Flow Duration', marker='o', linewidth=2)
plt.title("Step Plot: Flow Duration Transition")
plt.xlabel("Sample Index")
plt.ylabel("Duration (microseconds)")
plt.xticks([0, 1]) # Only two samples
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Step Plot
plt.figure(figsize=(10, 5))
features = ["Total Fwd Packet", "Total Bwd packets", "Total Length of Fwd Packet"]
for feature in features:
plt.step(df.index, df[feature], where='post', label=feature, marker='s')
plt.title("Step Plot: Packet Feature Comparison")
plt.xlabel("Sample Index")
plt.ylabel("Count / Length")
plt.xticks([0, 1])
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Training Model
# Define Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']
# Split the data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.title('Confusion Matrix: IoT Intrusion Detection')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Top 10 Important Network Features
importances = rf_model.feature_importances_
indices = np.argsort(importances)[-10:] # Top 10 features
plt.figure(figsize=(10, 6))
plt.title('Top 10 Important Network Features')
plt.barh(range(len(indices)), importances[indices], color='teal', align='center')
plt.yticks(range(len(indices)), [X.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Receiver Operating Characteristic (ROC)
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
Evaluate the Model
print("--- Final Evaluation Report ---")
print(classification_report(y_test, y_pred))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.4%}")--- Final Evaluation Report ---
precision recall f1-score support
0 1.00 1.00 1.00 61814
1 1.00 1.00 1.00 173231
accuracy 1.00 235045
macro avg 1.00 1.00 1.00 235045
weighted avg 1.00 1.00 1.00 235045
Overall Accuracy: 99.9545%