Multivariate Regression Project

What is Multivariate Regression? #

Multivariate Regression is a statistical and machine learning method used when you want to predict more than one output variable simultaneously. Instead of creating separate models for each target, this approach builds a single unified model that learns the relationships between input features and multiple outputs at the same time.

This is particularly powerful when the output variables are interrelated, because the model can capture and use those relationships to improve prediction accuracy.

Key Characteristics #

Produces two or more outputs in one prediction
Uses a single model to handle multiple target variables
Relies on multiple input features for prediction
Performs better when outputs are correlated or dependent
Reduces redundancy compared to building separate models

Why Use Multivariate Regression? #

In many real-world problems, outputs are not independent. Modeling them together allows the algorithm to:

Capture shared patterns between targets
Improve overall prediction performance
Reduce training time and complexity
Provide more realistic and consistent predictions

Example #

Imagine you want to predict a student’s performance. Instead of predicting each subject separately, you can build one model that predicts both:

Math marks
Science marks

using inputs such as:

Study hours
Class attendance

Since performance in different subjects is often related, the model can learn these connections and make better joint predictions.

Architecture of Multivariate Regression #

To understand multivariate regression properly, it helps to start with a quick recap of simple linear regression. In simple linear regression, we predict a single output using one input feature. As we move forward, multiple linear regression allows several inputs but still predicts only one output.

Multivariate regression goes a step further — it enables us to predict multiple outputs at the same time using one unified model. Instead of writing separate equations for each target variable, we represent everything in a matrix form, which makes computations efficient and scalable.

Mathematical Representation #

The general form of multivariate regression is:

Although this equation looks simple, it actually represents a system where multiple outputs are predicted simultaneously in a structured way.

$Y = X B + ϵ$

Components of the Equation #

Y (Output Matrix) #

Contains all the target variables
Shape: (n × p) → where n = number of observations, p = number of outputs
Example: Math and Science scores

X (Input Matrix) #

Includes all input features
Shape: (n × (k+1)) → includes intercept term
Example: Study hours, attendance

B (Coefficient Matrix) #

Stores weights (coefficients) for each feature-output pair
Shape: ((k+1) × p)
Each column corresponds to one output variable

ε (Error Matrix) #

Represents the difference between actual and predicted values
Captures noise or unexplained variation in the data

How It Works #

Instead of solving separate equations like:

Math = f(inputs)
Science = f(inputs)

Multivariate regression solves them together in one system. This allows the model to:

Learn shared patterns between outputs
Capture relationships among target variables
Improve prediction accuracy when outputs are correlated

Working of Multivariate Regression #

Multivariate regression follows a systematic process to learn from data and generate predictions for multiple outputs simultaneously. Instead of handling each target separately, the model works in a step-by-step pipeline, using matrix operations to make the process efficient and scalable.

Step 1: Prepare Input and Output Matrices #

The first step is to organize the dataset into matrix form:

X (Input Matrix)
Contains all the independent variables (features)
Example: Area, number of rooms
Y (Output Matrix)
Contains multiple dependent variables (targets)
Example: House price and rental value

Structuring data this way allows the model to process multiple outputs in a unified manner.

Step 2: Estimate the Coefficient Matrix #

To determine the optimal weights, we compute the coefficient matrix B using the normal equation adapted for multiple outputs:

$B = (X^{T} X)^{- 1} X^{T} Y$ B=(XTX)−1XTY

This formula finds the values of B that minimize the overall prediction error across all outputs.

Understanding the Components #

Xᵀ (Transpose of X)
Converts rows into columns and vice versa
(XᵀX)⁻¹ (Matrix Inverse)
Helps in solving the system of equations uniquely
XᵀY
Captures the relationship between inputs and outputs

By combining these operations, we obtain the best-fit coefficient matrix.

Step 3: Generate Predictions #

Once the coefficient matrix is computed, predictions for all outputs are made simultaneously using:

$\hat{Y} = X B$ Y^=XB

Here, Ŷ (Y hat) represents the predicted values for all target variables.

How the Process Works Together #

Data is structured into matrices
Coefficients are calculated using linear algebra
Predictions are generated in one operation

This approach ensures that all outputs are learned jointly, allowing the model to capture dependencies between them.

Implementing Support Vector Machine (SVM) #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

Load Dataset

# Load the dataset
df = pd.read_csv('/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv')
#For demonstration, we'll assume 'df' is already loaded

df.head(10)

Output: 1

Output: 2

Distribution of Flow Duration by Traffic Type (0=Normal, 1=Attack)

# 1. Setting the visual style
sns.set(style="whitegrid")

# 2. Select key features
features_to_plot = ['Flow Duration', 'Total Fwd Packets', 'Protocol']

# 3. Create the figure
plt.figure(figsize=(12, 6))

# FIXED: Added hue='Label' and legend=False to resolve the FutureWarning
sns.violinplot(
    x='Label', 
    y='Flow Duration', 
    hue='Label', 
    data=df, 
    palette='muted', 
    split=True, 
    legend=False
)

# 4. Adding titles and labels
plt.title('Distribution of Flow Duration by Traffic Type (0=Normal, 1=Attack)')
plt.xlabel('Traffic Label')
plt.ylabel('Flow Duration (Microseconds)')

# Applying log scale
plt.yscale('log') 

plt.show()

Distribution of Normal vs. Attack Traffic

# Setting the style
sns.set_theme(style="whitegrid")

# Create the Bar Plot for Class Counts
plt.figure(figsize=(8, 6))

# FIXED: Added hue='Label' and legend=False
ax = sns.countplot(x='Label', data=df, hue='Label', palette='viridis', legend=False)

# Adding labels for clarity
plt.title('Distribution of Normal vs. Attack Traffic')
plt.xlabel('Traffic Label (0: Normal, 1: Attack)')
plt.ylabel('Number of Samples')

# Adding count annotations on top of bars
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', (p.get_x() + 0.3, p.get_height() + 5000))

plt.show()

Strip Plot: Flow Duration across Traffic Types

# 1. Set visual style
sns.set_theme(style="ticks")

# 2. Select a feature to visualize
# Sampling for performance
df_sample = df.sample(n=10000, random_state=42) 

# 3. Create the Strip Plot
plt.figure(figsize=(10, 6))
sns.stripplot(
    data=df_sample, 
    x='Label', 
    y='Flow Duration', 
    hue='Label', 
    jitter=True,       
    alpha=0.5,         
    palette='viridis', 
    dodge=True,
    legend=False       # Keeps the plot clean as 'Label' is on the x-axis
)

# 4. Refine the plot
plt.title('Strip Plot: Flow Duration across Traffic Types')
plt.xlabel('Traffic Label (0: Normal, 1: Attack)')

# FIXED: Added 'r' before the string to handle the LaTeX backslash correctly
plt.ylabel(r'Flow Duration ($\mu s$)')

# Using log scale
plt.yscale('log') 

plt.show()

Correlation Heatmap: IoT-IDS Feature Relationships

# 1. Calculate the Correlation Matrix
# We calculate how every feature relates to every other feature (-1 to 1)
corr_matrix = df.corr()

# 2. Set up the figure
plt.figure(figsize=(16, 10))

# 3. Create the Heatmap
# cmap='coolwarm': Blue for negative correlation, Red for positive
# annot=True: Shows the numerical value in each cell
sns.heatmap(
    corr_matrix, 
    annot=True, 
    fmt=".2f", 
    cmap='coolwarm', 
    linewidths=0.5,
    cbar_kws={"shrink": .8}
)

# 4. Add title and adjust layout
plt.title('Correlation Heatmap: IoT-IDS Feature Relationships', fontsize=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

plt.tight_layout()
plt.show()

Training Model

# Dropping rows with missing values
df.dropna(inplace=True)

# Defining Features (X) and Target (y)
# We use the engineered features mentioned in your metadata
features = ['Src Port', 'Dst Port', 'Protocol', 'Flow Duration', 
            'Total Fwd Packet', 'Total Bwd packets', 'Flow IAT Min']
X = df[features]
y = df['Label']

# Split the data (80% Training, 20% Testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling features for better convergence
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train the model

# Initialize the model
model = LinearRegression()

# Fit the model
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

coefficients = pd.DataFrame({'Feature': features, 'Coefficient': model.coef_})
print(coefficients.sort_values(by='Coefficient', ascending=False))

            Feature  Coefficient
6       Flow IAT Min     0.122328
0           Src Port     0.103173
1           Dst Port     0.041321
4   Total Fwd Packet     0.002735
5  Total Bwd packets    -0.008765
3      Flow Duration    -0.148633
2           Protocol    -0.199873

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared Score: {r2:.4f}")

Mean Squared Error: 0.1237
R-squared Score: 0.3600

Actual vs Predicted Labels (Regression)

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.1)
plt.plot([0, 1], [0, 1], color='red', linestyle='--') # Perfect prediction line
plt.title('Actual vs Predicted Labels (Regression)')
plt.xlabel('Actual Label (0 or 1)')
plt.ylabel('Predicted Value')
plt.show()

Multivariate Regression Lasso Regression