Lasso Regression Project

Lasso Regression (Least Absolute Shrinkage and Selection Operator) #

Lasso Regression is a type of linear regression that incorporates L1 regularization to improve model performance and prevent overfitting. Unlike standard linear regression, it introduces a penalty term to the loss function, which helps control the complexity of the model.

One of the most powerful aspects of Lasso is its ability to automatically select important features. It does this by shrinking less significant feature coefficients all the way down to zero, effectively removing them from the model. This makes Lasso especially valuable when working with datasets that contain a large number of features.

Key Characteristics #

Applies L1 Regularization
Adds a penalty equal to the sum of the absolute values of coefficients, helping reduce model complexity.
Reduces Overfitting
Prevents the model from fitting noise in the data by constraining coefficient values.
Feature Selection Capability
Forces some coefficients to become exactly zero, automatically eliminating irrelevant features.
Promotes Sparse Models
Produces simpler and more interpretable models by keeping only the most impactful variables.
Works Well with High-Dimensional Data
Particularly useful when the number of features is large compared to the number of observations.

How L1 Regularization Works #

In Lasso Regression, the loss function is modified by adding a penalty term: $L o s s = R S S + λ \sum ∣ w_{i} ∣$

Where:

RSS = Residual Sum of Squares
λ (lambda) = Regularization parameter (controls strength of penalty)
wᵢ = Model coefficients

As the value of λ increases, more coefficients are pushed toward zero. Some may become exactly zero, which results in feature selection.

Bias–Variance Tradeoff #

The bias–variance tradeoff is a fundamental concept in machine learning that explains the relationship between a model’s complexity and its performance on new, unseen data.

In simple terms, it highlights the challenge of building a model that is neither too simple nor too complex. A model with the right balance can capture meaningful patterns in the data while still generalizing well beyond the training set.

Bias refers to errors caused by overly simplistic assumptions in the model
Variance refers to errors caused by the model being too sensitive to small fluctuations in the training data

Improving one often worsens the other, which is why finding the optimal balance is critical.

Role of Lasso Regression in Bias–Variance Tradeoff #

Lasso Regression plays an important role in managing this tradeoff through its regularization parameter (λ or α). By adjusting this parameter, we can control how flexible or constrained the model becomes.

Lasso regression bias variance tradeoff graph

Effect of Regularization Strength #

Low λ (Weak Regularization) #

Model fits the training data very closely
Captures even minor patterns and noise
Low bias, high variance
High chance of overfitting
Performs well on training data but poorly on unseen data

High λ (Strong Regularization) #

Coefficients are significantly reduced or forced to zero
Model becomes simpler and less flexible
Higher bias, lower variance
Reduces the risk of overfitting
Typically leads to better generalization on new data

How Lasso Achieves Balance #

Lasso Regression manages the bias–variance tradeoff in a structured way:

Regularization shrinks coefficients, which reduces model complexity and lowers variance
Feature selection removes irrelevant variables, making the model more stable
While some bias may increase, the reduction in variance often leads to improved overall performance

In many real-world cases, this tradeoff results in a lower test error, which is the ultimate goal.

Working of Lasso Regression #

Let’s understand how Lasso Regression works step by step:

1. Basic Linear Model #

Lasso Regression starts with the foundation of a standard linear regression model. In this setup, the target variable is expressed as a linear combination of input features and their corresponding weights.

Without any constraints, this model may become overly complex, especially when dealing with a large number of features, leading to overfitting.

2. L1-Regularized Loss Function #

To control model complexity, Lasso introduces an L1 penalty term into the loss function:

$L = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} + λ \sum_{j = 1}^{p} ∣ w_{j} ∣$

The first term represents the prediction error (Residual Sum of Squares)
The second term is the L1 penalty, which adds the absolute values of coefficients
λ (lambda) controls how strong the penalty is applied

As λ increases, the model is forced to keep coefficients smaller, reducing complexity.

3. Shrinking of Coefficients #

One of the defining features of Lasso is how it handles coefficients:

Increasing λ gradually shrinks coefficient values
Features with less importance are penalized more heavily
Some coefficients become exactly zero, not just small

This behavior is what differentiates Lasso from other regularization methods.

4. Built-in Feature Selection #

Because Lasso can push coefficients to zero:

Features with zero coefficients are effectively removed from the model
No separate feature selection technique is required
The final model becomes simpler, faster, and easier to interpret

This makes Lasso highly useful in scenarios with many irrelevant or redundant features.

5. Optimization Process #

Solving the Lasso objective function is not as straightforward as ordinary least squares. It is typically optimized using:

Coordinate Descent Algorithm
- Updates one coefficient at a time while keeping others fixed
- Iteratively improves the solution until convergence

This method is efficient even for high-dimensional datasets.

Implementing Support Vector Machine (SVM) #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_recall_curve

Load Dataset

# Assuming the file is named 'IoT-IDS-Hybrid.csv'
df = pd.read_csv('/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv')

df.head(10)

Output

3D Surface Analysis: Flow Duration vs Packets

# 1. Sample the data (1.2M rows is too many for a 3D surface plot)
# We take a sample of 2,000 points to ensure the plot renders quickly
df_sample = df.sample(2000, random_state=42)

# 2. Define the axes
# We use Flow Duration, Fwd Packets, and Bwd Packets to see traffic volume over time
x = df_sample['Flow Duration']
y = df_sample['Total Fwd Packet']
z = df_sample['Total Bwd packets']

# 3. Initialize the 3D Plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# 4. Create the Surface Plot
# plot_trisurf creates a surface by joining points into triangles
surf = ax.plot_trisurf(x, y, z, cmap='magma', edgecolor='none', alpha=0.8)

# 5. Add Labels and Styling
ax.set_title('3D Surface Analysis: Flow Duration vs Packets', fontsize=15)
ax.set_xlabel('Flow Duration', fontsize=12)
ax.set_ylabel('Total Fwd Packets', fontsize=12)
ax.set_zlabel('Total Bwd Packets', fontsize=12)

# Add a color bar which maps values to colors
fig.colorbar(surf, shrink=0.5, aspect=10)

# Adjust the viewing angle for better perspective
ax.view_init(elev=30, azim=45)

plt.show()

Joint Plot

# 1. Create the Joint Plot
# We use 'Flow Duration' and 'Total Fwd Packet' to see how traffic density 
# varies across different categories of the 'Label'.
joint_grid = sns.jointplot(
    data=df, 
    x='Flow Duration', 
    y='Total Fwd Packet', 
    hue='Label',      # 0 for Normal, 1 for Attack
    kind='scatter',   # Scatter plot with marginal distributions
    palette='Set1',   # Distinct colors for classes
    alpha=0.6,        # Transparency to handle potential overplotting
    height=8          # Size of the plot
)

# 2. Refine labels and add a title
joint_grid.set_axis_labels('Flow Duration (μs)', 'Total Forward Packets', fontsize=12)
joint_grid.fig.suptitle("Joint Distribution of Flow Duration vs Packets by Label", y=1.02, fontsize=14)

plt.show()

Rug Plot

# 1. Sample the data for clarity (Recommended for 1.2M+ rows)
# We take 1,000 points to keep the 'rug hairs' distinguishable
df_sample = df.sample(1000, random_state=42)

# 2. Create the Rug Plot
plt.figure(figsize=(12, 3))

# We plot 'Flow Duration' and color-code by 'Label' (0: Normal, 1: Attack)
sns.rugplot(
    data=df_sample, 
    x='Flow Duration', 
    hue='Label', 
    height=0.5,      # Adjusts the height of the rug ticks
    palette='husl',  # High-contrast colors
    alpha=0.6        # Transparency to see overlapping ticks
)

# 3. Add Labels and Titles
plt.title('Rug Plot: Distribution of Flow Duration (Normal vs. Attack)', fontsize=14)
plt.xlabel('Flow Duration (μs)', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.3) # Add subtle vertical grid lines

# Remove Y-axis as it's not needed for a 1D rug plot
plt.yticks([]) 

plt.show()

Fill Between Plot

# 1. Sample and Sort the data
# We sample 100 rows to make the 'fill' area clearly visible
df_sample = df.sample(100, random_state=42).reset_index()
# Sorting ensures the line and fill area don't criss-cross randomly
df_sample = df_sample.sort_values(by='index')

# 2. Define the plot
plt.figure(figsize=(14, 6))

# X-axis: Sample index
x = np.arange(len(df_sample))
# Y-axes: The two boundaries
y1 = df_sample['Idle Mean']
y2 = df_sample['Idle Max']

# 3. Create the Plot
# Plot the lines
plt.plot(x, y1, label='Idle Mean', color='blue', alpha=0.6)
plt.plot(x, y2, label='Idle Max', color='red', alpha=0.6)

# Fill the area between them
plt.fill_between(x, y1, y2, color='gray', alpha=0.3, label='Idle Time Variance')

# 4. Add Styling and Labels
plt.title('Fill Between Plot: Variance in Device Idle Times', fontsize=14)
plt.xlabel('Sample Sequence (Random Subset)', fontsize=12)
plt.ylabel('Microseconds (μs)', fontsize=12)
plt.legend(loc='upper right')
plt.grid(True, linestyle='--', alpha=0.5)

plt.show()

Data Preprocessing

# Separate Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']

# Split into Training and Testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data (Standardization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train the Lasso Model

# Initialize Lasso with an alpha (penalty strength)
# alpha=0.01 is a common starting point; higher alpha means more features become zero.
lasso_model = Lasso(alpha=0.01)

# Fit the model
lasso_model.fit(X_train_scaled, y_train)

# Identify which features were kept (non-zero coefficients)
kept_features = np.sum(lasso_model.coef_ != 0)
print(f"Lasso kept {kept_features} out of {X.shape[1]} features.")

Model Prediction

# Get raw predictions
raw_preds = lasso_model.predict(X_test_scaled)

# Convert to binary labels (Threshold = 0.5)
y_pred = [1 if x >= 0.5 else 0 for x in raw_preds]

Evaluation Matrix & Visualization

# Calculate Metrics
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Normal', 'Attack'], yticklabels=['Normal', 'Attack'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Lasso Regression: Confusion Matrix for IoT Intrusion Detection')
plt.show()

# Print full classification report
print(classification_report(y_test, y_pred))

 precision    recall  f1-score   support

           0       0.78      0.78      0.78     61522
           1       0.92      0.92      0.92    173523

    accuracy                           0.88    235045
   macro avg       0.85      0.85      0.85    235045
weighted avg       0.88      0.88      0.88    235045

Lasso Regression Simple Linear Regression