Simple Linear Regression Project

What is Linear Regression? #

Linear Regression is one of the most widely used techniques in supervised machine learning. It is designed to understand and model the relationship between a target variable (output) and one or more input variables (features). The goal is to predict continuous numerical values by fitting a straight line that best represents the data pattern.

In simple terms, it finds a mathematical relationship that explains how changes in input variables affect the output.

Key Concepts #

- It assumes a linear relationship between input (independent variables) and output (dependent variable).

- The model determines a best-fit line that minimizes prediction errors.

- This line is calculated using the least squares method, which reduces the difference between actual and predicted values.

- It can be used for both simple (one variable) and multiple (multiple variables) predictions.

- The equation of linear regression is typically written as:
  Y = mX + c
  where:
  - - Y = predicted value
  - - X = input feature
  - - m = slope (impact of X on Y)
  - - c = intercept (starting value)

Why Use Linear Regression? #

Linear Regression is popular because it is:

- Easy to understand and implement

- Fast to train compared to complex models

- Interpretable (you can clearly see how inputs affect output)

- Useful as a baseline model in machine learning projects

It is commonly applied in:

- Forecasting trends

- Business analytics

- Price prediction

- Risk analysis

Example: Predicting Exam Scores #

Imagine you want to estimate a student’s exam score based on how many hours they studied.

From observation, you notice a pattern:
As study time increases, exam scores also tend to improve.

In this case:

- Independent Variable (Input): Hours studied
  → This is the factor you measure or control

- Dependent Variable (Output): Exam score
  → This is the outcome you want to predict

The regression model uses study hours to estimate the expected score by drawing a line that best fits the observed data.

How It Works (Step-by-Step) #

1. Collect data (e.g., hours studied and exam scores)

1. Plot the data on a graph

1. Fit a straight line that best represents the data points

1. Use the line to make predictions for new inputs

Important Assumptions #

For accurate results, Linear Regression relies on a few assumptions:

- The relationship between variables is linear

- Data points are independent

- Errors are normally distributed

- There is minimal multicollinearity (in multiple regression)

Best-Fit Line in Linear Regression #

Linear regression is a fundamental technique used to understand and model the relationship between variables. The best-fit line is a straight line that most closely represents how the input variable(s) (X) relate to the output variable (Y).

Instead of passing exactly through every point, this line is positioned in such a way that it reduces the overall prediction error between actual data values and predicted values.

Objective of the Best-Fit Line #

The main purpose of linear regression is to determine a line that minimizes the difference between observed values and predicted outcomes. This difference is commonly referred to as error or residual.

By minimizing this error, the model becomes more accurate and can be used to make reliable predictions on new data.

Understanding Variables #

- Y (Dependent Variable / Target)
  This is the output we want to predict.

- X (Independent Variable / Predictor)
  This is the input used to predict the output.

Mathematical Representation #

A simple linear regression model can be written as:Y=θ1+θ2X

Where:

- θ₁ (Intercept) → The value of Y when X = 0

- θ₂ (Slope) → The rate at which Y changes with respect to X

Key Idea #

The best-fit line is chosen such that it minimizes the total squared error between actual and predicted values. This method is known as:

Least Squares Method

Understanding the Best-Fit Line in Linear Regression #

Linear Regression is one of the most widely used techniques in supervised learning. It helps us understand how a dependent variable changes with respect to one or more independent variables. The main objective is to draw a straight line that best represents the relationship between the variables.

1. Purpose of the Best-Fit Line #

The goal of linear regression is to find a line that closely matches the given data points. This line is called the best-fit line because it minimizes the difference between the actual observed values and the values predicted by the model.

Instead of passing through every point, the line is positioned in such a way that the overall error between predictions and real values is as small as possible.

2. Mathematical Equation of the Line #

In simple linear regression (with one input variable), the equation of the best-fit line is:y=mx+b

Where:

- y → Predicted output (dependent variable)

- x → Input feature (independent variable)

- m → Slope of the line (rate of change)

- b → Intercept (value of y when x = 0)

What do these mean? #

- Slope (m): Shows how much the output changes when the input increases by one unit.
  For example, if m=5m=5, then y increases by 5 units for every 1 unit increase in x.

- Intercept (b): Represents the starting point of the line on the y-axis.

3. Error Minimization using Least Squares #

To determine the best values of m and b, we use the Least Squares Method.

Residual (Error) #

The difference between the actual value and predicted value is called a residual:Residual=yi−y^i

Where:

- yiyi → Actual value

- y^iy^i → Predicted value

Objective Function #

The goal is to minimize the Sum of Squared Errors (SSE):∑(yi−y^i)2

Squaring ensures:

- All errors become positive

- Larger errors are penalized more heavily

This process guarantees that the final line fits the data as closely as possible.

4. Understanding the Best-Fit Line #

Once the model is trained:

- The slope explains the strength and direction of the relationship

- The intercept gives the baseline prediction

- The line can be used to make predictions for new input values

5. Key Assumptions of Linear Regression #

For reliable results, linear regression relies on some important assumptions:

- Linearity: The relationship between variables should be linear

- Independence: Observations should not depend on each other

- Homoscedasticity: Error variance should remain constant

- Normality of Errors: Residuals should be normally distributed

Hypothesis Function in Linear Regression #

In linear regression, the hypothesis function is the mathematical expression used to predict the value of the target variable based on input features. It defines how the input variables are mapped to the output.

In simple terms, it is the formula your model uses to make predictions.

1. Hypothesis Function (Single Variable) #

For a basic linear regression model with one independent variable, the hypothesis function is written as:h(x)=β0+β1x

Explanation of Terms: #

- h(x) or ŷ → Predicted value of the target variable

- x → Input feature (independent variable)

- β₀ (Intercept) → Value of the prediction when x = 0

- β₁ (Slope/Coefficient) → Represents how much the prediction changes when x increases by one unit

Example:
If β₁ = 3, it means that for every 1-unit increase in x, the predicted value increases by 3 units.

2. Hypothesis Function (Multiple Variables) #

When there are multiple input features, the model becomes multiple linear regression, and the hypothesis function expands as:h(x1,x2,…,xk)=β0+β1×1+β2×2+…+βkxk

Explanation: #

- x₁, x₂, …, xₖ → Different input features

- β₀ → Intercept term

- β₁, β₂, …, βₖ → Coefficients showing the impact of each feature

Each coefficient tells how strongly its corresponding variable affects the final prediction.

3. How the Hypothesis Function Works #

The hypothesis function:

- Takes input values (features)

- Multiplies them with learned coefficients

- Adds the intercept

- Produces a predicted output

This output is then compared with actual values to measure error and improve the model.

4. Role in Model Training #

The hypothesis function is central to training a linear regression model:

- Initially, coefficients (β values) are random

- The model adjusts them using optimization techniques (like Gradient Descent)

- The goal is to minimize prediction error

Over time, the hypothesis function becomes more accurate.

5. Vector Form (Advanced Representation) #

For better performance and scalability, especially in programming, the hypothesis function is often written in vector form:h(x)=θTx

Where:

- θ (theta) → Vector of parameters (β values)

- x → Feature vector

This form is widely used in libraries like NumPy and machine learning frameworks.

6. Key Characteristics #

- It assumes a linear relationship between inputs and output

- It is simple, fast, and easy to interpret

- Works well when features are properly scaled

Assumptions of Linear Regression #

Linear regression works best when certain conditions are met. These assumptions ensure that your model produces reliable and meaningful results.

1. Linearity #

The core idea of linear regression is that the relationship between the independent variables (X) and the dependent variable (Y) follows a straight-line pattern.
If the relationship is curved or complex, a simple linear model may not capture it properly.

2. Independence of Errors #

The prediction errors (residuals) should be independent of each other.
This means that one error should not influence another—especially important in time-series or sequential data.

3. Homoscedasticity (Constant Variance) #

The spread of errors should remain consistent across all levels of the input variables.
If the variance of errors increases or decreases (forming patterns like a funnel shape), it is called heteroscedasticity, which can weaken the reliability of the model.

4. Normal Distribution of Errors #

The residuals should follow a normal (bell-shaped) distribution.
This assumption is important for making valid statistical inferences like confidence intervals and hypothesis tests.

5. No Multicollinearity (Multiple Regression) #

In models with multiple input variables, predictors should not be highly correlated with each other.
Strong relationships between inputs can confuse the model and make it difficult to determine the true impact of each variable.

6. No Autocorrelation #

Residuals should not display patterns over time.
If errors are correlated (e.g., increasing or decreasing trends), it suggests the model is missing important information.

7. Additivity #

The effect of each independent variable on the dependent variable is assumed to be separate and additive.
This means the total impact is simply the sum of individual effects, without interaction unless explicitly modeled.

Types of Linear Regression #

Linear Regression can be categorized based on the number of input variables (features) used to predict the output. If the model uses only one input feature, it is called Simple Linear Regression. When multiple input features are involved, it is known as Multiple Linear Regression.

1. Simple Linear Regression #

Simple Linear Regression is applied when there is only one independent variable used to predict a dependent variable. It assumes that the relationship between input and output can be represented by a straight line.

Mathematical Representation #

y^=θ0+θ1x

Explanation of Terms #

- ŷ (y-hat): Predicted value of the dependent variable

- x: Independent variable (input feature)

- θ₀ (theta zero): Intercept (value of ŷ when x = 0)

- θ₁ (theta one): Slope (change in ŷ for a one-unit increase in x)

Example #

Suppose you want to estimate a person’s salary based on their years of experience. Here, experience is the input (x), and salary is the output (y).

Key Points #

- Works best when the relationship is approximately linear

- Easy to implement and interpret

- Sensitive to outliers (extreme values can affect the line significantly)

2. Multiple Linear Regression #

Multiple Linear Regression is used when two or more independent variables are used to predict a single dependent variable. It helps model more complex relationships compared to simple linear regression.

Mathematical Representation #

y^=θ0+θ1×1+θ2×2+⋯+θnxn

Explanation of Terms #

- x₁, x₂, …, xₙ: Input features

- θ₁, θ₂, …, θₙ: Coefficients (impact of each feature on output)

- θ₀: Intercept

- ŷ: Predicted output

Example #

Predicting house prices using multiple factors such as:

- Area (square feet)

- Number of bedrooms

- Location

- Age of the property

Key Points #

- Captures more realistic scenarios with multiple influencing factors

- Requires more data for accurate predictions

- Can suffer from multicollinearity (when input features are highly correlated)

Implementing Support Vector Machine (SVM) #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report, accuracy_score

Load Dataset

# Load the dataset
df = pd.read_csv('/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv')

# For demonstration, we assume 'df' is already loaded
df.head().sum()

Src Port                      1.549860e+05
Dst Port                      4.110900e+04
Protocol                      3.000000e+01
Flow Duration                 5.688227e+06
Total Fwd Packet              2.000000e+01
Total Bwd packets             2.000000e+01
Total Length of Fwd Packet    9.000000e+02
Fwd Packet Length Min         6.000000e+00
Bwd Packet Length Max         3.073000e+03
Flow IAT Min                  2.060000e+02
Fwd IAT Min                   2.459000e+03
Fwd Header Length             6.520000e+02
Bwd Packets/s                 9.928075e+01
Packet Length Max             3.079000e+03
Packet Length Std             8.032699e+02
RST Flag Count                0.000000e+00
FWD Init Win Bytes            8.988500e+04
Bwd Init Win Bytes            1.270000e+03
Idle Mean                     7.644808e+15
Idle Max                      8.494231e+15
FIN Flag Count                5.000000e+00
SYN Flag Count                6.000000e+00
ACK Flag Count                3.700000e+01
Label                         0.000000e+00
dtype: float64

3D Surface Plot: Ports vs. Flow Duration

# Initialize the 3D figure
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Creating the 3D Surface Plot (Triangular Surface)
# We use Src Port (X), Dst Port (Y), and Flow Duration (Z)
surf = ax.plot_trisurf(df['Src Port'], 
                       df['Dst Port'], 
                       df['Flow Duration'], 
                       cmap='viridis', 
                       edgecolor='none', 
                       alpha=0.8)

# Adding labels and title
ax.set_xlabel('Source Port')
ax.set_ylabel('Destination Port')
ax.set_zlabel('Flow Duration')
plt.title('3D Surface Plot: Ports vs. Flow Duration')

# Adding a color bar to show the scale of the Z-axis (Flow Duration)
fig.colorbar(surf, ax=ax, shrink=0.5, aspect=5, label='Flow Duration')

# Adjusting the view angle for better visibility
ax.view_init(elev=30, azim=120)

plt.show()

3D Pie Chart: Distribution of TCP Flags

# Identify the columns for the pie chart slices
# We use the sums of the specific flags provided in your data
labels = ['FIN Flags', 'SYN Flags', 'ACK Flags']
sizes = [df['FIN Flag Count'].sum(), 
         df['SYN Flag Count'].sum(), 
         df['ACK Flag Count'].sum()]

# Explode the slices to enhance the 3D-like depth
# (0.1 means the ACK slice will pull away from the center)
explode = (0.05, 0.05, 0.1)  

# Create the pie chart
fig, ax = plt.subplots(figsize=(8, 6))
ax.pie(sizes, 
       explode=explode, 
       labels=labels, 
       autopct='%1.1f%%',
       shadow=True, 
       startangle=90, 
       colors=['#ff6666', '#66b3ff', '#99ff99'])

# Set aspect ratio to be equal so that the pie is drawn as a circle.
ax.axis('equal')  

plt.title('3D Pie Chart: Distribution of TCP Flags')
plt.show()

Violin Plot: Packet Length Std Distribution per Protocol

# Assign 'Protocol' to both x and hue to fix the warning
sns.violinplot(x='Protocol', y='Packet Length Std', data=df, 
               hue='Protocol', inner="quartile", palette="muted", legend=False)

# Customizing the chart
plt.title('Violin Plot: Packet Length Std Distribution per Protocol')
plt.xlabel('Protocol')
plt.ylabel('Packet Length Std')

# Display the plot
plt.show()

Strip Plot: Flow Duration across Protocols

# Strip Plot: Individual data points of Flow Duration per Protocol
# 'jitter=True' helps spread out the points so they don't overlap too much
sns.stripplot(x='Protocol', y='Flow Duration', data=df, 
              hue='Protocol', jitter=True, palette='viridis', legend=False)

# Customizing the chart
plt.title('Strip Plot: Flow Duration across Protocols')
plt.xlabel('Protocol')
plt.ylabel('Flow Duration')

# Optional: Use log scale if Flow Duration values vary drastically
# plt.yscale('log')

# Display the plot
plt.show()

Training Model

# Selecting Features (X) and Target (y)
# Let's use 'Flow Duration' and 'Total Fwd Packets' as examples
features = ['Flow Duration', 'Total Fwd Packet', 'Total Bwd packets', 'Flow IAT Min']
X = df[features]
y = df['Label']

# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

Training set size: 940177
Testing set size: 235045

Train the model

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

print("Model training complete.")
print(f"Interception: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

Model training complete.
Interception: 0.7620919319155204
Coefficients: [-1.56320526e-08  5.80824421e-06 -2.14543328e-05  2.55014796e-08]

# Get continuous predictions
raw_predictions = model.predict(X_test)

# Convert to binary labels using a 0.5 threshold
final_predictions = [1 if p >= 0.5 else 0 for p in raw_predictions]

Confusion Matrix: IoT Attack Detection

# 1. Accuracy
accuracy = accuracy_score(y_test, final_predictions)
print(f"Accuracy Score: {accuracy * 100:.2f}%")

# 2. Confusion Matrix
cm = confusion_matrix(y_test, final_predictions)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix: IoT Attack Detection')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# 3. Detailed Report
print("\nClassification Report:")
print(classification_report(y_test, final_predictions))

Accuracy Score: 78.72%

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.22      0.35     61522
           1       0.78      0.99      0.87    173523

    accuracy                           0.79    235045
   macro avg       0.83      0.60      0.61    235045
weighted avg       0.80      0.79      0.74    235045

Now Improve Accuracy By Using StandardScaler()

# 1. Prepare Data
X = df.drop('Label', axis=1) # Use all 24 engineered features
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Feature Scaling (Crucial for high-performance models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Train Random Forest
# n_estimators=100 means we are using 100 decision trees together
rf_model = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)

# 4. Predict
y_pred = rf_model.predict(X_test_scaled)

# 5. Evaluate
print(f"New Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

New Accuracy: 99.96%

Confusion Matrix:
[[ 61497     25]
 [    61 173462]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     61522
           1       1.00      1.00      1.00    173523

    accuracy                           1.00    235045
   macro avg       1.00      1.00      1.00    235045
weighted avg       1.00      1.00      1.00    235045

Simple Linear Regression Multivariate Regression