Logistic Regression Project

A supervised algorithm called Logistic Regression is commonly used to solve classification-based problems in machine learning. Unlike Linear Regression, which produces continuous predicted values, logistic regression will provide a predicted probability that an input belongs to a class.

Logistic regression is best suited for binary classifications with two possible outputs such as Yes/no, True/False or 0/1. To convert the input to a probability between 0 – 1, logistic regression uses the sigmoid function (a mathematical function shaped like an “S”, illustrated below).

Types of Logistic Regression #

There are three major types of logistic regression, each based on the nature of the dependent variable.

1: Binomial logit regression (binomial logistic regression) or binary logit regression where the dependent variable has only two possible outcomes (i.e. yes/no, pass/fail, etc.) This is the most commonly used type of logit regression and usually relates to binary classification problems.

2: Multinomial logit regression – used when the dependent variable has three or more possible outcomes and the outcomes are unordered (i.e. classifying animals into categories such as cat/dog/sheep, etc.). Multinomial logit regression is simply the extension of the type of logistic regression used for binary classification problems to support multiple classifications.

3: Ordinal logit regression – applies when the dependent variable has three or more possible outcomes and the outcomes have a natural hierarchy/ranking (e.g. rating of low/medium/high). Ordinal logit regression considers the ordering of the outcomes when it performs its modeling.

The Implementation Pipeline #

Before we code, let’s look at the logical flow of our system:

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load Dataset

df = pd.read_csv("/kaggle/input/datasets/organizations/ailearner-researchlab/iot-intrusion-detection-hybrid-ml-dl-dataset/final_dataset.csv") 
df_sample = df.sample(frac=0.1, random_state=42)
df.head()

Output :1

Output :2

Target Variable Distribution (Label Balance)

plt.figure(figsize=(8, 5))
sns.countplot(x='Label', hue='Label', data=df_sample, palette='viridis', legend=False)
plt.title('Distribution of Normal (0) vs Attack (1) Traffic')
plt.xlabel('Traffic Type')
plt.ylabel('Count')
plt.show()

Correlation Heatmap

plt.figure(figsize=(16, 10))
# Sirf numeric columns ka correlation nikalna hai
corr_matrix = df_sample.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()

Feature Distribution (Histograms)

features_to_plot = ['Flow Duration', 'Total Fwd Packet', 'Total Bwd packets', 'Flow IAT Min']

df_sample[features_to_plot].hist(bins=30, figsize=(15, 10), color='skyblue', edgecolor='black')
plt.suptitle('Distribution of Key Network Features')
plt.show()

Box Plots (Outliers Detection)

plt.figure(figsize=(12, 6))
sns.boxplot(x='Label', y='Flow Duration', data=df_sample)
plt.yscale('log') # Log scale kyunke flow duration ki values bohat bari ho sakti hain
plt.title('Flow Duration vs Label (Log Scale)')
plt.show()

Data Preprocessing

# 1. Separate Features (X) and Target (y)
X = df.drop('Label', axis=1)
y = df['Label']

# 2. Split data into Training (80%) and Testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Feature Scaling (Very important for Logistic Regression!)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train
X_test = scaler.transform(X_test)

Train the Logistic Regression Model

# Initialize the model
# Using 'sag' or 'saga' solver is better for large datasets (1.2M rows)
log_model = LogisticRegression(solver='saga', max_iter=1000)

# Train the model
log_model.fit(X_train, y_train)

Evaluate the Model

# Make predictions
y_pred = log_model.predict(X_test)

# Print results
print("--- Accuracy Score ---")
print(f"{accuracy_score(y_test, y_pred):.4f}")

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred))

print("\n--- Confusion Matrix ---")
print(confusion_matrix(y_test, y_pred))

--- Accuracy Score ---
0.8855

--- Classification Report ---
              precision    recall  f1-score   support

           0       0.78      0.78      0.78     61522
           1       0.92      0.92      0.92    173523

    accuracy                           0.89    235045
   macro avg       0.85      0.85      0.85    235045
weighted avg       0.89      0.89      0.89    235045


--- Confusion Matrix ---
[[ 48058  13464]
 [ 13457 160066]]

Logistic Regression K-Nearest neighbor (KNN)