Logistic Regression

A supervised algorithm called Logistic Regression is commonly used to solve classification-based problems in machine learning. Unlike Linear Regression, which produces continuous predicted values, logistic regression will provide a predicted probability that an input belongs to a class.

Logistic regression is best suited for binary classifications with two possible outputs such as Yes/no, True/False or 0/1. To convert the input to a probability between 0 – 1, logistic regression uses the sigmoid function (a mathematical function shaped like an “S”, illustrated below).

Types of Logistic Regression #

There are three major types of logistic regression, each based on the nature of the dependent variable.

1: Binomial logit regression (binomial logistic regression) or binary logit regression where the dependent variable has only two possible outcomes (i.e. yes/no, pass/fail, etc.) This is the most commonly used type of logit regression and usually relates to binary classification problems.

2: Multinomial logit regression – used when the dependent variable has three or more possible outcomes and the outcomes are unordered (i.e. classifying animals into categories such as cat/dog/sheep, etc.). Multinomial logit regression is simply the extension of the type of logistic regression used for binary classification problems to support multiple classifications.

3: Ordinal logit regression – applies when the dependent variable has three or more possible outcomes and the outcomes have a natural hierarchy/ranking (e.g. rating of low/medium/high). Ordinal logit regression considers the ordering of the outcomes when it performs its modeling.

Assumptions of Logistic Regression #

Proposed Assumptions for Using Logistic Regression
The following assumptions should be understood prior to applying the logistic regression analysis to achieve successful implementation:

1: Independence of observations
The data points that are inputted into the logistic regression analysis should each be considered as independent from the other data points (no correlation/ dependence).
2: Binary dependent variable
Part of the assumptions of logistic regression includes that there are only two possible outcomes (results) found in the dependent variable. In contrast, if there were more than two dependent variables, you would utilize softmax functions.
3: Linearity of independent variables and log odds
The assumptions made by the logistic regression model include that there is a linear relationship between the independent variables and the dependent variable’s log odds (the log odds would be directly affected by the independent variables).
4: No outlier data points
When performing logistic regression, it must be considered that the dataset used will not contain extreme outlier data points since extreme outlier data points will negatively impact the accuracy of estimating logistic regression coefficients.
5: Adequate sample size
When performing logistic regression, a minimal amount of sample observations must be used to produce stable and reproducible results.

Understanding Sigmoid Function #

1: The Sigmoid function is used in logistic regression, which helps convert the model’s output, which is a real number, into value. The value is between 0 and 1, which can be considered a probability.

2: The sigmoid function takes all real numbers and maps them onto an “S” shape called the sigmoid or logistic function, which is used to determine the probability of a given input value falling within that range. Because probabilities can only have a range of 0-1, therefore, the use of the Sigmoid Function is ideal for this purpose in logistic regression.

3: The Sigmoid function is used in Logistic Regression to define a threshold value that is generally equal to 0.5. In logistic regression, when SIGMOID returns a Sigmoid score that is equal to or greater than 0.5, the data point will be processed as Class 1. If the data point returns a Sigmoid score of less than 0.5, the data is classified as Class 0. This approach allows you to transform continuous data into usable classes.

1. Problem Setup #

We are given a dataset: $\{(x^{(i)}, y^{(i)})\}_{i=1}^{m}$

Where:

$x^{(i)} \in \mathbb{R}^n$ x(i)∈Rn → feature vector
$y^{(i)} \in \{0,1\}$ y(i)∈{0,1} → class label
$m$ m → number of training examples

2. Linear Model (Score Function) #

Logistic Regression starts with a linear combination: $z = w^T x + b$

Expanded: $z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b$

Where:

$w$ w = weight vector
$b$ b = bias

3. Sigmoid (Logistic) Function #

To map this value into a probability: $\sigma(z) = \frac{1}{1 + e^{-z}}$

So the model becomes: $\hat{y} = P(y=1|x) = \frac{1}{1 + e^{-(w^T x + b)}}$

This is the hypothesis function.

4. Odds and Log-Odds #

Logistic Regression models log-odds (logit):

Odds: #

$\frac{P(y=1|x)}{P(y=0|x)} = \frac{\hat{y}}{1 – \hat{y}}$

Log-Odds: #

$\log \left( \frac{\hat{y}}{1 – \hat{y}} \right) = w^T x + b$

Key insight:

Logistic Regression assumes log-odds is linear in input features.

5. Likelihood Function #

We want to estimate $w, b$ w,b such that predictions match data.

For one example: $P(y|x) = \hat{y}^y (1 – \hat{y})^{1-y}$

For entire dataset: $L(w,b) = \prod_{i=1}^{m} \hat{y}^{(i)\,y^{(i)}} (1 – \hat{y}^{(i)})^{1 – y^{(i)}}$

6. Log-Likelihood #

Take log (to simplify): $\ell(w,b) = \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 – y^{(i)}) \log(1 – \hat{y}^{(i)}) \right]$

7. Cost Function (Binary Cross-Entropy) #

We minimize negative log-likelihood: $J(w,b) = – \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 – y^{(i)}) \log(1 – \hat{y}^{(i)}) \right]$

This is called:

Log Loss
Binary Cross-Entropy

8. Gradient Descent Optimization #

We update weights using gradients.

Gradient w.r.t weights: #

$\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} – y^{(i)}) x^{(i)}$

Gradient w.r.t bias: #

$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} – y^{(i)})$

Update Rule: #

$w := w – \alpha \frac{\partial J}{\partial w}$ w:=w−α∂w∂J $b := b – \alpha \frac{\partial J}{\partial b}$ b:=b−α∂b∂J

Where:

$\alpha$ α = learning rate

9. Decision Boundary #

Prediction rule: $\hat{y} = \begin{cases} 1 & \text{if } \hat{y} \geq 0.5 \\ 0 & \text{otherwise} \end{cases}$

Since: $\hat{y} = 0.5 \Rightarrow w^T x + b = 0$

Decision boundary: $w^T x + b = 0$

10. Vectorized Form (Efficient Implementation) #

Let:

$X \in \mathbb{R}^{m \times n}$ X∈Rm×n
$w \in \mathbb{R}^{n}$ w∈Rn

Then: $\hat{y} = \sigma(Xw + b)$

11. Regularization (Avoid Overfitting) #

L2 Regularization: #

$J(w,b) = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$

L1 Regularization: #

$J(w,b) = \text{Loss} + \frac{\lambda}{m} \sum_{j=1}^{n} |w_j|$

12. Summary (Mathematical Flow) #

$x \rightarrow z = w^T x + b \rightarrow \sigma(z) \rightarrow \hat{y}$ $\text{Loss} \rightarrow \text{Gradient} \rightarrow \text{Update } (w,b)$

Conclusion of Logistic Regression #

Logistic Regression is a supervised learning algorithm used for classification, especially binary problems. It predicts probabilities (0–1) instead of continuous values by applying the sigmoid function:

$\sigma(z)=\frac{1}{1+e^{-z}}$

It models the relationship between features and the log-odds of the outcome, uses cross-entropy loss for training, and classifies data using a threshold (usually 0.5).

Naive bayes Project Logistic Regression Project