Naive Bayes is a machine learning classification algorithm that predicts the category of a data point using probability. It assumes that all features are independent of each other. Naive Bayes performs well in many real-world applications such as spam filtering, document categorisation and sentiment analysis.




Here:
- Original data has two classes: green circles (y = 1) and red squares (y = 2).
- Estimate probability distribution along the first dimension i.e P(x1∣y=1),P(x1∣y=2)P(x1∣y=1),P(x1∣y=2)
- Estimate probability distribution along the second dimension i.e P(x2∣y=1),P(x2∣y=2)P(x2∣y=1),P(x2∣y=2)
- Combine both dimensions using conditional independence i.e P(x∣y)=∏αP(xα∣y)P(x∣y)=∏αP(xα∣y)
Key Features of Naive Bayes Classifiers #
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data based on the probabilities of different classes given the features of the data. It is used mostly in high-dimensional text classification
- The Naive Bayes Classifier is a simple probabilistic classifier and it has very few number of parameters which are used to build the ML models that can predict at a faster speed than other classification algorithms.
- It is a probabilistic classifier because it assumes that one feature in the model is independent of existence of another feature. In other words, each feature contributes to the predictions with no relation between each other.
- Naive Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying articles and many more.
The Naive Bayes algorithm is a probabilistic classifier based on Bayes’ Theorem. Despite its effectiveness in complex tasks like spam detection or sentiment analysis, it relies on a few critical—and often “naive”—simplifications.
Understanding these assumptions is key to knowing when the model will perform well and when it might struggle.
1. The Independence Assumption (The “Naive” Part) #
The core assumption is Conditional Independence. It assumes that the presence of a particular feature in a class is completely unrelated to the presence of any other feature, given the class variable.
Mathematically, for a set of features x1,x2,…,xn and a class C:
P(x1,x2,…,xn|C)=P(x1|C)•P(x2|C)•…•P(xn|C)
- In Reality: This is rarely true. For example, in a medical dataset, “High Blood Pressure” and “High Cholesterol” are often correlated. Naive Bayes treats them as two completely separate pieces of evidence.
- Why it’s used: It drastically simplifies the computation, turning a complex multidimensional joint probability into a simple product of individual probabilities.
2. Equal Importance of Features #
Naive Bayes assumes that every feature contributes equally to the outcome. It does not weight one feature as more significant than another.
- Example: In email filtering, the algorithm assumes the word “Money” is just as informative as the word “Discount” unless the data frequencies specifically prove otherwise. It cannot inherently “know” that some attributes are more critical than others without sufficient training data.
3. Distributed Data (Likelihood Assumptions) #
Depending on the version of Naive Bayes you use, the algorithm makes assumptions about the distribution of your continuous numerical data:
- Gaussian Naive Bayes: Assumes that continuous features follow a Normal (Gaussian) Distribution.
- Multinomial Naive Bayes: Assumes the data follows a multinomial distribution (ideal for word counts in text).
- Bernoulli Naive Bayes: Assumes features are binary (e.g., a word is either “present” or “absent”).
4. Large Training Datasets #
Because Naive Bayes relies on frequency counts to estimate probabilities, it assumes that the training data is representative of the real world.
- The Zero-Frequency Problem: If a specific feature value (like the word “Bargain”) never appears in the training data for a certain class, the probability for that feature becomes $0$. Since the algorithm multiplies probabilities together, the entire result becomes $0$.
- The Fix: This is usually handled by Laplace Smoothing (adding a small number to each count), but the underlying assumption remains that the training set is comprehensive.
Summary Table #
| Assumption | Meaning | Impact |
| Independence | Features don’t influence each other. | Simplifies math; ignores feature relationships. |
| Equality | All features have the same weight. | May oversimplify complex dependencies. |
| Distribution | Data follows a specific curve (e.g., Normal). | Performance drops if data is skewed. |
| Completeness | All possible outcomes are in the data. | Requires smoothing to avoid “Zero Probability” errors. |
Bayes’ Theorem #
Understand the Problem #
In many real-world situations, we want to answer:
“Given some observed data (X), what is the probability of a class (y)?”
Example:
- You see certain symptoms (X)
- You want to know the probability of a disease (y)
The Formula #
This formula helps us reverse probabilities.
Break Down Each Term #
1. Posterior → P(y∣X) #
What we want to find
“Probability of class y given data X”
2. Likelihood → P(X∣y) #
How likely the data is for a class
“If y is true, how likely is X?”
3. Prior → P(y) #
Initial belief before seeing data
“How common is class y?”
4. Evidence → P(X) #
Overall probability of data
“How common is X in general?”
Intuition (Very Important) #
Bayes’ Theorem says:
Updated belief = (How well data fits the class × Prior belief) ÷ Total data probability
Simple Numerical Example #
Problem: #
Suppose:
- 1% people have a disease → P(y)=0.01
- Test is 90% accurate → P(X∣y)=0.9
- False positive rate = 5% → P(X∣¬y)=0.05
We want:
Probability of disease given positive test
Calculate Evidence #
Apply Bayes’ Formula #
Final Result #
Even after a positive test, probability = 15.38%
Important Insight:
- Even accurate tests can mislead if the disease is rare
Why It Matters in Machine Learning #
Bayes’ Theorem is used in:
- Spam detection
- Naive Bayes classifier
- Medical diagnosis
- Prediction systems
Summary (Quick Revision) #
- Start with prior belief → P(y)
- Use data likelihood → P(X∣y)
- Normalize using → P(X)
- Get updated belief → P(y∣X)
Types of Naive Bayes Classifiers #
Different types are used depending on your data
1. Gaussian Naive Bayes (For Continuous Data) #
Gaussian Naive Bayes is a classification algorithm used when your data consists of continuous numerical values, such as height, weight, or temperature.
It assumes that each feature follows a normal (Gaussian) distribution, meaning the data forms a bell-shaped curve.
The model learns by calculating the mean and variance of each feature for every class. Using this information, it estimates the probability that a new data point belongs to a particular class.
Finally, it predicts the class with the highest probability.
This method is best used when your data contains real numeric values.
Examples: #
- Height and weight measurements
- Temperature readings
- Sensor data
Key Assumption: #
The data follows a normal (Gaussian) distribution
Simple Example: #
Imagine you want to predict whether a person is male or female based on height.
Gaussian Naive Bayes analyzes the height distribution for both groups and then decides which group a new person’s height most likely belongs to.
2. Multinomial Naive Bayes (For Discrete Data) #
Multinomial Naive Bayes is used when your data consists of discrete values, especially counts like how many times something appears.
It is most commonly used in text classification problems, such as spam detection or document categorization.
Instead of just checking if a word exists, this model looks at how often each word appears in a document.
Based on these counts, it calculates the probability of the document belonging to a certain class.
Finally, it predicts the class with the highest probability.
This method works best when dealing with frequency-based data.
Examples: #
- Word counts in emails
- Number of times a keyword appears
- Document classification
Key Idea: #
More frequent words have more influence on the prediction
Simple Example: #
Imagine an email containing words like “free”, “offer”, “win” many times.
Multinomial Naive Bayes counts these words and predicts whether the email is spam or not spam based on their frequency.
3. Bernoulli Naive Bayes (For Binary Data) #
Bernoulli Naive Bayes is used when your data is binary, meaning it only has two possible values like yes/no, true/false, or 0/1.
Instead of counting how many times a feature appears, this model only checks whether the feature is present or not.
It calculates probabilities based on this presence or absence of features for each class.
Finally, it predicts the class with the highest probability.
This method is best when features are simple and binary.
Examples: #
- Whether a word exists in an email (Yes/No)
- Clicked or not clicked
- Purchased or not purchased
Key Idea: #
Only presence matters, not frequency
Simple Example: #
If an email contains the word “discount”, Bernoulli Naive Bayes checks only its presence (not how many times it appears) and predicts whether the email is spam or not.

