Distribution Analysis, Correlation Analysis, and Outlier Detection #
Introduction to Data Analysis #
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is an essential step in data science, machine learning, and statistical research.
Some of the most important data analysis techniques include:
- Distribution Analysis
- Correlation Analysis
- Outlier Detection
- Trend Analysis
- Regression Analysis
In this tutorial, we will focus on three important techniques:
- Distribution Analysis
- Correlation Analysis
- Outlier Detection
1 Distribution Analysis #
What is Distribution Analysis? #
Distribution analysis is the process of understanding how data values are spread across a dataset. It helps us understand:
- Data shape
- Data center
- Data spread
- Data patterns
- Data skewness
It answers questions like:
- Is data normally distributed?
- Is data skewed?
- What is the range of values?
- Where do most values lie?
Types of Data Distribution #
Normal Distribution #
A normal distribution (Gaussian distribution) is symmetric and bell-shaped.
Properties:
- Mean = Median = Mode
- Symmetrical shape
- No skewness
Example:
Student heights often follow normal distribution.
Skewed Distribution #
Positive Skew (Right Skew) #
Properties:
- Tail on the right side
- Mean > Median
Example:
Income distribution.
Negative Skew (Left Skew) #
Properties:
- Tail on the left side
- Mean < Median
Example:
Age of retirement in a fixed organization.
Measures Used in Distribution Analysis #
Central Tendency Measures #
These show the center of data.
Mean:
Median:
Middle value after sorting.
Mode:
Most frequent value.
Spread Measures #
These show data variability.
Range:
Variance:Variance=N∑(x−μ)2
Standard Deviation:SD=Variance
Shape Measures #
Skewness:
Measures asymmetry.
Kurtosis:
Measures peakness.
Visualization Methods #
Distribution is commonly analyzed using:
Histogram:
Shows frequency distribution.
Box Plot:
Shows quartiles and outliers.
Density Plot:
Shows probability distribution.
Python Example #
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("data.csv")
# Histogram
plt.hist(df['Age'])
plt.title("Age Distribution")
plt.show()
# Boxplot
sns.boxplot(df['Age'])
plt.show()
# Summary statistics
print(df['Age'].describe())Correlation Analysis #
What is Correlation Analysis? #
Correlation analysis measures the relationship between two variables.
It helps answer:
- Do variables move together?
- How strong is the relationship?
- Is the relationship positive or negative?
Types of Correlation #
Positive Correlation #
When one variable increases, the other increases.
Example:
Study hours vs marks.
Negative Correlation #
When one increases, the other decreases.
Example:
Speed vs travel time.
Zero Correlation #
No relationship.
Example:
Shoe size vs intelligence.
Correlation Coefficient #
The correlation coefficient (r) ranges:
- +1 Perfect positive
- 0 No correlation
- −1 Perfect negative
Interpretation:
| Value | Meaning |
|---|---|
| 0.9 to 1 | Very strong |
| 0.7 to 0.9 | Strong |
| 0.5 to 0.7 | Moderate |
| 0.3 to 0.5 | Weak |
| 0 to 0.3 | Very weak |
Correlation Methods #
Pearson Correlation #
Used for linear relationships.
Formula:r=∑(x−xˉ)2∑(y−yˉ)2∑(x−xˉ)(y−yˉ)
Spearman Correlation #
Used for non-linear or ranked data.
Kendall Correlation #
Used for ordinal relationships.
Visualization Techniques #
Scatter Plot:
Shows relationship.
Heatmap:
Shows correlation matrix.
Python Example #
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
# Correlation matrix
corr = df.corr()
print(corr)
# Heatmap
sns.heatmap(corr, annot=True)
plt.show()
# Scatter plot
plt.scatter(df['Hours'], df['Marks'])
plt.show()Outlier Detection #
What are Outliers? #
Outliers are data points significantly different from other observations.
They may occur due to:
- Measurement error
- Data entry error
- Natural variation
- Fraud
- Rare events
Example:
Dataset:
10, 12, 11, 13, 100
Here 100 is an outlier.
Why Outlier Detection is Important #
Outliers can:
- Distort mean
- Reduce model accuracy
- Affect regression
- Mislead analysis
Sometimes outliers are important (fraud detection).
Methods for Detecting Outliers #
Z-Score Method #
Formula:
Z=σx−μ
Rule:
| Z Score | Meaning |
|---|---|
| > 3 | Outlier |
| < -3 | Outlier |
IQR Method (Most Common) #
Steps:
Step 1:
Find Q1 (25%)
Step 2:
Find Q3 (75%)
Step 3:IQR=Q3−Q1
Step 4:
Lower bound:Q1−1.5IQR
Upper bound:Q3+1.5IQR
Values outside are outliers.
Visualization Methods #
Boxplot is best for outlier detection.
Python Example #
Z Score #
import numpy as np mean = np.mean(df['Age']) std = np.std(df['Age']) z_scores = (df['Age'] - mean) / std outliers = df[np.abs(z_scores) > 3] print(outliers)
IQR Method
Q1 = df['Age'].quantile(0.25) Q3 = df['Age'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = df[(df['Age'] < lower) | (df['Age'] > upper)] print(outliers)
Best Practices #
Distribution Analysis #
Always:
- Check skewness
- Check missing values
- Check spread
Correlation Analysis #
Remember:
Correlation does NOT mean causation.
Example:
Ice cream sales vs drowning incidents correlate but one does not cause the other.
Outlier Detection #
Do not always remove outliers.
First check:
- Is it error?
- Is it real?
- Is it important?
Distribution analysis helps understand data structure.
Correlation analysis helps understand relationships.
Outlier detection helps clean data.
Together these techniques form the foundation of Exploratory Data Analysis (EDA).
They improve:
- Data quality
- Model performance
- Decision accuracy
| Technique | Purpose | Key Methods | Visualization | Benefits |
|---|---|---|---|---|
| Distribution Analysis | Understand how data is spread | Mean, Median, Standard Deviation, Skewness | Histogram, Boxplot, Density Plot | Understand data behavior |
| Correlation Analysis | Find relationship between variables | Pearson, Spearman, Kendall | Scatter Plot, Heatmap | Feature selection |
| Outlier Detection | Identify abnormal values | Z-Score, IQR | Boxplot, Scatter Plot | Improve model accuracy |
