Data analysis techniques

Distribution Analysis, Correlation Analysis, and Outlier Detection #

Introduction to Data Analysis #

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is an essential step in data science, machine learning, and statistical research.

Some of the most important data analysis techniques include:

Distribution Analysis
Correlation Analysis
Outlier Detection
Trend Analysis
Regression Analysis

In this tutorial, we will focus on three important techniques:

Distribution Analysis
Correlation Analysis
Outlier Detection

1 Distribution Analysis #

What is Distribution Analysis? #

Distribution analysis is the process of understanding how data values are spread across a dataset. It helps us understand:

Data shape
Data center
Data spread
Data patterns
Data skewness

It answers questions like:

Is data normally distributed?
Is data skewed?
What is the range of values?
Where do most values lie?

Types of Data Distribution #

Normal Distribution #

A normal distribution (Gaussian distribution) is symmetric and bell-shaped.

Properties:

Mean = Median = Mode
Symmetrical shape
No skewness

Example:
Student heights often follow normal distribution.

Skewed Distribution #

Positive Skew (Right Skew) #

Properties:

Tail on the right side
Mean > Median

Example:
Income distribution.

Negative Skew (Left Skew) #

Properties:

Tail on the left side
Mean < Median

Example:
Age of retirement in a fixed organization.

Measures Used in Distribution Analysis #

Central Tendency Measures #

These show the center of data.

Mean:

Mean=N∑X​

Median:
Middle value after sorting.

Mode:
Most frequent value.

Spread Measures #

These show data variability.

Range: $Range = Max – Min$

Variance: $Variance = \frac{\sum (x-\mu)^2}{N}$ Variance=N∑(x−μ)2

Standard Deviation: $SD = \sqrt{Variance}$ SD=Variance

Shape Measures #

Skewness:
Measures asymmetry.

Kurtosis:
Measures peakness.

Visualization Methods #

Distribution is commonly analyzed using:

Histogram:
Shows frequency distribution.

Box Plot:
Shows quartiles and outliers.

Density Plot:
Shows probability distribution.

Python Example #

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data.csv")

# Histogram
plt.hist(df['Age'])
plt.title("Age Distribution")
plt.show()

# Boxplot
sns.boxplot(df['Age'])
plt.show()

# Summary statistics
print(df['Age'].describe())

Correlation Analysis #

What is Correlation Analysis? #

Correlation analysis measures the relationship between two variables.

It helps answer:

Do variables move together?
How strong is the relationship?
Is the relationship positive or negative?

Types of Correlation #

Positive Correlation #

When one variable increases, the other increases.

Example:
Study hours vs marks.

Negative Correlation #

When one increases, the other decreases.

Example:
Speed vs travel time.

Zero Correlation #

No relationship.

Example:
Shoe size vs intelligence.

Correlation Coefficient #

The correlation coefficient (r) ranges:

+1 Perfect positive
0 No correlation
−1 Perfect negative

Interpretation:

Value	Meaning
0.9 to 1	Very strong
0.7 to 0.9	Strong
0.5 to 0.7	Moderate
0.3 to 0.5	Weak
0 to 0.3	Very weak

Correlation Methods #

Pearson Correlation #

Used for linear relationships.

Formula: $r = \frac{\sum (x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}$ r=∑(x−xˉ)2∑(y−yˉ)2∑(x−xˉ)(y−yˉ)

Spearman Correlation #

Used for non-linear or ranked data.

Kendall Correlation #

Used for ordinal relationships.

Visualization Techniques #

Scatter Plot:
Shows relationship.

Heatmap:
Shows correlation matrix.

Python Example #

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("data.csv")

# Correlation matrix
corr = df.corr()

print(corr)

# Heatmap
sns.heatmap(corr, annot=True)
plt.show()

# Scatter plot
plt.scatter(df['Hours'], df['Marks'])
plt.show()

Outlier Detection #

What are Outliers? #

Outliers are data points significantly different from other observations.

They may occur due to:

Measurement error
Data entry error
Natural variation
Fraud
Rare events

Example:

Dataset:
10, 12, 11, 13, 100

Here 100 is an outlier.

Why Outlier Detection is Important #

Outliers can:

Distort mean
Reduce model accuracy
Affect regression
Mislead analysis

Sometimes outliers are important (fraud detection).

Methods for Detecting Outliers #

Z-Score Method #

Formula:

Z=σx−μ

Rule:

Z Score	Meaning
> 3	Outlier
< -3	Outlier

IQR Method (Most Common) #

Steps:

Step 1:
Find Q1 (25%)

Step 2:
Find Q3 (75%)

Step 3: $IQR = Q3 – Q1$ IQR=Q3−Q1

Step 4:

Lower bound: $Q1 – 1.5 IQR$ Q1−1.5IQR

Upper bound: $Q3 + 1.5 IQR$ Q3+1.5IQR

Values outside are outliers.

Visualization Methods #

Boxplot is best for outlier detection.

Python Example #

Z Score #

import numpy as np

mean = np.mean(df['Age'])
std = np.std(df['Age'])

z_scores = (df['Age'] - mean) / std

outliers = df[np.abs(z_scores) > 3]

print(outliers)

IQR Method

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower) | (df['Age'] > upper)]

print(outliers)

Best Practices #

Distribution Analysis #

Always:

Check skewness
Check missing values
Check spread

Correlation Analysis #

Remember:

Correlation does NOT mean causation.

Example:

Ice cream sales vs drowning incidents correlate but one does not cause the other.

Outlier Detection #

Do not always remove outliers.

First check:

Is it error?
Is it real?
Is it important?

Distribution analysis helps understand data structure.
Correlation analysis helps understand relationships.
Outlier detection helps clean data.

Together these techniques form the foundation of Exploratory Data Analysis (EDA).

They improve:

Data quality
Model performance
Decision accuracy

Technique	Purpose	Key Methods	Visualization	Benefits
Distribution Analysis	Understand how data is spread	Mean, Median, Standard Deviation, Skewness	Histogram, Boxplot, Density Plot	Understand data behavior
Correlation Analysis	Find relationship between variables	Pearson, Spearman, Kendall	Scatter Plot, Heatmap	Feature selection
Outlier Detection	Identify abnormal values	Z-Score, IQR	Boxplot, Scatter Plot	Improve model accuracy

EDA basics EDA basics