View Categories

Data analysis techniques

Distribution Analysis, Correlation Analysis, and Outlier Detection #

Introduction to Data Analysis #

Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is an essential step in data science, machine learning, and statistical research.

Some of the most important data analysis techniques include:

  • Distribution Analysis
  • Correlation Analysis
  • Outlier Detection
  • Trend Analysis
  • Regression Analysis

In this tutorial, we will focus on three important techniques:

  • Distribution Analysis
  • Correlation Analysis
  • Outlier Detection

1 Distribution Analysis #

What is Distribution Analysis? #

Distribution analysis is the process of understanding how data values are spread across a dataset. It helps us understand:

  • Data shape
  • Data center
  • Data spread
  • Data patterns
  • Data skewness

It answers questions like:

  • Is data normally distributed?
  • Is data skewed?
  • What is the range of values?
  • Where do most values lie?

Types of Data Distribution #

Normal Distribution #

A normal distribution (Gaussian distribution) is symmetric and bell-shaped.

Properties:

  • Mean = Median = Mode
  • Symmetrical shape
  • No skewness

Example:
Student heights often follow normal distribution.

Skewed Distribution #

Positive Skew (Right Skew) #

Properties:

  • Tail on the right side
  • Mean > Median

Example:
Income distribution.

Negative Skew (Left Skew) #

Properties:

  • Tail on the left side
  • Mean < Median

Example:
Age of retirement in a fixed organization.

Measures Used in Distribution Analysis #

Central Tendency Measures #

These show the center of data.

Mean:

Mean=NXMean=N∑X​

Median:
Middle value after sorting.

Mode:
Most frequent value.

Spread Measures #

These show data variability.

Range:Range=MaxMinRange = Max – Min

Variance:Variance=(xμ)2NVariance = \frac{\sum (x-\mu)^2}{N}Variance=N∑(x−μ)2​

Standard Deviation:SD=VarianceSD = \sqrt{Variance}SD=Variance​

Shape Measures #

Skewness:
Measures asymmetry.

Kurtosis:
Measures peakness.

Visualization Methods #

Distribution is commonly analyzed using:

Histogram:
Shows frequency distribution.

Box Plot:
Shows quartiles and outliers.

Density Plot:
Shows probability distribution.

Python Example #

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data.csv")

# Histogram
plt.hist(df['Age'])
plt.title("Age Distribution")
plt.show()

# Boxplot
sns.boxplot(df['Age'])
plt.show()

# Summary statistics
print(df['Age'].describe())

Correlation Analysis #

What is Correlation Analysis? #

Correlation analysis measures the relationship between two variables.

It helps answer:

  • Do variables move together?
  • How strong is the relationship?
  • Is the relationship positive or negative?

Types of Correlation #

Positive Correlation #

When one variable increases, the other increases.

Example:
Study hours vs marks.

Negative Correlation #

When one increases, the other decreases.

Example:
Speed vs travel time.

Zero Correlation #

No relationship.

Example:
Shoe size vs intelligence.

Correlation Coefficient #

The correlation coefficient (r) ranges:

  • +1 Perfect positive
  • 0 No correlation
  • −1 Perfect negative

Interpretation:

ValueMeaning
0.9 to 1Very strong
0.7 to 0.9Strong
0.5 to 0.7Moderate
0.3 to 0.5Weak
0 to 0.3Very weak

Correlation Methods #

Pearson Correlation #

Used for linear relationships.

Formula:r=(xxˉ)(yyˉ)(xxˉ)2(yyˉ)2r = \frac{\sum (x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}r=∑(x−xˉ)2∑(y−yˉ​)2​∑(x−xˉ)(y−yˉ​)​

Spearman Correlation #

Used for non-linear or ranked data.

Kendall Correlation #

Used for ordinal relationships.

Visualization Techniques #

Scatter Plot:
Shows relationship.

Heatmap:
Shows correlation matrix.

Python Example #

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("data.csv")

# Correlation matrix
corr = df.corr()

print(corr)

# Heatmap
sns.heatmap(corr, annot=True)
plt.show()

# Scatter plot
plt.scatter(df['Hours'], df['Marks'])
plt.show()

Outlier Detection #

What are Outliers? #

Outliers are data points significantly different from other observations.

They may occur due to:

  • Measurement error
  • Data entry error
  • Natural variation
  • Fraud
  • Rare events

Example:

Dataset:
10, 12, 11, 13, 100

Here 100 is an outlier.

Why Outlier Detection is Important #

Outliers can:

  • Distort mean
  • Reduce model accuracy
  • Affect regression
  • Mislead analysis

Sometimes outliers are important (fraud detection).

Methods for Detecting Outliers #

Z-Score Method #

Formula:

Z=σx−μ​

Rule:

Z ScoreMeaning
> 3Outlier
< -3Outlier

IQR Method (Most Common) #

Steps:

Step 1:
Find Q1 (25%)

Step 2:
Find Q3 (75%)

Step 3:IQR=Q3Q1IQR = Q3 – Q1IQR=Q3−Q1

Step 4:

Lower bound:Q11.5IQRQ1 – 1.5 IQRQ1−1.5IQR

Upper bound:Q3+1.5IQRQ3 + 1.5 IQRQ3+1.5IQR

Values outside are outliers.

Visualization Methods #

Boxplot is best for outlier detection.

Python Example #

Z Score #

import numpy as np

mean = np.mean(df['Age'])
std = np.std(df['Age'])

z_scores = (df['Age'] - mean) / std

outliers = df[np.abs(z_scores) > 3]

print(outliers)

IQR Method

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['Age'] < lower) | (df['Age'] > upper)]

print(outliers)

Best Practices #

Distribution Analysis #

Always:

  • Check skewness
  • Check missing values
  • Check spread

Correlation Analysis #

Remember:

Correlation does NOT mean causation.

Example:

Ice cream sales vs drowning incidents correlate but one does not cause the other.

Outlier Detection #

Do not always remove outliers.

First check:

  • Is it error?
  • Is it real?
  • Is it important?

Distribution analysis helps understand data structure.
Correlation analysis helps understand relationships.
Outlier detection helps clean data.

Together these techniques form the foundation of Exploratory Data Analysis (EDA).

They improve:

  • Data quality
  • Model performance
  • Decision accuracy
TechniquePurposeKey MethodsVisualizationBenefits
Distribution AnalysisUnderstand how data is spreadMean, Median, Standard Deviation, SkewnessHistogram, Boxplot, Density PlotUnderstand data behavior
Correlation AnalysisFind relationship between variablesPearson, Spearman, KendallScatter Plot, HeatmapFeature selection
Outlier DetectionIdentify abnormal valuesZ-Score, IQRBoxplot, Scatter PlotImprove model accuracy
💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×