What is Data Cleaning? #
Data Cleaning is the process of fixing incorrect, incomplete, or messy data to improve its quality before analysis.
Handling Missing Values #
1- What are Missing Values? #
Empty or null values in data (NaN, None, blank cells).
2- Why it matters #
- Can distort analysis
- Leads to incorrect results
Methods to Handle Missing Values #
Remove Missing Data #
data.dropna(inplace=True)
Use when missing data is very small
Fill Missing Values #
Numerical Data (Mean):
data["age"].fillna(data["age"].mean(), inplace=True)
Categorical Data (Mode):
data["city"].fillna(data["city"].mode()[0], inplace=True)
Forward Fill / Backward Fill
data.fillna(method="ffill", inplace=True) data.fillna(method="bfill", inplace=True)
Duplicate Removal #
1-What are Duplicates? #
Repeated rows in the dataset
2- Why remove them? #
- Causes bias in analysis
- Increases data size unnecessarily
Detect Duplicates #
print(data.duplicated().sum())
Remove Duplicates
data.drop_duplicates(inplace=True)
Noise Handling #
What is Noise? #
Random errors or unusual values in data (outliers or incorrect entries)
Example:
Age = 150 ❌
Why handle noise? #
- Improves data accuracy
- Helps in better model performance
Methods to Handle Noise #
Remove Outliers (IQR Method) #
Q1 = data["salary"].quantile(0.25)
Q3 = data["salary"].quantile(0.75)
IQR = Q3 - Q1
data = data[(data["salary"] >= Q1 - 1.5*IQR) &
(data["salary"] <= Q3 + 1.5*IQR)]Replace Outliers
upper_limit = data["salary"].quantile(0.95) data["salary"] = data["salary"].clip(upper=upper_limit)
Smoothing (Simple Method)
data["salary"] = data["salary"].rolling(window=3).mean()
Complete Example (All Steps)
import pandas as pd
data = pd.read_csv("data.csv")
# Handle missing values
data["age"].fillna(data["age"].mean(), inplace=True)
data["city"].fillna(data["city"].mode()[0], inplace=True)
# Remove duplicates
data.drop_duplicates(inplace=True)
# Handle noise (outliers)
Q1 = data["salary"].quantile(0.25)
Q3 = data["salary"].quantile(0.75)
IQR = Q3 - Q1
data = data[(data["salary"] >= Q1 - 1.5*IQR) &
(data["salary"] <= Q3 + 1.5*IQR)]
# Save cleaned data
data.to_csv("clean_data.csv", index=False)| Concept | Problem | Methods | Example Code |
|---|---|---|---|
| Missing Values | Empty or null data (NaN) | Remove (dropna()), Fill (mean, mode), Forward/Backward fill | data["age"].fillna(data["age"].mean()) |
| Duplicate Removal | Repeated rows in dataset | Detect (duplicated()), Remove (drop_duplicates()) | data.drop_duplicates(inplace=True) |
| Noise Handling | Incorrect or extreme values (outliers) | Remove (IQR), Replace (clip), Smoothing | data["salary"].clip(upper=limit) |
