View Categories

Data Cleaning

What is Data Cleaning? #

Data Cleaning is the process of fixing incorrect, incomplete, or messy data to improve its quality before analysis.

Handling Missing Values #

1- What are Missing Values? #

Empty or null values in data (NaN, None, blank cells).

2- Why it matters #

  • Can distort analysis
  • Leads to incorrect results

Methods to Handle Missing Values #

Remove Missing Data #

data.dropna(inplace=True)

Use when missing data is very small

Fill Missing Values #

Numerical Data (Mean):

data["age"].fillna(data["age"].mean(), inplace=True)

Categorical Data (Mode):

data["city"].fillna(data["city"].mode()[0], inplace=True)

Forward Fill / Backward Fill

data.fillna(method="ffill", inplace=True)
data.fillna(method="bfill", inplace=True)

Duplicate Removal #

1-What are Duplicates? #

Repeated rows in the dataset

2- Why remove them? #

  • Causes bias in analysis
  • Increases data size unnecessarily

Detect Duplicates #

print(data.duplicated().sum())

Remove Duplicates

data.drop_duplicates(inplace=True)

Noise Handling #

What is Noise? #

Random errors or unusual values in data (outliers or incorrect entries)

Example:
Age = 150 ❌

Why handle noise? #

  • Improves data accuracy
  • Helps in better model performance

Methods to Handle Noise #

Remove Outliers (IQR Method) #

Q1 = data["salary"].quantile(0.25)
Q3 = data["salary"].quantile(0.75)
IQR = Q3 - Q1

data = data[(data["salary"] >= Q1 - 1.5*IQR) & 
            (data["salary"] <= Q3 + 1.5*IQR)]

Replace Outliers

upper_limit = data["salary"].quantile(0.95)
data["salary"] = data["salary"].clip(upper=upper_limit)

Smoothing (Simple Method)

data["salary"] = data["salary"].rolling(window=3).mean()

Complete Example (All Steps)

import pandas as pd

data = pd.read_csv("data.csv")

# Handle missing values
data["age"].fillna(data["age"].mean(), inplace=True)
data["city"].fillna(data["city"].mode()[0], inplace=True)

# Remove duplicates
data.drop_duplicates(inplace=True)

# Handle noise (outliers)
Q1 = data["salary"].quantile(0.25)
Q3 = data["salary"].quantile(0.75)
IQR = Q3 - Q1

data = data[(data["salary"] >= Q1 - 1.5*IQR) & 
            (data["salary"] <= Q3 + 1.5*IQR)]

# Save cleaned data
data.to_csv("clean_data.csv", index=False)
ConceptProblemMethodsExample Code
Missing ValuesEmpty or null data (NaN)Remove (dropna()), Fill (mean, mode), Forward/Backward filldata["age"].fillna(data["age"].mean())
Duplicate RemovalRepeated rows in datasetDetect (duplicated()), Remove (drop_duplicates())data.drop_duplicates(inplace=True)
Noise HandlingIncorrect or extreme values (outliers)Remove (IQR), Replace (clip), Smoothingdata["salary"].clip(upper=limit)
💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×