What is Data Cleaning? #
Data Cleaning is the process of fixing or removing incorrect, incomplete, or inconsistent data to make it ready for analysis.
- Improves data accuracy
- Ensures reliable analysis
- Removes errors and inconsistencies
- Helps in building better models
Common Data Issues #
- Missing values (empty cells)
- Duplicate records
- Incorrect data types
- Outliers (extreme values)
- Inconsistent formatting (e.g., “USA” vs “usa”)
Steps in Data Cleaning #
- Remove or fill missing values
- Remove duplicates
- Convert data types
- Standardize formats
- Handle outliers
Basic Python Example #
Step 1: Load Data #
import pandas as pd
data = pd.read_csv("data.csv")
print(data)Step 2: Handle Missing Values
# Check missing values print(data.isnull().sum()) # Fill missing values data["age"].fillna(data["age"].mean(), inplace=True)
Step 3: Remove Duplicates
data.drop_duplicates(inplace=True)
Step 4: Fix Data Types
data["age"] = data["age"].astype(int)
Step 5: Standardize Data
data["country"] = data["country"].str.lower()

