View Categories

Data Cleaning

What is Data Cleaning? #

Data Cleaning is the process of fixing or removing incorrect, incomplete, or inconsistent data to make it ready for analysis.

  • Improves data accuracy
  • Ensures reliable analysis
  • Removes errors and inconsistencies
  • Helps in building better models

Common Data Issues #

  • Missing values (empty cells)
  • Duplicate records
  • Incorrect data types
  • Outliers (extreme values)
  • Inconsistent formatting (e.g., “USA” vs “usa”)

Steps in Data Cleaning #

  • Remove or fill missing values
  • Remove duplicates
  • Convert data types
  • Standardize formats
  • Handle outliers

Basic Python Example #

Step 1: Load Data #

import pandas as pd

data = pd.read_csv("data.csv")
print(data)

Step 2: Handle Missing Values

# Check missing values
print(data.isnull().sum())

# Fill missing values
data["age"].fillna(data["age"].mean(), inplace=True)

Step 3: Remove Duplicates

data.drop_duplicates(inplace=True)

Step 4: Fix Data Types

data["age"] = data["age"].astype(int)

Step 5: Standardize Data

data["country"] = data["country"].str.lower()

💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×