View Categories

EDA basics

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the foundational step in any data science project. It helps us understand the structure, quality, and patterns in the data before applying any modeling or advanced analysis. In this tutorial, we’ll focus on two core components:

  1. Understanding the Dataset
  2. Summary Statistics

We’ll use Python and pandas for demonstration.

Import Libraries and Load Dataset #

# Import necessary libraries
import pandas as pd
import numpy as np

# Load dataset (example: Titanic dataset)
data = pd.read_csv("titanic.csv")

# Display first few rows
print(data.head())
  • pd.read_csv(): Reads a CSV file into a pandas DataFrame.
  • data.head(): Displays the first 5 rows to get an initial look.

Understanding the Dataset #

Understanding the dataset means knowing what each column represents, the types of data, and general info.

Check Dataset Shape #

print("Number of rows and columns:", data.shape)
  • Output example: (891, 12) → 891 rows, 12 columns.

Check Column Names #

print("Columns in dataset:", data.columns.tolist())

Data Types

print(data.dtypes)
  • int64 → Numerical integer
  • float64 → Numerical decimal
  • object → Categorical / string data

Quick Summary #

print(data.info())

Missing Values

print(data.isnull().sum())
  • Helps identify which columns have missing values.

Summary Statistics #

Summary statistics give numerical insight into data distributions and central tendencies.

Descriptive Statistics for Numerical Columns #

print(data.describe())
  • Includes:
    • count → Non-null values
    • mean → Average
    • std → Standard deviation
    • min / max → Range
    • 25%, 50%, 75% → Quartiles

Descriptive Statistics for Categorical Columns #

print(data.describe(include='object'))
  • Includes:
    • count → Non-null values
    • unique → Number of unique categories
    • top → Most frequent category
    • freq → Frequency of top category

Value Counts (for individual categorical columns) #

print(data['Sex'].value_counts())

Shows how many males and females are present (example for Titanic dataset).

Visual Check (Optional but Recommended) #

Even simple plots can reveal data patterns and outliers.

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for numerical column
sns.histplot(data['Age'], bins=20, kde=True)
plt.title("Age Distribution")
plt.show()

# Count plot for categorical column
sns.countplot(x='Sex', data=data)
plt.title("Gender Count")
plt.show()

Observations and Insights #

From EDA, you should document insights such as:

  • Missing values (Age column has 177 missing values)
  • Column types (categorical vs numerical)
  • Data distribution (Age is right-skewed)
  • Potential outliers (Fare column has very high values)
EDA StepCommand / FunctionPurpose / Explanation
Load Datapd.read_csv("file.csv")Load CSV file into a pandas DataFrame
View Datadata.head()Display first 5 rows of the dataset
Check Shapedata.shapeShow number of rows and columns
List Columnsdata.columnsList all column names
Data Typesdata.dtypesShow data type of each column (int, float, object)
Dataset Infodata.info()Overview of dataset: non-null counts, types, memory usage
Missing Valuesdata.isnull().sum()Count missing values per column
Summary Stats (Numerical)data.describe()Count, mean, std, min, max, quartiles
Summary Stats (Categorical)data.describe(include='object')Count, unique, top, frequency of categories
Value Countsdata['column'].value_counts()Frequency of each category in a column
Histogram (Numerical)sns.histplot(data['column'])Visualize distribution of a numerical column
Count Plot (Categorical)sns.countplot(x='column', data=data)Visualize frequency of categorical column
ObservationsManual reviewIdentify missing values, skewness, outliers,
EDA basics
💬
AIRA (AI Research Assistant) Neural Learning Interface • Drag & Resize Enabled
×