Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the foundational step in any data science project. It helps us understand the structure, quality, and patterns in the data before applying any modeling or advanced analysis. In this tutorial, we’ll focus on two core components:
- Understanding the Dataset
- Summary Statistics
We’ll use Python and pandas for demonstration.
Import Libraries and Load Dataset #
# Import necessary libraries
import pandas as pd
import numpy as np
# Load dataset (example: Titanic dataset)
data = pd.read_csv("titanic.csv")
# Display first few rows
print(data.head())pd.read_csv(): Reads a CSV file into a pandas DataFrame.data.head(): Displays the first 5 rows to get an initial look.
Understanding the Dataset #
Understanding the dataset means knowing what each column represents, the types of data, and general info.
Check Dataset Shape #
print("Number of rows and columns:", data.shape)- Output example:
(891, 12)→ 891 rows, 12 columns.
Check Column Names #
print("Columns in dataset:", data.columns.tolist())Data Types
print(data.dtypes)
- int64 → Numerical integer
- float64 → Numerical decimal
- object → Categorical / string data
Quick Summary #
print(data.info())
Missing Values
print(data.isnull().sum())
- Helps identify which columns have missing values.
Summary Statistics #
Summary statistics give numerical insight into data distributions and central tendencies.
Descriptive Statistics for Numerical Columns #
print(data.describe())
- Includes:
- count → Non-null values
- mean → Average
- std → Standard deviation
- min / max → Range
- 25%, 50%, 75% → Quartiles
Descriptive Statistics for Categorical Columns #
print(data.describe(include='object'))
- Includes:
- count → Non-null values
- unique → Number of unique categories
- top → Most frequent category
- freq → Frequency of top category
Value Counts (for individual categorical columns) #
print(data['Sex'].value_counts())
Shows how many males and females are present (example for Titanic dataset).
Visual Check (Optional but Recommended) #
Even simple plots can reveal data patterns and outliers.
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram for numerical column
sns.histplot(data['Age'], bins=20, kde=True)
plt.title("Age Distribution")
plt.show()
# Count plot for categorical column
sns.countplot(x='Sex', data=data)
plt.title("Gender Count")
plt.show()Observations and Insights #
From EDA, you should document insights such as:
- Missing values (
Agecolumn has 177 missing values) - Column types (categorical vs numerical)
- Data distribution (Age is right-skewed)
- Potential outliers (
Farecolumn has very high values)
| EDA Step | Command / Function | Purpose / Explanation |
|---|---|---|
| Load Data | pd.read_csv("file.csv") | Load CSV file into a pandas DataFrame |
| View Data | data.head() | Display first 5 rows of the dataset |
| Check Shape | data.shape | Show number of rows and columns |
| List Columns | data.columns | List all column names |
| Data Types | data.dtypes | Show data type of each column (int, float, object) |
| Dataset Info | data.info() | Overview of dataset: non-null counts, types, memory usage |
| Missing Values | data.isnull().sum() | Count missing values per column |
| Summary Stats (Numerical) | data.describe() | Count, mean, std, min, max, quartiles |
| Summary Stats (Categorical) | data.describe(include='object') | Count, unique, top, frequency of categories |
| Value Counts | data['column'].value_counts() | Frequency of each category in a column |
| Histogram (Numerical) | sns.histplot(data['column']) | Visualize distribution of a numerical column |
| Count Plot (Categorical) | sns.countplot(x='column', data=data) | Visualize frequency of categorical column |
| Observations | Manual review | Identify missing values, skewness, outliers, |

