Types of Data Analysis

Think of data analysis like solving a mystery. When you walk into a crime scene (your dataset), you first gather facts—that’s Descriptive. Then you dig into clues to understand motives—Diagnostic. After that, you make an educated guess about what might happen next—Predictive. Finally, you decide the best course of action—Prescriptive. Every business problem, from reducing customer churn to optimizing a supply chain, uses this ladder. Let’s climb it step by step, and I’ll show you code that brings each step to life.

The Four Flavors, One Ladder #

To make this stick, here’s a quick bird’s-eye view:

Type	Core Question	Example Tool	Business Action
Descriptive	What happened?	Summary stats, dashboards	Monthly sales dashboards
Diagnostic	Why did it happen?	Correlation, drill-down	Root-cause analysis of a drop in website traffic
Predictive	What will happen?	Regression, time series	Forecast next quarter’s demand
Diagnostic	What should we do?	Optimization, decision models	Recommending discount levels to maximize profit

Notice how each type answers a deeper question than the last. And here’s the secret: you rarely use them in isolation. In a real project, you flow from one to the next, building insight upon insight.

Descriptive Analysis: #

Descriptive analysis is exactly what it sounds like—it describes your data. It uses measures of central tendency (mean, median), spread (standard deviation, percentiles), and visual summaries like histograms and box plots. No judgments, no predictions—just the pure, unvarnished truth of the past.

Let’s work with a tiny sales dataset so you can see it in action. I’ll simulate a month of daily revenue for an online store.

import pandas as pd
import numpy as np

# Create a simple sales DataFrame
np.random.seed(42)
dates = pd.date_range('2026-04-01', periods=30, freq='D')
revenue = np.random.normal(loc=500, scale=80, size=30).round(2)  # mean $500, std 80
df = pd.DataFrame({'date': dates, 'revenue': revenue})

# Descriptive stats at a glance
desc = df['revenue'].describe()
print(desc)

Output:

count     30.000000
mean     503.401667
std       78.846578
min      332.160000
25%      446.132500
50%      501.080000
75%      558.152500
max      697.630000

From these eight numbers, you instantly know the average daily revenue ( $503), t h e m i d d l e - o f - t h e - r o a d d a y ($ 503),themiddle−of−the−roadday(501 median), and the spread. You can spot that the worst day brought in $332 a n d t h e b e s t n e a r l y$ 332andthebestnearly698. No complex statistics degree needed. When a stakeholder asks, “How did we do last month?” this is your answer.

Diagnostic Analysis: #

Descriptive told you what; diagnostic tells you why. It digs into relationships, anomalies, and patterns. The most common tool is correlation—does a rise in one thing tend to go with a rise (or fall) in another? But remember: correlation is not causation. Diagnostic analysis often requires domain knowledge to separate coincidence from true cause.

Let’s add some more columns to our dataset to simulate a real diagnostic scenario. Suppose we also recorded marketing spend and website visits each day.

# Add extra columns for diagnostic analysis
df['marketing_spend'] = np.random.normal(150, 30, 30).round(2)
df['website_visits'] = (revenue * 0.2 + np.random.normal(0, 20, 30)).round(0)
# Compute correlation matrix
corr_matrix = df[['revenue', 'marketing_spend', 'website_visits']].corr()
print(corr_matrix.round(2))

You might see something like:

                 revenue  marketing_spend  website_visits
revenue             1.00              0.15            0.71
marketing_spend     0.15              1.00            0.08
website_visits      0.71              0.08            1.00

Revenue and website visits have a strong positive correlation (0.71). That’s a clue—maybe more visitors drive more sales. Marketing spend shows a very weak correlation with revenue (0.15) in this simulation, which itself is diagnostic gold: either our marketing channel isn’t effective, or there’s a time lag we’re not measuring. These insights guide you to ask smarter questions and run further analysis (maybe a lagged correlation or an experiment).

Predictive Analysis: #

Here’s where you start using the past to forecast the future. Predictive analytics uses statistical models—like linear regression, time series forecasting, or machine learning—to anticipate what’s coming. Don’t let the word “predictive” intimidate you; with a few lines of Python, you can get a meaningful forecast.

Let’s use simple linear regression to predict tomorrow’s revenue based on website visits.

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

X = df[['website_visits']]
y = df['revenue']
model = LinearRegression()
model.fit(X, y)

# Predict revenue for a day with, say, 1200 website visits
predicted_revenue = model.predict([[1200]])
print(f"Predicted revenue for 1200 visits: ${predicted_revenue[0]:.2f}")

# Optional: quick scatter plot (you can generate your own image from this)
plt.figure(figsize=(7,3))
plt.scatter(df['website_visits'], df['revenue'], color='#1B3A5C', alpha=0.7)
plt.plot(df['website_visits'], model.predict(X), color='#A4D8F0', linewidth=2)
plt.xlabel('Website Visits'); plt.ylabel('Revenue')
plt.title('Revenue Prediction from Visits')
plt.tight_layout()
plt.show()

The model learns the relationship and spits out a number. In the real world, you’d validate this on unseen data, but the concept is straightforward: you’re quantifying the future with a probability baked in. Predictive models are the backbone of demand forecasting, risk scoring, and inventory planning.

Prescriptive Analysis: #

Prescriptive analysis goes beyond predicting what will happen—it tells you what to do about it. It often uses optimization algorithms, simulation, or decision models to recommend the best action given constraints. Think of it as the GPS that not only warns you about traffic ahead but also reroutes you.

For a sneak peek, let’s solve a tiny prescriptive problem: we want to maximize profit by deciding the optimal discount level for a product, assuming a simple relationship.

# Prescriptive: find discount that maximizes profit (simplified model)
import numpy as np

def profit(discount):
    base_price = 50
    base_units = 100
    # Price after discount
    price = base_price * (1 - discount)
    # Unit increase model: lower price increases demand linearly
    units = base_units * (1 + 1.5 * discount)
    cost_per_unit = 20
    return units * (price - cost_per_unit)

discounts = np.linspace(0, 0.5, 100)
profits = [profit(d) for d in discounts]
best_d = discounts[np.argmax(profits)]
print(f"Optimal discount: {best_d*100:.1f}% gives profit ${max(profits):.2f}")

This is a toy example, but in a large retail chain, prescriptive models determine markdown schedules, staff allocation, and supply chain adjustments by balancing hundreds of variables.

Putting It All Together – A Real-World Flow #

Now that you’ve met each type individually, let’s see how they work as a team in a realistic project: customer churn analysis.

Descriptive: You pull the last quarter’s data and find that 8% of customers churned. You create a churn by plan type (premium, basic) table.
Diagnostic: You discover that churn is highest among basic-plan customers who haven’t contacted support in the last 90 days. Correlation and cohort analysis confirm the pattern.
Predictive: You build a logistic regression model that flags which basic-plan customers are most likely to leave next month, using features like tenure, support tickets, and login frequency.
Prescriptive: The model’s output feeds into a recommendation engine: for high-risk customers, offer a 15% discount on upgrading to premium, because simulations show this maximizes retention ROI within a $10,000 monthly incentive budget.

That’s the ladder in action. The code you saw for each step can be stitched together in a Jupyter Notebook to form a complete pipeline. You’ll start with df.describe(), move to correlation analysis, train a predictive model, and finally wrap it with a simple optimization function. The beauty is that you don’t need to run a massive server farm—Pandas, NumPy, and scikit-learn on your laptop can handle a surprising amount of this.

Practice Challenges #

I want you to get your hands dirty. Try these three mini-tasks on your own to cement the concepts.

Descriptive challenge: Load the famous Iris dataset (from sklearn.datasets import load_iris) into a DataFrame. Use .describe() and at least one visualization (histogram or boxplot) to summarize petal length.
Diagnostic challenge: Using the same Iris data, compute the correlation matrix and identify which two numerical features have the strongest positive correlation. Print a short sentence explaining what that relationship might mean biologically.
Predictive + Prescriptive challenge: Create a synthetic dataset of study hours (1-10) and exam scores (score = 10*hours + random noise). Fit a linear regression to predict score. Then, using that model, determine the minimum study hours needed to achieve a score above 85 (prescriptive twist). Print your recommendation.

Understanding Data Analysis The Data Analysis Process (Step by Step)