What is Semi-Structured Data? #
Semi-Structured Data is data that does not follow a strict table format but still has some organizational structure (like tags or keys).
- More flexible than structured data
- Easier to store complex data
- Widely used in modern applications (APIs, web data)
- Bridges the gap between structured and unstructured data
Examples of Semi-Structured Data #
- JSON (JavaScript Object Notation)
- XML files
- HTML data from websites
- NoSQL databases (MongoDB)
Characteristics of Semi-Structured Data #
- No fixed schema (flexible structure)
- Uses tags or key-value pairs
- Can vary in format
- Easier to process than unstructured data
Common Tools #
- Python (JSON, XML libraries)
- MongoDB (NoSQL database)
- APIs (REST APIs return JSON)
- Spark / Hadoop
Basic Python Example #
Example: Working with JSON Data #
import json
data = '{"name": "Alex", "age": 25, "city": "New York"}'
# Convert JSON to Python dictionary
parsed_data = json.loads(data)
print(parsed_data["name"])Example: API Data (JSON)
import requests
response = requests.get("https://api.agify.io?name=alex")
data = response.json()
print(data)Best Practices #
- Validate data structure (keys and values)
- Convert to structured format when needed
- Handle missing or inconsistent fields
- Use proper parsing tools
