Context
InsightHub, a B2B analytics company, receives daily customer data exports from SaaS tools, internal PostgreSQL databases, and third-party CSV uploads. Analysts currently clean files manually in notebooks, causing inconsistent logic, duplicate records, and delayed reporting.
You need to design a production-grade batch pipeline that ingests raw datasets, standardizes and validates them, and publishes clean tables for downstream analysis. The goal is to make data preparation repeatable, auditable, and easy to backfill.
Scale Requirements
- Sources: 25 daily CSV/JSON feeds + 3 PostgreSQL source tables
- Volume: 300 GB raw data/day, growing 8% month-over-month
- Row count: ~1.2 billion records/day across all sources
- Latency target: Raw file arrival to analytics-ready tables in < 2 hours
- Retention: 180 days raw, 3 years curated
- SLA: 99.5% successful daily pipeline completion by 6:00 AM UTC
Requirements
- Ingest raw files and database extracts into a landing zone without modifying source data.
- Detect schema issues, malformed rows, null spikes, duplicates, and invalid business values.
- Standardize types, timestamps, categorical values, and column names across datasets.
- Implement deterministic deduplication and basic enrichment (reference table joins, derived columns).
- Publish cleaned, analytics-ready tables to a warehouse for BI and ad hoc SQL analysis.
- Support idempotent reruns, partition-based backfills, and auditability of every transformation.
- Define monitoring, alerting, and failure recovery for late files, bad schemas, and partial loads.
Constraints
- Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake.
- Team size is 3 data engineers; operational complexity should stay low.
- Budget for incremental infrastructure is capped at $15K/month.
- Some datasets contain PII and must support column-level masking and deletion requests within 7 days.