Context
InsightHub, a B2B analytics SaaS company, currently ingests application events, transactional database changes, and third-party CRM exports through ad hoc Python jobs into PostgreSQL. The platform now supports customer-facing dashboards, but data arrives 6-12 hours late, pipeline failures are hard to trace, and the current system cannot handle new enterprise customers.
You need to design a scalable analytics architecture that supports both near-real-time operational metrics and scheduled warehouse transformations for BI teams.
Scale Requirements
- Sources: Web/mobile events, PostgreSQL CDC, Salesforce daily exports, and billing system files
- Throughput: 150K events/sec peak, 25K avg
- Daily volume: ~4 TB raw data/day
- Latency: <2 minutes for streaming metrics, <30 minutes for curated warehouse tables
- Retention: 180 days raw, 3 years curated analytics
- Consumers: Internal analysts, customer dashboards, finance reporting, ML feature generation
Requirements
- Design ingestion for both streaming and batch sources with clear separation of raw, validated, and curated layers.
- Support CDC from PostgreSQL with exactly-once or effectively-once semantics for downstream warehouse tables.
- Build transformations for session metrics, account-level usage aggregates, and revenue reporting.
- Implement orchestration for scheduled ELT jobs, dependency management, retries, and backfills.
- Define data quality controls for schema drift, duplicates, null spikes, and late-arriving data.
- Expose analytics-ready tables to Snowflake with SLA-backed freshness guarantees.
- Include observability, alerting, and recovery procedures for ingestion, processing, and warehouse loads.
Constraints
- Existing cloud footprint is AWS; avoid introducing more than one new managed platform.
- Team size is 5 data engineers and 1 platform engineer.
- Incremental infrastructure budget is capped at $35K/month.
- Must support SOC 2 auditability and PII minimization; raw PII cannot be retained beyond 30 days.
- Historical backfill of 12 months must be possible without disrupting production SLAs.