Context
InsightLoop, a B2B SaaS analytics company, collects product events from web SDKs, backend services, and third-party CRM exports. Today, data arrives through a mix of nightly CSV uploads and ad hoc API pulls into PostgreSQL, causing inconsistent schemas, duplicate records, and 8-12 hour reporting delays.
You need to design a modern data collection and analysis pipeline that supports both near-real-time operational analytics and scheduled batch reporting. The system must standardize ingestion, validate data quality, and produce analytics-ready tables for BI and downstream data science use cases.
Scale Requirements
- Sources: 3 primary sources initially: web/mobile events, application database CDC, daily SaaS exports
- Throughput: 120K events/sec peak, 25K avg
- Event size: 1-3 KB JSON for streaming events; 50-200 GB/day for batch files
- Latency: <2 minutes for streaming data to warehouse, <2 hours for batch sources
- Storage: 6 TB/day raw, 18 months retained in data lake, curated warehouse tables retained indefinitely
- Consumers: 150 BI users, 20 scheduled dashboards, 8 downstream ML/analytics jobs
Requirements
- Design an ingestion layer for both streaming and batch data sources.
- Define how you would handle schema evolution, deduplication, and late-arriving data.
- Build raw, cleaned, and curated layers suitable for analytics and backfills.
- Ensure idempotent processing for retries and replays.
- Orchestrate transformations and dependencies between ingestion, validation, and warehouse modeling.
- Propose monitoring for freshness, throughput, failed loads, and data quality.
- Support replay/backfill of the last 90 days without corrupting downstream tables.
Constraints
- Cloud environment is AWS; Snowflake is already the company warehouse.
- Team has strong SQL/dbt skills, moderate Spark experience, and limited Kafka operations expertise.
- Incremental platform budget is capped at $35K/month.
- Must meet SOC 2 controls and support deletion requests within 7 days for regulated customer data.