Context
PulseCart, a mid-size e-commerce marketplace, currently relies on hourly batch ETL jobs to move application events from PostgreSQL and application logs into Snowflake. Product, fraud, and operations teams now need sub-minute visibility into orders, payments, inventory changes, and user activity to support live dashboards and alerting.
You are asked to design a real-time data processing system that complements the existing batch platform and produces analytics-ready data with strong reliability guarantees.
Scale Requirements
- Sources: Web/mobile events, order service, payment service, inventory service
- Peak throughput: 250K events/second, 60K average
- Event size: 1-3 KB JSON/Avro
- Latency target: P95 end-to-end under 60 seconds; P99 under 2 minutes
- Daily volume: ~12 TB raw data/day
- Retention: 30 days in streaming layer, 1 year in object storage, curated warehouse tables retained indefinitely
Requirements
- Ingest events from multiple services with ordered processing per entity where needed (for example,
order_id or user_id).
- Validate schemas, deduplicate retries, and quarantine malformed or poison messages without blocking the main pipeline.
- Perform real-time transformations such as enrichment, sessionization, payment status normalization, and inventory state aggregation.
- Land raw and processed data in durable storage for replay and backfills.
- Load queryable tables into Snowflake for BI dashboards and operational reporting.
- Orchestrate downstream transformations and support replay, reprocessing, and schema evolution.
- Provide monitoring for freshness, throughput, data quality, and cost.
Constraints
- AWS is the required cloud; existing platform already uses S3, Snowflake, and Airflow 2.x.
- Team size is 5 data engineers; operational complexity should be reasonable.
- Budget target is under $35K/month incremental spend.
- Must support PCI-sensitive payment events by masking or excluding restricted fields before analytics storage.
- Design for at-least-once ingestion, but minimize duplicates in downstream tables through idempotent processing.