Context
NovaRetail runs its operations on several technology-heavy systems: a PostgreSQL order database, Salesforce CRM, Stripe billing, and application logs emitted from Kubernetes services. Today, analysts rely on nightly CSV exports and ad hoc scripts, causing stale dashboards, duplicate records, and frequent pipeline failures.
You are asked to design a production-grade data platform that supports both batch ETL and near-real-time ingestion so business and engineering teams can trust operational and analytics data.
Scale Requirements
- Sources: PostgreSQL (~150 tables), Salesforce (~20M records), Stripe webhooks, application logs
- Throughput: 20K log events/sec peak, 5K CDC row changes/sec peak
- Latency: < 2 minutes for streaming data, < 1 hour for batch-loaded SaaS data
- Storage: 12 TB raw data retained for 1 year; curated warehouse tables retained indefinitely
- Consumers: 80 analysts, 12 data scientists, 15 internal dashboards
Requirements
- Design ingestion for both batch and streaming sources using a consistent raw-to-curated pattern.
- Capture PostgreSQL changes incrementally without full reloads and preserve delete events.
- Ingest Stripe and application events in near real time and make them queryable within 2 minutes.
- Build transformation layers for standardized entities such as
customers, orders, payments, and service_events.
- Implement orchestration with dependency management, retries, backfills, and idempotent reruns.
- Add data quality checks for schema drift, null-rate spikes, duplicate primary keys, and freshness SLA violations.
- Define monitoring, alerting, and failure recovery for ingestion, transformation, and warehouse load stages.
- Explain how you would support downstream analytics without breaking existing batch consumers during migration.
Constraints
- Infrastructure must remain AWS-based.
- Team size is 3 data engineers and 1 platform engineer.
- Monthly incremental platform budget is capped at $18K.
- PII is present; the design must support encryption, access controls, and deletion requests within 7 days.
- Minimize operational complexity; avoid introducing more than one custom service.