Context
BrightCart, a mid-sized retail marketplace, currently builds executive dashboards from manually refreshed CSV exports and ad hoc SQL queries. Finance, operations, and growth teams need automated daily and hourly reports with consistent KPI definitions, auditability, and failure recovery.
The company already stores transactional data in PostgreSQL, application events in Amazon S3, and CRM data in Salesforce. Your task is to design a maintainable pipeline that ingests these sources, transforms them into analytics-ready models, and powers dashboards in a BI tool with reliable refreshes.
Scale Requirements
- Sources: PostgreSQL OLTP (~250 GB), S3 event logs (~1.2 TB/day JSON), Salesforce (~8M records)
- Ingestion volume: ~150M new event rows/day, ~12M transactional updates/day
- Latency targets: hourly dashboard freshness for operations, daily refresh by 6:00 AM UTC for executive reporting
- Storage: 3 years of curated warehouse history, 90 days raw landing retention
- Concurrency: 300 BI users, 40 scheduled dashboard/report refreshes per hour
Requirements
- Build automated ingestion from PostgreSQL, S3, and Salesforce into a centralized warehouse.
- Support both hourly incremental loads and daily full reconciliation for critical finance metrics.
- Create curated data models for revenue, orders, refunds, customer cohorts, and marketing attribution.
- Ensure idempotent reruns, backfills for historical periods, and lineage from source to dashboard.
- Add data quality checks for freshness, row-count anomalies, null spikes, and referential integrity.
- Orchestrate dependencies so BI refreshes only run after successful model completion.
- Provide monitoring, alerting, and clear failure handling for partial loads and upstream outages.
Constraints
- AWS-first environment; prefer managed services over self-hosted infrastructure.
- Team size is 3 data engineers, so operational complexity should stay low.
- Budget cap is $18K/month incremental spend.
- Finance reports must be reproducible and support SOX-style audit review.
- Source PostgreSQL cannot tolerate heavy read load during business hours.