Context
BrightCart, a mid-sized e-commerce company, runs daily marketing and sales reporting from Shopify, Google Ads, Meta Ads, and a PostgreSQL order database. Today, analysts manually trigger scripts and rerun failed jobs, causing inconsistent dashboards and delayed reporting.
You need to design an operationally efficient data pipeline platform that standardizes ingestion, transformation, scheduling, and monitoring. The goal is to reduce manual intervention and provide reliable daily and hourly datasets for finance and growth teams.
Scale Requirements
- Sources: 4 primary systems, expanding to 12 within 12 months
- Batch volume: ~250 GB/day raw data
- Records: ~180M rows/day across all sources
- Latency: hourly refresh for ad spend, daily refresh for finance tables
- Retention: 2 years in warehouse, 90 days raw landing storage
- SLA: 99.5% successful scheduled pipeline runs per month
Requirements
- Design a pipeline that ingests API and database data into a centralized platform with minimal manual operations.
- Orchestrate dependencies between ingestion, staging, transformation, and data quality checks.
- Support incremental loads, idempotent reruns, and backfills for the last 90 days.
- Implement validation for schema drift, null spikes, duplicate records, and row-count anomalies.
- Expose curated tables for BI dashboards and downstream analysts.
- Define monitoring, alerting, and operational ownership for failures and SLA breaches.
- Include how you would choose and use tools to manage operations effectively across scheduling, lineage, logging, and recovery.
Constraints
- Existing cloud footprint is AWS.
- Team size is 3 data engineers and 1 analytics engineer.
- Budget should avoid large always-on clusters.
- PII from orders and customer tables must be encrypted at rest and access-controlled.
- Preference for managed services where operational overhead is lower.