Context
InsightLoop, a B2B SaaS company, asks candidates about data analysis tools, but the data team needs a more practical assessment: design a pipeline that consolidates data from multiple analytics tools into a governed warehouse. Today, product, finance, and operations teams export CSVs from Google Analytics, Salesforce, Stripe, and internal PostgreSQL into ad hoc spreadsheets, causing inconsistent metrics and delayed reporting.
You are asked to design a production-grade batch-first analytics pipeline that standardizes these sources into a single warehouse and makes curated tables available for analysts.
Scale Requirements
- Sources: Google Analytics 4 export, Salesforce API, Stripe API, internal PostgreSQL
- Daily volume: 250 GB raw data/day, growing 8% month-over-month
- Tables/entities: ~120 source objects, 40 curated marts
- Latency target: Source data available in warehouse within 30 minutes of scheduled extraction
- Batch frequency: Hourly for PostgreSQL and Stripe, every 4 hours for Salesforce, daily backfill for GA4
- Retention: 2 years raw, 5 years curated
Requirements
- Build an ingestion framework that supports API extraction, database replication, and file-based loads.
- Standardize raw data into a warehouse using reproducible ELT transformations.
- Support incremental loads, idempotent reruns, and historical backfills.
- Create curated analyst-facing models for revenue, customer activity, and funnel reporting.
- Add data quality checks for schema drift, null spikes, duplicate primary keys, and freshness.
- Orchestrate dependencies so downstream models run only after upstream loads succeed.
- Provide clear monitoring, alerting, and recovery procedures for failed jobs.
Constraints
- Existing stack is AWS-based, and the team prefers managed services over self-hosted infrastructure.
- Incremental budget is limited to $15K/month.
- Finance data must meet SOX-style auditability requirements.
- The team has 3 data engineers, so operational complexity should stay low.
Describe the end-to-end architecture, technology choices, orchestration strategy, data model layers, and how you would evaluate candidate proficiency with these tools through implementation decisions rather than a simple list of software names.