Design Analytics Tooling Migration Pipeline

Context

InsightLoop, a B2B SaaS company, asks candidates about data analysis tools, but the data team needs a more practical assessment: design a pipeline that consolidates data from multiple analytics tools into a governed warehouse. Today, product, finance, and operations teams export CSVs from Google Analytics, Salesforce, Stripe, and internal PostgreSQL into ad hoc spreadsheets, causing inconsistent metrics and delayed reporting.

You are asked to design a production-grade batch-first analytics pipeline that standardizes these sources into a single warehouse and makes curated tables available for analysts.

Scale Requirements

Sources: Google Analytics 4 export, Salesforce API, Stripe API, internal PostgreSQL
Daily volume: 250 GB raw data/day, growing 8% month-over-month
Tables/entities: ~120 source objects, 40 curated marts
Latency target: Source data available in warehouse within 30 minutes of scheduled extraction
Batch frequency: Hourly for PostgreSQL and Stripe, every 4 hours for Salesforce, daily backfill for GA4
Retention: 2 years raw, 5 years curated

Requirements

Build an ingestion framework that supports API extraction, database replication, and file-based loads.
Standardize raw data into a warehouse using reproducible ELT transformations.
Support incremental loads, idempotent reruns, and historical backfills.
Create curated analyst-facing models for revenue, customer activity, and funnel reporting.
Add data quality checks for schema drift, null spikes, duplicate primary keys, and freshness.
Orchestrate dependencies so downstream models run only after upstream loads succeed.
Provide clear monitoring, alerting, and recovery procedures for failed jobs.

Constraints

Existing stack is AWS-based, and the team prefers managed services over self-hosted infrastructure.
Incremental budget is limited to $15K/month.
Finance data must meet SOX-style auditability requirements.
The team has 3 data engineers, so operational complexity should stay low.

Describe the end-to-end architecture, technology choices, orchestration strategy, data model layers, and how you would evaluate candidate proficiency with these tools through implementation decisions rather than a simple list of software names.

Context

You are asked to design a production-grade batch-first analytics pipeline that standardizes these sources into a single warehouse and makes curated tables available for analysts.

Scale Requirements

Sources: Google Analytics 4 export, Salesforce API, Stripe API, internal PostgreSQL
Daily volume: 250 GB raw data/day, growing 8% month-over-month
Tables/entities: ~120 source objects, 40 curated marts
Latency target: Source data available in warehouse within 30 minutes of scheduled extraction
Batch frequency: Hourly for PostgreSQL and Stripe, every 4 hours for Salesforce, daily backfill for GA4
Retention: 2 years raw, 5 years curated

Requirements

Build an ingestion framework that supports API extraction, database replication, and file-based loads.
Standardize raw data into a warehouse using reproducible ELT transformations.
Support incremental loads, idempotent reruns, and historical backfills.
Create curated analyst-facing models for revenue, customer activity, and funnel reporting.
Add data quality checks for schema drift, null spikes, duplicate primary keys, and freshness.
Orchestrate dependencies so downstream models run only after upstream loads succeed.
Provide clear monitoring, alerting, and recovery procedures for failed jobs.

Constraints

Existing stack is AWS-based, and the team prefers managed services over self-hosted infrastructure.
Incremental budget is limited to $15K/month.
Finance data must meet SOX-style auditability requirements.
The team has 3 data engineers, so operational complexity should stay low.

Context

You are asked to design a production-grade batch-first analytics pipeline that standardizes these sources into a single warehouse and makes curated tables available for analysts.

Scale Requirements

Sources: Google Analytics 4 export, Salesforce API, Stripe API, internal PostgreSQL
Daily volume: 250 GB raw data/day, growing 8% month-over-month
Tables/entities: ~120 source objects, 40 curated marts
Latency target: Source data available in warehouse within 30 minutes of scheduled extraction
Batch frequency: Hourly for PostgreSQL and Stripe, every 4 hours for Salesforce, daily backfill for GA4
Retention: 2 years raw, 5 years curated

Requirements

Build an ingestion framework that supports API extraction, database replication, and file-based loads.
Standardize raw data into a warehouse using reproducible ELT transformations.
Support incremental loads, idempotent reruns, and historical backfills.
Create curated analyst-facing models for revenue, customer activity, and funnel reporting.
Add data quality checks for schema drift, null spikes, duplicate primary keys, and freshness.
Orchestrate dependencies so downstream models run only after upstream loads succeed.
Provide clear monitoring, alerting, and recovery procedures for failed jobs.

Constraints

Existing stack is AWS-based, and the team prefers managed services over self-hosted infrastructure.
Incremental budget is limited to $15K/month.
Finance data must meet SOX-style auditability requirements.
The team has 3 data engineers, so operational complexity should stay low.

Context

You are asked to design a production-grade batch-first analytics pipeline that standardizes these sources into a single warehouse and makes curated tables available for analysts.

Scale Requirements

Sources: Google Analytics 4 export, Salesforce API, Stripe API, internal PostgreSQL
Daily volume: 250 GB raw data/day, growing 8% month-over-month
Tables/entities: ~120 source objects, 40 curated marts
Latency target: Source data available in warehouse within 30 minutes of scheduled extraction
Batch frequency: Hourly for PostgreSQL and Stripe, every 4 hours for Salesforce, daily backfill for GA4
Retention: 2 years raw, 5 years curated

Requirements

Build an ingestion framework that supports API extraction, database replication, and file-based loads.
Standardize raw data into a warehouse using reproducible ELT transformations.
Support incremental loads, idempotent reruns, and historical backfills.
Create curated analyst-facing models for revenue, customer activity, and funnel reporting.
Add data quality checks for schema drift, null spikes, duplicate primary keys, and freshness.
Orchestrate dependencies so downstream models run only after upstream loads succeed.
Provide clear monitoring, alerting, and recovery procedures for failed jobs.

Constraints

Existing stack is AWS-based, and the team prefers managed services over self-hosted infrastructure.
Incremental budget is limited to $15K/month.
Finance data must meet SOX-style auditability requirements.
The team has 3 data engineers, so operational complexity should stay low.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Analytics Tooling Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Analytics Tooling Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Analytics Tooling Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer