Design Unified Collection Analytics Pipeline

Context

InsightLoop, a B2B SaaS analytics company, collects product events from web SDKs, backend services, and third-party CRM exports. Today, data arrives through a mix of nightly CSV uploads and ad hoc API pulls into PostgreSQL, causing inconsistent schemas, duplicate records, and 8-12 hour reporting delays.

You need to design a modern data collection and analysis pipeline that supports both near-real-time operational analytics and scheduled batch reporting. The system must standardize ingestion, validate data quality, and produce analytics-ready tables for BI and downstream data science use cases.

Scale Requirements

Sources: 3 primary sources initially: web/mobile events, application database CDC, daily SaaS exports
Throughput: 120K events/sec peak, 25K avg
Event size: 1-3 KB JSON for streaming events; 50-200 GB/day for batch files
Latency: <2 minutes for streaming data to warehouse, <2 hours for batch sources
Storage: 6 TB/day raw, 18 months retained in data lake, curated warehouse tables retained indefinitely
Consumers: 150 BI users, 20 scheduled dashboards, 8 downstream ML/analytics jobs

Requirements

Design an ingestion layer for both streaming and batch data sources.
Define how you would handle schema evolution, deduplication, and late-arriving data.
Build raw, cleaned, and curated layers suitable for analytics and backfills.
Ensure idempotent processing for retries and replays.
Orchestrate transformations and dependencies between ingestion, validation, and warehouse modeling.
Propose monitoring for freshness, throughput, failed loads, and data quality.
Support replay/backfill of the last 90 days without corrupting downstream tables.

Constraints

Cloud environment is AWS; Snowflake is already the company warehouse.
Team has strong SQL/dbt skills, moderate Spark experience, and limited Kafka operations expertise.
Incremental platform budget is capped at $35K/month.
Must meet SOC 2 controls and support deletion requests within 7 days for regulated customer data.

Context

Scale Requirements

Sources: 3 primary sources initially: web/mobile events, application database CDC, daily SaaS exports
Throughput: 120K events/sec peak, 25K avg
Event size: 1-3 KB JSON for streaming events; 50-200 GB/day for batch files
Latency: <2 minutes for streaming data to warehouse, <2 hours for batch sources
Storage: 6 TB/day raw, 18 months retained in data lake, curated warehouse tables retained indefinitely
Consumers: 150 BI users, 20 scheduled dashboards, 8 downstream ML/analytics jobs

Requirements

Design an ingestion layer for both streaming and batch data sources.
Define how you would handle schema evolution, deduplication, and late-arriving data.
Build raw, cleaned, and curated layers suitable for analytics and backfills.
Ensure idempotent processing for retries and replays.
Orchestrate transformations and dependencies between ingestion, validation, and warehouse modeling.
Propose monitoring for freshness, throughput, failed loads, and data quality.
Support replay/backfill of the last 90 days without corrupting downstream tables.

Constraints

Cloud environment is AWS; Snowflake is already the company warehouse.
Team has strong SQL/dbt skills, moderate Spark experience, and limited Kafka operations expertise.
Incremental platform budget is capped at $35K/month.
Must meet SOC 2 controls and support deletion requests within 7 days for regulated customer data.

Context

Scale Requirements

Sources: 3 primary sources initially: web/mobile events, application database CDC, daily SaaS exports
Throughput: 120K events/sec peak, 25K avg
Event size: 1-3 KB JSON for streaming events; 50-200 GB/day for batch files
Latency: <2 minutes for streaming data to warehouse, <2 hours for batch sources
Storage: 6 TB/day raw, 18 months retained in data lake, curated warehouse tables retained indefinitely
Consumers: 150 BI users, 20 scheduled dashboards, 8 downstream ML/analytics jobs

Requirements

Design an ingestion layer for both streaming and batch data sources.
Define how you would handle schema evolution, deduplication, and late-arriving data.
Build raw, cleaned, and curated layers suitable for analytics and backfills.
Ensure idempotent processing for retries and replays.
Orchestrate transformations and dependencies between ingestion, validation, and warehouse modeling.
Propose monitoring for freshness, throughput, failed loads, and data quality.
Support replay/backfill of the last 90 days without corrupting downstream tables.

Constraints

Cloud environment is AWS; Snowflake is already the company warehouse.
Team has strong SQL/dbt skills, moderate Spark experience, and limited Kafka operations expertise.
Incremental platform budget is capped at $35K/month.
Must meet SOC 2 controls and support deletion requests within 7 days for regulated customer data.

Context

Scale Requirements

Sources: 3 primary sources initially: web/mobile events, application database CDC, daily SaaS exports
Throughput: 120K events/sec peak, 25K avg
Event size: 1-3 KB JSON for streaming events; 50-200 GB/day for batch files
Latency: <2 minutes for streaming data to warehouse, <2 hours for batch sources
Storage: 6 TB/day raw, 18 months retained in data lake, curated warehouse tables retained indefinitely
Consumers: 150 BI users, 20 scheduled dashboards, 8 downstream ML/analytics jobs

Requirements

Design an ingestion layer for both streaming and batch data sources.
Define how you would handle schema evolution, deduplication, and late-arriving data.
Build raw, cleaned, and curated layers suitable for analytics and backfills.
Ensure idempotent processing for retries and replays.
Orchestrate transformations and dependencies between ingestion, validation, and warehouse modeling.
Propose monitoring for freshness, throughput, failed loads, and data quality.
Support replay/backfill of the last 90 days without corrupting downstream tables.

Constraints

Cloud environment is AWS; Snowflake is already the company warehouse.
Team has strong SQL/dbt skills, moderate Spark experience, and limited Kafka operations expertise.
Incremental platform budget is capped at $35K/month.
Must meet SOC 2 controls and support deletion requests within 7 days for regulated customer data.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Unified Collection Analytics Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Unified Collection Analytics Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Unified Collection Analytics Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer