Stateful Session Pipeline on Databricks

Context

A Databricks customer runs a serverless event ingestion platform for web and mobile activity, but session state is currently reconstructed in ad hoc downstream jobs, causing inconsistent session boundaries, duplicate attribution, and poor latency for product analytics. You need to design a Databricks-native pipeline that maintains stateful user sessions in a serverless environment while keeping operations simple for a small platform team.

The company wants a near-real-time lakehouse pipeline using Databricks Structured Streaming, Delta Lake, Delta Live Tables / Lakeflow Declarative Pipelines, and Databricks Workflows instead of managing long-lived stateful services directly.

Scale Requirements

Ingress: 250K events/sec peak, 60K avg
Event size: 1.5 KB average JSON
Daily volume: ~18 TB raw
Latency target: sessionized records queryable in < 2 minutes
Session rule: 30-minute inactivity timeout, cross-device merge within 5 minutes after identity resolution
Retention: raw events 180 days, session tables 2 years
Availability: 99.9% pipeline SLA

Requirements

Ingest ordered and unordered events from web SDKs, mobile SDKs, and backend APIs into the Databricks Lakehouse.
Build a stateful sessionization pipeline that handles late events, duplicate events, and identity merges without relying on sticky application servers.
Persist intermediate and final state in Delta tables so the pipeline can recover safely after serverless restarts.
Support both streaming session tables for product analytics and daily backfills for corrected identity mappings.
Ensure idempotent writes, replayability, and exactly-once semantics where feasible.
Expose monitoring for lag, state growth, bad records, and session merge anomalies.
Orchestrate streaming and batch correction jobs using Databricks-native tooling.

Constraints

Use Databricks on AWS with serverless compute where possible.
No external Redis or self-managed Kafka Streams state store.
PII must be encrypted and support deletion within 30 days.
Incremental cloud spend target: <$40K/month.
Team size: 5 data engineers, 1 platform engineer.

Describe the end-to-end architecture, state management strategy, table design, orchestration, monitoring, and failure recovery approach.

Context

Scale Requirements

Ingress: 250K events/sec peak, 60K avg
Event size: 1.5 KB average JSON
Daily volume: ~18 TB raw
Latency target: sessionized records queryable in < 2 minutes
Session rule: 30-minute inactivity timeout, cross-device merge within 5 minutes after identity resolution
Retention: raw events 180 days, session tables 2 years
Availability: 99.9% pipeline SLA

Requirements

Ingest ordered and unordered events from web SDKs, mobile SDKs, and backend APIs into the Databricks Lakehouse.
Build a stateful sessionization pipeline that handles late events, duplicate events, and identity merges without relying on sticky application servers.
Persist intermediate and final state in Delta tables so the pipeline can recover safely after serverless restarts.
Support both streaming session tables for product analytics and daily backfills for corrected identity mappings.
Ensure idempotent writes, replayability, and exactly-once semantics where feasible.
Expose monitoring for lag, state growth, bad records, and session merge anomalies.
Orchestrate streaming and batch correction jobs using Databricks-native tooling.

Constraints

Use Databricks on AWS with serverless compute where possible.
No external Redis or self-managed Kafka Streams state store.
PII must be encrypted and support deletion within 30 days.
Incremental cloud spend target: <$40K/month.
Team size: 5 data engineers, 1 platform engineer.

Describe the end-to-end architecture, state management strategy, table design, orchestration, monitoring, and failure recovery approach.

Context

Scale Requirements

Ingress: 250K events/sec peak, 60K avg
Event size: 1.5 KB average JSON
Daily volume: ~18 TB raw
Latency target: sessionized records queryable in < 2 minutes
Session rule: 30-minute inactivity timeout, cross-device merge within 5 minutes after identity resolution
Retention: raw events 180 days, session tables 2 years
Availability: 99.9% pipeline SLA

Requirements

Ingest ordered and unordered events from web SDKs, mobile SDKs, and backend APIs into the Databricks Lakehouse.
Build a stateful sessionization pipeline that handles late events, duplicate events, and identity merges without relying on sticky application servers.
Persist intermediate and final state in Delta tables so the pipeline can recover safely after serverless restarts.
Support both streaming session tables for product analytics and daily backfills for corrected identity mappings.
Ensure idempotent writes, replayability, and exactly-once semantics where feasible.
Expose monitoring for lag, state growth, bad records, and session merge anomalies.
Orchestrate streaming and batch correction jobs using Databricks-native tooling.

Constraints

Use Databricks on AWS with serverless compute where possible.
No external Redis or self-managed Kafka Streams state store.
PII must be encrypted and support deletion within 30 days.
Incremental cloud spend target: <$40K/month.
Team size: 5 data engineers, 1 platform engineer.

Describe the end-to-end architecture, state management strategy, table design, orchestration, monitoring, and failure recovery approach.

Context

Scale Requirements

Ingress: 250K events/sec peak, 60K avg
Event size: 1.5 KB average JSON
Daily volume: ~18 TB raw
Latency target: sessionized records queryable in < 2 minutes
Session rule: 30-minute inactivity timeout, cross-device merge within 5 minutes after identity resolution
Retention: raw events 180 days, session tables 2 years
Availability: 99.9% pipeline SLA

Requirements

Ingest ordered and unordered events from web SDKs, mobile SDKs, and backend APIs into the Databricks Lakehouse.
Build a stateful sessionization pipeline that handles late events, duplicate events, and identity merges without relying on sticky application servers.
Persist intermediate and final state in Delta tables so the pipeline can recover safely after serverless restarts.
Support both streaming session tables for product analytics and daily backfills for corrected identity mappings.
Ensure idempotent writes, replayability, and exactly-once semantics where feasible.
Expose monitoring for lag, state growth, bad records, and session merge anomalies.
Orchestrate streaming and batch correction jobs using Databricks-native tooling.

Constraints

Use Databricks on AWS with serverless compute where possible.
No external Redis or self-managed Kafka Streams state store.
PII must be encrypted and support deletion within 30 days.
Incremental cloud spend target: <$40K/month.
Team size: 5 data engineers, 1 platform engineer.

Describe the end-to-end architecture, state management strategy, table design, orchestration, monitoring, and failure recovery approach.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Stateful Session Pipeline on Databricks

Context

Scale Requirements

Requirements

Constraints

Your Answer

Stateful Session Pipeline on Databricks

Context

Scale Requirements

Requirements

Constraints

Stateful Session Pipeline on Databricks

Context

Scale Requirements

Requirements

Constraints

Your Answer