Context
A Databricks customer runs a serverless event ingestion platform for web and mobile activity, but session state is currently reconstructed in ad hoc downstream jobs, causing inconsistent session boundaries, duplicate attribution, and poor latency for product analytics. You need to design a Databricks-native pipeline that maintains stateful user sessions in a serverless environment while keeping operations simple for a small platform team.
The company wants a near-real-time lakehouse pipeline using Databricks Structured Streaming, Delta Lake, Delta Live Tables / Lakeflow Declarative Pipelines, and Databricks Workflows instead of managing long-lived stateful services directly.
Scale Requirements
- Ingress: 250K events/sec peak, 60K avg
- Event size: 1.5 KB average JSON
- Daily volume: ~18 TB raw
- Latency target: sessionized records queryable in < 2 minutes
- Session rule: 30-minute inactivity timeout, cross-device merge within 5 minutes after identity resolution
- Retention: raw events 180 days, session tables 2 years
- Availability: 99.9% pipeline SLA
Requirements
- Ingest ordered and unordered events from web SDKs, mobile SDKs, and backend APIs into the Databricks Lakehouse.
- Build a stateful sessionization pipeline that handles late events, duplicate events, and identity merges without relying on sticky application servers.
- Persist intermediate and final state in Delta tables so the pipeline can recover safely after serverless restarts.
- Support both streaming session tables for product analytics and daily backfills for corrected identity mappings.
- Ensure idempotent writes, replayability, and exactly-once semantics where feasible.
- Expose monitoring for lag, state growth, bad records, and session merge anomalies.
- Orchestrate streaming and batch correction jobs using Databricks-native tooling.
Constraints
- Use Databricks on AWS with serverless compute where possible.
- No external Redis or self-managed Kafka Streams state store.
- PII must be encrypted and support deletion within 30 days.
- Incremental cloud spend target: <$40K/month.
- Team size: 5 data engineers, 1 platform engineer.
Describe the end-to-end architecture, state management strategy, table design, orchestration, monitoring, and failure recovery approach.