Context
A Databricks customer is modernizing legacy Spark jobs into a unified Databricks Lakehouse pipeline using Delta Lake, Databricks Workflows, and Delta Live Tables (Lakeflow Declarative Pipelines). Several existing jobs still use low-level RDD transformations, while newer code uses DataFrames. The team wants a clear migration strategy for batch and streaming ETL, with strong data quality guarantees and lower operational overhead.
You are asked to design the target pipeline architecture and explain where Spark RDDs, DataFrames, and Datasets should or should not be used in Databricks production pipelines.
Scale Requirements
- Input sources: CDC from operational databases, JSON application logs, and hourly Parquet partner drops
- Throughput: 150K records/sec peak streaming ingest, 4 TB/day batch ingest
- Latency target: Bronze tables < 2 minutes from arrival; Silver tables < 10 minutes
- Retention: 180 days raw, 2 years curated
- Consumers: 40 BI dashboards, 12 ML feature pipelines, 20 downstream batch jobs
Requirements
- Design a Databricks-native pipeline from ingestion to curated Delta tables using Bronze/Silver/Gold layers.
- Explain the trade-offs between RDDs, DataFrames, and Datasets for ETL transformations, schema enforcement, optimization, and maintainability.
- Identify which API you would standardize on for Databricks batch and streaming pipelines, and where exceptions are justified.
- Include data quality controls such as schema validation, null checks, deduplication, and quarantine handling.
- Show how orchestration, backfills, and monitoring would work in Databricks Workflows.
- Describe how your design supports both SQL-first analytics users and PySpark/Scala engineers.
Constraints
- Platform must stay fully on Databricks with Delta Lake as the system of record.
- Team has mixed Python and Scala experience; most analysts are SQL-first.
- Compliance requires auditability, replayability, and lineage for all production tables.
- Minimize custom JVM code and operational complexity; avoid designs that block Photon/Catalyst optimizations.