Interview Guides

Choose Spark APIs for Lakeflow

Medium

Pipelines

Context

A Databricks customer is modernizing legacy Spark jobs into a unified Databricks Lakehouse pipeline using Delta Lake, Databricks Workflows, and Delta Live Tables (Lakeflow Declarative Pipelines). Several existing jobs still use low-level RDD transformations, while newer code uses DataFrames. The team wants a clear migration strategy for batch and streaming ETL, with strong data quality guarantees and lower operational overhead.

You are asked to design the target pipeline architecture and explain where Spark RDDs, DataFrames, and Datasets should or should not be used in Databricks production pipelines.

Scale Requirements

Input sources: CDC from operational databases, JSON application logs, and hourly Parquet partner drops
Throughput: 150K records/sec peak streaming ingest, 4 TB/day batch ingest
Latency target: Bronze tables < 2 minutes from arrival; Silver tables < 10 minutes
Retention: 180 days raw, 2 years curated
Consumers: 40 BI dashboards, 12 ML feature pipelines, 20 downstream batch jobs

Requirements

Design a Databricks-native pipeline from ingestion to curated Delta tables using Bronze/Silver/Gold layers.
Explain the trade-offs between RDDs, DataFrames, and Datasets for ETL transformations, schema enforcement, optimization, and maintainability.
Identify which API you would standardize on for Databricks batch and streaming pipelines, and where exceptions are justified.
Include data quality controls such as schema validation, null checks, deduplication, and quarantine handling.
Show how orchestration, backfills, and monitoring would work in Databricks Workflows.
Describe how your design supports both SQL-first analytics users and PySpark/Scala engineers.

Constraints

Platform must stay fully on Databricks with Delta Lake as the system of record.
Team has mixed Python and Scala experience; most analysts are SQL-first.
Compliance requires auditability, replayability, and lineage for all production tables.
Minimize custom JVM code and operational complexity; avoid designs that block Photon/Catalyst optimizations.

Choose Spark APIs for Lakeflow

Medium

Pipelines

Context

You are asked to design the target pipeline architecture and explain where Spark RDDs, DataFrames, and Datasets should or should not be used in Databricks production pipelines.

Scale Requirements

Input sources: CDC from operational databases, JSON application logs, and hourly Parquet partner drops
Throughput: 150K records/sec peak streaming ingest, 4 TB/day batch ingest
Latency target: Bronze tables < 2 minutes from arrival; Silver tables < 10 minutes
Retention: 180 days raw, 2 years curated
Consumers: 40 BI dashboards, 12 ML feature pipelines, 20 downstream batch jobs

Requirements

Design a Databricks-native pipeline from ingestion to curated Delta tables using Bronze/Silver/Gold layers.
Explain the trade-offs between RDDs, DataFrames, and Datasets for ETL transformations, schema enforcement, optimization, and maintainability.
Identify which API you would standardize on for Databricks batch and streaming pipelines, and where exceptions are justified.
Include data quality controls such as schema validation, null checks, deduplication, and quarantine handling.
Show how orchestration, backfills, and monitoring would work in Databricks Workflows.
Describe how your design supports both SQL-first analytics users and PySpark/Scala engineers.

Constraints

Platform must stay fully on Databricks with Delta Lake as the system of record.
Team has mixed Python and Scala experience; most analysts are SQL-first.
Compliance requires auditability, replayability, and lineage for all production tables.
Minimize custom JVM code and operational complexity; avoid designs that block Photon/Catalyst optimizations.

Your Answer

Choose Spark APIs for Lakeflow

Medium

Pipelines

Context

You are asked to design the target pipeline architecture and explain where Spark RDDs, DataFrames, and Datasets should or should not be used in Databricks production pipelines.

Scale Requirements

Input sources: CDC from operational databases, JSON application logs, and hourly Parquet partner drops
Throughput: 150K records/sec peak streaming ingest, 4 TB/day batch ingest
Latency target: Bronze tables < 2 minutes from arrival; Silver tables < 10 minutes
Retention: 180 days raw, 2 years curated
Consumers: 40 BI dashboards, 12 ML feature pipelines, 20 downstream batch jobs

Requirements

Design a Databricks-native pipeline from ingestion to curated Delta tables using Bronze/Silver/Gold layers.
Explain the trade-offs between RDDs, DataFrames, and Datasets for ETL transformations, schema enforcement, optimization, and maintainability.
Identify which API you would standardize on for Databricks batch and streaming pipelines, and where exceptions are justified.
Include data quality controls such as schema validation, null checks, deduplication, and quarantine handling.
Show how orchestration, backfills, and monitoring would work in Databricks Workflows.
Describe how your design supports both SQL-first analytics users and PySpark/Scala engineers.

Constraints

Platform must stay fully on Databricks with Delta Lake as the system of record.
Team has mixed Python and Scala experience; most analysts are SQL-first.
Compliance requires auditability, replayability, and lineage for all production tables.
Minimize custom JVM code and operational complexity; avoid designs that block Photon/Catalyst optimizations.

Choose Spark APIs for Lakeflow

Medium

Pipelines

Context

You are asked to design the target pipeline architecture and explain where Spark RDDs, DataFrames, and Datasets should or should not be used in Databricks production pipelines.

Scale Requirements

Input sources: CDC from operational databases, JSON application logs, and hourly Parquet partner drops
Throughput: 150K records/sec peak streaming ingest, 4 TB/day batch ingest
Latency target: Bronze tables < 2 minutes from arrival; Silver tables < 10 minutes
Retention: 180 days raw, 2 years curated
Consumers: 40 BI dashboards, 12 ML feature pipelines, 20 downstream batch jobs

Requirements

Design a Databricks-native pipeline from ingestion to curated Delta tables using Bronze/Silver/Gold layers.
Explain the trade-offs between RDDs, DataFrames, and Datasets for ETL transformations, schema enforcement, optimization, and maintainability.
Identify which API you would standardize on for Databricks batch and streaming pipelines, and where exceptions are justified.
Include data quality controls such as schema validation, null checks, deduplication, and quarantine handling.
Show how orchestration, backfills, and monitoring would work in Databricks Workflows.
Describe how your design supports both SQL-first analytics users and PySpark/Scala engineers.

Constraints

Platform must stay fully on Databricks with Delta Lake as the system of record.
Team has mixed Python and Scala experience; most analysts are SQL-first.
Compliance requires auditability, replayability, and lineage for all production tables.
Minimize custom JVM code and operational complexity; avoid designs that block Photon/Catalyst optimizations.