Interview Guides

Build a Spark ETL Pipeline for Databricks Vector Search Indexing

Medium

Coding

You are given a Delta Lake table docs(doc_id STRING, uri STRING, title STRING, body STRING, updated_at TIMESTAMP) containing raw knowledge-base documents that will be indexed into Databricks Vector Search for a RAG application. Write a function in PySpark that performs the ETL needed before indexing: normalize whitespace, split each document body into overlapping chunks of at most 500 characters with 50-character overlap, preserve doc_id, uri, and chunk order, and emit a DataFrame with schema (chunk_id, doc_id, uri, title, chunk_text, chunk_index, updated_at). The solution should be efficient on Databricks for large tables, avoid collecting data to the driver, and support incremental reprocessing by only re-chunking documents whose updated_at is newer than the last successful run timestamp. In your explanation, describe how this output would feed a Databricks Vector Search pipeline and why Spark is a good fit here. Expected solution outline: use Spark SQL / DataFrame transforms, a UDF or pandas UDF only if needed for chunking, deterministic chunk_id generation, filtering on updated_at, and discussion of partitioning/performance tradeoffs on Databricks.

Build a Spark ETL Pipeline for Databricks Vector Search Indexing

Medium

Coding

You are given a Delta Lake table docs(doc_id STRING, uri STRING, title STRING, body STRING, updated_at TIMESTAMP) containing raw knowledge-base documents that will be indexed into Databricks Vector Search for a RAG application. Write a function in PySpark that performs the ETL needed before indexing: normalize whitespace, split each document body into overlapping chunks of at most 500 characters with 50-character overlap, preserve doc_id, uri, and chunk order, and emit a DataFrame with schema (chunk_id, doc_id, uri, title, chunk_text, chunk_index, updated_at). The solution should be efficient on Databricks for large tables, avoid collecting data to the driver, and support incremental reprocessing by only re-chunking documents whose updated_at is newer than the last successful run timestamp. In your explanation, describe how this output would feed a Databricks Vector Search pipeline and why Spark is a good fit here. Expected solution outline: use Spark SQL / DataFrame transforms, a UDF or pandas UDF only if needed for chunking, deterministic chunk_id generation, filtering on updated_at, and discussion of partitioning/performance tradeoffs on Databricks.

Your Answer

Build a Spark ETL Pipeline for Databricks Vector Search Indexing | Dataford Interview Questions - Dataford - Ace your Interview