Build Embedding ETL for Vector Search

Context

You’re interviewing for a Data Engineering role on the Search & Recommendations team at a large e-commerce marketplace (think Etsy/eBay scale). The company has 65M monthly active buyers, 4M active sellers, and a catalog of ~1.2B listings across multiple locales. A new AI-powered experience—semantic search + “similar items” recommendations—is being rolled out. The product relies on a vector database to serve approximate nearest neighbor (ANN) queries over item and query embeddings.

Today, the catalog pipeline is mostly batch: sellers update listings (title, description, images, attributes), changes land in Kafka, and nightly Spark jobs build a denormalized “search document” table in Snowflake. This is sufficient for keyword search, but the AI experience has stricter requirements: embeddings must be refreshed quickly after edits, must be consistent across modalities (text + image), and the vector index must support safe backfills and model upgrades. A previous attempt produced stale vectors, duplicate records, and inconsistent metadata filters, causing relevance regressions and customer support escalations.

Your task is to design a complete data pipeline (streaming + batch) to generate embeddings and load them into a vector database for online serving, while also maintaining an analytics-friendly history in the warehouse.

Scale Requirements

Event throughput: average 35K listing updates/sec, peak 120K/sec during promotions.
Catalog size: 1.2B active listings, 6B total historical versions retained for audit.
Embedding models:
- Text encoder: 768-dim float32
- Image encoder: 1024-dim float32
- Optional fused embedding: 1536-dim float32
Freshness SLO: P95 < 5 minutes from listing update to queryable vector.
Availability: vector serving 99.9%; pipeline should degrade gracefully.
Retention:
- Raw events: 14 days in Kafka
- Processed feature/embedding artifacts: 90 days in object storage
- Warehouse history: 2+ years
Compliance: GDPR/CCPA deletion within 72 hours; auditability of model version used.

Data Characteristics

Sources

ListingChangeEvent (Kafka): emitted on create/update/delete.
MediaIngestEvent (Kafka): image uploaded/removed; image processing pipeline produces canonical image URLs.
SellerPolicyEvent (Kafka): policy violations can suppress listings.
Reference data (Snowflake): category taxonomy, locale normalization tables.

Example event schema (simplified)

field	type	notes
event_id	string	globally unique UUID
listing_id	string	stable item id
event_ts	timestamp	producer timestamp
op	string	UPSERT/DELETE
title	string	may be null on partial updates
description	string	may be large (up to 20KB)
attributes	variant/json	sparse key-value
image_ids	array	may arrive later than text
seller_id	string	used for policy joins
locale	string	affects tokenization

Quality issues you must handle

Out-of-order and late events (up to 2 hours late) due to mobile clients and retries.
Partial updates (title changed but description omitted).
Duplicates (at-least-once delivery; producers retry).
Model upgrades (v1 → v2) requiring re-embedding and safe cutover.
Deletes and suppressions must remove vectors and prevent serving.

Requirements

Functional

Build an ingestion and processing pipeline that creates a canonical “listing document” suitable for embedding.
Generate text and image embeddings (and optionally fused embeddings) and write them to a vector database with correct metadata for filtering (category, locale, seller policy status, active flag).
Support incremental updates (only re-embed when relevant fields change) and idempotent writes to the vector DB.
Handle late-arriving image events: if images arrive after text, update the fused embedding and metadata without corrupting the index.
Provide a backfill/reindex mechanism for model upgrades and bug fixes, without downtime.
Maintain an analytics/audit trail in Snowflake: which model version produced which embedding for each listing version.

Non-functional

Meet freshness SLO (P95 < 5 min) while controlling cost.
Ensure exactly-once effect at the vector DB layer (no duplicates, no missing deletes).
Implement data quality checks and observability: schema drift, null explosions, embedding distribution drift, and index load failures.
Ensure GDPR deletion propagates to object storage, warehouse, and vector DB.

Constraints

Cloud: AWS. Data lake on S3, warehouse is Snowflake.
Streaming platform: Kafka (MSK) already exists; team has moderate Kafka/Spark experience.
Orchestration: Airflow 2.x is standard.
Vector DB: choose one and justify (e.g., Pinecone, Weaviate, Milvus, or OpenSearch k-NN). Assume it supports upserts, deletes, namespaces/collections, and metadata filters.
Budget: incremental $60K/month for embedding compute + vector storage/serving.

Interview Prompts

Propose the end-to-end architecture (streaming + batch) and justify key choices: partitioning, state management, watermarking, and storage formats.
Define the data model in the vector DB: IDs, namespaces (by model version?), metadata fields, and how you represent multi-embedding (text/image/fused).
Explain how you ensure idempotency and correctness under duplicates/out-of-order events.
Describe your reindex/backfill strategy for a new embedding model, including validation and cutover.
Provide monitoring/alerting and failure handling strategies that you would run in production.

Your answer should be specific about technologies (Kafka, Spark, Airflow, dbt/Snowflake) and include concrete operational details (SLOs, retries, DLQs, and rollback).

Context

Scale Requirements

Event throughput: average 35K listing updates/sec, peak 120K/sec during promotions.
Catalog size: 1.2B active listings, 6B total historical versions retained for audit.
Embedding models:
- Text encoder: 768-dim float32
- Image encoder: 1024-dim float32
- Optional fused embedding: 1536-dim float32
Freshness SLO: P95 < 5 minutes from listing update to queryable vector.
Availability: vector serving 99.9%; pipeline should degrade gracefully.
Retention:
- Raw events: 14 days in Kafka
- Processed feature/embedding artifacts: 90 days in object storage
- Warehouse history: 2+ years
Compliance: GDPR/CCPA deletion within 72 hours; auditability of model version used.

Data Characteristics

Sources

ListingChangeEvent (Kafka): emitted on create/update/delete.
MediaIngestEvent (Kafka): image uploaded/removed; image processing pipeline produces canonical image URLs.
SellerPolicyEvent (Kafka): policy violations can suppress listings.
Reference data (Snowflake): category taxonomy, locale normalization tables.

Example event schema (simplified)

field	type	notes
event_id	string	globally unique UUID
listing_id	string	stable item id
event_ts	timestamp	producer timestamp
op	string	UPSERT/DELETE
title	string	may be null on partial updates
description	string	may be large (up to 20KB)
attributes	variant/json	sparse key-value
image_ids	array	may arrive later than text
seller_id	string	used for policy joins
locale	string	affects tokenization

Quality issues you must handle

Out-of-order and late events (up to 2 hours late) due to mobile clients and retries.
Partial updates (title changed but description omitted).
Duplicates (at-least-once delivery; producers retry).
Model upgrades (v1 → v2) requiring re-embedding and safe cutover.
Deletes and suppressions must remove vectors and prevent serving.

Requirements

Functional

Build an ingestion and processing pipeline that creates a canonical “listing document” suitable for embedding.
Generate text and image embeddings (and optionally fused embeddings) and write them to a vector database with correct metadata for filtering (category, locale, seller policy status, active flag).
Support incremental updates (only re-embed when relevant fields change) and idempotent writes to the vector DB.
Handle late-arriving image events: if images arrive after text, update the fused embedding and metadata without corrupting the index.
Provide a backfill/reindex mechanism for model upgrades and bug fixes, without downtime.
Maintain an analytics/audit trail in Snowflake: which model version produced which embedding for each listing version.

Non-functional

Meet freshness SLO (P95 < 5 min) while controlling cost.
Ensure exactly-once effect at the vector DB layer (no duplicates, no missing deletes).
Implement data quality checks and observability: schema drift, null explosions, embedding distribution drift, and index load failures.
Ensure GDPR deletion propagates to object storage, warehouse, and vector DB.

Constraints

Cloud: AWS. Data lake on S3, warehouse is Snowflake.
Streaming platform: Kafka (MSK) already exists; team has moderate Kafka/Spark experience.
Orchestration: Airflow 2.x is standard.
Vector DB: choose one and justify (e.g., Pinecone, Weaviate, Milvus, or OpenSearch k-NN). Assume it supports upserts, deletes, namespaces/collections, and metadata filters.
Budget: incremental $60K/month for embedding compute + vector storage/serving.

Interview Prompts

Propose the end-to-end architecture (streaming + batch) and justify key choices: partitioning, state management, watermarking, and storage formats.
Define the data model in the vector DB: IDs, namespaces (by model version?), metadata fields, and how you represent multi-embedding (text/image/fused).
Explain how you ensure idempotency and correctness under duplicates/out-of-order events.
Describe your reindex/backfill strategy for a new embedding model, including validation and cutover.
Provide monitoring/alerting and failure handling strategies that you would run in production.

Your answer should be specific about technologies (Kafka, Spark, Airflow, dbt/Snowflake) and include concrete operational details (SLOs, retries, DLQs, and rollback).

Context

Scale Requirements

Event throughput: average 35K listing updates/sec, peak 120K/sec during promotions.
Catalog size: 1.2B active listings, 6B total historical versions retained for audit.
Embedding models:
- Text encoder: 768-dim float32
- Image encoder: 1024-dim float32
- Optional fused embedding: 1536-dim float32
Freshness SLO: P95 < 5 minutes from listing update to queryable vector.
Availability: vector serving 99.9%; pipeline should degrade gracefully.
Retention:
- Raw events: 14 days in Kafka
- Processed feature/embedding artifacts: 90 days in object storage
- Warehouse history: 2+ years
Compliance: GDPR/CCPA deletion within 72 hours; auditability of model version used.

Data Characteristics

Sources

ListingChangeEvent (Kafka): emitted on create/update/delete.
MediaIngestEvent (Kafka): image uploaded/removed; image processing pipeline produces canonical image URLs.
SellerPolicyEvent (Kafka): policy violations can suppress listings.
Reference data (Snowflake): category taxonomy, locale normalization tables.

Example event schema (simplified)

field	type	notes
event_id	string	globally unique UUID
listing_id	string	stable item id
event_ts	timestamp	producer timestamp
op	string	UPSERT/DELETE
title	string	may be null on partial updates
description	string	may be large (up to 20KB)
attributes	variant/json	sparse key-value
image_ids	array	may arrive later than text
seller_id	string	used for policy joins
locale	string	affects tokenization

Quality issues you must handle

Out-of-order and late events (up to 2 hours late) due to mobile clients and retries.
Partial updates (title changed but description omitted).
Duplicates (at-least-once delivery; producers retry).
Model upgrades (v1 → v2) requiring re-embedding and safe cutover.
Deletes and suppressions must remove vectors and prevent serving.

Requirements

Functional

Build an ingestion and processing pipeline that creates a canonical “listing document” suitable for embedding.
Generate text and image embeddings (and optionally fused embeddings) and write them to a vector database with correct metadata for filtering (category, locale, seller policy status, active flag).
Support incremental updates (only re-embed when relevant fields change) and idempotent writes to the vector DB.
Handle late-arriving image events: if images arrive after text, update the fused embedding and metadata without corrupting the index.
Provide a backfill/reindex mechanism for model upgrades and bug fixes, without downtime.
Maintain an analytics/audit trail in Snowflake: which model version produced which embedding for each listing version.

Non-functional

Meet freshness SLO (P95 < 5 min) while controlling cost.
Ensure exactly-once effect at the vector DB layer (no duplicates, no missing deletes).
Implement data quality checks and observability: schema drift, null explosions, embedding distribution drift, and index load failures.
Ensure GDPR deletion propagates to object storage, warehouse, and vector DB.

Constraints

Cloud: AWS. Data lake on S3, warehouse is Snowflake.
Streaming platform: Kafka (MSK) already exists; team has moderate Kafka/Spark experience.
Orchestration: Airflow 2.x is standard.
Vector DB: choose one and justify (e.g., Pinecone, Weaviate, Milvus, or OpenSearch k-NN). Assume it supports upserts, deletes, namespaces/collections, and metadata filters.
Budget: incremental $60K/month for embedding compute + vector storage/serving.

Interview Prompts

Propose the end-to-end architecture (streaming + batch) and justify key choices: partitioning, state management, watermarking, and storage formats.
Define the data model in the vector DB: IDs, namespaces (by model version?), metadata fields, and how you represent multi-embedding (text/image/fused).
Explain how you ensure idempotency and correctness under duplicates/out-of-order events.
Describe your reindex/backfill strategy for a new embedding model, including validation and cutover.
Provide monitoring/alerting and failure handling strategies that you would run in production.

Your answer should be specific about technologies (Kafka, Spark, Airflow, dbt/Snowflake) and include concrete operational details (SLOs, retries, DLQs, and rollback).

Context

Scale Requirements

Event throughput: average 35K listing updates/sec, peak 120K/sec during promotions.
Catalog size: 1.2B active listings, 6B total historical versions retained for audit.
Embedding models:
- Text encoder: 768-dim float32
- Image encoder: 1024-dim float32
- Optional fused embedding: 1536-dim float32
Freshness SLO: P95 < 5 minutes from listing update to queryable vector.
Availability: vector serving 99.9%; pipeline should degrade gracefully.
Retention:
- Raw events: 14 days in Kafka
- Processed feature/embedding artifacts: 90 days in object storage
- Warehouse history: 2+ years
Compliance: GDPR/CCPA deletion within 72 hours; auditability of model version used.

Data Characteristics

Sources

ListingChangeEvent (Kafka): emitted on create/update/delete.
MediaIngestEvent (Kafka): image uploaded/removed; image processing pipeline produces canonical image URLs.
SellerPolicyEvent (Kafka): policy violations can suppress listings.
Reference data (Snowflake): category taxonomy, locale normalization tables.

Example event schema (simplified)

field	type	notes
event_id	string	globally unique UUID
listing_id	string	stable item id
event_ts	timestamp	producer timestamp
op	string	UPSERT/DELETE
title	string	may be null on partial updates
description	string	may be large (up to 20KB)
attributes	variant/json	sparse key-value
image_ids	array	may arrive later than text
seller_id	string	used for policy joins
locale	string	affects tokenization

Quality issues you must handle

Out-of-order and late events (up to 2 hours late) due to mobile clients and retries.
Partial updates (title changed but description omitted).
Duplicates (at-least-once delivery; producers retry).
Model upgrades (v1 → v2) requiring re-embedding and safe cutover.
Deletes and suppressions must remove vectors and prevent serving.

Requirements

Functional

Build an ingestion and processing pipeline that creates a canonical “listing document” suitable for embedding.
Generate text and image embeddings (and optionally fused embeddings) and write them to a vector database with correct metadata for filtering (category, locale, seller policy status, active flag).
Support incremental updates (only re-embed when relevant fields change) and idempotent writes to the vector DB.
Handle late-arriving image events: if images arrive after text, update the fused embedding and metadata without corrupting the index.
Provide a backfill/reindex mechanism for model upgrades and bug fixes, without downtime.
Maintain an analytics/audit trail in Snowflake: which model version produced which embedding for each listing version.

Non-functional

Meet freshness SLO (P95 < 5 min) while controlling cost.
Ensure exactly-once effect at the vector DB layer (no duplicates, no missing deletes).
Implement data quality checks and observability: schema drift, null explosions, embedding distribution drift, and index load failures.
Ensure GDPR deletion propagates to object storage, warehouse, and vector DB.

Constraints

Cloud: AWS. Data lake on S3, warehouse is Snowflake.
Streaming platform: Kafka (MSK) already exists; team has moderate Kafka/Spark experience.
Orchestration: Airflow 2.x is standard.
Vector DB: choose one and justify (e.g., Pinecone, Weaviate, Milvus, or OpenSearch k-NN). Assume it supports upserts, deletes, namespaces/collections, and metadata filters.
Budget: incremental $60K/month for embedding compute + vector storage/serving.

Interview Prompts

Propose the end-to-end architecture (streaming + batch) and justify key choices: partitioning, state management, watermarking, and storage formats.
Define the data model in the vector DB: IDs, namespaces (by model version?), metadata fields, and how you represent multi-embedding (text/image/fused).
Explain how you ensure idempotency and correctness under duplicates/out-of-order events.
Describe your reindex/backfill strategy for a new embedding model, including validation and cutover.
Provide monitoring/alerting and failure handling strategies that you would run in production.

Your answer should be specific about technologies (Kafka, Spark, Airflow, dbt/Snowflake) and include concrete operational details (SLOs, retries, DLQs, and rollback).

Interview Guides

Context

Scale Requirements

Data Characteristics

Sources

Example event schema (simplified)

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Build Embedding ETL for Vector Search

Context

Scale Requirements

Data Characteristics

Sources

Example event schema (simplified)

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Your Answer

Build Embedding ETL for Vector Search

Context

Scale Requirements

Data Characteristics

Sources

Example event schema (simplified)

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Build Embedding ETL for Vector Search

Context

Scale Requirements

Data Characteristics

Sources

Example event schema (simplified)

Quality issues you must handle

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Your Answer