Context
You’re interviewing for a Data Engineering role on the Search & Recommendations team at a large e-commerce marketplace (think Etsy/eBay scale). The company has 65M monthly active buyers, 4M active sellers, and a catalog of ~1.2B listings across multiple locales. A new AI-powered experience—semantic search + “similar items” recommendations—is being rolled out. The product relies on a vector database to serve approximate nearest neighbor (ANN) queries over item and query embeddings.
Today, the catalog pipeline is mostly batch: sellers update listings (title, description, images, attributes), changes land in Kafka, and nightly Spark jobs build a denormalized “search document” table in Snowflake. This is sufficient for keyword search, but the AI experience has stricter requirements: embeddings must be refreshed quickly after edits, must be consistent across modalities (text + image), and the vector index must support safe backfills and model upgrades. A previous attempt produced stale vectors, duplicate records, and inconsistent metadata filters, causing relevance regressions and customer support escalations.
Your task is to design a complete data pipeline (streaming + batch) to generate embeddings and load them into a vector database for online serving, while also maintaining an analytics-friendly history in the warehouse.
Scale Requirements
- Event throughput: average 35K listing updates/sec, peak 120K/sec during promotions.
- Catalog size: 1.2B active listings, 6B total historical versions retained for audit.
- Embedding models:
- Text encoder: 768-dim float32
- Image encoder: 1024-dim float32
- Optional fused embedding: 1536-dim float32
- Freshness SLO: P95 < 5 minutes from listing update to queryable vector.
- Availability: vector serving 99.9%; pipeline should degrade gracefully.
- Retention:
- Raw events: 14 days in Kafka
- Processed feature/embedding artifacts: 90 days in object storage
- Warehouse history: 2+ years
- Compliance: GDPR/CCPA deletion within 72 hours; auditability of model version used.
Data Characteristics
Sources
- ListingChangeEvent (Kafka): emitted on create/update/delete.
- MediaIngestEvent (Kafka): image uploaded/removed; image processing pipeline produces canonical image URLs.
- SellerPolicyEvent (Kafka): policy violations can suppress listings.
- Reference data (Snowflake): category taxonomy, locale normalization tables.
Example event schema (simplified)
| field | type | notes |
|---|
| event_id | string | globally unique UUID |
| listing_id | string | stable item id |
| event_ts | timestamp | producer timestamp |
| op | string | UPSERT/DELETE |
| title | string | may be null on partial updates |
| description | string | may be large (up to 20KB) |
| attributes | variant/json | sparse key-value |
| image_ids | array | may arrive later than text |
| seller_id | string | used for policy joins |
| locale | string | affects tokenization |
Quality issues you must handle
- Out-of-order and late events (up to 2 hours late) due to mobile clients and retries.
- Partial updates (title changed but description omitted).
- Duplicates (at-least-once delivery; producers retry).
- Model upgrades (v1 → v2) requiring re-embedding and safe cutover.
- Deletes and suppressions must remove vectors and prevent serving.
Requirements
Functional
- Build an ingestion and processing pipeline that creates a canonical “listing document” suitable for embedding.
- Generate text and image embeddings (and optionally fused embeddings) and write them to a vector database with correct metadata for filtering (category, locale, seller policy status, active flag).
- Support incremental updates (only re-embed when relevant fields change) and idempotent writes to the vector DB.
- Handle late-arriving image events: if images arrive after text, update the fused embedding and metadata without corrupting the index.
- Provide a backfill/reindex mechanism for model upgrades and bug fixes, without downtime.
- Maintain an analytics/audit trail in Snowflake: which model version produced which embedding for each listing version.
Non-functional
- Meet freshness SLO (P95 < 5 min) while controlling cost.
- Ensure exactly-once effect at the vector DB layer (no duplicates, no missing deletes).
- Implement data quality checks and observability: schema drift, null explosions, embedding distribution drift, and index load failures.
- Ensure GDPR deletion propagates to object storage, warehouse, and vector DB.
Constraints
- Cloud: AWS. Data lake on S3, warehouse is Snowflake.
- Streaming platform: Kafka (MSK) already exists; team has moderate Kafka/Spark experience.
- Orchestration: Airflow 2.x is standard.
- Vector DB: choose one and justify (e.g., Pinecone, Weaviate, Milvus, or OpenSearch k-NN). Assume it supports upserts, deletes, namespaces/collections, and metadata filters.
- Budget: incremental $60K/month for embedding compute + vector storage/serving.
Interview Prompts
- Propose the end-to-end architecture (streaming + batch) and justify key choices: partitioning, state management, watermarking, and storage formats.
- Define the data model in the vector DB: IDs, namespaces (by model version?), metadata fields, and how you represent multi-embedding (text/image/fused).
- Explain how you ensure idempotency and correctness under duplicates/out-of-order events.
- Describe your reindex/backfill strategy for a new embedding model, including validation and cutover.
- Provide monitoring/alerting and failure handling strategies that you would run in production.
Your answer should be specific about technologies (Kafka, Spark, Airflow, dbt/Snowflake) and include concrete operational details (SLOs, retries, DLQs, and rollback).