Context
You’re joining MercuryPay, a global fintech with 38M MAUs and 6.5M DAUs across North America and Europe. MercuryPay is rolling out an in-app “AI Financial Assistant” that answers questions like “Can I afford this purchase?”, “Why was my card declined?”, and “Summarize my spending this week.” The assistant is powered by a large language model (LLM) hosted in a GPU inference cluster.
The product requirement is strict: P95 end-to-end latency < 250 ms for short responses, and the assistant must be available during peak traffic (paydays, holidays). The LLM itself is only part of the story—latency and correctness depend heavily on retrieval-augmented generation (RAG) and real-time user context (recent transactions, account balances, merchant metadata, risk flags, and user preferences). Today, MercuryPay’s data platform is mostly batch: nightly Spark jobs land data in a lake (S3) and publish curated tables to Snowflake. This introduces 12–24 hour staleness, which is unacceptable for a conversational assistant that must reflect recent transactions and fraud signals.
Your task is to design the data engineering pipelines that make low-latency LLM serving possible at scale: ingesting events, building online features, updating vector indexes, enforcing data quality, and providing safe fallbacks. You are not being asked to design the model; you are being asked to design the pipelines and data architecture that feed and monitor it.
Scale Requirements
- Inference traffic: 120K requests/sec peak globally; 35K req/sec average.
- Context fetch latency budget (everything before the model): P95 < 60 ms.
- Streaming events:
- Transactions: 25K events/sec peak (auths, settlements, chargebacks)
- User activity: 80K events/sec peak (app events, assistant interactions)
- Fraud/risk signals: 10K events/sec peak
- Data volume: ~40 TB/day raw events; 2 PB retained in the lake (S3) for 13 months.
- Freshness:
- Transaction context in online store: < 5 seconds from event time
- Vector index updates for new merchant descriptions/policies: < 5 minutes
- Correctness: Exactly-once where possible; otherwise effectively-once with idempotency.
- Compliance: PCI/PII handling, GDPR deletion within 30 days, auditability for model outputs.
Data Characteristics
Key entities and example schemas
- Transaction event (JSON/Avro in Kafka)
event_id (uuid), user_id, account_id, merchant_id, amount, currency, event_type (AUTH|SETTLEMENT|REVERSAL), event_ts, ingest_ts, risk_score, mcc_code, country
- User profile / preferences (CDC from Postgres)
user_id, locale, risk_tier, kyc_status, notification_prefs, updated_at
- Knowledge documents (merchant help pages, card policy docs)
doc_id, doc_type, source_url, title, body, updated_at, language
Quality issues you must handle
- Late arriving events: settlements can arrive hours after auth; chargebacks days later.
- Duplicates: retries from upstream producers; CDC replays.
- Out-of-order: cross-region ingestion and producer clock skew.
- Schema evolution: new fields (e.g.,
installment_plan_id) added without coordinated deploys.
Requirements
Functional
- Build a pipeline that maintains an online feature store for LLM context (recent transactions, aggregates, risk flags) queryable in < 20 ms P95.
- Build a pipeline that maintains a vector index for RAG (merchant/policy docs) with update latency < 5 minutes.
- Provide a batch backfill mechanism to recompute features and embeddings for historical corrections (e.g., bug fix, new aggregation definition).
- Implement data quality gates (schema validation, deduplication, anomaly detection) and route bad records to a DLQ with replay.
- Provide auditable inference telemetry: store prompts, retrieved context IDs, model version, latency, and safety outcomes for investigations.
Non-functional
- High availability across 3 AZs; tolerate single-AZ failure with minimal degradation.
- Secure handling of PII: encryption in transit/at rest, access controls, and tokenization where needed.
- Cost-aware design: GPU inference is expensive; pipelines should reduce unnecessary context fetches and avoid wasteful recomputation.
Constraints
- Cloud: AWS (existing S3 data lake, EMR, Snowflake). You may introduce managed services if justified.
- Team: 6 data engineers (strong Spark/dbt), 2 platform engineers (Kubernetes), limited prior Flink experience.
- Existing orchestration: Airflow 2.x.
- Existing warehouse: Snowflake used by analytics and compliance.
- You must support blue/green deployments for feature definitions and embedding models.
What you should deliver in your answer
- A concrete end-to-end architecture (streaming + batch) with named components.
- Data modeling for online features and offline history (including keys, TTLs, and versioning).
- How you’ll handle late data, deduplication, idempotency, and schema evolution.
- Orchestration strategy for streaming jobs, batch backfills, and index rebuilds.
- Monitoring/alerting, SLOs, and failure recovery playbooks.
- Performance considerations to hit the latency budgets and throughput targets.