Interview Guides

Interview Guides

Capacity-plan a multi-tenant GenAI recommendation platform with tight latency and GPU limits | Dataford Interview Questions - Dataford - Ace your Interview

All questions/ML System Design/Capacity-plan a multi-tenant GenAI recommendation platform with tight latency and GPU limits

Capacity-plan a multi-tenant GenAI recommendation platform with tight latency and GPU limits

Hard

ML System Design

Asked at 1 company1

Also asked at

Databricks

Problem

A B2C media company wants to build a Databricks-based recommendation and retrieval platform that combines nightly embedding generation, vector search, and online reranking for 120 million users and 15 million content items. The system must support 12,000 QPS average and 30,000 QPS peak for recommendation requests, P95 latency under 180 ms, offline NDCG improvement of at least 8% over the current baseline, and a hard limit of 64 A100-equivalent GPUs plus $350k/month total serving spend. Ask the candidate to design the end-to-end architecture using Databricks components such as Mosaic AI Model Serving, Vector Search, Delta Lake, Lakeflow/streaming pipelines, and Unity Catalog, and to reason through tenant isolation, autoscaling, backfills, embedding refresh cadence, and failure modes. The candidate should explicitly estimate capacity, identify bottlenecks, and explain to a non-technical CFO why the chosen design meets both growth and cost targets better than a simpler always-on GPU fleet.

Problem

A B2C media company wants to build a Databricks-based recommendation and retrieval platform that combines nightly embedding generation, vector search, and online reranking for 120 million users and 15 million content items. The system must support 12,000 QPS average and 30,000 QPS peak for recommendation requests, P95 latency under 180 ms, offline NDCG improvement of at least 8% over the current baseline, and a hard limit of 64 A100-equivalent GPUs plus $350k/month total serving spend. Ask the candidate to design the end-to-end architecture using Databricks components such as Mosaic AI Model Serving, Vector Search, Delta Lake, Lakeflow/streaming pipelines, and Unity Catalog, and to reason through tenant isolation, autoscaling, backfills, embedding refresh cadence, and failure modes. The candidate should explicitly estimate capacity, identify bottlenecks, and explain to a non-technical CFO why the chosen design meets both growth and cost targets better than a simpler always-on GPU fleet.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Databricks

Design a low-latency fraud scoring platform on Databricks under a strict cloud budgetMedium

OpenAI

Design a low-latency support copilot for a SaaS customer under strict budgetEasy

OpenAI

Design a low-latency RAG assistant for OpenAI API docs and incident runbooksMedium