You are designing a feature store for a large consumer marketplace that powers multiple ML systems, including search retrieval, ranking, fraud detection, and personalized recommendations. Today, each team computes features separately, causing inconsistent definitions between training and serving, slow experimentation, and frequent training-serving skew. You have been asked to build a shared feature platform that supports both offline training datasets and low-latency online inference. The goal is to make feature computation reusable, fresh, and reliable across models while reducing leakage, drift, and operational overhead.
| Signal | Value |
|---|---|
| DAU | 45M |
| Peak inference QPS across ML services | 220K |
| Models using the feature store | 120 |
| Distinct feature definitions | 8,000 |
| Daily raw events | 9B |
| Online feature freshness target | < 2 minutes |
| Per-request feature lookup budget (p99) | 15ms |
How would you design this feature store end to end so that it supports both model training and online serving at this scale? Explain the architecture, data flow, consistency strategy, evaluation approach, and how you would handle drift, skew, backfills, and failures.