You are supporting a production ML ranking system for a high-traffic digital platform where model quality directly affects revenue and user engagement. The system uses a mix of real-time and batch features from user behavior, item metadata, and contextual signals, and the team has seen silent regressions caused by upstream schema changes and shifting user behavior. You have been asked to design an end-to-end approach for detecting, diagnosing, and responding to feature drift before it materially harms model performance. The solution should work for both fast-changing online features and slower batch-computed features.
| Signal | Value |
|---|---|
| DAU | 18M |
| Peak prediction QPS | 120K |
| Ranked items per request | 150 |
| Total model features | 650 |
| Real-time features | 220 |
| Batch features | 430 |
| Feature freshness SLA | real-time < 5 min; batch < 24 hr |
| p99 inference latency budget | 120 ms |
| Training data retained | 90 days |
How would you design the production ML system so feature drift monitoring is a first-class part of the end-to-end architecture, including how data is generated, served, evaluated, alerted on, and used to trigger mitigation or rollback decisions?