You are responsible for a large recommendation or search system with a retrieval stage and one or more ranking stages. The system serves live traffic, retrains regularly, and depends on both batch and real-time features.
What failure modes would you monitor in a large-scale retrieval and ranking system?
Monitoring across retrieval, ranking, and re-rankingFeature drift and training-serving skew awarenessOnline serving dependencies and fallback designProduct and model quality guardrails, not just infrastructure alerts