You are responsible for a large recommendation or search system with a retrieval stage and one or more ranking stages. The system serves live traffic, retrains regularly, and depends on both batch and real-time features.
What failure modes would you monitor in a large-scale retrieval and ranking system?