You are responsible for an automated workflow that moves and transforms operational data across internal systems. The workflow normally runs without much manual intervention, but you want a clear approach for when failures start happening repeatedly and affect a large number of records or downstream users.
What would you do if an automated workflow started creating errors at scale?