
You are reviewing a supervised learning pipeline and notice that model quality changes a lot across retrains. Some of the instability appears to come from bad records, noisy labels, and uneven performance across groups.
How would you actively identify and manage data issues such as outliers, noise, and biases?
Outliers from impossible values and extreme but valid casesFeature noise from bad joins, stale values, and inconsistent unitsLabel noise from delayed outcomes or manual review errorsBias from representation gaps and unequal model performance across groups