RideWave is a large ride-hailing marketplace (~8M monthly active riders) that recently shipped a redesigned “Schedule a Ride” flow. The team ran a 14-day A/B test and also collected post-ride NPS surveys with an open-text prompt: “What’s the main reason for your score?”.
Leaders are debating whether the redesign improved the experience because of qualitative feedback (themes in text like “confusing UI” or “pricing transparency”) or because it actually changed quantitative outcomes (conversion and NPS score). Your job is to connect the two: show you understand the difference between qualitative vs quantitative data, and demonstrate how to rigorously analyze both in one experiment.
A text-analytics pipeline (human-labeled + model-assisted) assigns each comment a binary indicator for whether it mentions the theme “confusing UI”. You are told the classifier has precision = 0.90 and recall = 0.80 for that theme, and performance is stable across variants.
| Metric | Control (A) | Treatment (B) |
|---|---|---|
| Eligible sessions (n) | 120,480 | 119,920 |
| Completed scheduled rides (conversions) | 14,458 | 15,110 |
| NPS responses (m) | 18,240 | 18,010 |
| Mean NPS score | 34.6 | 36.1 |
| Std dev of NPS score | 28.0 | 27.5 |
| Comments flagged “confusing UI” (observed) | 2,190 | 1,845 |
Additional info: each session belongs to a unique user (no repeated measures), randomization is at the user level, and you can treat NPS scores as approximately continuous for inference.
You need to decide whether the redesign should roll out globally. Do this by (a) clearly distinguishing qualitative vs quantitative data in this setting, and (b) running appropriate statistical tests on the quantitative summaries while accounting for measurement error in the qualitative theme label.