You are designing an ML-assisted medical coding system for a healthcare workflow platform. Coders review patient charts, discharge summaries, physician notes, lab results, and structured encounter data, and the system should recommend diagnosis and procedure codes to speed up review while maintaining high accuracy. The business goal is to reduce manual coding time, improve coding consistency, and minimize downstream claim denials and compliance risk. The system must support both real-time suggestions during chart review and batch pre-coding for newly completed encounters.
| Signal | Value |
|---|---|
| Daily active coders | 18,000 |
| Encounters processed per day | 2.5M |
| Peak online suggestion QPS | 1,200 |
| Historical labeled encounters | 1.1B |
| Active code catalog | 95,000 diagnosis/procedure codes |
| Avg chart text per encounter | 8-20 pages equivalent |
| p99 latency budget for online suggestions | 800ms |
How would you design this system end to end so it can generate accurate code recommendations at scale, balancing retrieval, ranking, and final re-ranking or validation while handling both batch and online use cases? Explain the architecture, model choices, evaluation approach, monitoring, and the main failure modes you would plan for.