Product Context
Design the deployment architecture for machine learning models that personalize next-best actions inside Voya Learn, Voya Retire, and related participant digital experiences. The system should decide which educational content, retirement guidance prompts, and plan recommendations to surface to each user session.
Scale
| Signal | Value |
|---|
| Registered users across participant surfaces | 9M |
| DAU | 1.2M |
| Peak recommendation QPS | 18K |
| Eligible content + offers catalog | 2.5M items |
| New/updated items per day | 40K |
| Per-request latency budget (p99) | 180ms |
| Training events per day | 220M impressions/clicks/conversions |
Task
You are the AI Architect responsible for choosing the tools and technologies used to deploy these models in production. Rather than listing favorite tools, design the end-to-end ML serving system and explain where specific technologies fit.
- Clarify the product, ML, and compliance requirements for personalized recommendations in Voya participant channels.
- Propose a multi-stage architecture for online serving, including candidate retrieval, ranking, and re-ranking or policy layers.
- Choose the deployment stack for training, model registry, feature storage, real-time inference, and batch scoring, and justify why each technology fits Voya's constraints.
- Define how you would handle model rollout, rollback, offline evaluation, online experimentation, and monitoring.
- Identify key failure modes such as feature drift, training-serving skew, stale features, and service degradation, and explain mitigations.
Constraints
- Recommendations may use behavioral, account, and plan-level features, but must respect financial-services privacy, auditability, and access controls.
- Some features are near-real-time (recent clicks, contribution changes), while others refresh daily (account balances, plan metadata).
- The system must support both online low-latency inference and overnight batch scoring for outbound campaigns.
- Cost matters: GPU usage should be limited to stages where it materially improves business value.
- The design should be resilient enough for participant-facing traffic with 99.95% availability and clear fallback behavior.