Product Context
FinFlow is a digital payments platform that must make a real-time decision on every card transaction: approve, decline, or send for step-up verification. The service is used by merchants and consumers globally, and model quality directly affects fraud loss, conversion, and customer trust.
Scale
| Signal | Value |
|---|
| DAU | 18M transacting users |
| Peak transaction QPS | 45K requests/sec |
| Average QPS | 18K requests/sec |
| Merchants | 250K active merchants |
| Historical transactions | 9B over 24 months |
| Feature freshness target | < 2 seconds for streaming counters |
| End-to-end latency budget | 80ms p99 |
Task
Design a low-latency, real-time ML decision service on AWS using Python and SageMaker. Your design should address:
- How requests flow from API ingress through feature lookup, candidate decisioning, model scoring, policy checks, and response.
- What should run online vs batch, including feature computation, model training, model deployment, and backfills.
- Which models you would use at each stage, and why a multi-stage architecture is preferable to a single model.
- How you would evaluate the system offline and online, including delayed labels for fraud outcomes.
- How you would handle monitoring, feature drift, training-serving skew, and safe rollback.
- What failure modes you expect at this scale and how the system should degrade gracefully.
Constraints
- The service must return a decision within 80ms p99; hard timeout at 100ms.
- False declines are expensive, but missed fraud has direct financial cost; the platform wants tunable thresholds by merchant risk tier.
- Some labels are delayed by days to weeks due to chargebacks, so training data is partially censored.
- PII and payment data must remain compliant with PCI requirements; minimize sensitive data movement.
- The system must continue serving during partial outages in feature infrastructure or model endpoints.
- Cost matters: the online path should avoid heavyweight per-request deep models unless justified by measurable lift.