Design Checkout Risk Decision Service

Product Context

ShopSphere is a global e-commerce marketplace. During checkout, every payment attempt must be scored in real time to decide whether to approve, step-up authenticate, or block the transaction; downtime or bad predictions directly impact revenue and fraud losses.

Scale

Signal	Value
DAU	45M shoppers
Peak checkout QPS	28K payment attempts/sec
Average QPS	9K/sec
Merchants	1.2M
Active cards / payment instruments	180M
Historical transactions for training	9B over 24 months
End-to-end decision latency budget (p99)	120ms
Availability target	99.99%

Task

Design a highly available ML system for checkout risk scoring. Your design should address both prediction quality and graceful degradation when dependencies fail.

Clarify the product objective, decision actions, and error costs (false approve vs false decline).
Propose an end-to-end architecture covering feature generation, candidate policy evaluation, ML scoring, and final decisioning.
Define the online and offline paths: training data, feature store, model training cadence, deployment, and rollback.
Explain how you would meet strict availability and latency requirements across regions while avoiding training-serving skew.
Define offline and online evaluation, including business guardrails and segment-level monitoring.
Identify major failure modes, especially dependency outages, feature drift, stale features, and model degradation during traffic spikes.

Constraints

The service is in the critical path of checkout: if it is unavailable, carts are abandoned within seconds.
Some labels are delayed: confirmed fraud may arrive days to weeks later via chargebacks.
Regulatory and audit requirements require explainable decisions and immutable decision logs.
Cross-border traffic creates feature sparsity and cold-start for new merchants, devices, and payment instruments.
Cost matters: the online path must primarily run on CPU, with limited room for expensive deep models.

Signal

Value

DAU

45M shoppers

Peak checkout QPS

28K payment attempts/sec

Average QPS

9K/sec

Merchants

1.2M

Active cards / payment instruments

180M

Historical transactions for training

9B over 24 months

End-to-end decision latency budget (p99)

120ms

Availability target

99.99%

Task

Design a highly available ML system for checkout risk scoring. Your design should address both prediction quality and graceful degradation when dependencies fail.

Clarify the product objective, decision actions, and error costs (false approve vs false decline).

Propose an end-to-end architecture covering feature generation, candidate policy evaluation, ML scoring, and final decisioning.

Define the online and offline paths: training data, feature store, model training cadence, deployment, and rollback.

Explain how you would meet strict availability and latency requirements across regions while avoiding training-serving skew.

Define offline and online evaluation, including business guardrails and segment-level monitoring.

Identify major failure modes, especially dependency outages, feature drift, stale features, and model degradation during traffic spikes.

Constraints

The service is in the critical path of checkout: if it is unavailable, carts are abandoned within seconds.

Some labels are delayed: confirmed fraud may arrive days to weeks later via chargebacks.

Regulatory and audit requirements require explainable decisions and immutable decision logs.

Cross-border traffic creates feature sparsity and cold-start for new merchants, devices, and payment instruments.

Cost matters: the online path must primarily run on CPU, with limited room for expensive deep models.

Signal

Value

DAU

45M shoppers

Peak checkout QPS

28K payment attempts/sec

Average QPS

9K/sec

Merchants

1.2M

Active cards / payment instruments

180M

Historical transactions for training

9B over 24 months

End-to-end decision latency budget (p99)

120ms

Availability target

99.99%

Task

Design a highly available ML system for checkout risk scoring. Your design should address both prediction quality and graceful degradation when dependencies fail.

Clarify the product objective, decision actions, and error costs (false approve vs false decline).

Propose an end-to-end architecture covering feature generation, candidate policy evaluation, ML scoring, and final decisioning.

Define the online and offline paths: training data, feature store, model training cadence, deployment, and rollback.

Explain how you would meet strict availability and latency requirements across regions while avoiding training-serving skew.

Define offline and online evaluation, including business guardrails and segment-level monitoring.

Identify major failure modes, especially dependency outages, feature drift, stale features, and model degradation during traffic spikes.

Constraints

The service is in the critical path of checkout: if it is unavailable, carts are abandoned within seconds.

Some labels are delayed: confirmed fraud may arrive days to weeks later via chargebacks.

Regulatory and audit requirements require explainable decisions and immutable decision logs.

Cross-border traffic creates feature sparsity and cold-start for new merchants, devices, and payment instruments.

Cost matters: the online path must primarily run on CPU, with limited room for expensive deep models.

Signal

Value

DAU

45M shoppers

Peak checkout QPS

28K payment attempts/sec

Average QPS

9K/sec

Merchants

1.2M

Active cards / payment instruments

180M

Historical transactions for training

9B over 24 months

End-to-end decision latency budget (p99)

120ms

Availability target

99.99%

Task

Design a highly available ML system for checkout risk scoring. Your design should address both prediction quality and graceful degradation when dependencies fail.

Clarify the product objective, decision actions, and error costs (false approve vs false decline).

Propose an end-to-end architecture covering feature generation, candidate policy evaluation, ML scoring, and final decisioning.

Define the online and offline paths: training data, feature store, model training cadence, deployment, and rollback.

Explain how you would meet strict availability and latency requirements across regions while avoiding training-serving skew.

Define offline and online evaluation, including business guardrails and segment-level monitoring.

Identify major failure modes, especially dependency outages, feature drift, stale features, and model degradation during traffic spikes.

Constraints

The service is in the critical path of checkout: if it is unavailable, carts are abandoned within seconds.

Some labels are delayed: confirmed fraud may arrive days to weeks later via chargebacks.

Regulatory and audit requirements require explainable decisions and immutable decision logs.

Cross-border traffic creates feature sparsity and cold-start for new merchants, devices, and payment instruments.

Cost matters: the online path must primarily run on CPU, with limited room for expensive deep models.

Interview Guides

Product Context

Scale

Task

Constraints

Design Checkout Risk Decision Service

Product Context

Scale

Task

Constraints

Your Answer

Design Checkout Risk Decision Service

Product Context

Scale

Task

Constraints

Design Checkout Risk Decision Service

Product Context

Scale

Task

Constraints

Your Answer