Learn Ad Bidding with MDPs

Business Context

BidFlow runs real-time ad auctions for a mobile commerce app, serving roughly 8 million ad impressions per day. The ads team wants a policy that chooses bid levels sequentially to maximize long-term revenue under a daily budget, and they need a candidate to explain how Markov Decision Processes (MDPs) formalize this problem and connect to reinforcement learning.

Dataset

You are given an offline interaction log generated by a historical bidding policy. Each row represents one decision step in an auction session and includes the current state, chosen action, observed reward, and next state.

Feature Group	Count	Examples
State features	9	hour_of_day, remaining_budget_pct, user_segment, device_type, recent_ctr_10m, auction_competitiveness
Action features	1	bid_level in {low, medium, high}
Reward fields	2	click_value, spend
Transition fields	3	next_remaining_budget_pct, next_recent_ctr_10m, terminal_flag

Size: 420K state-action transitions across 30 days
Target: Learn the optimal policy maximizing discounted cumulative reward
Reward: reward = click_value - 0.4 * spend
Missing data: ~6% missing in recent CTR features, ~3% missing in auction competitiveness due to logging gaps

Success Criteria

A strong solution should:

Correctly define the MDP components: states, actions, transition dynamics, rewards, and discount factor.
Train a baseline RL-style value model from offline data and derive a policy.
Show that the learned policy improves expected return over a heuristic baseline by at least 8% on a held-out test set.
Explain why reinforcement learning solves MDPs by learning value functions or policies from interaction data.

Constraints

Inference must complete in under 20 ms per auction.
The solution should be simple enough to explain to product and ads stakeholders.
Training is offline only; no online exploration is allowed in the interview setting.

Deliverables

Define the MDP for this bidding problem.
Build a tabular or fitted Q-learning baseline from the logged data.
Evaluate the learned policy against a fixed heuristic policy.
Explain the relationship between MDPs and reinforcement learning, including key assumptions and limitations.

Business Context

Dataset

Feature Group	Count	Examples
State features	9	hour_of_day, remaining_budget_pct, user_segment, device_type, recent_ctr_10m, auction_competitiveness
Action features	1	bid_level in {low, medium, high}
Reward fields	2	click_value, spend
Transition fields	3	next_remaining_budget_pct, next_recent_ctr_10m, terminal_flag

Size: 420K state-action transitions across 30 days
Target: Learn the optimal policy maximizing discounted cumulative reward
Reward: reward = click_value - 0.4 * spend
Missing data: ~6% missing in recent CTR features, ~3% missing in auction competitiveness due to logging gaps

Success Criteria

A strong solution should:

Correctly define the MDP components: states, actions, transition dynamics, rewards, and discount factor.
Train a baseline RL-style value model from offline data and derive a policy.
Show that the learned policy improves expected return over a heuristic baseline by at least 8% on a held-out test set.
Explain why reinforcement learning solves MDPs by learning value functions or policies from interaction data.

Constraints

Inference must complete in under 20 ms per auction.
The solution should be simple enough to explain to product and ads stakeholders.
Training is offline only; no online exploration is allowed in the interview setting.

Deliverables

Define the MDP for this bidding problem.
Build a tabular or fitted Q-learning baseline from the logged data.
Evaluate the learned policy against a fixed heuristic policy.
Explain the relationship between MDPs and reinforcement learning, including key assumptions and limitations.

Business Context

Dataset

Feature Group	Count	Examples
State features	9	hour_of_day, remaining_budget_pct, user_segment, device_type, recent_ctr_10m, auction_competitiveness
Action features	1	bid_level in {low, medium, high}
Reward fields	2	click_value, spend
Transition fields	3	next_remaining_budget_pct, next_recent_ctr_10m, terminal_flag

Size: 420K state-action transitions across 30 days
Target: Learn the optimal policy maximizing discounted cumulative reward
Reward: reward = click_value - 0.4 * spend
Missing data: ~6% missing in recent CTR features, ~3% missing in auction competitiveness due to logging gaps

Success Criteria

A strong solution should:

Correctly define the MDP components: states, actions, transition dynamics, rewards, and discount factor.
Train a baseline RL-style value model from offline data and derive a policy.
Show that the learned policy improves expected return over a heuristic baseline by at least 8% on a held-out test set.
Explain why reinforcement learning solves MDPs by learning value functions or policies from interaction data.

Constraints

Inference must complete in under 20 ms per auction.
The solution should be simple enough to explain to product and ads stakeholders.
Training is offline only; no online exploration is allowed in the interview setting.

Deliverables

Define the MDP for this bidding problem.
Build a tabular or fitted Q-learning baseline from the logged data.
Evaluate the learned policy against a fixed heuristic policy.
Explain the relationship between MDPs and reinforcement learning, including key assumptions and limitations.

Business Context

Dataset

Feature Group	Count	Examples
State features	9	hour_of_day, remaining_budget_pct, user_segment, device_type, recent_ctr_10m, auction_competitiveness
Action features	1	bid_level in {low, medium, high}
Reward fields	2	click_value, spend
Transition fields	3	next_remaining_budget_pct, next_recent_ctr_10m, terminal_flag

Size: 420K state-action transitions across 30 days
Target: Learn the optimal policy maximizing discounted cumulative reward
Reward: reward = click_value - 0.4 * spend
Missing data: ~6% missing in recent CTR features, ~3% missing in auction competitiveness due to logging gaps

Success Criteria

A strong solution should:

Correctly define the MDP components: states, actions, transition dynamics, rewards, and discount factor.
Train a baseline RL-style value model from offline data and derive a policy.
Show that the learned policy improves expected return over a heuristic baseline by at least 8% on a held-out test set.
Explain why reinforcement learning solves MDPs by learning value functions or policies from interaction data.

Constraints

Inference must complete in under 20 ms per auction.
The solution should be simple enough to explain to product and ads stakeholders.
Training is offline only; no online exploration is allowed in the interview setting.

Deliverables

Define the MDP for this bidding problem.
Build a tabular or fitted Q-learning baseline from the logged data.
Evaluate the learned policy against a fixed heuristic policy.
Explain the relationship between MDPs and reinforcement learning, including key assumptions and limitations.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Learn Ad Bidding with MDPs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Learn Ad Bidding with MDPs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Learn Ad Bidding with MDPs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer