Business Context
BidFlow runs real-time ad auctions for a mobile commerce app, serving roughly 8 million ad impressions per day. The ads team wants a policy that chooses bid levels sequentially to maximize long-term revenue under a daily budget, and they need a candidate to explain how Markov Decision Processes (MDPs) formalize this problem and connect to reinforcement learning.
Dataset
You are given an offline interaction log generated by a historical bidding policy. Each row represents one decision step in an auction session and includes the current state, chosen action, observed reward, and next state.
| Feature Group | Count | Examples |
|---|
| State features | 9 | hour_of_day, remaining_budget_pct, user_segment, device_type, recent_ctr_10m, auction_competitiveness |
| Action features | 1 | bid_level in {low, medium, high} |
| Reward fields | 2 | click_value, spend |
| Transition fields | 3 | next_remaining_budget_pct, next_recent_ctr_10m, terminal_flag |
| | |
- Size: 420K state-action transitions across 30 days
- Target: Learn the optimal policy maximizing discounted cumulative reward
- Reward:
reward = click_value - 0.4 * spend
- Missing data: ~6% missing in recent CTR features, ~3% missing in auction competitiveness due to logging gaps
Success Criteria
A strong solution should:
- Correctly define the MDP components: states, actions, transition dynamics, rewards, and discount factor.
- Train a baseline RL-style value model from offline data and derive a policy.
- Show that the learned policy improves expected return over a heuristic baseline by at least 8% on a held-out test set.
- Explain why reinforcement learning solves MDPs by learning value functions or policies from interaction data.
Constraints
- Inference must complete in under 20 ms per auction.
- The solution should be simple enough to explain to product and ads stakeholders.
- Training is offline only; no online exploration is allowed in the interview setting.
Deliverables
- Define the MDP for this bidding problem.
- Build a tabular or fitted Q-learning baseline from the logged data.
- Evaluate the learned policy against a fixed heuristic policy.
- Explain the relationship between MDPs and reinforcement learning, including key assumptions and limitations.