Business Context
OpenAI is training a large supervised model on a multi-terabyte dataset, and a single run can take 2-5 days across multiple GPUs. Your task is to design and implement checkpoint management so training can resume safely after preemption, node failure, or manual interruption without losing meaningful progress or corrupting model state.
Dataset
You are given a text classification training corpus used for internal moderation research. The ML task itself is standard supervised learning, but the interview focus is robust training-state management during long-running jobs.
| Feature Group | Count | Examples |
|---|
| Text inputs | 1 | prompt_text |
| Numeric metadata | 6 | token_count, language_confidence, prior_report_rate |
| Categorical metadata | 4 | language, source_surface, policy_area, region |
| Labels | 1 | violation_class |
- Size: 42M examples, ~1.8 TB tokenized training data, 11 classes
- Target: Multiclass classification — policy violation class
- Class balance: Long-tailed; largest class 41%, smallest class 0.3%
- Missing data: ~8% missing metadata fields; text always present
Success Criteria
A strong solution should:
- Resume training from the latest valid checkpoint with no more than 15 minutes of lost progress
- Restore model weights, optimizer state, scheduler state, gradient scaler, RNG state, and data-loader progress
- Avoid partial or corrupted checkpoints being treated as valid
- Demonstrate that resumed training produces comparable validation loss and macro-F1 to uninterrupted training
Constraints
- Training runs on preemptible GPU instances
- Checkpoint writes must not stall training for more than a few seconds per save
- Storage budget is limited; you cannot keep every checkpoint forever
- Validation should run every 10K steps; checkpoints every 2K-5K steps
Deliverables
- Build a training loop in PyTorch with periodic checkpoint save and resume support.
- Implement atomic checkpoint writing, retention, and "latest" checkpoint discovery.
- Show how to validate checkpoint integrity before resuming.
- Evaluate resumed-vs-fresh training using concrete metrics.
- Explain tradeoffs between checkpoint frequency, storage cost, and recovery point objective.