Checkpoint Multi-Day OpenAI Training Runs

Business Context

OpenAI is training a large supervised model on a multi-terabyte dataset, and a single run can take 2-5 days across multiple GPUs. Your task is to design and implement checkpoint management so training can resume safely after preemption, node failure, or manual interruption without losing meaningful progress or corrupting model state.

Dataset

You are given a text classification training corpus used for internal moderation research. The ML task itself is standard supervised learning, but the interview focus is robust training-state management during long-running jobs.

Feature Group	Count	Examples
Text inputs	1	prompt_text
Numeric metadata	6	token_count, language_confidence, prior_report_rate
Categorical metadata	4	language, source_surface, policy_area, region
Labels	1	violation_class

Size: 42M examples, ~1.8 TB tokenized training data, 11 classes
Target: Multiclass classification — policy violation class
Class balance: Long-tailed; largest class 41%, smallest class 0.3%
Missing data: ~8% missing metadata fields; text always present

Success Criteria

A strong solution should:

Resume training from the latest valid checkpoint with no more than 15 minutes of lost progress
Restore model weights, optimizer state, scheduler state, gradient scaler, RNG state, and data-loader progress
Avoid partial or corrupted checkpoints being treated as valid
Demonstrate that resumed training produces comparable validation loss and macro-F1 to uninterrupted training

Constraints

Training runs on preemptible GPU instances
Checkpoint writes must not stall training for more than a few seconds per save
Storage budget is limited; you cannot keep every checkpoint forever
Validation should run every 10K steps; checkpoints every 2K-5K steps

Deliverables

Build a training loop in PyTorch with periodic checkpoint save and resume support.
Implement atomic checkpoint writing, retention, and "latest" checkpoint discovery.
Show how to validate checkpoint integrity before resuming.
Evaluate resumed-vs-fresh training using concrete metrics.
Explain tradeoffs between checkpoint frequency, storage cost, and recovery point objective.

Business Context

Dataset

Feature Group	Count	Examples
Text inputs	1	prompt_text
Numeric metadata	6	token_count, language_confidence, prior_report_rate
Categorical metadata	4	language, source_surface, policy_area, region
Labels	1	violation_class

Size: 42M examples, ~1.8 TB tokenized training data, 11 classes
Target: Multiclass classification — policy violation class
Class balance: Long-tailed; largest class 41%, smallest class 0.3%
Missing data: ~8% missing metadata fields; text always present

Success Criteria

A strong solution should:

Resume training from the latest valid checkpoint with no more than 15 minutes of lost progress
Restore model weights, optimizer state, scheduler state, gradient scaler, RNG state, and data-loader progress
Avoid partial or corrupted checkpoints being treated as valid
Demonstrate that resumed training produces comparable validation loss and macro-F1 to uninterrupted training

Constraints

Training runs on preemptible GPU instances
Checkpoint writes must not stall training for more than a few seconds per save
Storage budget is limited; you cannot keep every checkpoint forever
Validation should run every 10K steps; checkpoints every 2K-5K steps

Deliverables

Build a training loop in PyTorch with periodic checkpoint save and resume support.
Implement atomic checkpoint writing, retention, and "latest" checkpoint discovery.
Show how to validate checkpoint integrity before resuming.
Evaluate resumed-vs-fresh training using concrete metrics.
Explain tradeoffs between checkpoint frequency, storage cost, and recovery point objective.

Business Context

Dataset

Feature Group	Count	Examples
Text inputs	1	prompt_text
Numeric metadata	6	token_count, language_confidence, prior_report_rate
Categorical metadata	4	language, source_surface, policy_area, region
Labels	1	violation_class

Size: 42M examples, ~1.8 TB tokenized training data, 11 classes
Target: Multiclass classification — policy violation class
Class balance: Long-tailed; largest class 41%, smallest class 0.3%
Missing data: ~8% missing metadata fields; text always present

Success Criteria

A strong solution should:

Resume training from the latest valid checkpoint with no more than 15 minutes of lost progress
Restore model weights, optimizer state, scheduler state, gradient scaler, RNG state, and data-loader progress
Avoid partial or corrupted checkpoints being treated as valid
Demonstrate that resumed training produces comparable validation loss and macro-F1 to uninterrupted training

Constraints

Training runs on preemptible GPU instances
Checkpoint writes must not stall training for more than a few seconds per save
Storage budget is limited; you cannot keep every checkpoint forever
Validation should run every 10K steps; checkpoints every 2K-5K steps

Deliverables

Build a training loop in PyTorch with periodic checkpoint save and resume support.
Implement atomic checkpoint writing, retention, and "latest" checkpoint discovery.
Show how to validate checkpoint integrity before resuming.
Evaluate resumed-vs-fresh training using concrete metrics.
Explain tradeoffs between checkpoint frequency, storage cost, and recovery point objective.

Business Context

Dataset

Feature Group	Count	Examples
Text inputs	1	prompt_text
Numeric metadata	6	token_count, language_confidence, prior_report_rate
Categorical metadata	4	language, source_surface, policy_area, region
Labels	1	violation_class

Size: 42M examples, ~1.8 TB tokenized training data, 11 classes
Target: Multiclass classification — policy violation class
Class balance: Long-tailed; largest class 41%, smallest class 0.3%
Missing data: ~8% missing metadata fields; text always present

Success Criteria

A strong solution should:

Resume training from the latest valid checkpoint with no more than 15 minutes of lost progress
Restore model weights, optimizer state, scheduler state, gradient scaler, RNG state, and data-loader progress
Avoid partial or corrupted checkpoints being treated as valid
Demonstrate that resumed training produces comparable validation loss and macro-F1 to uninterrupted training

Constraints

Training runs on preemptible GPU instances
Checkpoint writes must not stall training for more than a few seconds per save
Storage budget is limited; you cannot keep every checkpoint forever
Validation should run every 10K steps; checkpoints every 2K-5K steps

Deliverables

Build a training loop in PyTorch with periodic checkpoint save and resume support.
Implement atomic checkpoint writing, retention, and "latest" checkpoint discovery.
Show how to validate checkpoint integrity before resuming.
Evaluate resumed-vs-fresh training using concrete metrics.
Explain tradeoffs between checkpoint frequency, storage cost, and recovery point objective.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Checkpoint Multi-Day OpenAI Training Runs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Checkpoint Multi-Day OpenAI Training Runs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Checkpoint Multi-Day OpenAI Training Runs

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer