Scale a Model Experiment

Project Context

At OpenAI, a research team has early evidence that a new post-training experiment improves model helpfulness on a narrow internal benchmark. The result was produced on a small-scale run using 64 GPUs and a lightly curated dataset, but leadership wants to know within 8 weeks whether the approach should be scaled into a larger experiment on shared training infrastructure and evaluated for possible use in a future ChatGPT model iteration.

You are the program manager partnering with 4 research scientists, 5 research engineers, 1 data engineer, and a shared infrastructure team. The work is urgent because the next model planning review is in 9 weeks, and this experiment must either show credible signal or be deprioritized.

Key Stakeholders

The research lead wants maximum experimental flexibility and fast iteration. The infrastructure lead wants predictable cluster usage and fewer failed jobs on shared capacity. The safety evaluation lead requires expanded red-team and policy checks before any large-scale run. Finance is scrutinizing compute spend after two recent over-budget training projects.

Constraints

Timeline: 8 weeks to recommendation, 9 weeks to executive review
Compute budget: 18,000 H100 GPU-hours total
Team capacity: no new headcount; infrastructure support limited to 0.5 FTE
Data dependency: refreshed preference dataset arrives at end of Week 2
Shared cluster: no single job may reserve more than 20% of training capacity for over 24 hours

Complications

A parallel multimodal training project has already pre-booked 35% of cluster capacity in Weeks 4-6.
The small-scale result is not yet reproduced; one prior rerun showed only half the original gain.
Safety eval tooling in OpenAI Evals is missing coverage for one key failure mode the policy team cares about.

Your Task

Build an 8-week execution plan to scale the experiment from pilot to decision-ready evidence.
Define how you would collaborate with research engineers on sequencing, ownership, and compute trade-offs.
Propose success criteria, stage gates, and a rollback or stop condition if results are weak.
Identify the top risks and mitigations across compute, data, evaluation, and stakeholder alignment.
Recommend what to present at the Week 9 executive review: scale further, narrow scope, or stop.

Project Context

Key Stakeholders

Constraints

Timeline: 8 weeks to recommendation, 9 weeks to executive review
Compute budget: 18,000 H100 GPU-hours total
Team capacity: no new headcount; infrastructure support limited to 0.5 FTE
Data dependency: refreshed preference dataset arrives at end of Week 2
Shared cluster: no single job may reserve more than 20% of training capacity for over 24 hours

Complications

A parallel multimodal training project has already pre-booked 35% of cluster capacity in Weeks 4-6.
The small-scale result is not yet reproduced; one prior rerun showed only half the original gain.
Safety eval tooling in OpenAI Evals is missing coverage for one key failure mode the policy team cares about.

Your Task

Build an 8-week execution plan to scale the experiment from pilot to decision-ready evidence.
Define how you would collaborate with research engineers on sequencing, ownership, and compute trade-offs.
Propose success criteria, stage gates, and a rollback or stop condition if results are weak.
Identify the top risks and mitigations across compute, data, evaluation, and stakeholder alignment.
Recommend what to present at the Week 9 executive review: scale further, narrow scope, or stop.

Project Context

Key Stakeholders

Constraints

Timeline: 8 weeks to recommendation, 9 weeks to executive review
Compute budget: 18,000 H100 GPU-hours total
Team capacity: no new headcount; infrastructure support limited to 0.5 FTE
Data dependency: refreshed preference dataset arrives at end of Week 2
Shared cluster: no single job may reserve more than 20% of training capacity for over 24 hours

Complications

A parallel multimodal training project has already pre-booked 35% of cluster capacity in Weeks 4-6.
The small-scale result is not yet reproduced; one prior rerun showed only half the original gain.
Safety eval tooling in OpenAI Evals is missing coverage for one key failure mode the policy team cares about.

Your Task

Build an 8-week execution plan to scale the experiment from pilot to decision-ready evidence.
Define how you would collaborate with research engineers on sequencing, ownership, and compute trade-offs.
Propose success criteria, stage gates, and a rollback or stop condition if results are weak.
Identify the top risks and mitigations across compute, data, evaluation, and stakeholder alignment.
Recommend what to present at the Week 9 executive review: scale further, narrow scope, or stop.

Project Context

Key Stakeholders

Constraints

Timeline: 8 weeks to recommendation, 9 weeks to executive review
Compute budget: 18,000 H100 GPU-hours total
Team capacity: no new headcount; infrastructure support limited to 0.5 FTE
Data dependency: refreshed preference dataset arrives at end of Week 2
Shared cluster: no single job may reserve more than 20% of training capacity for over 24 hours

Complications

A parallel multimodal training project has already pre-booked 35% of cluster capacity in Weeks 4-6.
The small-scale result is not yet reproduced; one prior rerun showed only half the original gain.
Safety eval tooling in OpenAI Evals is missing coverage for one key failure mode the policy team cares about.

Your Task

Build an 8-week execution plan to scale the experiment from pilot to decision-ready evidence.
Define how you would collaborate with research engineers on sequencing, ownership, and compute trade-offs.
Propose success criteria, stage gates, and a rollback or stop condition if results are weak.
Identify the top risks and mitigations across compute, data, evaluation, and stakeholder alignment.
Recommend what to present at the Week 9 executive review: scale further, narrow scope, or stop.

Interview Guides

Project Context

Key Stakeholders

Constraints

Complications

Your Task

Scale a Model Experiment

Project Context

Key Stakeholders

Constraints

Complications

Your Task

Scale a Model Experiment

Project Context

Key Stakeholders

Constraints

Complications

Your Task

Scale a Model Experiment

Project Context

Key Stakeholders

Constraints

Complications

Your Task