Project Context
At OpenAI, a research team has early evidence that a new post-training experiment improves model helpfulness on a narrow internal benchmark. The result was produced on a small-scale run using 64 GPUs and a lightly curated dataset, but leadership wants to know within 8 weeks whether the approach should be scaled into a larger experiment on shared training infrastructure and evaluated for possible use in a future ChatGPT model iteration.
You are the program manager partnering with 4 research scientists, 5 research engineers, 1 data engineer, and a shared infrastructure team. The work is urgent because the next model planning review is in 9 weeks, and this experiment must either show credible signal or be deprioritized.
Key Stakeholders
The research lead wants maximum experimental flexibility and fast iteration. The infrastructure lead wants predictable cluster usage and fewer failed jobs on shared capacity. The safety evaluation lead requires expanded red-team and policy checks before any large-scale run. Finance is scrutinizing compute spend after two recent over-budget training projects.
Constraints
- Timeline: 8 weeks to recommendation, 9 weeks to executive review
- Compute budget: 18,000 H100 GPU-hours total
- Team capacity: no new headcount; infrastructure support limited to 0.5 FTE
- Data dependency: refreshed preference dataset arrives at end of Week 2
- Shared cluster: no single job may reserve more than 20% of training capacity for over 24 hours
Complications
- A parallel multimodal training project has already pre-booked 35% of cluster capacity in Weeks 4-6.
- The small-scale result is not yet reproduced; one prior rerun showed only half the original gain.
- Safety eval tooling in OpenAI Evals is missing coverage for one key failure mode the policy team cares about.
Your Task
- Build an 8-week execution plan to scale the experiment from pilot to decision-ready evidence.
- Define how you would collaborate with research engineers on sequencing, ownership, and compute trade-offs.
- Propose success criteria, stage gates, and a rollback or stop condition if results are weak.
- Identify the top risks and mitigations across compute, data, evaluation, and stakeholder alignment.
- Recommend what to present at the Week 9 executive review: scale further, narrow scope, or stop.