On June 17, 2026, OpenAI published LifeSciBench — a benchmark built with 173 practicing scientists to answer a question glossier AI evals skip: Can this model do the messy work of drug discovery—not just recite biology?
The same week brought Deployment Simulation for pre-launch safety and GPT-Rosalind positioning in life sciences. LifeSciBench is the scoreboard for that bet.
Headline result: GPT-Rosalind hits 36.1% strict pass rate vs GPT-5.5 at 25.7% — meaningful progress, with most tasks still unsolved.
TL;DR
| Metric | Value |
|---|---|
| Tasks | 750 expert-authored |
| Scientists | 173 contributors, 453 reviewers |
| Artifacts | 1,062 files (figures, PDFs, sequences, etc.) |
| Rubric criteria | 19,020 (~25 per task) |
| GPT-Rosalind pass | 36.1% |
| GPT-5.5 pass | 25.7% |
| Tacit Labs | Nicole Fitzgerald — AI × biology applied lab |
| Paper | openai.com/index/introducing-life-sci-bench |
Why Life Science Needs Its Own Benchmark
Most science benchmarks test isolated skills:
- Multiple-choice biology
- Single-step predictions
- Clean reference answers
Real biotech work looks like:
- Interpreting incomplete Phase 1/2 data for an FDA Type B meeting
- Reconciling conflicting assay readouts
- Designing CRISPR donors under operational constraints
- Explaining uncertainty to a skeptical reviewer
LifeSciBench tasks read like emails to a senior scientist collaborator—prompt, context artifacts, free-response answer, rubric graded.
Seven Workflows Measured
OpenAI grouped industry survey responses into seven recurring workflows:
| Workflow | Example demand |
|---|---|
| Evidence handling | Extract, reconcile, audit papers and records |
| Analysis | Quantitative interpretation with caveats |
| Design, optimization & prediction | Experiments, constructs, assays |
| Scientific reasoning | Mechanism, hypothesis, conflict resolution |
| Validation & operations | Lab execution, troubleshooting |
| Translation | Bench → bedside, regulatory framing |
| Scientific communication | Expert-facing writeups |
79% of tasks need multiple reasoning steps (avg 4 steps). 53% require artifacts—not prompt text alone.
Dataset Construction (Why Experts Trust It)
Rigor signals:
- 173 task authors with PhD training + biotech/pharma experience
- ~6 automated review cycles per task (avg)
- ≥2 expert review rounds
- ≥90% reviewer agreement in-domain
- 453 independent validators (97% hold PhD+)
Reviewer agreement on benchmark quality: 96%+ in all categories (real-world relevance, reasoning test, grounding, usefulness).
This is closer to contract research organization (CRO) review than crowd-sourced QA.
Grading: 19,020 Rubric Criteria
Pass threshold: 70% rubric score per task.
Science rarely reduces to one correct string. Rubrics score:
- Correct claims (+points)
- Missing assay limitations (−implicit failure)
- Wrong evidence weighting
- Format expected by regulators or PI review
Partial credit matters: ~14% of tasks show models earning ≥50% rubric while failing pass threshold—useful but not deployable alone.
Example Task Flavor (DMD Gene Therapy)
LifeSciBench publishes a Duchenne muscular dystrophy accelerated-approval critique—micro-dystrophin AAV9 package with Western blot, immunofluorescence, NSAA functional data.
A strong answer flags:
- Assay specificity (MANEX1A epitope sharing)
- Invalid standards (138 kDa vs full-length dystrophin)
- Revertant fiber confounding
- External control bias on NSAA
- Surrogate endpoint validity
GPT-Rosalind-style outputs must pressure-test like a skeptical FDA reviewer—not summarize the press release.
Results: Where GPT-Rosalind Wins and Loses
Overall
| Model | Pass rate | Notes |
|---|---|---|
| GPT-Rosalind | 36.1% | Life-science-tuned |
| GPT-5.5 | 25.7% | General frontier |
Strongest workflows (Rosalind gains)
| Workflow | GPT-5.5 | GPT-Rosalind |
|---|---|---|
| Scientific Communication | 56.3% | 71.1% (n=9 — small) |
| Translation | 36.8% | 57.7% |
| Expert-useful outputs | 29.1% | 44.7% |
| Uncertainty handling | 29.3% | 44.8% |
Models do best when tasks have clear evidence boundaries and need structured judgment.
Weakest areas
| Challenge | GPT-Rosalind pass |
|---|---|
| Design / Optimization / Prediction | ~30.7% |
| Analysis | ~30.3% |
| With artifacts/URLs | 28.1% (vs 45.1% text-only) |
| Exact numeric outputs | 14.8% |
| Sequence/structure outputs | 24.0% |
| Construct generation | 27.3% |
Artifact gap is the story: models struggle to read complex figures, large sequence files, and synthesize into decisions—exactly what wet labs produce daily.
GPT-Rosalind and Tacit Labs
OpenAI pairs the benchmark with GPT-Rosalind — a life-sciences-oriented model line (see also Rosalind Biodefense product threads from May 2026).
Nicole Fitzgerald (@ninklefitz), formerly Microsoft Research and Databricks Mosaic AI, announced Tacit Labs the same day—an applied research lab for AI + autonomous biotech tooling.
LifeSciBench measures models; Tacit Labs builds systems that might sit in real R&D workflows—complementary, not redundant.
Benchmark Politics on X
@scaling01 noted OpenAI comparing against xAI, Google — not Anthropic — framing a shift in competitive narrative.
Community context (Wired reporting): Anthropic restricts API use for competitor benchmarking—making cross-vendor LifeSciBench tables asymmetric.
@teortaxesTex criticized including Grok in charts—methodology debates will continue as labs pick favorable comparators.
Relation to Deployment Simulation
Released one day apart (June 16–17):
| Tool | Question |
|---|---|
| Deployment Simulation | How will the model behave in ChatGPT traffic? |
| LifeSciBench | Can the model do PhD-level biotech tasks? |
Together: operational safety forecasting + domain capability measurement — OpenAI's pre-release stack for high-stakes verticals.
Contrast with ALE (agent autonomy) and Fable 5 cyber evals (security politics).
Limitations (OpenAI's)
- Not live lab validation — tasks are self-contained
- No multi-week iterative science
- Specialty coverage incomplete
- Exact-output tasks brittle (formatting vs science)
- Benchmark ≠ discovery impact
Next step: deployment studies in real research programs.
Who Should Care
| Audience | Takeaway |
|---|---|
| Biotech / pharma AI teams | Rubric-heavy evals match how QA and regulatory think |
| Model labs | Artifact-heavy multimodal science remains wide open |
| Investors | 36% pass = far from automating drug discovery |
| Developers | GPT-Rosalind API access via OpenAI contributor program |
OpenAI invites scientist contributors and GPT-Rosalind access requests via the announcement page.
Summary
LifeSciBench is the most serious public attempt yet to grade AI on industry-shaped biology work—FDA skepticism, assay traps, translation—not textbook drills.
GPT-Rosalind leads GPT-5.5 by 10+ points on pass rate but fails most tasks. The gap to production is artifacts, exact constructs, and live iteration.
Tacit Labs signals OpenAI is not stopping at benchmarks—they want tools inside labs, not just chatbots that read Nature abstracts.
Related Reading
- OpenAI Deployment Simulation
- GPT-5.6 Release Guide
- Agents' Last Exam (ALE)
- Why AI Works — Mechanistic Interpretability
- US Government Bans Fable 5
Benchmark statistics cited from OpenAI LifeSciBench announcement (June 17, 2026).