What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

LifeSciBench: GPT-Rosalind Life Science AI Benchmark | explainx.ai Blog

On June 17, 2026, OpenAI published LifeSciBench — a benchmark built with 173 practicing scientists to answer a question glossier AI evals skip: Can this model do the messy work of drug discovery—not just recite biology?

The same week brought Deployment Simulation for pre-launch safety and GPT-Rosalind positioning in life sciences. LifeSciBench is the scoreboard for that bet.

Headline result: GPT-Rosalind hits 36.1% strict pass rate vs GPT-5.5 at 25.7% — meaningful progress, with most tasks still unsolved.

TL;DR

Metric	Value
Tasks	750 expert-authored
Scientists	173 contributors, 453 reviewers
Artifacts	1,062 files (figures, PDFs, sequences, etc.)
Rubric criteria	19,020 (~25 per task)
GPT-Rosalind pass	36.1%
GPT-5.5 pass	25.7%
Tacit Labs	Nicole Fitzgerald — AI × biology applied lab
Paper	openai.com/index/introducing-life-sci-bench

Why Life Science Needs Its Own Benchmark

Most science benchmarks test isolated skills:

Multiple-choice biology
Single-step predictions
Clean reference answers

Real biotech work looks like:

Interpreting incomplete Phase 1/2 data for an FDA Type B meeting
Reconciling conflicting assay readouts
Designing CRISPR donors under operational constraints
Explaining uncertainty to a skeptical reviewer

LifeSciBench tasks read like emails to a senior scientist collaborator—prompt, context artifacts, free-response answer, rubric graded.

Seven Workflows Measured

OpenAI grouped industry survey responses into seven recurring workflows:

Workflow	Example demand
Evidence handling	Extract, reconcile, audit papers and records
Analysis	Quantitative interpretation with caveats
Design, optimization & prediction	Experiments, constructs, assays
Scientific reasoning	Mechanism, hypothesis, conflict resolution
Validation & operations	Lab execution, troubleshooting
Translation	Bench → bedside, regulatory framing
Scientific communication	Expert-facing writeups

79% of tasks need multiple reasoning steps (avg 4 steps). 53% require artifacts—not prompt text alone.

Dataset Construction (Why Experts Trust It)

Rigor signals:

173 task authors with PhD training + biotech/pharma experience
~6 automated review cycles per task (avg)
≥2 expert review rounds
≥90% reviewer agreement in-domain
453 independent validators (97% hold PhD+)

Reviewer agreement on benchmark quality: 96%+ in all categories (real-world relevance, reasoning test, grounding, usefulness).

This is closer to contract research organization (CRO) review than crowd-sourced QA.

Grading: 19,020 Rubric Criteria

Pass threshold: 70% rubric score per task.

Science rarely reduces to one correct string. Rubrics score:

Correct claims (+points)
Missing assay limitations (−implicit failure)
Wrong evidence weighting
Format expected by regulators or PI review

Partial credit matters: ~14% of tasks show models earning ≥50% rubric while failing pass threshold—useful but not deployable alone.

Example Task Flavor (DMD Gene Therapy)

LifeSciBench publishes a Duchenne muscular dystrophy accelerated-approval critique—micro-dystrophin AAV9 package with Western blot, immunofluorescence, NSAA functional data.

A strong answer flags:

Assay specificity (MANEX1A epitope sharing)
Invalid standards (138 kDa vs full-length dystrophin)
Revertant fiber confounding
External control bias on NSAA
Surrogate endpoint validity

GPT-Rosalind-style outputs must pressure-test like a skeptical FDA reviewer—not summarize the press release.

Results: Where GPT-Rosalind Wins and Loses

Overall

Model	Pass rate	Notes
GPT-Rosalind	36.1%	Life-science-tuned
GPT-5.5	25.7%	General frontier

Strongest workflows (Rosalind gains)

Workflow	GPT-5.5	GPT-Rosalind
Scientific Communication	56.3%	71.1% (n=9 — small)
Translation	36.8%	57.7%
Expert-useful outputs	29.1%	44.7%
Uncertainty handling	29.3%	44.8%

Models do best when tasks have clear evidence boundaries and need structured judgment.

Weakest areas

Challenge	GPT-Rosalind pass
Design / Optimization / Prediction	~30.7%
Analysis	~30.3%
With artifacts/URLs	28.1% (vs 45.1% text-only)
Exact numeric outputs	14.8%
Sequence/structure outputs	24.0%
Construct generation	27.3%

Artifact gap is the story: models struggle to read complex figures, large sequence files, and synthesize into decisions—exactly what wet labs produce daily.

GPT-Rosalind and Tacit Labs

OpenAI pairs the benchmark with GPT-Rosalind — a life-sciences-oriented model line (see also Rosalind Biodefense product threads from May 2026).

Nicole Fitzgerald (@ninklefitz), formerly Microsoft Research and Databricks Mosaic AI, announced Tacit Labs the same day—an applied research lab for AI + autonomous biotech tooling.

LifeSciBench measures models; Tacit Labs builds systems that might sit in real R&D workflows—complementary, not redundant.

Benchmark Politics on X

@scaling01 noted OpenAI comparing against xAI, Google — not Anthropic — framing a shift in competitive narrative.

Community context (Wired reporting): Anthropic restricts API use for competitor benchmarking—making cross-vendor LifeSciBench tables asymmetric.

@teortaxesTex criticized including Grok in charts—methodology debates will continue as labs pick favorable comparators.

Relation to Deployment Simulation

Released one day apart (June 16–17):

Tool	Question
Deployment Simulation	How will the model behave in ChatGPT traffic?
LifeSciBench	Can the model do PhD-level biotech tasks?

Together: operational safety forecasting + domain capability measurement — OpenAI's pre-release stack for high-stakes verticals.

Contrast with ALE (agent autonomy) and Fable 5 cyber evals (security politics).

Limitations (OpenAI's)

Not live lab validation — tasks are self-contained
No multi-week iterative science
Specialty coverage incomplete
Exact-output tasks brittle (formatting vs science)
Benchmark ≠ discovery impact

Next step: deployment studies in real research programs.

Who Should Care

Audience	Takeaway
Biotech / pharma AI teams	Rubric-heavy evals match how QA and regulatory think
Model labs	Artifact-heavy multimodal science remains wide open
Investors	36% pass = far from automating drug discovery
Developers	GPT-Rosalind API access via OpenAI contributor program

OpenAI invites scientist contributors and GPT-Rosalind access requests via the announcement page.

Summary

LifeSciBench is the most serious public attempt yet to grade AI on industry-shaped biology work—FDA skepticism, assay traps, translation—not textbook drills.

GPT-Rosalind leads GPT-5.5 by 10+ points on pass rate but fails most tasks. The gap to production is artifacts, exact constructs, and live iteration.

Tacit Labs signals OpenAI is not stopping at benchmarks—they want tools inside labs, not just chatbots that read Nature abstracts.

LifeSciBench: OpenAI's 750-Task Benchmark for GPT-Rosalind in Biotech

TL;DR

Why Life Science Needs Its Own Benchmark

Seven Workflows Measured

Dataset Construction (Why Experts Trust It)

Grading: 19,020 Rubric Criteria

Example Task Flavor (DMD Gene Therapy)

Results: Where GPT-Rosalind Wins and Loses

Overall

Strongest workflows (Rosalind gains)

Weakest areas

GPT-Rosalind and Tacit Labs

Benchmark Politics on X

Relation to Deployment Simulation

Limitations (OpenAI's)

Who Should Care

Summary

Related Reading

Related posts

OpenAI Deployment Simulation: Predicting Model Behavior Before Release

Will GPT-5.6 Be as Good as Claude Fable 5? A Benchmark-by-Benchmark Comparison (2026)

OpenAI Partner Network: $150M Investment, 300K Certified Consultants, and the Enterprise Bet