← Blog
explainx / blog

Anthropic VirBench: Why Biological Agents Need Deterministic Tools Like gget virus (2026)

Anthropic VirBench: coding agents beat biology agents until you add gget virus. Deterministic NCBI retrieval raised accuracy from 17% to 99.7% on viral queries.

13 min readYash Thakker
AI AgentsLife SciencesAnthropicBioinformaticsAgent Harness

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Anthropic VirBench: Why Biological Agents Need Deterministic Tools Like gget virus (2026)

On June 8, 2026, Anthropic published "Paving the way for agents in biology"—an essay by Laura Luebbert (Broad Institute, FutureHouse) arguing that biological data infrastructure must be redesigned for agents, not just human browser clicks.

The case study is sharp: task state-of-the-art scientific agents (Claude, GPT, Biomni, Edison Analysis) with retrieving viral sequences from NCBI Virus—the database behind outbreak surveillance, diagnostic assay design, and protein model training data. Even frontier models failed reproducibility tests. Accuracy jumped to nearly 100% once the team added gget virus, a deterministic retrieval layer built with NCBI collaborators.

The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.

TL;DR

QuestionAnswer
What was tested?VirBench — 120 queries, 40 pathogens, manually verified ground-truth counts.
Worst agent accuracy (alone)?16.9% mean (Claude Sonnet 4). Best: 91.3% (GPT-5.5)—still not enough for science.
With gget virus?≥90% all agents; 99.7% peak (GPT-5.5). Variability largely gone.
Why it matters now?May 2026 Bundibugyo Ebola outbreak in DRC—genomic answers depend on correct sequence retrieval first.
Broader lesson?Build deterministic execution layers; let models hypothesize, not reinvent pagination.
Full paper?arXiv:2606.06749 — Nasri et al., 2026.

The hill town problem: biology wasn't built for agents

Laura Luebbert opens with an analogy: using AI agents on today's biological data is like driving through an old Italian hill town designed before cars—beautiful, thoughtful, but full of narrow winding streets (idiosyncratic file formats, scattered databases, one-off scripts).

Software, by contrast, was built for cars:

  • Paved roads → version control
  • Clear lanes → documented APIs
  • Standardized signals → package managers
  • Fast start-to-finish travel → testable outputs (a GitHub patch that passes CI)

Coding agents advanced quickly because the infrastructure matches agent needs. Biological agents lag because retrieval and validation layers are brittle, heterogeneous, and process-dependent—and biology offers few simple, verifiable rewards comparable to tests pass.

The bottleneck is not only reasoning. It is the absence of widespread deterministic execution layers for querying biological data. A scientist can express intent ("find all human kinases with this domain and pull their structures"), but agents lack a dependable, repeatable path to the databases.

In biology, small retrieval errors have severe downstream consequences:

  • Wrong genome build → invalid coordinates
  • Mixing RefSeq and GenBank unintentionally
  • Treating partial genomes as complete
  • Confusing segment names in segmented viruses
  • Missing records due to inconsistent metadata fields

It does not matter how powerful the model is if the route depends on local knowledge hidden in a web UI.

Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.


Karpathy's "click tax" — the same pain in software

This mismatch is not unique to biology. Luebbert cites Andrej Karpathy's talk on software in the AI era: he vibe-coded a small web app quickly, then lost a week on authentication, payments, and deployment—clicking through browser dashboards.

"The code was the easiest part! Most of the work was in the browser, clicking things."

Documentation kept saying "go to this URL, click this dropdown." Karpathy's conclusion: nobody should have to do this—we must build for agents.

Anthropic's virology case study is the biological version of that complaint. Long before agents, computational biologists built partial fixes—Biopython, BioPerl, Entrez Direct, BioMart, gget—to move data out of browsers into scriptable workflows. But biological data still lives in a messy network of roads, each with its own identifiers, conventions, and degree of programmatic access.


Case study: NCBI Virus and the May 2026 Ebola outbreak

NCBI Virus aggregates viral sequence records from GenBank, RefSeq, and the international INSDC ecosystem (NCBI, ENA, DDBJ), including Pathoplexus—behind a searchable web interface.

Virology labs pass around long lists of complex filters that users manually reproduce in the browser. Exactly the workflow Karpathy described—except the stakes are public health.

Bundibugyo virus, DRC, May 2026

On May 14, 2026, INRB Kinshasa analyzed 13 blood samples and confirmed Bundibugyo virus disease in eight the next day. An Ebola outbreak was declared. By May 29, WHO reported 1,000+ confirmed and suspected cases and 200+ deaths in the DRC.

Researchers generated the first near-complete outbreak genomes, establishing a new spillover event. Public health officials need immediate answers:

  1. How different is this virus from prior Ebola viruses?
  2. Do existing diagnostics still detect it?
  3. Will existing therapeutics still protect patients?

Answering these requires comparing new genomes against historical Ebola records in NCBI Virus and Pathoplexus. The first steps should be automatable. Instead, they often involve manual browser filtering, hoping the dataset is complete and correct.

Much of NCBI Virus's filtering logic lives only in the web interface. A seasoned virologist might take a few clicks for SARS-CoV-2 surface glycoprotein sequences from 2025. Programmatically, it can require a multi-hundred-line script gluing REST, Datasets, and E-utilities APIs—paginating, reconciling identifiers, downloading hundreds of gigabytes, then filtering locally.

Even when APIs exist, agents struggle when:

  • API filtering ≠ web UI semantics
  • Metadata fields are poorly documented
  • Identifiers differ across sources
  • "The right answer" depends on expert conventions machines must infer

VirBench: 120 queries, ground-truth counts, three runs each

To measure the gap, Luebbert's team built VirBench—documented in the preprint "Deterministic access to global viral sequence data enables robust agentic scientific discovery" (Nasri et al., 2026).

Benchmark design

PropertyDetail
Queries120 realistic viral sequence retrievals
Pathogens40, from broad family searches to accession lookups
Filters per query1–9 simultaneous (median 6); up to 16 filter types
Expected counts0 to 3,226 sequences (median 22)
Ground truthManually verified via NCBI Virus web interface
Use casesSurveillance, diagnostic assay design, protein model training data
Contributors58 queries from Sabeti Lab diagnostics team

Example query (Ebolavirus):

Retrieve viral sequences from NCBI for TaxID 3052462 (Orthoebolavirus zairense (ZEBOV)) with: host organism human; geographic location Africa; collected 01/01/2014–06/20/2014; minimum sequence length 15,200 bases; maximum 1,900 ambiguous characters (N's); exclude lab-passaged samples.

Agents tested

Evaluated February 26, 2026:

  • Claude Sonnet 4 and Claude Opus 4.7 (Anthropic Messages API)
  • GPT-5.2-pro and GPT-5.5 (OpenAI Responses API, web search + code execution)
  • Biomni OSS v0.0.8 (Claude Sonnet 4 backend)
  • Edison Analysis (Edison client SDK)

Each query ran three independent times per agent to test reproducibility.


What happened when agents tried alone

Performance varied widely—and even the best model was not reliably good enough for dataset construction, where the effective bar is 100%.

Accuracy without gget virus

AgentMean accuracyStability (σ=μ threshold)
Claude Sonnet 416.9%Low
Biomni OSS22.5%Low
Edison Analysis40.0%Moderate
GPT-5.2-pro67.1%Moderate
Claude Opus 4.783.2%0.93 stability
GPT-5.591.3%1.00 stability

Newer frontier models improved substantially—but residual errors remain consequential. A missing or incorrect record can determine whether a diagnostic assay appears to cover circulating diversity, or whether an outbreak is inferred to have started weeks earlier or later.

The reproducibility problem

The same model often returned different answers on identical prompts. For the example Ebolavirus query, Claude Sonnet 4 returned:

  • Run 1: 106 sequences (expected: 266)
  • Run 2: 15 sequences
  • Run 3: 5 sequences

That undermines both accuracy and reproducibility—requirements for any scientific workflow.

When wrong retrieval changes biology

Anthropic illustrated downstream impact with two analyses:

Phylogenetic trees (TMRCA): A manually curated NCBI dataset inferred a January 2014 time to most recent common ancestor for the 2014 West African Ebola epidemic—consistent with prior literature. Agent-retrieved sets produced trees pushing TMRCA to 1922, or shifting it to April 2014 by missing Guinea sequences—changing inferred outbreak timing.

Therapeutic epitopes: For antibody candidates maftivimab and MBP134, three Sonnet 4 runs produced three different impressions of mutation variability in target regions—because underlying sequence sets were incomplete or wrong.

Failure modes

Agents often understood the task but lacked machine-actionable execution:

  • Under-counted when pagination stopped early (Influenza A, HIV-1, SARS-CoV-2)
  • Over-counted when filters were applied incorrectly
  • Struggled with metadata fields whose meaning depends on context (e.g., geographic info stored in virusName rather than location)
  • Performance degraded beyond 3–4 simultaneous filters

Answers could look plausible while being wrong—especially dangerous because sequence retrieval is usually step one in a long pipeline.


gget virus: the deterministic layer

The team developed gget virus in collaboration with NCBI researchers—not a simple API wrapper, but a system that reproduces NCBI Virus web-interface behavior across fragmented backends.

What it coordinates

  • NCBI Datasets REST API — lightweight metadata
  • NCBI Datasets CLI — cached bulk packages for SARS-CoV-2 and Influenza A
  • E-utilities — GenBank records for protein-level filters
  • Local filtering — when web UI semantics aren't exposed programmatically
  • Batching and retry logic — comprehensive retrieval without arbitrary cutoffs
  • Standardized outputs + logs — auditable, human- and machine-readable

The preprint reports >98% data transfer reduction for representative high-volume queries by applying metadata constraints before sequence download.

Install and basic usage

pip install gget

# Example: Zaire ebolavirus with filters
gget virus "Zaire ebolavirus" \
  --host human \
  --geo_location Africa \
  --collection_date_after 2014-01-01 \
  --collection_date_before 2014-06-20 \
  --min_seq_length 15200 \
  --max_n 1900

Documentation: gget virus module (Pachter Lab)

Written by Ferdous Nasri; developed with Sarah Gurev, Patrick Varilly, Krithik Ramesh, Nuala A. O'Leary, Jonah Cool, Bernhard Y. Renard, Pardis Sabeti, and Laura Luebbert.


Results with gget virus: model choice mattered less

When agents were instructed to use gget virus, the picture changed dramatically:

AgentAccuracy without ggetAccuracy with gget
Claude Sonnet 416.9%92.8%
Biomni OSS22.5%90.0%
Edison Analysis40.0%93.1%
GPT-5.2-pro67.1%98.9%
Claude Opus 4.783.2%98.3%
GPT-5.591.3%99.7%

Run-to-run variability was largely eliminated (stability 0.92–1.00). The performance gap between models narrowed dramatically. Adding a deterministic retrieval layer made model choice much less important—cheaper models plus the right tool beat expensive models fighting messy APIs alone.

One notable run: GPT-5.5 independently discovered and used gget virus on one query despite not being prompted to— the only correct answer for that question among 360 runs.

Remaining errors

Residual failures shifted from "can't access data reliably" to "agent misused the tool":

  • Incorrect local filtering after download
  • Partial processing of large FASTA files
  • Reverting to alternative APIs despite instructions
  • Wrong parameters on gget calls

The retrieval layer worked; agent invocation and output preservation still need guardrails—echoing @mosyaseen's loop-engineering point: you need something that can say no.


The highway under the hill town

Anthropic returns to the city analogy: gget virus is a highway tunnel under pedestrian infrastructure—on-ramps, interchanges, exit numbers tied to known mile markers.

Karpathy's prescription applies directly: "make [genomic data] accessible to agents."

Creative work—hypothesis generation, experimental design, mechanism reasoning—should stay with models. The layer underneath must be boringly reliable:

  • Gene identifiers
  • Schemas
  • Retrieval logic
  • Coordinate systems
  • Metadata conventions
  • Data access paths

Broader ecosystem

gget virus joins a growing set of context engines for scientific agents:

SystemRole
ToolUniverseTool aggregation for biomedical agents
Edison Scientific RobinResearch agent with tool harness
BiomniGeneral-purpose biomedical agent
gget virusDeterministic viral sequence retrieval

The design question: where does determinism belong, and how do you build it so agents can invoke it without brittle post-processing?

As Nils Homer noted (cited in Anthropic's footnotes): "AI assistants need to work with your code, your outputs, and your analysis logic"—so agents can inspect how data was retrieved, not just what was returned.


Will better models make tools obsolete?

Anthropic addresses the obvious objection: if you extrapolate model curves, agents might eventually navigate messy portals alone.

Maybe. But even if an agent can fight through a confusing bioinformatics workflow, that does not mean it should every time:

  • Too expensive (token burn on pagination)
  • Too slow (multi-hour API gluing)
  • Too hard to audit (no retrieval logs)
  • Too hard to trust (plausible wrong counts)

If today's harnesses become obsolete, the lesson for database maintainers holds: design for agents as scaled users—explicit filtering semantics, stable identifiers, machine-readable logs, deterministic endpoints.

This parallels software agent discourse from the same week: Peter Steinberger's loop tweet argued engineers should design loops and skills, not re-prompt agents from scratch every time. Biology's version is design deterministic retrieval tools, not let each agent reinvent NCBI pagination.


Implications for agent builders

1. Separate reasoning from retrieval

Let the model plan and interpret. Let deterministic tools fetch, filter, and log. VirBench shows reasoning without retrieval fails reproducibility even at 91% mean accuracy.

2. Test at 100%, not "pretty good"

For dataset construction, 91% is not passing. Benchmark with ground-truth counts, multiple runs, and downstream analyses (phylogenetics, epitope mapping)—not just "did it return something."

3. Build agent-accessible interfaces

Biological databases need:

  • Filtering semantics matching web UIs
  • Documented metadata fields with examples
  • Pagination that cannot silently truncate
  • Logs showing how results were produced
  • Stable identifiers across sources

4. Connectors and MCP for science

Software teams solve this with MCP servers and Claude connectors. Life sciences needs the same pattern: thin, deterministic tool surfaces agents call instead of browsing.

5. Cheaper models + right tool > frontier model alone

VirBench's most practical finding: gget virus democratized accuracy. Reliable science should not require the newest or most expensive model—or insider knowledge of which model handles which database best.


Related reading

ExplainX guides

Primary sources


Summary

Anthropic's June 8, 2026 biology agents essay makes a precise claim: coding agents outran biological agents because software infrastructure was built for programmatic access, and biology wasn't.

VirBench proved it with numbers:

  • 120 queries, 40 pathogens, agents alone: 16.9%–91.3% accuracy with dangerous run-to-run variance
  • gget virus added: ≥90% for every agent, 99.7% peak, variability largely gone
  • Wrong retrieval changed phylogenetic outbreak dates and therapeutic epitope conclusions

The prescription is not "wait for smarter models." It is build deterministic execution layers—boring, auditable, repeatable—and let agents be creative on top.

For outbreak response in the DRC, diagnostic assay design, and protein model training data, that infrastructure is not a nice-to-have. It is the difference between a plausible-looking wrong answer and science you can trust.


Published June 9, 2026. VirBench metrics and outbreak statistics from Anthropic's June 8, 2026 post and Nasri et al. arXiv:2606.06749—verify against upstream before citing in research or public-health contexts.

Related posts