VirBench is a benchmark of 120 manually verified viral sequence retrieval queries spanning 40 pathogens, created by Laura Luebbert's team at Anthropic and collaborators. It tests whether AI agents can reproduce NCBI Virus web-interface filters programmatically—with ground-truth counts for surveillance, diagnostic assay design, and training-data construction.

How did AI agents perform on VirBench without gget virus?

Mean accuracy ranged from 16.9% (Claude Sonnet 4) to 91.3% (GPT-5.5) across Biomni OSS, Edison Analysis, Claude, and GPT agents. The same model often returned different counts on identical prompts across three runs—unacceptable for scientific dataset construction where the bar is effectively 100%.

What is gget virus and how much did it improve agent accuracy?

gget virus is a deterministic command-line tool that reproduces NCBI Virus filtering across REST, Datasets, and E-utilities APIs. When agents used it, accuracy rose above 90% for all systems and peaked at 99.7% for GPT-5.5, with run-to-run variability largely eliminated per the arXiv preprint (2606.06749).

Why do coding agents advance faster than biological agents?

Software infrastructure offers version control, documented APIs, package managers, and testable outputs (e.g., a patch that passes CI). Biological databases like NCBI Virus were designed for browser workflows with implicit conventions, scattered APIs, and filtering logic that lives only in web UIs.

What is the Karpathy connection in Anthropic's biology agents post?

Andrej Karpathy reported that vibe-coding a web app was easy but auth, payments, and deployment required a week of browser dashboard clicking. Anthropic uses this as a parallel to virologists manually reproducing complex NCBI Virus filters—environments built for humans, not agents.

How do I install and use gget virus?

Install gget via pip (pip install gget), then run gget virus with a taxon name or ID plus filter flags. Example: gget virus 'Zaire ebolavirus' --host human --geo_location Africa. See the Pachter Lab docs at pachterlab.github.io/gget/en/virus.html.

Anthropic VirBench: Why Biological Agents Need | explainx.ai Blog

On June 8, 2026, Anthropic published "Paving the way for agents in biology"—an essay by Laura Luebbert (Broad Institute, FutureHouse) arguing that biological data infrastructure must be redesigned for agents, not just human browser clicks.

The case study is sharp: task state-of-the-art scientific agents (Claude, GPT, Biomni, Edison Analysis) with retrieving viral sequences from NCBI Virus—the database behind outbreak surveillance, diagnostic assay design, and protein model training data. Even frontier models failed reproducibility tests. Accuracy jumped to nearly 100% once the team added gget virus, a deterministic retrieval layer built with NCBI collaborators.

The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.

TL;DR

Question	Answer
What was tested?	VirBench — 120 queries, 40 pathogens, manually verified ground-truth counts.
Worst agent accuracy (alone)?	16.9% mean (Claude Sonnet 4). Best: 91.3% (GPT-5.5)—still not enough for science.
With gget virus?	≥90% all agents; 99.7% peak (GPT-5.5). Variability largely gone.
Why it matters now?	May 2026 Bundibugyo Ebola outbreak in DRC—genomic answers depend on correct sequence retrieval first.
Broader lesson?	Build deterministic execution layers; let models hypothesize, not reinvent pagination.
Full paper?	arXiv:2606.06749 — Nasri et al., 2026.

The hill town problem: biology wasn't built for agents

Laura Luebbert opens with an analogy: using AI agents on today's biological data is like driving through an old Italian hill town designed before cars—beautiful, thoughtful, but full of narrow winding streets (idiosyncratic file formats, scattered databases, one-off scripts).

Software, by contrast, was built for cars:

Paved roads → version control
Clear lanes → documented APIs
Standardized signals → package managers
Fast start-to-finish travel → testable outputs (a GitHub patch that passes CI)

Coding agents advanced quickly because the infrastructure matches agent needs. Biological agents lag because retrieval and validation layers are brittle, heterogeneous, and process-dependent—and biology offers few simple, verifiable rewards comparable to tests pass.

The bottleneck is not only reasoning. It is the absence of widespread deterministic execution layers for querying biological data. A scientist can express intent ("find all human kinases with this domain and pull their structures"), but agents lack a dependable, repeatable path to the databases.

In biology, small retrieval errors have severe downstream consequences:

Wrong genome build → invalid coordinates
Mixing RefSeq and GenBank unintentionally
Treating partial genomes as complete
Confusing segment names in segmented viruses
Missing records due to inconsistent metadata fields

It does not matter how powerful the model is if the route depends on local knowledge hidden in a web UI.

Karpathy's "click tax" — the same pain in software

This mismatch is not unique to biology. Luebbert cites Andrej Karpathy's talk on software in the AI era: he vibe-coded a small web app quickly, then lost a week on authentication, payments, and deployment—clicking through browser dashboards.

"The code was the easiest part! Most of the work was in the browser, clicking things."

Documentation kept saying "go to this URL, click this dropdown." Karpathy's conclusion: nobody should have to do this—we must build for agents.

Anthropic's virology case study is the biological version of that complaint. Long before agents, computational biologists built partial fixes—Biopython, BioPerl, Entrez Direct, BioMart, gget—to move data out of browsers into scriptable workflows. But biological data still lives in a messy network of roads, each with its own identifiers, conventions, and degree of programmatic access.

Case study: NCBI Virus and the May 2026 Ebola outbreak

NCBI Virus aggregates viral sequence records from GenBank, RefSeq, and the international INSDC ecosystem (NCBI, ENA, DDBJ), including Pathoplexus—behind a searchable web interface.

Virology labs pass around long lists of complex filters that users manually reproduce in the browser. Exactly the workflow Karpathy described—except the stakes are public health.

Bundibugyo virus, DRC, May 2026

On May 14, 2026, INRB Kinshasa analyzed 13 blood samples and confirmed Bundibugyo virus disease in eight the next day. An Ebola outbreak was declared. By May 29, WHO reported 1,000+ confirmed and suspected cases and 200+ deaths in the DRC.

Researchers generated the first near-complete outbreak genomes, establishing a new spillover event. Public health officials need immediate answers:

How different is this virus from prior Ebola viruses?
Do existing diagnostics still detect it?
Will existing therapeutics still protect patients?

Answering these requires comparing new genomes against historical Ebola records in NCBI Virus and Pathoplexus. The first steps should be automatable. Instead, they often involve manual browser filtering, hoping the dataset is complete and correct.

Much of NCBI Virus's filtering logic lives only in the web interface. A seasoned virologist might take a few clicks for SARS-CoV-2 surface glycoprotein sequences from 2025. Programmatically, it can require a multi-hundred-line script gluing REST, Datasets, and E-utilities APIs—paginating, reconciling identifiers, downloading hundreds of gigabytes, then filtering locally.

Even when APIs exist, agents struggle when:

API filtering ≠ web UI semantics
Metadata fields are poorly documented
Identifiers differ across sources
"The right answer" depends on expert conventions machines must infer

VirBench: 120 queries, ground-truth counts, three runs each

To measure the gap, Luebbert's team built VirBench—documented in the preprint "Deterministic access to global viral sequence data enables robust agentic scientific discovery" (Nasri et al., 2026).

Benchmark design

Property	Detail
Queries	120 realistic viral sequence retrievals
Pathogens	40, from broad family searches to accession lookups
Filters per query	1–9 simultaneous (median 6); up to 16 filter types
Expected counts	0 to 3,226 sequences (median 22)
Ground truth	Manually verified via NCBI Virus web interface
Use cases	Surveillance, diagnostic assay design, protein model training data
Contributors	58 queries from Sabeti Lab diagnostics team

Example query (Ebolavirus):

Retrieve viral sequences from NCBI for TaxID 3052462 (Orthoebolavirus zairense (ZEBOV)) with: host organism human; geographic location Africa; collected 01/01/2014–06/20/2014; minimum sequence length 15,200 bases; maximum 1,900 ambiguous characters (N's); exclude lab-passaged samples.

Agents tested

Evaluated February 26, 2026:

Claude Sonnet 4 and Claude Opus 4.7 (Anthropic Messages API)
GPT-5.2-pro and GPT-5.5 (OpenAI Responses API, web search + code execution)
Biomni OSS v0.0.8 (Claude Sonnet 4 backend)
Edison Analysis (Edison client SDK)

Each query ran three independent times per agent to test reproducibility.

What happened when agents tried alone

Performance varied widely—and even the best model was not reliably good enough for dataset construction, where the effective bar is 100%.

Accuracy without gget virus

Agent	Mean accuracy	Stability (σ=μ threshold)
Claude Sonnet 4	16.9%	Low
Biomni OSS	22.5%	Low
Edison Analysis	40.0%	Moderate
GPT-5.2-pro	67.1%	Moderate
Claude Opus 4.7	83.2%	0.93 stability
GPT-5.5	91.3%	1.00 stability

Newer frontier models improved substantially—but residual errors remain consequential. A missing or incorrect record can determine whether a diagnostic assay appears to cover circulating diversity, or whether an outbreak is inferred to have started weeks earlier or later.

The reproducibility problem

The same model often returned different answers on identical prompts. For the example Ebolavirus query, Claude Sonnet 4 returned:

Run 1: 106 sequences (expected: 266)
Run 2: 15 sequences
Run 3: 5 sequences

That undermines both accuracy and reproducibility—requirements for any scientific workflow.

When wrong retrieval changes biology

Anthropic illustrated downstream impact with two analyses:

Phylogenetic trees (TMRCA): A manually curated NCBI dataset inferred a January 2014 time to most recent common ancestor for the 2014 West African Ebola epidemic—consistent with prior literature. Agent-retrieved sets produced trees pushing TMRCA to 1922, or shifting it to April 2014 by missing Guinea sequences—changing inferred outbreak timing.

Therapeutic epitopes: For antibody candidates maftivimab and MBP134, three Sonnet 4 runs produced three different impressions of mutation variability in target regions—because underlying sequence sets were incomplete or wrong.

Failure modes

Agents often understood the task but lacked machine-actionable execution:

Under-counted when pagination stopped early (Influenza A, HIV-1, SARS-CoV-2)
Over-counted when filters were applied incorrectly
Struggled with metadata fields whose meaning depends on context (e.g., geographic info stored in virusName rather than location)
Performance degraded beyond 3–4 simultaneous filters

Answers could look plausible while being wrong—especially dangerous because sequence retrieval is usually step one in a long pipeline.

gget virus: the deterministic layer

The team developed gget virus in collaboration with NCBI researchers—not a simple API wrapper, but a system that reproduces NCBI Virus web-interface behavior across fragmented backends.

What it coordinates

NCBI Datasets REST API — lightweight metadata
NCBI Datasets CLI — cached bulk packages for SARS-CoV-2 and Influenza A
E-utilities — GenBank records for protein-level filters
Local filtering — when web UI semantics aren't exposed programmatically
Batching and retry logic — comprehensive retrieval without arbitrary cutoffs
Standardized outputs + logs — auditable, human- and machine-readable

The preprint reports >98% data transfer reduction for representative high-volume queries by applying metadata constraints before sequence download.

Install and basic usage

bash

pip install gget

# Example: Zaire ebolavirus with filters
gget virus "Zaire ebolavirus" \
  --host human \
  --geo_location Africa \
  --collection_date_after 2014-01-01 \
  --collection_date_before 2014-06-20 \
  --min_seq_length 15200 \
  --max_n 1900

Documentation: gget virus module (Pachter Lab)

Written by Ferdous Nasri; developed with Sarah Gurev, Patrick Varilly, Krithik Ramesh, Nuala A. O'Leary, Jonah Cool, Bernhard Y. Renard, Pardis Sabeti, and Laura Luebbert.

Results with gget virus: model choice mattered less

When agents were instructed to use gget virus, the picture changed dramatically:

Agent	Accuracy without gget	Accuracy with gget
Claude Sonnet 4	16.9%	92.8%
Biomni OSS	22.5%	90.0%
Edison Analysis	40.0%	93.1%
GPT-5.2-pro	67.1%	98.9%
Claude Opus 4.7	83.2%	98.3%
GPT-5.5	91.3%	99.7%

Run-to-run variability was largely eliminated (stability 0.92–1.00). The performance gap between models narrowed dramatically. Adding a deterministic retrieval layer made model choice much less important—cheaper models plus the right tool beat expensive models fighting messy APIs alone.

One notable run: GPT-5.5 independently discovered and used gget virus on one query despite not being prompted to— the only correct answer for that question among 360 runs.

Remaining errors

Residual failures shifted from "can't access data reliably" to "agent misused the tool":

Incorrect local filtering after download
Partial processing of large FASTA files
Reverting to alternative APIs despite instructions
Wrong parameters on gget calls

The retrieval layer worked; agent invocation and output preservation still need guardrails—echoing @mosyaseen's loop-engineering point: you need something that can say no.

The highway under the hill town

Anthropic returns to the city analogy: gget virus is a highway tunnel under pedestrian infrastructure—on-ramps, interchanges, exit numbers tied to known mile markers.

Karpathy's prescription applies directly: "make [genomic data] accessible to agents."

Creative work—hypothesis generation, experimental design, mechanism reasoning—should stay with models. The layer underneath must be boringly reliable:

Gene identifiers
Schemas
Retrieval logic
Coordinate systems
Metadata conventions
Data access paths

Broader ecosystem

gget virus joins a growing set of context engines for scientific agents:

System	Role
ToolUniverse	Tool aggregation for biomedical agents
Edison Scientific Robin	Research agent with tool harness
Biomni	General-purpose biomedical agent
gget virus	Deterministic viral sequence retrieval

The design question: where does determinism belong, and how do you build it so agents can invoke it without brittle post-processing?

As Nils Homer noted (cited in Anthropic's footnotes): "AI assistants need to work with your code, your outputs, and your analysis logic"—so agents can inspect how data was retrieved, not just what was returned.

Will better models make tools obsolete?

Anthropic addresses the obvious objection: if you extrapolate model curves, agents might eventually navigate messy portals alone.

Maybe. But even if an agent can fight through a confusing bioinformatics workflow, that does not mean it should every time:

Too expensive (token burn on pagination)
Too slow (multi-hour API gluing)
Too hard to audit (no retrieval logs)
Too hard to trust (plausible wrong counts)

If today's harnesses become obsolete, the lesson for database maintainers holds: design for agents as scaled users—explicit filtering semantics, stable identifiers, machine-readable logs, deterministic endpoints.

This parallels software agent discourse from the same week: Peter Steinberger's loop tweet argued engineers should design loops and skills, not re-prompt agents from scratch every time. Biology's version is design deterministic retrieval tools, not let each agent reinvent NCBI pagination.

Implications for agent builders

1. Separate reasoning from retrieval

Let the model plan and interpret. Let deterministic tools fetch, filter, and log. VirBench shows reasoning without retrieval fails reproducibility even at 91% mean accuracy.

2. Test at 100%, not "pretty good"

For dataset construction, 91% is not passing. Benchmark with ground-truth counts, multiple runs, and downstream analyses (phylogenetics, epitope mapping)—not just "did it return something."

3. Build agent-accessible interfaces

Biological databases need:

Filtering semantics matching web UIs
Documented metadata fields with examples
Pagination that cannot silently truncate
Logs showing how results were produced
Stable identifiers across sources

4. Connectors and MCP for science

Software teams solve this with MCP servers and Claude connectors. Life sciences needs the same pattern: thin, deterministic tool surfaces agents call instead of browsing.

5. Cheaper models + right tool > frontier model alone

VirBench's most practical finding: gget virus democratized accuracy. Reliable science should not require the newest or most expensive model—or insider knowledge of which model handles which database best.

explainx.ai guides

Primary sources

Anthropic Science: Paving the way for agents in biology (June 8, 2026)
arXiv preprint: Deterministic access to global viral sequence data (2606.06749)
gget virus documentation (Pachter Lab)
Elliot Hershberg: How Software in the Life Sciences Actually Works — cited in Anthropic footnotes on fragmented bioinformatics tooling

Summary

Anthropic's June 8, 2026 biology agents essay makes a precise claim: coding agents outran biological agents because software infrastructure was built for programmatic access, and biology wasn't.

VirBench proved it with numbers:

120 queries, 40 pathogens, agents alone: 16.9%–91.3% accuracy with dangerous run-to-run variance
gget virus added: ≥90% for every agent, 99.7% peak, variability largely gone
Wrong retrieval changed phylogenetic outbreak dates and therapeutic epitope conclusions

The prescription is not "wait for smarter models." It is build deterministic execution layers—boring, auditable, repeatable—and let agents be creative on top.

For outbreak response in the DRC, diagnostic assay design, and protein model training data, that infrastructure is not a nice-to-have. It is the difference between a plausible-looking wrong answer and science you can trust.

Published June 9, 2026. VirBench metrics and outbreak statistics from Anthropic's June 8, 2026 post and Nasri et al. arXiv:2606.06749—verify against upstream before citing in research or public-health contexts.

Related: Long-Read Genome Sequencing and Rare Disease Diagnosis — the same data-quality bottleneck VirBench identified in viral queries applies to variant interpretation in genomics: AI models are only as accurate as the curated databases they access.

The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.

TL;DR

Question	Answer
What was tested?	VirBench — 120 queries, 40 pathogens, manually verified ground-truth counts.
Worst agent accuracy (alone)?	16.9% mean (Claude Sonnet 4). Best: 91.3% (GPT-5.5)—still not enough for science.
With gget virus?	≥90% all agents; 99.7% peak (GPT-5.5). Variability largely gone.
Why it matters now?	May 2026 Bundibugyo Ebola outbreak in DRC—genomic answers depend on correct sequence retrieval first.
Broader lesson?	Build deterministic execution layers; let models hypothesize, not reinvent pagination.
Full paper?	arXiv:2606.06749 — Nasri et al., 2026.

The hill town problem: biology wasn't built for agents

Software, by contrast, was built for cars:

Paved roads → version control
Clear lanes → documented APIs
Standardized signals → package managers
Fast start-to-finish travel → testable outputs (a GitHub patch that passes CI)

In biology, small retrieval errors have severe downstream consequences:

Wrong genome build → invalid coordinates
Mixing RefSeq and GenBank unintentionally
Treating partial genomes as complete
Confusing segment names in segmented viruses
Missing records due to inconsistent metadata fields

It does not matter how powerful the model is if the route depends on local knowledge hidden in a web UI.

Karpathy's "click tax" — the same pain in software

"The code was the easiest part! Most of the work was in the browser, clicking things."

Documentation kept saying "go to this URL, click this dropdown." Karpathy's conclusion: nobody should have to do this—we must build for agents.

Case study: NCBI Virus and the May 2026 Ebola outbreak

NCBI Virus aggregates viral sequence records from GenBank, RefSeq, and the international INSDC ecosystem (NCBI, ENA, DDBJ), including Pathoplexus—behind a searchable web interface.

Virology labs pass around long lists of complex filters that users manually reproduce in the browser. Exactly the workflow Karpathy described—except the stakes are public health.

Bundibugyo virus, DRC, May 2026

Researchers generated the first near-complete outbreak genomes, establishing a new spillover event. Public health officials need immediate answers:

How different is this virus from prior Ebola viruses?
Do existing diagnostics still detect it?
Will existing therapeutics still protect patients?

Even when APIs exist, agents struggle when:

API filtering ≠ web UI semantics
Metadata fields are poorly documented
Identifiers differ across sources
"The right answer" depends on expert conventions machines must infer

VirBench: 120 queries, ground-truth counts, three runs each

Benchmark design

Property	Detail
Queries	120 realistic viral sequence retrievals
Pathogens	40, from broad family searches to accession lookups
Filters per query	1–9 simultaneous (median 6); up to 16 filter types
Expected counts	0 to 3,226 sequences (median 22)
Ground truth	Manually verified via NCBI Virus web interface
Use cases	Surveillance, diagnostic assay design, protein model training data
Contributors	58 queries from Sabeti Lab diagnostics team

Example query (Ebolavirus):

Retrieve viral sequences from NCBI for TaxID 3052462 (Orthoebolavirus zairense (ZEBOV)) with: host organism human; geographic location Africa; collected 01/01/2014–06/20/2014; minimum sequence length 15,200 bases; maximum 1,900 ambiguous characters (N's); exclude lab-passaged samples.

Agents tested

Evaluated February 26, 2026:

Claude Sonnet 4 and Claude Opus 4.7 (Anthropic Messages API)
GPT-5.2-pro and GPT-5.5 (OpenAI Responses API, web search + code execution)
Biomni OSS v0.0.8 (Claude Sonnet 4 backend)
Edison Analysis (Edison client SDK)

Each query ran three independent times per agent to test reproducibility.

What happened when agents tried alone

Performance varied widely—and even the best model was not reliably good enough for dataset construction, where the effective bar is 100%.

Accuracy without gget virus

Agent	Mean accuracy	Stability (σ=μ threshold)
Claude Sonnet 4	16.9%	Low
Biomni OSS	22.5%	Low
Edison Analysis	40.0%	Moderate
GPT-5.2-pro	67.1%	Moderate
Claude Opus 4.7	83.2%	0.93 stability
GPT-5.5	91.3%	1.00 stability

The reproducibility problem

The same model often returned different answers on identical prompts. For the example Ebolavirus query, Claude Sonnet 4 returned:

Run 1: 106 sequences (expected: 266)
Run 2: 15 sequences
Run 3: 5 sequences

That undermines both accuracy and reproducibility—requirements for any scientific workflow.

When wrong retrieval changes biology

Anthropic illustrated downstream impact with two analyses:

Failure modes

Agents often understood the task but lacked machine-actionable execution:

Under-counted when pagination stopped early (Influenza A, HIV-1, SARS-CoV-2)
Over-counted when filters were applied incorrectly
Struggled with metadata fields whose meaning depends on context (e.g., geographic info stored in virusName rather than location)
Performance degraded beyond 3–4 simultaneous filters

Answers could look plausible while being wrong—especially dangerous because sequence retrieval is usually step one in a long pipeline.

gget virus: the deterministic layer

What it coordinates

NCBI Datasets REST API — lightweight metadata
NCBI Datasets CLI — cached bulk packages for SARS-CoV-2 and Influenza A
E-utilities — GenBank records for protein-level filters
Local filtering — when web UI semantics aren't exposed programmatically
Batching and retry logic — comprehensive retrieval without arbitrary cutoffs
Standardized outputs + logs — auditable, human- and machine-readable

The preprint reports >98% data transfer reduction for representative high-volume queries by applying metadata constraints before sequence download.

Install and basic usage

bash

pip install gget

# Example: Zaire ebolavirus with filters
gget virus "Zaire ebolavirus" \
  --host human \
  --geo_location Africa \
  --collection_date_after 2014-01-01 \
  --collection_date_before 2014-06-20 \
  --min_seq_length 15200 \
  --max_n 1900

Documentation: gget virus module (Pachter Lab)

Written by Ferdous Nasri; developed with Sarah Gurev, Patrick Varilly, Krithik Ramesh, Nuala A. O'Leary, Jonah Cool, Bernhard Y. Renard, Pardis Sabeti, and Laura Luebbert.

Results with gget virus: model choice mattered less

When agents were instructed to use gget virus, the picture changed dramatically:

Agent	Accuracy without gget	Accuracy with gget
Claude Sonnet 4	16.9%	92.8%
Biomni OSS	22.5%	90.0%
Edison Analysis	40.0%	93.1%
GPT-5.2-pro	67.1%	98.9%
Claude Opus 4.7	83.2%	98.3%
GPT-5.5	91.3%	99.7%

One notable run: GPT-5.5 independently discovered and used gget virus on one query despite not being prompted to— the only correct answer for that question among 360 runs.

Remaining errors

Residual failures shifted from "can't access data reliably" to "agent misused the tool":

Incorrect local filtering after download
Partial processing of large FASTA files
Reverting to alternative APIs despite instructions
Wrong parameters on gget calls

The retrieval layer worked; agent invocation and output preservation still need guardrails—echoing @mosyaseen's loop-engineering point: you need something that can say no.

The highway under the hill town

Anthropic returns to the city analogy: gget virus is a highway tunnel under pedestrian infrastructure—on-ramps, interchanges, exit numbers tied to known mile markers.

Karpathy's prescription applies directly: "make [genomic data] accessible to agents."

Creative work—hypothesis generation, experimental design, mechanism reasoning—should stay with models. The layer underneath must be boringly reliable:

Gene identifiers
Schemas
Retrieval logic
Coordinate systems
Metadata conventions
Data access paths

Broader ecosystem

gget virus joins a growing set of context engines for scientific agents:

System	Role
ToolUniverse	Tool aggregation for biomedical agents
Edison Scientific Robin	Research agent with tool harness
Biomni	General-purpose biomedical agent
gget virus	Deterministic viral sequence retrieval

The design question: where does determinism belong, and how do you build it so agents can invoke it without brittle post-processing?

Will better models make tools obsolete?

Anthropic addresses the obvious objection: if you extrapolate model curves, agents might eventually navigate messy portals alone.

Maybe. But even if an agent can fight through a confusing bioinformatics workflow, that does not mean it should every time:

Too expensive (token burn on pagination)
Too slow (multi-hour API gluing)
Too hard to audit (no retrieval logs)
Too hard to trust (plausible wrong counts)

Implications for agent builders

1. Separate reasoning from retrieval

Let the model plan and interpret. Let deterministic tools fetch, filter, and log. VirBench shows reasoning without retrieval fails reproducibility even at 91% mean accuracy.

2. Test at 100%, not "pretty good"

3. Build agent-accessible interfaces

Biological databases need:

Filtering semantics matching web UIs
Documented metadata fields with examples
Pagination that cannot silently truncate
Logs showing how results were produced
Stable identifiers across sources

4. Connectors and MCP for science

Software teams solve this with MCP servers and Claude connectors. Life sciences needs the same pattern: thin, deterministic tool surfaces agents call instead of browsing.

5. Cheaper models + right tool > frontier model alone

explainx.ai guides

Primary sources

Anthropic Science: Paving the way for agents in biology (June 8, 2026)
arXiv preprint: Deterministic access to global viral sequence data (2606.06749)
gget virus documentation (Pachter Lab)
Elliot Hershberg: How Software in the Life Sciences Actually Works — cited in Anthropic footnotes on fragmented bioinformatics tooling

Summary

VirBench proved it with numbers:

120 queries, 40 pathogens, agents alone: 16.9%–91.3% accuracy with dangerous run-to-run variance
gget virus added: ≥90% for every agent, 99.7% peak, variability largely gone
Wrong retrieval changed phylogenetic outbreak dates and therapeutic epitope conclusions

The prescription is not "wait for smarter models." It is build deterministic execution layers—boring, auditable, repeatable—and let agents be creative on top.

The hill town problem: biology wasn't built for agents

Karpathy's "click tax" — the same pain in software

Case study: NCBI Virus and the May 2026 Ebola outbreak

Bundibugyo virus, DRC, May 2026

VirBench: 120 queries, ground-truth counts, three runs each

Benchmark design

Agents tested

What happened when agents tried alone

Accuracy without gget virus

The reproducibility problem

When wrong retrieval changes biology

Failure modes

gget virus: the deterministic layer

What it coordinates

Install and basic usage

Results with gget virus: model choice mattered less

Remaining errors

The highway under the hill town

Broader ecosystem

Will better models make tools obsolete?

Implications for agent builders

1. Separate reasoning from retrieval

2. Test at 100%, not "pretty good"

3. Build agent-accessible interfaces

4. Connectors and MCP for science

5. Cheaper models + right tool > frontier model alone

Related reading

Summary

The hill town problem: biology wasn't built for agents

Karpathy's "click tax" — the same pain in software

Case study: NCBI Virus and the May 2026 Ebola outbreak

Bundibugyo virus, DRC, May 2026

VirBench: 120 queries, ground-truth counts, three runs each

Benchmark design

Agents tested

What happened when agents tried alone

Accuracy without gget virus

The reproducibility problem

When wrong retrieval changes biology

Failure modes

gget virus: the deterministic layer

What it coordinates

Install and basic usage

Results with gget virus: model choice mattered less

Remaining errors

The highway under the hill town

Broader ecosystem

Will better models make tools obsolete?

Implications for agent builders

1. Separate reasoning from retrieval

2. Test at 100%, not "pretty good"

3. Build agent-accessible interfaces

4. Connectors and MCP for science

5. Cheaper models + right tool > frontier model alone

Related reading

Summary

Related posts

What Running an AI Agent Actually Costs Per Month

LM Studio Bionic: Open-Model Agent for Code and Work Projects

Claude Code Desktop Browser: Built-In Web Browsing in the App (July 2026)

Related posts

What Running an AI Agent Actually Costs Per Month

LM Studio Bionic: Open-Model Agent for Code and Work Projects

Claude Code Desktop Browser: Built-In Web Browsing in the App (July 2026)