What is Senior SWE-Bench?

Senior SWE-Bench is a coding-agent benchmark from Snorkel AI with Princeton and UW–Madison that evaluates agents on senior-engineer work — under-specified feature requests, runtime bug investigation, and code quality ("taste") beyond what instructions state. It ships 50 public and 50 private tasks sourced from real PRs in repos like PostHog, Gitea, and Better Auth. Official site: senior-swe-bench.snorkel.ai.

How is Senior SWE-Bench different from SWE-Bench Pro?

SWE-Bench Pro tasks often read like detailed specs — median ~6,008 characters with dozens of named functions and interfaces. Senior SWE-Bench feature prompts median ~639 characters with zero explicit code symbols, closer to Slack messages. Feature tasks touch an average of 11 files across multiple services. Scoring adds taste metrics (bloat, practice alignment, relative taste) on top of verifiers.

Who leads the Senior SWE-Bench leaderboard?

As of the public leaderboard, Claude Opus 4.8 (max effort) leads with a 24.0% tasteful solve rate (pass@1). Claude Sonnet 5 max follows at 19.4%, GPT-5.5 xhigh at 16.0%, Claude Opus 4.7 max at 14.1%, and GPT-5.4 xhigh at 14.0%. GLM-5.2 max reaches 12.5% — ahead of Kimi K2.6 and Gemini 3.1 Pro.

What counts as a tasteful solve?

A tasteful solve requires all of: verifiers pass, validation-agent tests pass, rubric score above 0.5, bloat under 2× reference size, practice score above 2/5, and relative taste above 2/5. Runtime correctness alone is insufficient — the benchmark penalizes over-engineered or stylistically wrong solutions that happen to pass tests.

How do I run Senior SWE-Bench?

The benchmark integrates with Harbor — the same agent-evaluation framework used by Terminal-Bench. The public dataset is on GitHub; agents run via Mini-SWE-Agent on the official leaderboard. See senior-swe-bench.snorkel.ai for the dataset link and Harbor run instructions.

Why do frontier models score so low?

Senior SWE-Bench tasks are long-horizon — even Opus 4.8 averages 117.1K output tokens per feature task. Instructions are deliberately vague, bugs require starting services and reading logs, and taste gates reject bloated patches. The site states top frontier models fail senior-level correctness and taste on more than 75% of tasks — a much harder bar than patch-and-pass benchmarks.

Senior SWE-Bench: Snorkel AI Coding Agent Benchmark (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Senior SWE-Bench: Snorkel AI Coding Agent Benchmark (2026) | explainx.ai Blog | explainx.ai

SWE-Bench Pro tells an agent exactly which function to add, which file to edit, and which test must pass. Real senior engineers get a Slack message: "Google Books should be a fallback metadata source for BookWorm."

Senior SWE-Bench — from Snorkel AI, Princeton, and UW–Madison — closes that gap. Its tagline: "We treat agents like senior engineers, so why evaluate them like junior engineers?"

The first public leaderboard is sobering: Claude Opus 4.8 leads at 24.0% tasteful solve rate. Even the best frontier model fails more than three out of four senior-level tasks when correctness and code taste both count.

TL;DR


Site	senior-swe-bench.snorkel.ai
Tasks	50 public + 50 private from real PRs
Repos	PostHog (8), Electric (6), Gitea (6), Better Auth (4), Harbor (4), +7
Types	Feature · bug · perf · migration
Stacks	Python services, Elixir, Go, SQL, TS libs, Rust, TS frontend, +4
Agent harness	Mini-SWE-Agent via Harbor
Leader	Claude Opus 4.8 max — 24.0% tasteful pass@1
Key insight	Top models miss senior-level correctness + taste on over 75% of tasks

Three Pillars: Why Junior Benchmarks Lie

1. Under-specified features, not spec documents

Senior SWE-Bench feature tasks read like natural messages — median 639 characters, zero explicit code symbols in the prompt.

Compare to a typical SWE-Bench Pro instruction: ~6,008 characters, ~39 named code symbols, function signatures, file paths, and interface tables spelled out line by line.

Senior SWE-Bench median instruction length is 31% of SWE-Bench Pro — closer to how a staff engineer actually delegates.

Because solutions vary, Snorkel introduces a validation agent: expert-designed recipes generate behavioral tests that adapt to each submitted patch — not a single golden diff.

2. Bugs that need runtime investigation

Bug and performance tasks come from PRs that required real debugging — starting services, reading logs, profiling, reproducing flaky behavior from user reports.

This is the opposite of "apply this one-line fix from the issue description." Agents must investigate before they patch.

3. Taste scoring — ship the right code, not just passing code

A tasteful solve requires all of:

Gate	Threshold
Verifiers	Pass
Validation agent	Pass
Rubric	above 0.5
Bloat	under 2× reference size
Practice alignment	above 2/5
Relative taste	above 2/5

Verifiers can enforce load-bearing codebase practices never stated in the prompt — naming conventions, error-handling patterns, how sibling files structure imports.

The benchmark explicitly punishes agents that over-build, ignore local style, or ship 2× bloated diffs that technically pass tests.

Leaderboard: Opus 4.8 Leads, Everyone Else Struggles

Public results use Mini-SWE-Agent with model-specific effort settings. Tasteful solve rate (pass@1):

Rank	Model	Effort	Tasteful solve
1	Claude Opus 4.8	max	24.0%
—	Claude Sonnet 5	max	19.4%
2	GPT-5.5	xhigh	16.0%
3	Claude Opus 4.7	max	14.1%
4	GPT-5.4	xhigh	14.0%
5	GLM-5.2	max	12.5%
6	Kimi K2.6	default	8.2%
7	Claude Sonnet 4.6	high	8.2%
8	Gemini 3.1 Pro	high	6.1%
9	Gemini 3.5 Flash	medium	3.0%

Takeaways:

Opus 4.8 and Sonnet 5 dominate — Anthropic's stack owns the top two slots when taste gates apply.
GPT-5.5 lands third at 16% — strong, but not the leader here (contrast DeepSWE where GPT-5.5 leads at 70%).
GLM-5.2 at 12.5% beats Kimi K2.6 and Gemini 3.1 Pro — open-weight competition is real but still far from frontier Anthropic/OpenAI tiers.
The scatter plot tracks solve rate vs. compute (output tokens + agent steps). More tokens does not automatically mean higher tasteful solve.

The site states plainly: the top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.

Task Design vs. Other Coding Benchmarks

Senior SWE-Bench compares itself to SWE-Bench Pro, DeepSWE, and internal FrontierCode metrics:

Dimension	Senior SWE-Bench	SWE-Bench Pro	DeepSWE
Instruction style	Slack-style, under-specified	Detailed spec with symbols	Concise, original tasks
Median prompt length	~639 chars (features)	~6,008 chars	~half of SWE-Bench Pro
Files per feature task	~11 avg	~5 avg	~5 avg
Horizon	Hundreds of agent steps	Shorter patch scope	Long-horizon (668 LOC ref)
Scoring	Correctness + taste + validation agent	Verifier pass	Behavior verifiers
Opus 4.8 tokens/task	117.1K output (features)	—	97.0K (self-reported)

Tasks span libraries to multi-service applications, authored by engineers with hundreds of commits in each repo — not synthetic toy repos.

50 public tasks are released for community eval; 50 private tasks hold out contamination. Dataset lives on GitHub; runs go through Harbor (same framework as Terminal-Bench 2.0).

SWE-Bench Pro vs. Senior SWE-Bench: A Concrete Example

The Senior SWE-Bench site contrasts prompt styles side by side.

SWE-Bench Pro (excerpt — Open Library / BookWorm Google Books task):

40+ lines of Problem / Justify / Define Success / Proposal
Named tuples (STAGED_SOURCES), exact URL templates, function signatures (stage_from_google_books, fetch_google_book, BaseLookupWorker)
New public interfaces listed with inputs, outputs, and locations

Senior SWE-Bench equivalent:

#eng-platform

Engineer 10:42 AM
Google Books should be a fallback metadata source for BookWorm
for fallback/staging imports.

Same underlying feature. One prompt is a contract; the other is a message. Senior SWE-Bench argues agents that ace the contract may still fail the message — and production work looks like the message.

What the Validation Agent Does

Static golden tests break when valid implementations differ. Senior SWE-Bench's validation agent:

Takes the submitted solution
Applies expert-designed recipes per task type
Generates behavioral tests adapted to that specific patch
Scores rubric dimensions (correctness, bloat, practice fit, taste)

This mirrors how a senior reviewer would QA a PR — not just "CI green," but "would I merge this?"

Combined with verifiers (runtime tests) and taste metrics derived from observed codebase practices, the benchmark separates lucky passes from shippable work.

Implications for Teams Picking Coding Agents

Do not read 24% as "Opus is bad." Read it as "senior work is hard and our old benchmarks were measuring something easier."

Practical guidance:

Treat public SWE scores as necessary, not sufficient — especially SWE-Bench Pro-style over-specified tasks. See Cursor reward-hacking on SWE-Bench for contamination risks.
Add taste and review gates in your harness — Senior SWE-Bench's bloat and practice scores align with what staff engineers actually reject in code review.
Stress under-specified prompts in private evals — if your internal agent tickets read like Senior SWE-Bench, benchmark them that way.
Budget for long horizons — Opus 4.8 averages 117K output tokens per feature task here. Token cost and latency matter as much as pass rate.
Use Harbor for reproducibility — same infra as Terminal-Bench; compare agent scaffolding, not just raw models.

For agent loop design — how to structure multi-step coding sessions — see loop engineering for Claude Code.

Limitations

Early leaderboard — few models, single harness (Mini-SWE-Agent); other scaffolds may reorder ranks.
50 public tasks — small sample; high variance on pass@1.
Taste metrics are codebase-specific — what counts as "good practice" in PostHog may not transfer to your monorepo.
Private holdout — public leaderboards can still be gamed; private 50 tasks matter for official claims.
Snorkel affiliation — Snorkel also co-created Terminal-Bench; evaluate independently where possible.

How to Run It

Visit senior-swe-bench.snorkel.ai
Clone the dataset from GitHub (linked on site)
Run agents via Harbor with Mini-SWE-Agent config
Read the blog post on validation agent, taste scoring, and QC (linked from overview)

Senior SWE-Bench: Snorkel AI's Benchmark for Under-Specified Tasks and Tasteful Code

Related posts

DeepSWE Benchmark: GPT-5.5 Leads as SWE-Bench Pro Faces Scrutiny

GeneBench-Pro: OpenAI''s Research-Level Benchmark for Computational Biology Judgment

Commit History: GitHub's New All-Time Commit Leaderboard Explained (2026)

TL;DR

Three Pillars: Why Junior Benchmarks Lie

1. Under-specified features, not spec documents

2. Bugs that need runtime investigation

3. Taste scoring — ship the right code, not just passing code

Leaderboard: Opus 4.8 Leads, Everyone Else Struggles

Task Design vs. Other Coding Benchmarks

SWE-Bench Pro vs. Senior SWE-Bench: A Concrete Example

What the Validation Agent Does

Implications for Teams Picking Coding Agents

Limitations

How to Run It

Related Reading