Senior SWE-Bench: Snorkel AI's Benchmark for Under-Specified Tasks and Tasteful Code
Senior SWE-Bench from Snorkel AI, Princeton, and UWβMadison evaluates coding agents like senior engineers β realistic prompts, runtime bug investigation, and taste scoring. Opus 4.8 leads at 24% tasteful solve; frontier models fail over 75% of the time.
SWE-Bench Pro tells an agent exactly which function to add, which file to edit, and which test must pass. Real senior engineers get a Slack message: "Google Books should be a fallback metadata source for BookWorm."
Senior SWE-Bench β from Snorkel AI, Princeton, and UWβMadison β closes that gap. Its tagline: "We treat agents like senior engineers, so why evaluate them like junior engineers?"
The first public leaderboard is sobering: Claude Opus 4.8 leads at 24.0% tasteful solve rate. Even the best frontier model fails more than three out of four senior-level tasks when correctness and code taste both count.
Top models miss senior-level correctness + taste on over 75% of tasks
Three Pillars: Why Junior Benchmarks Lie
1. Under-specified features, not spec documents
Senior SWE-Bench feature tasks read like natural messages β median 639 characters, zero explicit code symbols in the prompt.
Compare to a typical SWE-Bench Pro instruction: ~6,008 characters, ~39 named code symbols, function signatures, file paths, and interface tables spelled out line by line.
Senior SWE-Bench median instruction length is 31% of SWE-Bench Pro β closer to how a staff engineer actually delegates.
Because solutions vary, Snorkel introduces a validation agent: expert-designed recipes generate behavioral tests that adapt to each submitted patch β not a single golden diff.
2. Bugs that need runtime investigation
Bug and performance tasks come from PRs that required real debugging β starting services, reading logs, profiling, reproducing flaky behavior from user reports.
This is the opposite of "apply this one-line fix from the issue description." Agents must investigate before they patch.
3. Taste scoring β ship the right code, not just passing code
A tasteful solve requires all of:
Gate
Threshold
Verifiers
Pass
Validation agent
Pass
Rubric
above 0.5
Bloat
under 2Γ reference size
Practice alignment
above 2/5
Relative taste
above 2/5
Verifiers can enforce load-bearing codebase practices never stated in the prompt β naming conventions, error-handling patterns, how sibling files structure imports.
The benchmark explicitly punishes agents that over-build, ignore local style, or ship 2Γ bloated diffs that technically pass tests.
Leaderboard: Opus 4.8 Leads, Everyone Else Struggles
Public results use Mini-SWE-Agent with model-specific effort settings. Tasteful solve rate (pass@1):
Rank
Model
Effort
Tasteful solve
1
Claude Opus 4.8
max
24.0%
β
Claude Sonnet 5
max
19.4%
2
GPT-5.5
xhigh
16.0%
3
Claude Opus 4.7
max
14.1%
4
GPT-5.4
xhigh
14.0%
5
GLM-5.2
max
12.5%
6
Kimi K2.6
default
8.2%
7
Claude Sonnet 4.6
high
8.2%
8
Gemini 3.1 Pro
high
6.1%
9
Gemini 3.5 Flash
medium
3.0%
Takeaways:
Opus 4.8 and Sonnet 5 dominate β Anthropic's stack owns the top two slots when taste gates apply.
GPT-5.5 lands third at 16% β strong, but not the leader here (contrast DeepSWE where GPT-5.5 leads at 70%).
GLM-5.2 at 12.5% beats Kimi K2.6 and Gemini 3.1 Pro β open-weight competition is real but still far from frontier Anthropic/OpenAI tiers.
The scatter plot tracks solve rate vs. compute (output tokens + agent steps). More tokens does not automatically mean higher tasteful solve.
The site states plainly: the top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.
Task Design vs. Other Coding Benchmarks
Senior SWE-Bench compares itself to SWE-Bench Pro, DeepSWE, and internal FrontierCode metrics:
Dimension
Senior SWE-Bench
SWE-Bench Pro
DeepSWE
Instruction style
Slack-style, under-specified
Detailed spec with symbols
Concise, original tasks
Median prompt length
~639 chars (features)
~6,008 chars
~half of SWE-Bench Pro
Files per feature task
~11 avg
~5 avg
~5 avg
Horizon
Hundreds of agent steps
Shorter patch scope
Long-horizon (668 LOC ref)
Scoring
Correctness + taste + validation agent
Verifier pass
Behavior verifiers
Opus 4.8 tokens/task
117.1K output (features)
β
97.0K (self-reported)
Tasks span libraries to multi-service applications, authored by engineers with hundreds of commits in each repo β not synthetic toy repos.
50 public tasks are released for community eval; 50 private tasks hold out contamination. Dataset lives on GitHub; runs go through Harbor (same framework as Terminal-Bench 2.0).
SWE-Bench Pro vs. Senior SWE-Bench: A Concrete Example
The Senior SWE-Bench site contrasts prompt styles side by side.
SWE-Bench Pro (excerpt β Open Library / BookWorm Google Books task):
40+ lines of Problem / Justify / Define Success / Proposal
Named tuples (STAGED_SOURCES), exact URL templates, function signatures (stage_from_google_books, fetch_google_book, BaseLookupWorker)
New public interfaces listed with inputs, outputs, and locations
Senior SWE-Bench equivalent:
#eng-platform
Engineer 10:42 AM
Google Books should be a fallback metadata source for BookWorm
for fallback/staging imports.
Same underlying feature. One prompt is a contract; the other is a message. Senior SWE-Bench argues agents that ace the contract may still fail the message β and production work looks like the message.
What the Validation Agent Does
Static golden tests break when valid implementations differ. Senior SWE-Bench's validation agent:
Takes the submitted solution
Applies expert-designed recipes per task type
Generates behavioral tests adapted to that specific patch
Scores rubric dimensions (correctness, bloat, practice fit, taste)
This mirrors how a senior reviewer would QA a PR β not just "CI green," but "would I merge this?"
Combined with verifiers (runtime tests) and taste metrics derived from observed codebase practices, the benchmark separates lucky passes from shippable work.
Implications for Teams Picking Coding Agents
Do not read 24% as "Opus is bad." Read it as "senior work is hard and our old benchmarks were measuring something easier."
Practical guidance:
Treat public SWE scores as necessary, not sufficient β especially SWE-Bench Pro-style over-specified tasks. See Cursor reward-hacking on SWE-Bench for contamination risks.
Add taste and review gates in your harness β Senior SWE-Bench's bloat and practice scores align with what staff engineers actually reject in code review.
Stress under-specified prompts in private evals β if your internal agent tickets read like Senior SWE-Bench, benchmark them that way.
Budget for long horizons β Opus 4.8 averages 117K output tokens per feature task here. Token cost and latency matter as much as pass rate.
Use Harbor for reproducibility β same infra as Terminal-Bench; compare agent scaffolding, not just raw models.