Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding
Ornith-1.0 is a new MIT-licensed model family from DeepReinforce that learns its own agent scaffolds during RL post-training. The 397B MoE variant hits 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified.
On June 25, 2026, the DeepReinforce.AI team behind @ornith_ announced Ornith-1.0 — a family of MIT-licensed, open-weight models built specifically for agentic coding. The release spans 9B Dense, 31B Dense, 35B MoE, and 397B MoE checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases.
The technical bet is not just bigger pretraining. Ornith-1.0 treats the agent scaffold — memory layout, retry logic, tool orchestration — as something the model learns during reinforcement learning, not something engineers hard-code once per benchmark category. That is why the team calls it a self-scaffolding training strategy.
DeepReinforce's launch post summarizes the positioning plainly: "Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts."
Why Agent Scaffolds Matter for Coding Agents
Most public coding-agent scores combine three ingredients: the base model, the harness (OpenHands, Harbor/Terminus-2, Claude Code, mini SWE agent), and the benchmark task distribution. When harness design is fixed, leaderboard gains can reflect benchmark-specific tuning as much as raw model capability.
Ornith-1.0 attacks that coupling directly. Each RL step runs in two stages:
Scaffold stage — conditioned on the task and the scaffold used last time, the model proposes a refined scaffold.
Solution stage — conditioned on that scaffold and the task description, the model produces a solution rollout.
Reward from the rollout backpropagates to both stages. Over training, scaffolds that induce higher-reward trajectories survive; weak orchestration patterns get replaced. Per-task-category strategies can emerge without a human maintaining separate harness configs for Terminal-Bench, SWE-Bench, and repo-generation evals.
For teams running Claude Code agent workflows, Cursor agents, or custom MCP loops, the implication is practical: orchestration is trainable, not only prompt-engineered.
Benchmark Results: 397B MoE vs Frontier Models
DeepReinforce reports that Ornith-1.0-397B leads comparable open-weight models on agentic coding suites and matches or exceeds Claude Opus 4.7 on several headline benchmarks — while Claude Opus 4.8 and GLM-5.2-744B still top some columns.
Benchmark
Ornith-1.0-397B
Qwen3.5-397B
Claude Opus 4.7
Claude Opus 4.8
DeepSeek-V4-Pro
Terminal-Bench 2.1 (Terminus-2)
77.5
53.5
70.3
85.0
67.9
Terminal-Bench 2.1 (Claude Code)
78.2
48.6
69.7
78.9
66.5
SWE-Bench Verified
82.4
76.4
80.8
87.6
80.6
SWE-Bench Pro
62.2
51.6
64.3
69.2
55.4
SWE-Bench Multilingual
78.9
69.3
—
—
76.2
NL2Repo
48.2
36.8
—
69.7
—
ClawEval (avg)
77.1
70.7
78.2
—
75.8
Sources: Ornith-1.0 technical blog, June 2026. Empty cells mean the model was not listed in DeepReinforce's public table.
Three takeaways for engineering leaders:
Terminal-Bench 2.1 — Ornith-1.0-397B at 77.5 under Harbor/Terminus-2 is a meaningful jump over Qwen3.5-397B (53.5) and closes much of the gap to closed frontier models. See our Terminal-Bench 2.0 guide for why this benchmark stresses real shell workflows.
SWE-Bench Verified — 82.4 puts the open model in the same band as Claude Opus 4.7 (80.8) and DeepSeek-V4-Pro (80.6), though Opus 4.8 still leads at 87.6.
Harness sensitivity — Ornith scores 78.2 on Terminal-Bench when evaluated through Claude Code 2.1.126, not just Terminus-2. That suggests the learned scaffolds transfer across agent runtimes, but always verify on your toolchain.
Mid-Size and Edge Variants: 35B and 9B
Not every team can serve a 397B MoE cluster. Ornith-1.0's smaller checkpoints are where the release gets interesting for cost-conscious deployments.
Ornith-1.0-35B MoE
Benchmark
Ornith-1.0-35B
Qwen3.5-35B
Qwen3.6-35B
Qwen3.5-397B
Terminal-Bench 2.1 (Terminus-2)
64.2
41.4
52.5
53.5
SWE-Bench Verified
75.6
70.0
73.4
76.4
SWE-Bench Pro
50.4
44.6
49.5
51.6
ClawEval (avg)
69.8
65.4
68.7
70.7
The 35B model beating Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs. 53.5) is the standout efficiency story in DeepReinforce's tables. A MoE checkpoint at 35B active parameters should not outperform a 397B-class base on terminal agent tasks unless post-training and scaffold learning are doing substantial work.
Ornith-1.0-9B Dense (edge-friendly)
Benchmark
Ornith-1.0-9B
Qwen3.5-9B
Gemma4-31B
Terminal-Bench 2.1 (Terminus-2)
43.1
21.3
42.1
SWE-Bench Verified
69.4
53.2
52.0
SWE-Bench Pro
42.9
31.3
35.7
ClawEval (avg)
63.1
53.2
48.5
A 9B dense model scoring 43.1 on Terminal-Bench 2.1 — essentially matching Gemma 4-31B (42.1) — is strong evidence that agentic coding skills can compress into edge-deployable footprints when training targets scaffolds plus solutions jointly.
Fighting Reward Hacking in Self-Scaffolding RL
Letting the model author its own scaffold creates a familiar failure mode: the scaffold learns to game the verifier instead of solving the task. DeepReinforce documents examples such as reading visible test files and hardcoding expected outputs, touching files the grader checks without implementing behavior, or copying oracle solutions when they leak into the environment.
Their mitigation stack has three layers:
Fixed trust boundary — environments, tool surfaces, and test isolation stay immutable. The model may only evolve inner scaffold logic: memory, error handling, orchestration.
Deterministic monitor — flags attempts to read withheld paths, modify verification scripts, or call tools outside the sanctioned surface. Violations get zero reward and drop out of advantage computation.
Frozen LLM judge — acts as a veto on top of the verifier when intent-level gaming stays inside allowed tools.
This mirrors broader industry concern about eval contamination and reward hacking. Ornith's approach is notable because the attack surface includes scaffold code the model writes about itself, not only final patches.
Pipeline RL and Long Rollouts
Agentic coding rollouts are long. Standard on-policy RL becomes expensive when trajectories span thousands of tokens across tool calls. Ornith-1.0 uses pipeline RL with staleness-weighted GRPO: older off-policy tokens are down-weighted by age and discarded past a threshold, so long-horizon training stays stable without treating every stale token as equally valid.
DeepReinforce publishes the weighting scheme and clipped token-level GRPO loss in the technical blog. For practitioners, the important point is architectural: self-scaffolding only works if RL infrastructure can absorb multi-hour agent trajectories — the same constraint that shows up in agent harness engineering write-ups.
Evaluation Methodology (What the Numbers Actually Mean)
Scores are not directly comparable unless harness, temperature, context window, and run count match. DeepReinforce documents:
Benchmark
Harness / settings (from DeepReinforce footnotes)
Terminal-Bench 2.1 (Terminus-2)
Harbor/Terminus-2, temp=1.0, top_p=1.0, 128K context, 4h timeout, 32 CPU / 48GB RAM, 5-run average
Terminal-Bench 2.1 (Claude Code)
Claude Code 2.1.126, temp=1.0, max 131072 tokens, 5-run average
SWE-Bench Verified / Pro / Multilingual
OpenHands, temp=1.0, top_p=0.95, 256K context
SWE Atlas (QnA / RF / TW)
mini SWE agent, temp=1.0, top_p=0.95, 128K context, 5-run average
They also ship a modified Qwen chat template for training/inference alignment (chat_template.jinja on HF) and Harbor tweaks for vLLM reasoning_content keys. Reproducing leaderboard numbers requires matching those details, not only loading weights.
Who Should Try Ornith-1.0 First?
Strong fit:
Teams building self-hosted coding agents who need MIT-licensed weights
Researchers studying learned harnesses vs fixed OpenHands/Terminus configs
Orgs evaluating 9B–35B models for cost-sensitive agent loops on private repos
Proceed with caution:
Production systems that require independently reproduced benchmark numbers before model swaps
Workloads where Claude Opus 4.8 or GLM-5.2 still lead on your target eval
Teams without GPU capacity for 397B MoE serving — start with 9B or 35B and measure on internal tasks
2026's coding-model releases increasingly optimize for agent trajectories, not single-turn code completion. Ornith-1.0 sits alongside:
Closed frontier models (Claude Opus 4.7/4.8, GPT-5.x, Gemini 3.x) tuned on proprietary agent data
Open-weight bases (Qwen 3.5/3.6, Gemma 4, DeepSeek-V4) with community harnesses
Benchmark-focused post-training shops (DeepReinforce here, plus eval-driven releases like DeepSWE discussions)
Ornith's differentiator is explicit: learn the scaffold. That aligns with loop engineering and durable agent workflow design — treat orchestration as a first-class artifact, not an afterthought wrapped around chat completions.
Ornith-1.0 is one of the most interesting open-weight coding releases of June 2026 because it attacks harness design — not just next-token loss on GitHub diffs. The 397B MoE variant reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified under DeepReinforce's published protocols, with MIT-licensed weights from 9B to 397B.
Treat the leaderboard as a strong directional signal, not a deployment checklist. Reproduce scores on your repositories, your CI, and your agent stack before committing infrastructure. If self-scaffolding RL holds up under independent audit, it could reshape how teams think about agent loops and benchmark-specific tuning.