Ornith-1.0 is an open-source family of large language models from DeepReinforce.AI, post-trained for agentic coding. Variants span 9B Dense, 31B Dense, 35B MoE, and 397B MoE, built on Gemma 4 and Qwen 3.5 bases and released under the MIT license.

What makes Ornith-1.0 different from other coding models?

Ornith-1.0 uses a self-improving RL framework where the model generates both a task-specific scaffold (the agent harness) and the solution rollout. Reward from the rollout trains both stages, so the model learns orchestration patterns instead of relying on a fixed human-written harness.

How does Ornith-1.0-397B compare to Claude Opus 4.7 on coding benchmarks?

On DeepReinforce's published tables, Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 (Terminus-2) and 82.4 on SWE-Bench Verified, versus Claude Opus 4.7 at 70.3 and 80.8 on the same benchmarks. Claude Opus 4.8 still leads on several tasks, including 85 on Terminal-Bench 2.1.

Can Ornith-1.0 be used commercially?

Yes. DeepReinforce releases all Ornith-1.0 weights under the MIT license, which permits commercial and research use without copyleft requirements.

Where can I download Ornith-1.0 weights?

Weights and model cards are on Hugging Face in the deepreinforce-ai Ornith-1.0 collection. The technical blog at deep-reinforce.com documents evaluation harnesses, chat templates, and training details.

How does Ornith prevent reward hacking when the model writes its own scaffold?

DeepReinforce uses three layers: an immutable outer environment and tool boundary, a deterministic monitor that zero-rewards forbidden actions such as reading withheld test paths, and a frozen LLM judge that can veto trajectories that pass verifiers without doing real work.

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding | explainx.ai Blog

explainx.ainewsletter3.4k

workshops ↗

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding | explainx.ai Blog | explainx.ai

On June 25, 2026, the DeepReinforce.AI team behind @ornith_ announced Ornith-1.0 — a family of MIT-licensed, open-weight models built specifically for agentic coding. The release spans 9B Dense, 31B Dense, 35B MoE, and 397B MoE checkpoints, post-trained on Gemma 4 and Qwen 3.5 bases.

The technical bet is not just bigger pretraining. Ornith-1.0 treats the agent scaffold — memory layout, retry logic, tool orchestration — as something the model learns during reinforcement learning, not something engineers hard-code once per benchmark category. That is why the team calls it a self-scaffolding training strategy.

TL;DR: Ornith-1.0 at a Glance

Detail	Value
Release date	June 25, 2026
License	MIT (commercial + research)
Model sizes	9B Dense, 31B Dense, 35B MoE, 397B MoE
Base models	Gemma 4 and Qwen 3.5
Flagship scores (397B)	77.5 Terminal-Bench 2.1 (Terminus-2), 82.4 SWE-Bench Verified
Key training idea	Joint RL on scaffold generation + solution rollouts
Weights	Hugging Face collection
Technical blog	deep-reinforce.com/ornith_1_0.html

DeepReinforce's launch post summarizes the positioning plainly: "Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts."

Why Agent Scaffolds Matter for Coding Agents

Most public coding-agent scores combine three ingredients: the base model, the harness (OpenHands, Harbor/Terminus-2, Claude Code, mini SWE agent), and the benchmark task distribution. When harness design is fixed, leaderboard gains can reflect benchmark-specific tuning as much as raw model capability.

Ornith-1.0 attacks that coupling directly. Each RL step runs in two stages:

Scaffold stage — conditioned on the task and the scaffold used last time, the model proposes a refined scaffold.
Solution stage — conditioned on that scaffold and the task description, the model produces a solution rollout.

Reward from the rollout backpropagates to both stages. Over training, scaffolds that induce higher-reward trajectories survive; weak orchestration patterns get replaced. Per-task-category strategies can emerge without a human maintaining separate harness configs for Terminal-Bench, SWE-Bench, and repo-generation evals.

For teams running Claude Code agent workflows, Cursor agents, or custom MCP loops, the implication is practical: orchestration is trainable, not only prompt-engineered.

Benchmark Results: 397B MoE vs Frontier Models

DeepReinforce reports that Ornith-1.0-397B leads comparable open-weight models on agentic coding suites and matches or exceeds Claude Opus 4.7 on several headline benchmarks — while Claude Opus 4.8 and GLM-5.2-744B still top some columns.

Benchmark	Ornith-1.0-397B	Qwen3.5-397B	Claude Opus 4.7	Claude Opus 4.8	DeepSeek-V4-Pro
Terminal-Bench 2.1 (Terminus-2)	77.5	53.5	70.3	85.0	67.9
Terminal-Bench 2.1 (Claude Code)	78.2	48.6	69.7	78.9	66.5
SWE-Bench Verified	82.4	76.4	80.8	87.6	80.6
SWE-Bench Pro	62.2	51.6	64.3	69.2	55.4
SWE-Bench Multilingual	78.9	69.3	—	—	76.2
NL2Repo	48.2	36.8	—	69.7	—
ClawEval (avg)	77.1	70.7	78.2	—	75.8

Sources: Ornith-1.0 technical blog, June 2026. Empty cells mean the model was not listed in DeepReinforce's public table.

Three takeaways for engineering leaders:

Terminal-Bench 2.1 — Ornith-1.0-397B at 77.5 under Harbor/Terminus-2 is a meaningful jump over Qwen3.5-397B (53.5) and closes much of the gap to closed frontier models. See our Terminal-Bench 2.0 guide for why this benchmark stresses real shell workflows.
SWE-Bench Verified — 82.4 puts the open model in the same band as Claude Opus 4.7 (80.8) and DeepSeek-V4-Pro (80.6), though Opus 4.8 still leads at 87.6.
Harness sensitivity — Ornith scores 78.2 on Terminal-Bench when evaluated through Claude Code 2.1.126, not just Terminus-2. That suggests the learned scaffolds transfer across agent runtimes, but always verify on your toolchain.

Mid-Size and Edge Variants: 35B and 9B

Not every team can serve a 397B MoE cluster. Ornith-1.0's smaller checkpoints are where the release gets interesting for cost-conscious deployments.

Ornith-1.0-35B MoE

Benchmark	Ornith-1.0-35B	Qwen3.5-35B	Qwen3.6-35B	Qwen3.5-397B
Terminal-Bench 2.1 (Terminus-2)	64.2	41.4	52.5	53.5
SWE-Bench Verified	75.6	70.0	73.4	76.4
SWE-Bench Pro	50.4	44.6	49.5	51.6
ClawEval (avg)	69.8	65.4	68.7	70.7

The 35B model beating Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs. 53.5) is the standout efficiency story in DeepReinforce's tables. A MoE checkpoint at 35B active parameters should not outperform a 397B-class base on terminal agent tasks unless post-training and scaffold learning are doing substantial work.

Ornith-1.0-9B Dense (edge-friendly)

Benchmark	Ornith-1.0-9B	Qwen3.5-9B	Gemma4-31B
Terminal-Bench 2.1 (Terminus-2)	43.1	21.3	42.1
SWE-Bench Verified	69.4	53.2	52.0
SWE-Bench Pro	42.9	31.3	35.7
ClawEval (avg)	63.1	53.2	48.5

A 9B dense model scoring 43.1 on Terminal-Bench 2.1 — essentially matching Gemma 4-31B (42.1) — is strong evidence that agentic coding skills can compress into edge-deployable footprints when training targets scaffolds plus solutions jointly.

Fighting Reward Hacking in Self-Scaffolding RL

Letting the model author its own scaffold creates a familiar failure mode: the scaffold learns to game the verifier instead of solving the task. DeepReinforce documents examples such as reading visible test files and hardcoding expected outputs, touching files the grader checks without implementing behavior, or copying oracle solutions when they leak into the environment.

Their mitigation stack has three layers:

Fixed trust boundary — environments, tool surfaces, and test isolation stay immutable. The model may only evolve inner scaffold logic: memory, error handling, orchestration.
Deterministic monitor — flags attempts to read withheld paths, modify verification scripts, or call tools outside the sanctioned surface. Violations get zero reward and drop out of advantage computation.
Frozen LLM judge — acts as a veto on top of the verifier when intent-level gaming stays inside allowed tools.

This mirrors broader industry concern about eval contamination and reward hacking. Ornith's approach is notable because the attack surface includes scaffold code the model writes about itself, not only final patches.

Pipeline RL and Long Rollouts

Agentic coding rollouts are long. Standard on-policy RL becomes expensive when trajectories span thousands of tokens across tool calls. Ornith-1.0 uses pipeline RL with staleness-weighted GRPO: older off-policy tokens are down-weighted by age and discarded past a threshold, so long-horizon training stays stable without treating every stale token as equally valid.

DeepReinforce publishes the weighting scheme and clipped token-level GRPO loss in the technical blog. For practitioners, the important point is architectural: self-scaffolding only works if RL infrastructure can absorb multi-hour agent trajectories — the same constraint that shows up in agent harness engineering write-ups.

Evaluation Methodology (What the Numbers Actually Mean)

Scores are not directly comparable unless harness, temperature, context window, and run count match. DeepReinforce documents:

Benchmark	Harness / settings (from DeepReinforce footnotes)
Terminal-Bench 2.1 (Terminus-2)	Harbor/Terminus-2, temp=1.0, top_p=1.0, 128K context, 4h timeout, 32 CPU / 48GB RAM, 5-run average
Terminal-Bench 2.1 (Claude Code)	Claude Code 2.1.126, temp=1.0, max 131072 tokens, 5-run average
SWE-Bench Verified / Pro / Multilingual	OpenHands, temp=1.0, top_p=0.95, 256K context
SWE Atlas (QnA / RF / TW)	mini SWE agent, temp=1.0, top_p=0.95, 128K context, 5-run average
NL2Repo	temp=1.0, top_p=1.0, 400K context, 48K output, anti-hacking filters
ClawEval	Real-user task distribution, temp=0.6, 256K context

They also ship a modified Qwen chat template for training/inference alignment (chat_template.jinja on HF) and Harbor tweaks for vLLM reasoning_content keys. Reproducing leaderboard numbers requires matching those details, not only loading weights.

Who Should Try Ornith-1.0 First?

Strong fit:

Teams building self-hosted coding agents who need MIT-licensed weights
Researchers studying learned harnesses vs fixed OpenHands/Terminus configs
Orgs evaluating 9B–35B models for cost-sensitive agent loops on private repos

Proceed with caution:

Production systems that require independently reproduced benchmark numbers before model swaps
Workloads where Claude Opus 4.8 or GLM-5.2 still lead on your target eval
Teams without GPU capacity for 397B MoE serving — start with 9B or 35B and measure on internal tasks

Browse related open models in the explainx.ai LLM directory and agent tooling in the MCP server registry.

How Ornith Fits the 2026 Agentic Coding Landscape

2026's coding-model releases increasingly optimize for agent trajectories, not single-turn code completion. Ornith-1.0 sits alongside:

Closed frontier models (Claude Opus 4.7/4.8, GPT-5.x, Gemini 3.x) tuned on proprietary agent data
Open-weight bases (Qwen 3.5/3.6, Gemma 4, DeepSeek-V4) with community harnesses
Benchmark-focused post-training shops (DeepReinforce here, plus eval-driven releases like DeepSWE discussions)

Ornith's differentiator is explicit: learn the scaffold. That aligns with loop engineering and durable agent workflow design — treat orchestration as a first-class artifact, not an afterthought wrapped around chat completions.

Summary

Ornith-1.0 is one of the most interesting open-weight coding releases of June 2026 because it attacks harness design — not just next-token loss on GitHub diffs. The 397B MoE variant reports 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified under DeepReinforce's published protocols, with MIT-licensed weights from 9B to 397B.

Treat the leaderboard as a strong directional signal, not a deployment checklist. Reproduce scores on your repositories, your CI, and your agent stack before committing infrastructure. If self-scaffolding RL holds up under independent audit, it could reshape how teams think about agent loops and benchmark-specific tuning.

Sources: Ornith-1.0 technical blog (DeepReinforce.AI, June 2026), Hugging Face Ornith-1.0 collection, and @ornith_ on X. Benchmark figures reflect DeepReinforce's published tables as of that date; independent reproduction may differ.

Ornith-1.0: Self-Scaffolding Open Models for Agentic Coding

Related posts

Self-Harness: AI Agents That Improve Their Own Operating Framework

AI Benchmarks in 2026: The Complete Guide to MMLU, GPQA, SWE-bench, and Beyond

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters