Column
Ornith-1.0 Deep Dive — DeepReinforce's June 26, 2026 MIT Open-Weights Family Specialized for Agentic Coding Three Sizes (9B Dense / 35B MoE / 397B MoE), All at 262K Context, Built on Qwen 3.5 + Gemma 4, Shipping in BF16 + FP8 + GGUF SWE-Bench Verified 82.4% (397B) / 75.6% (35B) / 69.4% (9B), SWE-Bench Pro 62.2%, Vendor-Reported SOTA Among Open Weights at Each Size Tier Reinforcement Learning Optimizes Both Solution Rollouts AND the Scaffolding That Drives Them — A 'Self-Improving' Design Compatible With OpenHands / Hermes Agent / OpenClaw, ClawEval Benchmark Published — Directly Relevant to Oflight's OpenClaw Service Users

AI2026-06-26

Ornith-1.0 Deep Dive — DeepReinforce's June 26, 2026 MIT Open-Weights Family Specialized for Agentic Coding Three Sizes (9B Dense / 35B MoE / 397B MoE), All at 262K Context, Built on Qwen 3.5 + Gemma 4, Shipping in BF16 + FP8 + GGUF SWE-Bench Verified 82.4% (397B) / 75.6% (35B) / 69.4% (9B), SWE-Bench Pro 62.2%, Vendor-Reported SOTA Among Open Weights at Each Size Tier Reinforcement Learning Optimizes Both Solution Rollouts AND the Scaffolding That Drives Them — A 'Self-Improving' Design Compatible With OpenHands / Hermes Agent / OpenClaw, ClawEval Benchmark Published — Directly Relevant to Oflight's OpenClaw Service Users

DeepReinforce released Ornith-1.0 on June 26, 2026 (official / Hugging Face collection). It is an MIT-licensed open-weights family specialized for agentic coding, with no regional restrictions.

Three sizes: Ornith-1.0-9B (dense, ~19GB BF16) / Ornith-1.0-35B (MoE) / Ornith-1.0-397B (MoE, built on Qwen 3.5 + Gemma 4). All sizes ship 262K context, with FP8 and GGUF quantizations released alongside.

Benchmarks (vendor-reported, claimed SOTA at each open-weights size tier):

| Benchmark | 9B | 35B | 397B |
|---|---|---|---|
| SWE-Bench Verified | 69.4% | 75.6% | 82.4% |
| SWE-Bench Pro | 42.9% | 50.4% | 62.2% |
| SWE-Bench Multilingual | — | — | 78.9% |
| Terminal-Bench 2.1 | 43.1% | 64.2% | 77.5-78.2% |
| NL2Repo | 27.2% | 34.6% | 48.2% |
| ClawEval | — | — | 77.1% |

Design thesis: Reinforcement learning optimizes both the solution rollouts and the scaffolding (the agent structure that drives them) itself — a 'self-improving' agentic-coding design. It sits naturally next to the Loop Engineering Maker-Checker paradigm. Reasoning is exposed via `<think>...</think>` blocks; function calling and tool use are first-class.

Distribution and ops: vLLM ≥ 0.19.1 / SGLang ≥ 0.5.9 / Transformers ≥ 5.8.1 / Docker + llama.cpp / Ollama. OpenAI-compatible endpoints. The 9B fits on a single 80GB GPU; 35B and 397B want an 8×80GB GPU node (TP=8). Agent-framework compatibility: OpenHands, Hermes Agent, and [OpenClaw](../services/openclaw-setup) (Oflight's own service line — and ClawEval is in DeepReinforce's published benchmark set).

DeepReinforce lineage: an RL-focused research organization that has previously shipped CUDA-L1 (avg 3.12× GPU speedup), CUDA-L2 (HGEMM kernels beating cuBLAS), and IterX (MLSys 2026 NVIDIA Track). Ornith-1.0 applies the same RL playbook to LLM self-improvement.

Positioning: alongside Kimi K2.7-Code (1T MoE / 32B active) and GLM-5.2 (Intelligence Index v4.1 = 51, open-weights leader), Ornith-1.0 is at the front of the June-2026 agentic-coding open-weights race. Against Chinese-origin models (Kimi / GLM), its differentiator is MIT license + no regional restrictions + a US-flag procurement story.

Caveat: benchmarks are DeepReinforce's own vendor-reported numbers. Independent third-party verification on public leaderboards has not yet appeared (as of June 26, 2026).

The article closes with three inquiry funnels for Ornith-1.0–era local-LLM evaluation, build, and ongoing maintenance.

Ornith DeepReinforce Open Weight Agentic Coding RL MIT License SWE-Bench OpenClaw

TL;DR — Ornith-1.0 in One Sentence

DeepReinforce released Ornith-1.0 on June 26, 2026 (official / Hugging Face collection).

Four points:

1. MIT open-weights LLM family specialized for agentic coding — no regional restrictions 2. Three sizes ship simultaneously — 9B Dense / 35B MoE / 397B MoE, all at 262K context, with FP8 and GGUF quantizations released alongside 3. SWE-Bench Verified 82.4% (397B) / 75.6% (35B) / 69.4% (9B), SWE-Bench Pro 62.2%, Terminal-Bench 2.1 77.5-78.2% — claimed SOTA at each open-weights size tier 4. Design signature: RL optimizes both solution rollouts AND the scaffolding that drives them — a self-improving design, with first-class compatibility for OpenHands / Hermes Agent / OpenClaw (ClawEval is in DeepReinforce's own benchmark suite)

This column sits next to our Kimi K2.7-Code, Local LLM June 2026 Update, and Loop Engineering coverage as part of the June-2026 agentic-coding cluster.

Release Overview — Three Sizes at Once

Item	Value
Release date	June 26, 2026
Org	DeepReinforce (`deepreinforce-ai`)
License	MIT (all variants, fully commercial, modifiable, redistributable, no regional limits)
Models	Ornith-1.0-9B (dense) / Ornith-1.0-35B (MoE) / Ornith-1.0-397B (MoE)
Quantization	397B FP8 / 35B GGUF / 9B GGUF released alongside
Base	Qwen 3.5 (35B / 397B) + Gemma 4 (397B only)
Context	262,144 tokens (all sizes)
Dtype	BF16 (with FP8 / GGUF variants)
Reasoning	`<think>...</think>` blocks
Tool use	OpenAI-compatible function calling

Who Is DeepReinforce?

DeepReinforce is an RL-focused research organization that has previously published:

- [CUDA-L1](https://github.com/deepreinforce-ai/CUDA-L1): an RL framework for CUDA optimization, with avg 3.12× speedup across 250 real-world GPU tasks (MarkTechPost, 2025-08) - [CUDA-L2](https://github.com/deepreinforce-ai/CUDA-L2): RL-based kernels surpassing cuBLAS on matrix multiply (RTX 3090 HGEMM in March 2026, A100 HGEMM in January 2026) - IterX: presented at MLSys 2026 NVIDIA Track with significant H100 / B200 speedups - Ornith-1.0: applies the same RL playbook to LLM self-improvement

Reading Ornith-1.0 as the next step in this RL-everywhere research line is the right frame.

Design Thesis — Jointly Optimizing Solution Rollouts and Scaffolding

The headline design choice in Ornith-1.0 is that reinforcement learning optimizes both the solution rollouts and the scaffolding (the agent structure that drives those rollouts) — not just one or the other.

Standard RLHF / RLAIF optimizes input → output in one shot. Ornith-1.0 instead trains over input → the agent's reasoning trajectory and tool-call sequence → final output as a single end-to-end RL objective. The consequences:

- The model itself learns how to decompose a problem - The decision policy for when to call which tool gets baked into the weights - Running it inside an agent framework (OpenHands / Hermes / OpenClaw), the scaffolding behavior is already learned, so fewer trials are needed to reach a correct answer

This is essentially the Loop Engineering Maker-Checker pattern internalized into the training objective. It's a different lineage from Sakana Fugu's orchestration model (which dispatches across multiple LLMs); Ornith learns the scaffolding inside a single LLM.

Benchmarks — Vendor-Reported, Claimed SOTA at Each Tier

Benchmark	9B Dense	35B MoE	397B MoE
SWE-Bench Verified	69.4%	75.6%	82.4%
SWE-Bench Pro	42.9%	50.4%	62.2%
SWE-Bench Multilingual	—	—	78.9%
Terminal-Bench 2.1 (Terminus-2)	43.1%	64.2%	77.5-78.2%
NL2Repo	27.2%	34.6%	48.2%
ClawEval	—	—	77.1%

What stands out:

- 397B at SWE-Bench Verified 82.4% — open-weights at that level is in the same neighborhood as Kimi K2.7-Code (which didn't publish SWE-Bench, only vendor-internal benches) - 35B at SWE-Bench Verified 75.6% — within striking distance of Claude Opus 4.8 / GPT-5.5, and runnable on a serious consumer GPU stack (e.g. 8×RTX 5090) - 9B Dense at SWE-Bench Verified 69.4% — a frontier-class agent on a single 80GB GPU or one RTX 5090 - ClawEval 77.1% on 397B — OpenClaw is Oflight's own service line, and seeing DeepReinforce treat ClawEval as a first-class benchmark is meaningful industry validation

Important caveat: these are DeepReinforce's own self-reported numbers. Registration on the public SWE-Bench leaderboard and third-party validations on Aider polyglot / LiveCodeBench / Cognition FrontierCode have not appeared as of June 26, 2026. Run a PoC on your own code before production adoption.

Architecture Details

Ornith-1.0-9B (Dense):

- 9B dense transformer - 262K context, BF16 - Single 80GB GPU (~19GB VRAM) or comfortable on one RTX 5090 (32GB) - Tensor parallelism supports multi-GPU sharding - Realistic production candidate for individual developers and SMBs

Ornith-1.0-35B (MoE):

- 35B Mixture-of-Experts - Built on Qwen 3.5 - 262K context, BF16 - Recommended: 8×80GB GPU node (TP=8); the GGUF quant lets you go lighter - DeepReinforce reports beating Qwen 3.5-35B and Gemma 4-31B on benchmarks - The mid-market production sweet spot

Ornith-1.0-397B (MoE):

- 397B Mixture-of-Experts - Built on Qwen 3.5 + Gemma 4 (hybrid composition) - 262K context, BF16 (FP8 quant available) - Recommended: 8×80GB GPU node (TP=8) - SWE-Bench Verified 82.4% / SWE-Bench Pro 62.2% / Terminal-Bench 2.1 77.5-78.2% — claimed open-weights SOTA at this size tier - The flagship for large-enterprise SI

Distribution and Ops

Recommended inference engines:

- vLLM ≥ 0.19.1 — production GPU serving - SGLang ≥ 0.5.9 — agent workflows (RadixAttention) - Transformers ≥ 5.8.1 — Hugging Face standard - Docker + llama.cpp — CPU / edge via GGUF - Ollama — personal PoC

API compatibility: OpenAI-compatible endpoints (via vLLM / SGLang). Drops into existing tools (Claude Code, Cursor, Aider, Cline, cmux) with config-only changes.

Agent-Framework Compatibility — Including OpenClaw

Ornith-1.0 is officially compatible with:

- OpenHands - Hermes Agent (Nous Research) - [OpenClaw](../services/openclaw-setup) (Oflight's own service line; ClawEval is in DeepReinforce's published benchmarks)

Impact for OpenClaw users: customers running our OpenClaw setup service can swap Ornith-1.0 in as the backend LLM with low effort — change the API key and inference endpoint and you get ClawEval-77.1%-class behavior on the 397B.

Our OpenClaw monthly maintenance plans (Light ¥9,800 / Standard ¥19,800 / Premium ¥49,800) include LLM-model swaps and API-spec updates — Ornith-1.0 migration falls within scope.

Competitive Positioning (June 2026 Agentic-Coding OSS)

Model	Size	License	SWE-Bench Verified	Published benchmarks	Origin
Ornith-1.0-397B	397B MoE	MIT	82.4% (vendor-reported)	SWE-Bench / Terminal-Bench / NL2Repo / ClawEval	US (presumed)
Kimi K2.7-Code	1T MoE / 32B active	Modified MIT	Not publicly reported	Vendor-internal only	China
GLM-5.2	TBD	MIT	TBD	Intelligence Index v4.1 = 51	China
MiniMax M3	TBD	OSS	—	SWE-Bench Pro 59.0%	China
Claude Opus 4.8	Closed	Commercial	~75-80%	Frontier	US
GPT-5.5	Closed	Commercial	~80%	Frontier	US

Ornith-1.0's differentiators:

1. Full three-size MIT open-weights line — same family from individual workstation (9B) to enterprise flagship (397B) 2. US (presumed) origin with no regional restrictions — sidesteps the Chinese-model cross-border-data scrutiny 3. Published benchmarks include the standard SWE-Bench suite — easier to compare than Kimi K2.7-Code's vendor-only numbers 4. RL-optimized scaffolding — a real-world agent-runtime edge by design 5. ClawEval treated as first-class — strong fit for OpenClaw adopters

Use Cases

- Large refactors and multi-file PRs (35B / 397B) - Automated code review inside CI / CD (9B / 35B) - Agent-style SWE-Bench-class problems (all sizes) - Terminal agents via Cline / Aider / cmux (35B recommended) - Multi-turn tool-use workflows - Agentic coding on OpenClaw / OpenHands / Hermes - On-prem, data-sovereignty-sensitive SI engagements

Oflight's View — How to Adopt Ornith-1.0

What we recommend in our AI consulting practice:

Step 1 — PoC on the 9B (from ¥198K, AI consulting assessment): 9B fits on a single RTX 5090 / 80GB GPU, so PoC hardware cost is minimal. Measure ROI on your own coding workloads (CI auto-review, internal SDK Q&A, bug triage).

Step 2 — Production on 35B / 397B (from ¥498K PoC build, ~¥5M+ full SI): 35B fits on a single 8×H100 / B200 node; 397B wants B200 ×4 / H200 ×8 class. Pair with a Japanese GPU cloud (Sakura HPC, GMO GPU, AWS Tokyo p5) or the Intec ¥5M+ SI pattern.

Step 3 — OpenClaw + Ornith for agent-based business automation (with [OpenClaw maintenance plan](../services/openclaw-setup) ¥9.8K–¥49.8K): customers already on OpenClaw can swap Ornith-1.0 in as the backend LLM and pick up ClawEval-77.1%-class performance. LLM swap and prompt tuning are in scope of OpenClaw monthly maintenance.

What We Could Not Officially Confirm

As of June 26, 2026, the following were not confirmable in our research:

- DeepReinforce's HQ location, executive team, and funding status (the official deep-reinforce.com returned 403 to our automated fetcher, so this was not confirmable via the web) - Hosted SaaS API availability and pricing (the product may be self-host / on-prem only) - A Japan-region endpoint - Plans for registration on the [public SWE-Bench leaderboard](https://www.swebench.com/) - Third-party scores on Cognition FrontierCode, Aider polyglot, LiveCodeBench, BigCodeBench - Official VS Code / JetBrains / IntelliJ extensions - AIME / GPQA-Diamond / MMLU and other general-knowledge benchmarks (potentially omitted on purpose, as the model is coding-specialized)

Verify with the Hugging Face collection and DeepReinforce GitHub before any production decision.

Talk to Us About Ornith-1.0 — Three Inquiry Funnels

We support Ornith-1.0 and other local LLMs across evaluation, build, and ongoing maintenance.

(1) Evaluation & Requirements (from ¥198,000)

"Does Ornith-1.0 fit our workload?" "Should we start with 9B, 35B, or 397B?" "How does it pair with OpenClaw?" — 1–2 weeks, written report deliverable.

👉 Contact us — AI consulting (evaluation)

(2) On-Prem Build & PoC (from ¥498,000)

Stand up Ornith-1.0 with PoC build, fine-tuning, inference-engine setup, and quantization in 4–8 weeks. GPU sizing (RTX 5090 single-card vs B200 node) and ROI measurement included.

👉 Contact us — PoC / production SI

(3) Ongoing Maintenance (¥9,800–¥80,000 / month)

Ongoing support for model-update tracking, quant re-tuning, new-release evaluation, KPI monitoring, and OpenClaw backend swaps for Ornith-1.0.

- For OpenClaw-deployed sites: OpenClaw maintenance — Light ¥9,800 / Standard ¥19,800 / Premium ¥49,800 per month - AI-consulting continuous support: Light ¥30,000/mo (monthly meeting + new-model tracking) / Standard ¥80,000/mo (bi-weekly + prompt tuning + monthly KPI review + training & FAQ updates) / Premium on request

👉 Contact us — OpenClaw maintenance

FAQ

Q1. Is Ornith-1.0 OK for commercial use? A. Yes — MIT. Fully commercial, modifiable, redistributable, no regional restrictions, no extra contract. Re-check the upstream Qwen 3.5 / Gemma 4 licenses for any inherited obligations.

Q2. Difference vs Chinese OSS (Kimi K2.7-Code / GLM-5.2)? A. The data-sovereignty story is different. Ornith-1.0 is MIT, no regional restrictions, and there's no Chinese-routed-API concern. SWE-Bench Verified is in the same band as Kimi K2.7-Code (82.4% on 397B), with a clearer US (presumed) + MIT procurement story. See our Kimi K2.7-Code coverage.

Q3. Can a single RTX 5090 run it? A. The 9B runs comfortably (~19GB at BF16 — half of the 5090's 32GB). 35B runs on a single card at GGUF Q4 / Q5, though full 262K context wants multiple cards. The 397B cannot run on a single card — you need an 8×80GB GPU node or 4×B200.

Q4. Does SWE-Bench Verified 82.4% really beat Claude Opus 4.8? A. It's DeepReinforce's self-reported number — not yet on the public SWE-Bench leaderboard, no independent validation. Mandatory PoC measurement on your own code before production. That said, claiming 69.4% even on the 9B is noteworthy.

Q5. How does it integrate with OpenClaw? A. Officially compatible — ClawEval is in DeepReinforce's published benchmark suite (77.1% on 397B). Existing customers on our OpenClaw setup service can swap Ornith-1.0 in as the backend LLM. Our [OpenClaw maintenance plans](../services/openclaw-setup) cover this migration work.

Q6. Which size — 9B, 35B, or 397B? A. 9B for individual / SMB PoC, 35B for mid-market production, 397B for enterprise flagship. 9B runs on one GPU so PoC cost is minimal; 35B is the price-performance sweet spot; 397B is SOTA but wants 8×H100-class.

Q7. What does "RL-optimized scaffolding" actually mean in practice? A. Imagine the model has internalized agent-runtime experience in its weights. For the same prompt, Ornith-1.0 has already learned how to decompose problems and when to call which tools — so in an agent framework, it should reach the right answer in fewer iterations than a bare LLM. The numerical evidence is currently DeepReinforce's own published claim; third-party comparative data is still thin.

Q8. Relationship to CUDA-L1 / CUDA-L2 / IterX? A. All products of the same RL-research organization, at different layers. CUDA-L1/L2 is GPU-kernel optimization; IterX is inference optimization; Ornith-1.0 is LLM self-improvement itself. Reading Ornith as DeepReinforce's RL playbook applied to the LLM layer is the clean mental model.

Bottom Line

Ornith-1.0 is DeepReinforce's MIT open-weights LLM family specialized for agentic coding, released June 26, 2026. Three sizes (9B / 35B / 397B), all at 262K context, RL-optimized scaffolding, SWE-Bench Verified 82.4% on the 397B (vendor-reported).

Why it matters: 1. MIT + no regional restrictions — fully sidesteps Chinese-OSS cross-border-data scrutiny 2. Complete three-size lineup — individual workstation to enterprise flagship in one family 3. Published on the standard SWE-Bench suite — easier to compare than Kimi K2.7-Code's vendor-only numbers 4. ClawEval treated as first-class — strong natural fit for our OpenClaw service customers

Caveat: vendor-reported benchmarks; independent validation still to come. PoC measurement is non-negotiable before production.

At Oflight, we cover Ornith-1.0 (and other local LLMs) end-to-end — evaluation, PoC, OpenClaw integration, and ongoing maintenance. Use the three inquiry funnels above to get in touch.

References

Primary: - DeepReinforce official Ornith-1.0 page - Hugging Face collection Ornith-1.0 - Hugging Face: Ornith-1.0-9B - Hugging Face: Ornith-1.0-35B - Hugging Face: Ornith-1.0-397B - DeepReinforce GitHub - DeepReinforce X (@deep_reinforce) DeepReinforce prior work: - CUDA-L1 GitHub - CUDA-L2 GitHub - CUDA-L1 official blog - MarkTechPost — CUDA-L1 coverage Benchmarks: - SWE-Bench official Related Oflight columns: - Kimi K2.7-Code (same-window competitor) - Local LLM June 2026 Update - Loop Engineering — scaffolding origin - Sakana Fugu — different orchestration approach - Cognition FrontierCode benchmark - PLaMo 3.0 Prime - Claude Code Agent View - Cursor Automations - cmux (Manaflow) - Liquid AI Japanese-specialized models Oflight services: - AI Consulting - OpenClaw setup - Software Development Inquiries: - AI consulting (evaluation / PoC) - OpenClaw + Ornith integration / maintenance - Custom software development / SI Note: DeepReinforce's HQ location, executive team, funding status, hosted SaaS API availability, Japan-region endpoint, planned SWE-Bench leaderboard registration, third-party scores on Cognition FrontierCode / Aider polyglot / LiveCodeBench, official VS Code / JetBrains extensions, and general-knowledge benchmark scores (AIME / GPQA-Diamond / MMLU) were not confirmable as of June 26, 2026. Verify with the Hugging Face model cards and DeepReinforce GitHub before production decisions.

Feel free to contact us