Skip to main content

株式会社オブライト

Articles tagged "SWE-bench"

4 articles

Cognition AI's FrontierCode Explained: The Next-Gen Coding AI Benchmark That Asks 'Is It Mergeable?'

On June 8, 2026, Cognition AI unveiled **FrontierCode** — not a product, but a coding AI evaluation benchmark. It measures not just 'does it pass tests' but 'would an OSS maintainer actually merge this?' across six axes. This article covers its differences from SWE-bench Verified, the three-tier dataset (Diamond/Main/Extended), official results with Claude Opus 4.8 leading at 13.4% on Diamond, and its relevance to Japan's rigorous code-review culture.

Cognition AIFrontierCodeSWE-bench

Claude Opus 4.7 Complete Guide — SWE-bench 87.6%, Vision 98.5% & New xhigh Effort Mode [April 16, 2026 Release]

Released April 16, 2026, Claude Opus 4.7 achieves SWE-bench Verified 87.6%, Vision accuracy 98.5%, and introduces the new xhigh Effort Control — all at the same price as Opus 4.6. This guide covers every major upgrade to Anthropic's latest flagship model.

Claude Opus 4.7AnthropicSWE-bench

GLM-5.1 Complete Guide — #1 SWE-bench Pro Open-Source LLM [April 2026]

GLM-5.1 by Z.ai (released April 7, 2026) is the first open-source LLM to top SWE-bench Pro at 58.4%, surpassing GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). This guide covers its 744B/40B-active MoE architecture, MIT license, 8-hour autonomous task capability, and setup via Ollama.

GLM-5.1Z.aiSWE-bench

MiniMax M2.5 Complete Guide — Lightning Attention Achieves 80.2% SWE-bench [2026]

MiniMax M2.5 achieves 80.2% on SWE-bench Verified using proprietary Lightning Attention in a 230B MoE model. Full breakdown of architecture, benchmarks, license terms, and setup instructions.

MiniMax M2.5SWE-benchLightning Attention