株式会社オブライト
AI2026-05-17

Benchmark

Also known as: Benchmark / AI評価指標 / ベンチマーク評価

A standardized set of tasks or datasets used to measure and compare AI model capabilities. MMLU, HumanEval, SWE-bench, and HLE are prominent examples covering reasoning, coding, and frontier difficulty.


Overview

Benchmarks are standardized tests for objective model comparison. Widely used examples: MMLU (knowledge and reasoning), HumanEval (code generation), SWE-bench (real GitHub bug fixes), MATH, and GPQA/HLE (doctoral-level science). As of 2026, frontier models exceed 80-90% on SWE-bench Verified.

Limitations

Benchmark contamination — models trained on test data — inflates scores. Benchmark performance also does not always predict real-world utility. Eval Harnesses that test models on task-specific custom evaluations are therefore important for production decision-making.

Related Columns

AI
Qwen3.5-9B vs GPT-4o-mini vs Claude Haiku: 2026 SLM Comparison Guide
A comprehensive 2026 comparison of three leading SLMs: Qwen3.5-9B, GPT-4o-mini, and Claude 3.5 Haiku. Evaluates benchmarks (MMLU, HumanEval, math, vision), latency and throughput, cost analysis (API pricing vs local inference), Japanese language quality, multimodal capabilities, context windows, privacy, offline capability, and fine-tuning flexibility. Includes best-use-case recommendations for each model.
AI
Codex vs Claude Code vs Cursor vs Copilot — 2026 AI Coding Tool Comparison [Visual Guide]
In-depth comparison of OpenAI Codex, Claude Code, Cursor, and GitHub Copilot across pricing, features, SWE-bench scores, and use cases. Build your ideal AI coding stack with selection flowcharts and combination strategies.
AI
Claude Opus 4.7 Complete Guide — SWE-bench 87.6%, Vision 98.5% & New xhigh Effort Mode [April 16, 2026 Release]
Released April 16, 2026, Claude Opus 4.7 achieves SWE-bench Verified 87.6%, Vision accuracy 98.5%, and introduces the new xhigh Effort Control — all at the same price as Opus 4.6. This guide covers every major upgrade to Anthropic's latest flagship model.
AI
Local LLM Landscape April 2026 — Top 10 Open-Source Models Comprehensive Comparison [Ollama Guide]
Comprehensive comparison of the top 10 local LLMs as of April 2026. Covers SWE-bench scores, Japanese language performance, VRAM requirements, Ollama commands, and licensing for Gemma 4, Llama 4, Qwen 3.5, GLM-5.1, Kimi K2.5, MiniMax M2.5, and more.

Related Terms

Feel free to contact us

Contact Us