株式会社オブライト
AI2026-05-08

OpenAI GPT-Realtime-2 and the Three New Voice Models — A Practitioner's 2026 Look at Reasoning Voice Agents, Live Translation, and Streaming Whisper

On May 7, 2026, OpenAI released a trio of new voice models: GPT-Realtime-2 (the first voice model with GPT-5-class reasoning), GPT-Realtime-Translate (live translation across 70+ input / 13 output languages), and GPT-Realtime-Whisper (streaming speech-to-text). This article summarizes capabilities, benchmark deltas vs 1.5, pricing, when to pick which, and the upgrade decision from 1.5 — based on official information.


Overview — three new voice models on May 7, 2026

On May 7, 2026, OpenAI released three new voice models on the Realtime API. This article summarizes the trio: - GPT-Realtime-2 — billed as the first voice model with GPT-5-class reasoning - GPT-Realtime-Translate — live translation across 70+ input languages and 13 output languages - GPT-Realtime-Whisper — streaming speech-to-text that flows as the speaker talks For the previous-generation 1.5, see our existing post on OpenAI gpt-realtime-1.5 and realtime-voice-component. This article focuses on "is the upgrade from 1.5 worth it?" and "when to use which of the three new models?"

GPT-Realtime-2 — reported gains over 1.5

From OpenAI's announcement and the major coverage:

MetricGPT-Realtime-1.5GPT-Realtime-2Delta
Big Bench Audio (high reasoning)81.4%96.6%+15.2pt
Audio MultiChallenge instruction-following (xhigh)34.7%48.5%+13.8pt
Hardest adversarial benchmark — call success after prompt tuning69%95%+26pt
Context window32K128K

Key takeaways: - The "GPT-5-class reasoning, in voice" framing is backed by sizeable benchmark gains. - Instruction following improves ~14 points — meaningful for workflows where the model must "do as told." - 128K context handles long conversations, large prompts, and tool-call histories without falling apart. - Reasoning level (high / xhigh) is selectable, so quality and cost can be tuned per use case. Numbers are at announcement time — verify the current state on OpenAI's official release page and model docs.

How each of the three models is positioned

GPT-Realtime-2 — flagship reasoning voice agent - Designed to reason, call tools, handle interruptions, and keep the conversation moving without dropping it. - Speech-to-speech end to end. - Strong fit for customer support, education, personal assistants, virtual receptionists. GPT-Realtime-Translate — live translation specialist - Translates while you speak — same-time-conversation simultaneous interpretation through an API. - 70+ input languages → 13 output languages at announcement. - Cross-border e-commerce, global support, international events, education, media. GPT-Realtime-Whisper — streaming STT - Text streams as the person speaks; built for low-latency captions, meeting notes, command input. - Differs from batch-style Whisper by prioritizing latency over post-hoc fidelity. - Live captions for streams, call-center agent assist, medical / legal dictation pipelines.

Pricing (May 2026, as published)

Pricing on the OpenAI material:

ModelUnitPrice
GPT-Realtime-2 — audio input1M tokens$32
GPT-Realtime-2 — cached audio input1M tokens$0.40
GPT-Realtime-2 — audio output1M tokens$64
GPT-Realtime-Translateper minute$0.034
GPT-Realtime-Whisperper minute$0.017

Observations - Token rate sits in the same band as 1.5; the gain is effectively a price drop per unit of capability. - Per-minute pricing on Translate / Whisper makes budgeting straightforward. - Cached input ($0.40 per 1M) is roughly 1/80 of regular input — meaningful for long-running agent sessions. Pricing changes — check the official OpenAI pricing page before adopting.

When to pick which

1. Hard conversation / reasoning → GPT-Realtime-2 Customer-support escalation, complex procedural guidance, sales role-play, education tutors. First choice when conversation quality directly drives satisfaction. 2. Cross-border communication → GPT-Realtime-Translate Overseas e-commerce chat, global trade shows, foreign customer reception, international online classes. The $0.034/min unit cost is far below typical interpreter costs. 3. Captions, meeting notes, voice commands → GPT-Realtime-Whisper Live-stream captions, real-time meeting transcription, hands-free input on the floor, medical / legal dictation. $0.017/min keeps running cost very low. 4. Keep 1.5 where it makes sense Low-impact existing integrations, finely-tuned prompts. 1.5 stays available — a hybrid (move important workloads to 2, leave the rest on 1.5) is realistic.

Should we upgrade from 1.5?

Move now if - Instruction-following or branching logic is wobbling on 1.5. - Long sessions hit context limits frequently — 128K is a real upgrade. - You want to lift call-success rate in customer support — the +26pt at the hard end of the benchmark is large. - You can switch within the existing budget envelope (token pricing is comparable). Take your time if - The workload is simple FAQ-class. - Prompts are deeply tuned for 1.5 (re-tuning cost is non-trivial). - Latency is already acceptable and the new features aren't on your critical path. Watch on switch - Plan for prompt re-optimization (high / xhigh selection changes behavior). - Output style may shift; A/B log to KPIs (resolution rate, average handle time, escalation rate).

Realistic business scenarios

Concrete patterns we put in front of clients: - Field operations (construction, logistics) — hands-free: Whisper for dictation + structuring, Realtime-2 for interactive support. - Cross-border e-commerce support: Translate covers 70+ languages first, escalating to Realtime-2 only when needed. - Real-time call-center assist: Whisper captions the call, Realtime-2 proposes the next response to the operator. - Medical / legal back office: dictation → automatic structuring → integrate to existing systems. - Education and training: Realtime-2 as a tutor; Translate to caption multilingual lectures live. - Better meeting experiences: Whisper for live notes, Realtime-2 for facilitation assist. Within our AI BPO and AI Consulting, these patterns get embedded under the "humans in front, AI in the back" frame. For the prior-generation realtime-voice-component UI, see the existing post on 1.5.

Trade-offs and watch-outs

- Cloud-only: not for projects that disallow external transmission of confidential data. Pair with on-prem options (DGX Spark + local LLM) when needed. - Token / minute consumption is hard to predict: idle silences and looping responses can spike costs. Set timeouts and max-session-length. - High vs xhigh: defaulting to xhigh raises latency and cost. A/B by workload to find the sweet spot. - Translate covers 13 output languages: Japanese is included, but uncovered targets need fallback solutions. - "Verbatim" guarantees: Whisper streams aren't designed as legally verbatim records. Re-run a batch transcription for sources of truth. - Re-optimization cost: prompts deeply tuned for 1.5 may shift behavior on 2 — budget validation time.

How Oflight uses these

We design voice-AI integrations in three layers: 1. Model selection: pick gpt-realtime-2 / Translate / Whisper per workload; route confidential ones to local stacks. 2. Orchestration: drive business actions from voice events with our OpenClaw agent platform. 3. System integration: connect to your CRM, ticketing, internal DBs, and telephony. The abstraction lives at the model-selection layer, so when a new model lands (like this one), customer integrations swap underneath without surface-level changes. See AI Consulting or AI BPO for engagements.

FAQ

Q1: Can we keep using 1.5? A: Yes for now. But for complex, instruction-heavy workloads, the gap shows up directly in business KPIs — worth evaluating an upgrade. Q2: Does Translate replace human interpreters? A: Not in cases where verbatim accuracy is binding (legal, medical, contractual negotiations). For everyday cross-border ops and tier-1 inquiry intake, it's already a usable layer. Q3: Streaming Whisper vs batch Whisper? A: Batch is for high-fidelity post-event transcripts; streaming is for "text flows as you speak." Use both in series for the official record. Q4: Is it OK to send confidential calls to the cloud? A: Avoid on workloads with hard external-transmission rules. For "minimize what leaves" patterns, combine with local LLMs (e.g., DGX Spark) and only send sanitized metadata. Q5: Is $0.034/min cheap? A: Compared to interpreter cost, yes — by a wide margin. But long sessions across many channels compound; build daily aggregate forecasts. Q6: high vs xhigh — how to pick? A: Default to high for general conversation. Try xhigh on complex branching, multi-tool flows, or adversarial inputs (financial product guidance, medical / legal triage). Q7: How does this relate to realtime-voice-component? A: The browser-side React patterns from the previous-generation post still apply — change only the backend model identifier.

References

Feel free to contact us

Contact Us