株式会社オブライト
AI2026-04-28

OpenAI gpt-realtime-1.5 and the Official realtime-voice-component — A Practitioner's Look at the New Voice-Agent Stack [2026]

OpenAI released the gpt-realtime-1.5 audio model on February 26, 2026, and openai/realtime-voice-component on GitHub provides an official React reference for voice UIs. This article summarizes the documented gains (+5% audio reasoning, +10.23% transcription, +7% instruction following), pricing, the component's positioning as a reference implementation, and practical considerations for business adoption.


The 2026 voice-agent stack at a glance

OpenAI released the gpt-realtime-1.5 voice model on February 26, 2026. Alongside it, openai/realtime-voice-component on GitHub serves as the official React reference for browser voice UIs. This article covers both — the model side (gpt-realtime-1.5) and the UI side (realtime-voice-component) — and lays out the practitioner's question: "What do I take off the shelf, and what do I write myself?"

What gpt-realtime-1.5 is

gpt-realtime-1.5 is the audio-focused model on OpenAI's Realtime API, designed for native speech-to-speech (audio in → audio out). The Realtime API itself is a low-latency channel that handles audio, image, and text inputs plus audio and text outputs; the 1.5 release is an iteration over the prior gpt-realtime. OpenAI positions it as their flagship audio model for production voice agents and customer support.

Reported gains

Per OpenAI, the published improvements over the prior version are:

MetricImprovementWhat it means
Big Bench Audio (audio reasoning)+5%Reasoning quality on what was said
Transcription+10.23%Audio → text fidelity
Instruction following+7%Steerability via system prompts

Tool calling has also been strengthened, which matters for agents that act on the user's behalf — booking, lookups, ticket creation, and so on.

Pricing (as published at release)

OpenAI's published prices at release (unchanged from the prior version): - Text input: $4 / 1M tokens (cached $0.40) - Text output: $16 / 1M tokens - Audio input: $32 / 1M seconds (cached $0.40) - Audio output: $64 / 1M seconds Pricing can change. Verify the current numbers on the OpenAI pricing page before committing to production.

What realtime-voice-component is

openai/realtime-voice-component is OpenAI's official React reference for browser voice UIs that talk to the Realtime API (Apache-2.0). The package centers on: - `createVoiceControlController()`: a reusable controller that owns the session - `useVoiceControl()`: a hook that binds the controller to React - `VoiceControlWidget`: a launcher UI for triggering voice - `GhostCursorOverlay`: an overlay for visible command progress - A pattern that puts tools on the app side It's specifically aimed at "tool-constrained UIs" — pressing this button, filling this form — operated by voice.

Important: how to position realtime-voice-component

The repository explicitly frames itself as an open-source reference implementation. It's useful for education, demos, and informing local adoption — but it is not a long-term-supported, production-ready UI kit. The `package.json` is currently `private`, so the package is not on npm. For production, the realistic choices are: - Build your own UI using the reference as a guide: read the repo, then implement React components owned by your codebase. Most flexible. - Talk to the Realtime API directly: when you need lower-level transport / session control, custom audio handling, non-React runtimes, or to design state from scratch. - Use openai-agents-js: a broader headless SDK for agent orchestration, handoffs, and richer hosted-tool / MCP flows.

Layered view — what sits where

Loading diagram...

Where voice agents pay off

Areas where voice agents tend to add real value: - Customer support: tier-1 intake, FAQ answers, smart escalation to humans - Internal helpdesk: "how do I use this system?" answered conversationally with RAG over internal docs - Field work (construction, logistics, medical): hands-busy environments where voice → structured data is a real win - Accessibility: an alternative for users who can't operate keyboards comfortably - Phone / call workflows: tier-1 intake combined with human approval at the end At Oflight, AI BPO increasingly fields requests for voice-driven workflows in a hybrid local + cloud design — "humans at the front, AI in the back," with voice as one input channel.

Things to watch in implementation

Voice-specific pitfalls: - You need a backend `/session` endpoint: realtime-voice-component requires a server-side endpoint that relays the browser's SDP and session config to OpenAI's Realtime API (so your API key never reaches the browser). - Default VAD and interrupt settings: the controller uses Realtime's `server_vad` by default with `interrupt_response: false`. That default matters when the assistant audio is not played by the UI — tune it to fit your case. - Controller lifecycle: never destroy externally-owned controllers from a leaf-component cleanup. In React dev-mode remounts, that leaves a "dead controller" behind and the connection silently fails. - Mic permissions and browser quirks: Safari / Chrome / Firefox each behave a little differently. HTTPS is mandatory. - Latency: network + model + TTS stack up. Far-away regions feel sluggish. - Background noise: in real workplaces, take VAD (voice activity detection) and echo cancellation seriously. - PII handling: voice / call data is often PII. Decide retention policy, encryption, and consent up front. - Tool-call safety: don't let "delete" or "send" be invoked by voice without a confirmation step. - Cost: audio is billed per second. Idle silence and looping responses can spike the bill — set timeouts and caps.

How Oflight uses this stack

When a project needs a voice agent, our default approach is: 1. Model: start with gpt-realtime-1.5 on the cloud. If on-prem-only is required, switch to a different audio model with local serving. 2. UI: study openai/realtime-voice-component's patterns, but ship our own React components in production (since the package is not on npm). 3. Orchestration: openai-agents-js or direct Realtime API, depending on agent complexity. 4. Middleware: combine with OpenClaw so the connections to your CRM / ticketing / internal DB are abstracted. We handle scoping through delivery in AI Consulting.

FAQ

Q1: Worth migrating from prior gpt-realtime? A: Transcription is +10% better and instruction following improved, so it's worth evaluating. Pricing is unchanged, which lowers the cost-side risk. Q2: I want realtime-voice-component on npm. A: It's not currently published to npm. The realistic path is to study the repo and ship your own React components, or vendor it via Git submodule / copy. Q3: Japanese transcription accuracy? A: OpenAI's published gains are English-centric. Verify Japanese accuracy on your own data and environment before committing. Q4: Can it run fully on-premise? A: gpt-realtime-1.5 is cloud-only. For fully on-premise, combine open-source STT + TTS components in a hybrid design. Q5: Rough cost per minute? A: A back-of-envelope per-minute exchange (30s audio in + 30s audio out + ~3,000 tokens of system prompt) lands in the single-digit-yen-to-low-tens-of-yen range. Real costs vary widely by usage pattern — measure in PoC.

References

Feel free to contact us

Contact Us