OpenAI GPT-Realtime-2: The Voice API That Can Reason

OpenAI GPT-Realtime-2 voice API waveform visualization

OpenAI killed the Realtime API beta on May 12 — three days after shipping its replacement. If you have voice agents running on the old OpenAI-Beta: realtime=v1 interface, they are already broken. The new models — gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper — launched May 9 and represent a genuine generational jump for voice AI. Here is what changed, what to build with it, and what you need to fix.

Three Models, Three Jobs

OpenAI split the voice API into purpose-built models rather than forcing one model to do everything poorly.

gpt-realtime-2 is the flagship: speech-to-speech with GPT-5-class reasoning, a 128K context window (up from 32K), and reliable tool calling during live conversations. Cost runs roughly $0.077 per minute of input audio.

gpt-realtime-translate handles live speech-to-speech translation across 70+ input languages into 13 output languages. It is trained on professional interpreter audio, auto-detects source language, and adapts the translated voice to match the original speaker’s tone and pitch. Cost: $0.034 per minute. This model does not reason or call APIs — it translates, full stop.

gpt-realtime-whisper is streaming transcription: speech in, text out, live. At $0.017 per minute it is the cheapest path to real-time captions or any workflow where you only need the transcript.

Model	Job	Cost/min	Reasoning
gpt-realtime-2	Speech-to-speech	~$0.077	Yes (5 levels)
gpt-realtime-translate	Live translation	$0.034	No
gpt-realtime-whisper	Transcription	$0.017	No

Why Reasoning in Voice Changes the Category

The headline feature on gpt-realtime-2 is configurable reasoning effort: five levels — minimal, low, medium, high, and xhigh — that let you tune the latency-versus-intelligence tradeoff per session. This sounds like a settings menu. It is actually a category shift.

Previous realtime voice models skipped reasoning to stay fast. That made them good at simple back-and-forth and bad at anything requiring multi-step logic or parallel tool calls. A customer support agent that needed to look up an account, verify identity, check eligibility, and process a change reliably in a single voice session was not viable. Now it is.

At the high setting, gpt-realtime-2 scores 96.6% on Big Bench Audio. At xhigh, it hits 48.5% average pass rate on Audio MultiChallenge — which tests exactly the complex, multi-turn instruction-following where previous models fell apart. The tradeoff is real latency; the default is low for a reason. Start there and move up only if your use case requires it.

The 128K context window is the underreported part of this launch. At roughly 40 tokens per second of audio, 128K fits approximately 53 minutes of conversation per side. A real enterprise support call that escalates through departments and references a months-long support history now fits inside a single session context. That was not possible before.

Real deployments confirm the improvement. According to OpenAI’s launch post, Zillow’s call success rate went from 69% to 95% after moving to gpt-realtime-2 for home valuation and financing calls. Glean reported a 42.9% relative increase in helpfulness in internal evals. Genspark’s Call for Me Agent saw a 26% improvement in effective conversation rate.

The Migration: Five Changes Required

The beta API is gone and the GA interface is not wire-compatible. If your agents stopped responding on May 12, here is what needs to change:

Remove the beta header. Delete OpenAI-Beta: realtime=v1 from all requests.
Update ephemeral key generation. Move to POST /v1/realtime/client_secrets for browser and mobile clients.
Switch the WebRTC endpoint. Use /v1/realtime/calls for SDP setup.
Add session.type. The session initialization now requires this field.
Update event names. response.text.delta becomes response.output_text.delta. response.audio.delta becomes response.output_audio.delta. conversation.item.created splits into conversation.item.added and conversation.item.done. Any handler listening for the old names is silently broken.

Also move output audio configuration under session.audio.output rather than the top-level session object. The Voice Agents guide and the Azure AI Foundry migration notes cover the complete schema changes if you need the full picture.

The Bottom Line

Voice AI has been stuck in demo mode: impressive in a five-minute presentation, unreliable under production load. The limiting factor was never audio quality — it was the inability to reason, hold long context, and call tools reliably during a live conversation. gpt-realtime-2 removes those limitations at a price point that makes production deployments viable.

It is not complete: fine-tuning is not available yet, xhigh mode has real latency costs, and ChatGPT Voice Mode itself has not been upgraded to this model. But for developers building voice agents on the API, this is the version worth building on. If your agents are already broken from the May 12 cutoff, the migration is mechanical and documented. If you are starting fresh, start with the gpt-realtime-2 model page.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.