OpenAI Realtime API Goes GA: Three New Voice Models

Abstract visualization of OpenAI Realtime API voice models with sound waves and neural network connections in blue and white

OpenAI Realtime API goes GA with GPT-Realtime-2, Translate, and Whisper

OpenAI’s Realtime API exited beta on May 7. It now ships in general availability with three new models: GPT-Realtime-2, which brings GPT-5-class reasoning to voice; GPT-Realtime-Translate, which handles live speech across 70+ languages; and GPT-Realtime-Whisper, which streams transcription in real time. The old beta endpoint was deprecated May 12 and is gone. If you were building on it, you either migrated or your app broke.

GPT-Realtime-2: What Changed Under the Hood

The headline model is GPT-Realtime-2. The most significant technical change is the context window: up from 32k to 128k tokens, which translates to roughly one to two hours of dense back-and-forth audio. For customer service or healthcare consultation use cases where sessions run long, this matters.

Reasoning is now configurable. You can set reasoning effort to minimal, low, medium, high, or very high — with low as the default. The trade-off is latency versus quality, and OpenAI is letting developers make that call per use case. On the benchmark side, GPT-Realtime-2 hits 96.6% on Big Bench Audio, a 15.2 percentage point jump over its predecessor.

One capability worth calling out specifically: the model can now run parallel tool calls and narrate what it is doing while it does them. No dead air during multi-step tasks. For voice agents that need to query a database, book an appointment, and confirm availability in a single turn, this is the kind of thing that separates a demo from a product.

The Other Two Models

GPT-Realtime-Translate is the multilingual play. It handles live speech-to-speech translation from 70+ input languages into 13 output languages while staying in sync with the speaker. At $0.034 per minute, it is economically viable at meaningful scale — useful for multilingual customer support, healthcare consultations with patients who speak different languages, and cross-border business communication.

GPT-Realtime-Whisper is streaming transcription only — not speech-to-speech. It transcribes as the speaker talks, making it useful for live captioning, meeting notes, broadcast subtitles, and accessibility tooling. The price point is $0.017 per minute, which is low enough to run continuously in broadcast environments.

Three New Capabilities That Expand What You Can Build

Beyond the new models, OpenAI shipped three capabilities that change the architecture of what is possible with the Realtime API.

MCP server support. Voice agents can now connect to external tools and APIs mid-conversation using the Model Context Protocol, without writing custom integration code for each data source.
SIP phone calling. The Realtime API connects directly to the public switched telephone network, PBX systems, and desk phones via SIP. This is the end of requiring a separate telephony layer to build AI call center agents. TechCrunch called it production-ready plumbing for one of the most entrenched business channels: the telephone.
Image input. You can now pass screenshots and photos into a Realtime API session alongside audio. A user can share a screenshot of an error and ask about it verbally, and the model has full visual context.

Breaking Changes: What to Fix

GA is not wire-compatible with the beta interface. This is not a minor version bump — it requires code changes. Here is what to address:

Remove the OpenAI-Beta: realtime=v1 header from all requests
Rename event handlers: response.text.delta becomes response.output_text.delta
Replace conversation.item.created with conversation.item.added and conversation.item.done
Remove the temperature parameter — it no longer exists in GA
Update any references to the deprecated beta model IDs (gpt-4o-realtime-preview and variants)

A field-tested migration guide on GitHub covers the WebRTC connection flow, session config, and every event name change documented from production debugging. Worth bookmarking before you touch your code.

The Pricing Math

GPT-Realtime-2 is billed by token: $32 per million audio input tokens and $64 per million output tokens. Audio encodes at one token per 100ms of user speech and one token per 50ms of assistant speech. In practice, this works out to $0.18 to $0.46 per minute. Caching your system prompt drops the input cost to $0.40 per million cached tokens — significant if your prompts are stable across sessions.

The honest comparison: a cascaded stack of Deepgram for transcription, your own LLM, and ElevenLabs for voice synthesis will run three to five times cheaper per minute. The trade-off is engineering complexity, latency tuning across multiple vendors, and the absence of native reasoning in the voice path. For early-stage projects or anywhere engineering speed matters more than per-minute cost, the integrated approach is defensible. For high-volume deployments — call center scale, millions of minutes per month — run the numbers carefully before committing.

One thing that should be a given but often is not: pin your model version and watch the model documentation for updates. GA does not mean the API is frozen. OpenAI has a history of deprecating endpoints faster than teams expect.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.