
The OpenAI Realtime API beta is gone. As of May 12, 2026, sending the OpenAI-Beta: realtime=v1 header returns a beta_api_shape_disabled error and nothing else. If your voice integration has been silent since then, that is probably why. The replacement — GPT-Realtime-2 — is not just a migration target. It is a materially different model with configurable reasoning, a 128K token context window, and reliable tool use. The migration takes 30–60 minutes. Here is exactly what to change.
Four Changes That Break Existing Integrations
The GA interface is not a rename of the beta. Four things changed, and all four can silently break your integration if you miss any of them:
- Remove the beta header. Delete
OpenAI-Beta: realtime=v1from every request. This alone unblocks most broken builds. - New ephemeral key endpoint. Browser and mobile clients now mint short-lived keys via
POST /v1/realtime/client_secrets. Keys expire in one minute. - New WebRTC SDP endpoint. The SDP exchange for WebRTC connections moves to
POST /v1/realtime/calls. The old endpoint returns a 404. - Add
session.type. Omitting this field causes session creation to fail without a clear error message.
Event Names That Changed
Three event renames are the most common migration gotcha. They produce no errors — just missing data in your handlers:
response.text.delta→response.output_text.deltaconversation.item.created→conversation.item.addedandconversation.item.done(now two events)- Legacy content types replaced by
output_textandoutput_audio
Check every dc.addEventListener("message", ...) handler in your codebase for the old event names before you consider the migration done.
Building a WebRTC Session With GPT-Realtime-2
The WebRTC flow has two halves: your server mints the key, your browser uses it. Your API key never reaches the browser — only the ephemeral token does.
Server (Node.js) — mint the ephemeral key:
const res = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "gpt-realtime-2",
voice: "alloy",
reasoning: { effort: "low" }
})
});
const { client_secret } = await res.json();
// Send client_secret.value to the browser
Browser — set up the WebRTC peer connection:
const pc = new RTCPeerConnection();
const audio = document.createElement("audio");
audio.autoplay = true;
pc.ontrack = e => audio.srcObject = e.streams[0];
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(stream.getTracks()[0]);
const dc = pc.createDataChannel("oai-events");
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdp = await fetch("/v1/realtime/calls", {
method: "POST",
headers: { Authorization: `Bearer ${ephemeralKey}`, "Content-Type": "application/sdp" },
body: offer.sdp
});
await pc.setRemoteDescription({ type: "answer", sdp: await sdp.text() });
The oai-events data channel carries all JSON events — session config, tool calls, transcripts, and turn completions all flow through it.
Reasoning Effort: What It Actually Means in Production
GPT-Realtime-2 supports five reasoning levels: minimal, low, medium, high, and xhigh. The model can now pause and think before responding to complex questions instead of immediately generating the first token.
Here is the thing to know before reading OpenAI’s benchmark results: every published number was produced at high or xhigh reasoning effort. Production defaults to low. Start at low — it keeps latency tolerable for conversational flow — then increase effort only for specific turns that need multi-step reasoning or tool orchestration. The context window expanded from 32K to 128K tokens, so long sessions no longer get truncated mid-conversation.
The Three Realtime Models Compared
| Model | Use case | Price/min |
|---|---|---|
| GPT-Realtime-2 | Voice agents with reasoning and tool use | $0.18–$0.46 (uncached) |
| GPT-Realtime-Translate | Live speech translation, 70+ languages → 13 outputs | $0.034 flat |
| GPT-Realtime-Whisper | Streaming transcription with partial real-time results | $0.017 flat |
GPT-Realtime-Translate adapts to the source speaker’s voice tone and pitch rather than layering a synthetic voice on top. GPT-Realtime-Whisper returns provisional partial transcripts as speech arrives, then revises them for high final accuracy. Use GPT-Realtime-2 for agentic workflows that need reasoning; use the other two for narrow tasks where cost matters more than intelligence.
Prompt caching applies to GPT-Realtime-2 and cuts the uncached audio input rate by roughly 98.75% on repeated context, bringing effective cost down to $0.05–$0.10 per minute for sessions with substantial repeated context.
What to Do This Week
- Audit for the beta header. Search your codebase for
realtime=v1. Remove it. Test against the GA endpoint. - Update your event handlers. Check for the three renamed events listed above. Missed renames produce silent data loss with no exceptions thrown.
- Set reasoning effort explicitly. Default is
low; setting it in your session config makes behavior predictable when you tune it later.
The field-tested WebRTC migration repo on GitHub documents every endpoint change, event rename, and session schema update from real production debugging. The official OpenAI Realtime API docs have the canonical GA event schemas. For cost projections before you commit to GPT-Realtime-2 at scale, the realtime cost guide breaks down token math for both WebRTC and WebSocket sessions.













