
Google’s Gemini Live API moved from preview to general availability on Vertex AI at I/O 2026 — and this upgrade is more than a status change. You now get production SLAs, multi-region failover, enterprise compliance support, and Gemini 2.5 Flash Native Audio under the hood. If you’ve been waiting to build real-time voice or vision agents without stitching together separate speech-to-text, LLM, and text-to-speech services, the wait is over.
The Number That Changes Decisions
Before talking about what it can do, here’s the comparison every team will have in the next meeting.
| API | Audio Input Price | Vision | Voices |
|---|---|---|---|
| Gemini 2.5 Flash Live | $0.00165/min | Yes | 30 HD |
| OpenAI Realtime mini | $0.084/min | No | 9 |
| OpenAI Realtime standard | $0.30/min | No | 9 |
At 100,000 minutes per month — a modest workload for a production voice agent — Gemini Live costs around $165. The OpenAI Realtime mini equivalent runs ~$8,400. That gap doesn’t just affect operating costs; it changes which use cases are worth building at all.
What GA Actually Means
Google moved Gemini Live to GA on Vertex AI with multi-region support, which means two things. First, you get the availability guarantees required for production workloads — this is no longer experimental infrastructure. Second, enterprise data residency and compliance features are now live, so regulated industries (finance, healthcare) can actually deploy it without a legal fight.
Companies already running production workloads on it include Shopify (Sidekick), United Wholesale Mortgage (Mia), and SightCall. UWM’s Mia has generated over 14,000 loans and doubled underwriter productivity since launching on the platform. That’s the kind of social proof that gets internal budget approvals.
What It Can Do That Competitors Can’t
The headline capability is end-to-end native audio — no separate STT or TTS pipeline. Audio goes in, audio comes out, with 30 HD voices across 24+ languages and a 70-language understanding range. That alone cuts ~100–200ms of latency per turn versus an STT → LLM → TTS chain.
But the real differentiator is vision. Gemini Live can process a camera feed and audio simultaneously — no other real-time conversational API does this. Send frames at up to 1 FPS alongside audio and the model can see your screen, interpret a live video feed, or discuss a diagram while talking with you. This enables agent patterns that simply don’t exist on competing APIs.
Two other features worth knowing:
- Affective dialog — The model detects emotional tone (pitch, pace, expressed sentiment) and adapts its response style in real time.
- Proactive audio — The model distinguishes “is this directed at me?” from ambient conversation and stays quiet when it should. This is what ambient AI needs to not be annoying.
Architecture: What You Actually Need to Decide
Gemini Live uses a persistent WebSocket connection, not REST. Two patterns are supported:
Server-to-server (recommended for most apps): Your backend manages the WebSocket to Gemini. Clients stream to your server, your server forwards to the API. API key stays server-side. This is the right default for production.
Client-to-server (direct frontend connection): Lower latency (one fewer network hop), but requires ephemeral tokens — short-lived credentials your server issues to the client. Never put an API key in frontend code.
Session limits: 15 minutes for audio-only, 2 minutes for audio plus video. For longer interactions, use context resumption — the API supports session history restoration between connections.
Starting in Ten Lines
The official examples repo has implementations in Python, JavaScript, and Node.js. The minimal Python session using the GenAI SDK looks like this:
import asyncio
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
model = "gemini-3.1-flash-live-preview"
async def main():
async with client.aio.live.connect(
model=model,
config={"response_modalities": ["AUDIO"]}
) as session:
await session.send_realtime_input(
audio=types.Blob(data=audio_chunk, mime_type="audio/pcm;rate=16000")
)
async for response in session.receive():
if response.server_content:
for part in response.server_content.model_turn.parts:
if part.inline_data:
play(part.inline_data.data) # 24kHz PCM output
Partner integrations are available for LiveKit, Pipecat, and Firebase AI SDK (for mobile and web), so you don’t need to write WebSocket handling from scratch if you’re already on one of those frameworks.
The fastest path to test without writing code: Vertex AI Studio’s multimodal live console lets you try the API in the browser before touching a keyboard. Full API pricing is documented here.













