Gemini-SQL2: Google’s Text-to-SQL Clears 80% on BIRD

Natural language transforming into SQL code with Gemini AI - Gemini-SQL2 BIRD benchmark visualization

Gemini-SQL2 by Google Research: first text-to-SQL model to clear 80% on the BIRD benchmark

Google Research dropped Gemini-SQL2 on June 12 — a text-to-SQL system built on Gemini 3.1 Pro that became the first model to clear 80% on the BIRD benchmark. It posted 80.04% execution accuracy. The previous record was 77.2%. There’s no public API yet, but the integration targets are BigQuery Studio, AlloyDB AI, and Cloud SQL Studio. This one is heading straight into Google’s enterprise data stack.

Why 80% on BIRD Actually Means Something

BIRD replaced Spider as the industry standard for text-to-SQL evaluation, and it’s legitimately hard. The dataset covers 12,751 question-SQL pairs across 95 databases and 37 professional domains — and the critical detail is how it scores: execution accuracy. Your generated SQL has to run and return the correct result set. Syntactically beautiful SQL that returns the wrong rows scores zero.

That’s a much higher bar than older benchmarks, which is why 80% means something. Human performance on BIRD sits at 92.96%, leaving a 12.92-point gap above Gemini-SQL2. The previous best was ~77.2% — also Gemini-SQL, meaning Google now holds both top spots on the single-model leaderboard. GPT-5.5-xhigh lands at 72.8%. Claude Opus 4.6 at 70.9%. Google is 7+ points clear of the field.

Model	BIRD Execution Accuracy
Gemini-SQL2	80.04%
Gemini-SQL	~77.2%
GPT-5.5-xhigh	72.8%
Claude Opus 4.6	70.9%
Human	92.96%

What Gemini-SQL2 Actually Does

Gemini-SQL2 is not a new foundation model — it’s specialized post-training and scaffolding on top of Gemini 3.1 Pro. The key capability: it handles the stuff that makes real-world text-to-SQL hard. BIRD databases contain dirty values, ambiguous column names, and external knowledge requirements that older benchmarks ignored. Gemini-SQL2 works through these without fine-tuning on your specific schema — it uses the 1M-token context window to ingest large schemas at inference time and reasons through the ambiguity.

For a developer, this means queries like “MRR by region for accounts that churned within 90 days of their upgrade” — which requires joins, window functions, and date arithmetic — are increasingly something you hand off rather than write. Not always. Not at 92.96%. But at 80%, the failure rate is low enough to make the copilot model viable for a significant slice of routine analytical work.

The Catch: No API, No Timeline

As of June 12, Google has released no public API, no model card, and no technical report for Gemini-SQL2. What they have said is that it’s “heading into its data services next,” with BigQuery Studio, AlloyDB AI, and Cloud SQL Studio named as integration targets. No timeline. If you’re waiting to build something on top of it, you’re waiting.

The practical implication: if you’re already a BigQuery user, you’ll get this upgrade for free when it ships. The existing NL2SQL feature in BigQuery Studio already uses Gemini — Gemini-SQL2’s 3-point BIRD improvement would meaningfully upgrade that experience. You don’t migrate to Gemini-SQL2; Gemini-SQL2 migrates to you.

What This Means for Your Team

The near-term impact is on who can query your databases without writing SQL. Business analysts, product managers, and operations teams can ask questions in natural language and get results that are correct most of the time. The standard recommendation remains: expose 5-10 curated, governed views rather than raw warehouse tables. At 80% BIRD accuracy, production schemas with thousands of columns and half-deprecated views will perform worse than the benchmark suggests.

Professional SQL engineers aren’t going anywhere. The 12.92-point gap to human performance is real, and complex analytical work — performance-critical queries, migrations, stored procedures — still requires someone who knows what they’re doing. What changes is the volume of boilerplate that lands in the ticket queue.

The Competitive Read

The question “can AI write SQL?” is settled. The question now is “whose AI writes it best?” Google’s answer is structural, not just technical. They own BigQuery, AlloyDB, and Cloud SQL — the databases where Gemini-SQL2 will ship natively. OpenAI and Anthropic don’t own databases; they compete through APIs and third-party integrations. A 7-point lead on BIRD combined with first-party database distribution is a meaningful moat.

Watch for the model card and technical report — Google hasn’t published either yet, which limits the community’s ability to understand what’s driving the improvement. And watch for whether the 80% BIRD score holds on your specific schema. The full benchmark analysis from The Decoder covers the gap between BIRD scores and production-grade enterprise schemas.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.