Sakana Fugu Beta: A 7B Model That Beats GPT-5

Illustration of a small AI conductor model orchestrating larger GPT-5, Claude, and Gemini models across a network

Sakana Fugu: a 7B RL-trained model that orchestrates GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro

Sakana AI opened beta access to Fugu this week — a multi-agent orchestration system built on a counterintuitive premise: a 7-billion-parameter model can outperform GPT-5 on difficult benchmarks not by being smarter, but by being a better manager. The research behind it was accepted at ICLR 2026 and showed state-of-the-art results on GPQA-Diamond, LiveCodeBench, and AIME25. Now developers can apply for API access, and the integration is deliberately frictionless: it is an OpenAI-compatible endpoint that replaces a single model call with coordinated multi-model execution.

What Fugu Is (and What It Is Not)

Fugu is not a frontier model. It does not try to out-think GPT-5, Claude Sonnet 4, or Gemini 2.5 Pro. Instead, it orchestrates them. The underlying research model is a Qwen2.5-7B base, trained via reinforcement learning to design collaboration strategies across a pool of more powerful workers. Sakana calls this the RL Conductor.

The commercial product comes in two variants: Fugu Mini for latency-sensitive production applications, and Fugu Ultra for maximum performance on complex reasoning and coding tasks. Both are accessible via the same OpenAI-compatible API format.

How the RL Conductor Works

Most multi-agent frameworks — LangGraph, CrewAI, AutoGen — require developers to manually define which model handles which subtask. You write the routing logic. You decide that Claude handles creative reasoning and GPT handles code output. You maintain that logic as models evolve.

The Conductor skips all of that. It was trained via GRPO (Group Relative Policy Optimization) on 960 problems spanning math, science, and coding, and it learned two things simultaneously: how to design communication topologies between agents, and how to prompt-engineer each worker model to maximize its individual strengths. The routing is not programmed — it is a skill the model acquired through training.

On coding benchmarks, this produced specific, observable behavior. Gemini 2.5 Pro and Claude Sonnet 4 were typically assigned as high-level planners; GPT-5 was brought in at the end to write the final optimized code. In some cases, the Conductor handed the entire planning process to Gemini 2.5 Pro and let it dictate subtasks for the rest of the pool. It was not following a script — it was making judgment calls.

There is also a recursive dimension. When Fugu is allowed to call itself, it reads its own prior output, evaluates whether the strategy worked, and spins up a corrective workflow on the fly. David Ha (hardmaru), a researcher at Sakana, described this as a new axis for inference-time compute scaling: instead of running a bigger model longer, you run a coordination layer that catches and corrects its own mistakes.

Benchmark Results

Benchmark	Score	Notes
AIME25 (math)	93.3%	~3% above best individual worker
GPQA-Diamond	87.5%	SOTA at ICLR 2026 publication
LiveCodeBench	83.93%	SOTA at ICLR 2026 publication

The ~3% gain over the best individual frontier model might sound modest. The paper’s authors put it in context: that margin is consistent with the performance gap between entire generations of frontier models. Getting it from a coordination layer — not a larger model — is the point. You can read the full benchmark analysis on the GPQA Diamond leaderboard.

Integration: Two Lines of Code

The practical pitch is straightforward. If you are already calling OpenAI’s API, integrating Fugu is a base URL change:

from openai import OpenAI

# Before: single-model call
client = OpenAI(api_key="your-openai-key")
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Solve this problem..."}]
)

# After: Fugu orchestrates across GPT-5, Claude Sonnet 4, Gemini 2.5 Pro
client = OpenAI(api_key="your-fugu-key", base_url="https://api.sakana.ai/fugu/v1")
response = client.chat.completions.create(
    model="fugu-ultra",
    messages=[{"role": "user", "content": "Solve this problem..."}]
)

Your existing code, request format, and response parsing stay the same. Fugu handles model selection, routing, and result aggregation internally. You stop managing separate API keys for each provider.

The Build vs. Buy Case

If you are still hand-coding routing logic for multi-model workflows in 2026, that choice deserves examination. LangGraph and CrewAI are capable frameworks, but they put the routing burden on the developer — a burden that compounds as your model mix changes and edge cases accumulate.

Fugu’s case is that learned orchestration outperforms hand-coded orchestration on hard tasks, and does so with fewer API calls than competing pipelines. That matters at scale: the cost dynamics of multi-agent systems mean workflows that are cheap in testing can become expensive in production. Fugu helps on the “fewer calls” side, but total compute cost still depends on the models it calls.

Beta access is open. If your application involves reasoning, coding, or scientific analysis — and a single frontier model is not delivering — Fugu is the most principled attempt yet to make orchestration something you buy rather than build. Apply and read the full technical writeup at sakana.ai/fugu-beta/, and review the RL Conductor research for implementation details.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.