
Sakana AI opened beta access to Fugu this week — a multi-agent orchestration system built on a counterintuitive premise: a 7-billion-parameter model can outperform GPT-5 on difficult benchmarks not by being smarter, but by being a better manager. The research behind it was accepted at ICLR 2026 and showed state-of-the-art results on GPQA-Diamond, LiveCodeBench, and AIME25. Now developers can apply for API access, and the integration is deliberately frictionless: it is an OpenAI-compatible endpoint that replaces a single model call with coordinated multi-model execution.
What Fugu Is (and What It Is Not)
Fugu is not a frontier model. It does not try to out-think GPT-5, Claude Sonnet 4, or Gemini 2.5 Pro. Instead, it orchestrates them. The underlying research model is a Qwen2.5-7B base, trained via reinforcement learning to design collaboration strategies across a pool of more powerful workers. Sakana calls this the RL Conductor.
The commercial product comes in two variants: Fugu Mini for latency-sensitive production applications, and Fugu Ultra for maximum performance on complex reasoning and coding tasks. Both are accessible via the same OpenAI-compatible API format.
How the RL Conductor Works
Most multi-agent frameworks — LangGraph, CrewAI, AutoGen — require developers to manually define which model handles which subtask. You write the routing logic. You decide that Claude handles creative reasoning and GPT handles code output. You maintain that logic as models evolve.
The Conductor skips all of that. It was trained via GRPO (Group Relative Policy Optimization) on 960 problems spanning math, science, and coding, and it learned two things simultaneously: how to design communication topologies between agents, and how to prompt-engineer each worker model to maximize its individual strengths. The routing is not programmed — it is a skill the model acquired through training.
On coding benchmarks, this produced specific, observable behavior. Gemini 2.5 Pro and Claude Sonnet 4 were typically assigned as high-level planners; GPT-5 was brought in at the end to write the final optimized code. In some cases, the Conductor handed the entire planning process to Gemini 2.5 Pro and let it dictate subtasks for the rest of the pool. It was not following a script — it was making judgment calls.
There is also a recursive dimension. When Fugu is allowed to call itself, it reads its own prior output, evaluates whether the strategy worked, and spins up a corrective workflow on the fly. David Ha (hardmaru), a researcher at Sakana, described this as a new axis for inference-time compute scaling: instead of running a bigger model longer, you run a coordination layer that catches and corrects its own mistakes.
Benchmark Results
| Benchmark | Score | Notes |
|---|---|---|
| AIME25 (math) | 93.3% | ~3% above best individual worker |
| GPQA-Diamond | 87.5% | SOTA at ICLR 2026 publication |
| LiveCodeBench | 83.93% | SOTA at ICLR 2026 publication |
The ~3% gain over the best individual frontier model might sound modest. The paper’s authors put it in context: that margin is consistent with the performance gap between entire generations of frontier models. Getting it from a coordination layer — not a larger model — is the point. You can read the full benchmark analysis on the GPQA Diamond leaderboard.
Integration: Two Lines of Code
The practical pitch is straightforward. If you are already calling OpenAI’s API, integrating Fugu is a base URL change:
from openai import OpenAI
# Before: single-model call
client = OpenAI(api_key="your-openai-key")
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": "Solve this problem..."}]
)
# After: Fugu orchestrates across GPT-5, Claude Sonnet 4, Gemini 2.5 Pro
client = OpenAI(api_key="your-fugu-key", base_url="https://api.sakana.ai/fugu/v1")
response = client.chat.completions.create(
model="fugu-ultra",
messages=[{"role": "user", "content": "Solve this problem..."}]
)
Your existing code, request format, and response parsing stay the same. Fugu handles model selection, routing, and result aggregation internally. You stop managing separate API keys for each provider.
The Build vs. Buy Case
If you are still hand-coding routing logic for multi-model workflows in 2026, that choice deserves examination. LangGraph and CrewAI are capable frameworks, but they put the routing burden on the developer — a burden that compounds as your model mix changes and edge cases accumulate.
Fugu’s case is that learned orchestration outperforms hand-coded orchestration on hard tasks, and does so with fewer API calls than competing pipelines. That matters at scale: the cost dynamics of multi-agent systems mean workflows that are cheap in testing can become expensive in production. Fugu helps on the “fewer calls” side, but total compute cost still depends on the models it calls.
Beta access is open. If your application involves reasoning, coding, or scientific analysis — and a single frontier model is not delivering — Fugu is the most principled attempt yet to make orchestration something you buy rather than build. Apply and read the full technical writeup at sakana.ai/fugu-beta/, and review the RL Conductor research for implementation details.













