
NVIDIA’s Nemotron 3 Ultra is live as of today: 550 billion parameters, 55 billion active per token, a 1 million token context window, and a commercial-use license. It’s available right now on NVIDIA NIM, HuggingFace, and OpenRouter — no waitlist, no preview access request. Announced by Jensen Huang at Computex 2026 on June 1, Ultra is the largest model NVIDIA has shipped and the most capable open-weight model available in the US or EU today.
The Benchmark Reality — Don’t Skip This Part
Nemotron 3 Ultra scores 48 on the Artificial Analysis Intelligence Index. That puts it ahead of every other US open-weight model — Gemma 4 31B sits at 39, Nemotron 3 Super at 36. But it trails China’s Kimi K2.6 at 54, and the proprietary frontier — Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro — all sit at 57.
That 9-point gap from the frontier is real. For the most demanding reasoning tasks, you’ll feel it. But two things soften the gap in practice: Kimi K2.6 is geo-restricted and practically inaccessible for US and EU enterprise deployments. And for most agent workloads — code generation, tool use, long-context analysis — Ultra’s other numbers matter more than the headline score.
| Model | Intelligence Index | Open Weights | Commercial Use |
|---|---|---|---|
| Claude Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro | 57 | No | API only |
| Kimi K2.6 | 54 | Yes | Geo-restricted |
| Nemotron 3 Ultra | 48 | Yes | Yes |
| Gemma 4 31B | 39 | Yes | Yes |
| Nemotron 3 Super | 36 | Yes | Yes |
Why 300+ Tokens Per Second Changes the Math
The architecture is where Ultra gets interesting. NVIDIA built the model around three complementary techniques that collectively explain the performance claims.
LatentMoE is NVIDIA’s take on Mixture-of-Experts routing. Expert computation happens in a compressed latent dimension, cutting all-to-all communication overhead by roughly 4x compared to standard MoE designs. The result: a 550B-parameter model that scales across GPUs without collapsing under its own routing cost.
Mamba-2 hybrid layers handle the 1 million token context. Standard Transformer attention scales quadratically with sequence length — which makes genuinely long context windows financially brutal at this scale. Mamba-2 provides linear-complexity sequence modeling, making it practical to ingest an entire large codebase, a year of email threads, or a 900-page compliance document in a single context.
Multi-Token Prediction (MTP) is how NVIDIA gets 300+ tokens per second. Two MTP layers baked into the checkpoint act as a built-in speculative decoder — the model drafts multiple tokens in parallel and verifies them, dramatically accelerating generation. GPT-5.4 on shared inference averages 80 to 120 tokens per second. Ultra at 300+ is 2.5 to 4 times faster. For multi-agent pipelines running dozens of parallel sessions, that throughput difference directly translates to lower latency or lower cost.
Agent-First Post-Training
Throughput and context are only useful if the model can do the work. Nemotron 3 Ultra was post-trained using multi-environment reinforcement learning through NVIDIA’s NeMo Gym — an open-source RL library with environments spanning competitive coding, competition math, and long-horizon tool use. This is why Ultra scores 91% on PinchBench agent productivity, matching Kimi K2.6 on that specific benchmark despite the overall intelligence gap.
The model also supports inference-time reasoning budget control, letting you dial computation up or down based on query complexity. For agent orchestration frameworks managing mixed-complexity tasks, this is a meaningful operational lever.
How to Get Access
Three paths depending on your situation:
NIM API (easiest): Sign up for the NVIDIA Developer Program at build.nvidia.com. You get 1,000 free inference credits on signup. The endpoint is OpenAI-compatible — swap your base URL and API key, keep your existing SDK:
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-YOUR_KEY_HERE"
)
completion = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=4096
)
print(completion.choices[0].message.content)
HuggingFace weights: The quantized NVFP4 version is at nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4, with a full BF16 version also available. Running 550B locally requires datacenter hardware — this path is for teams with the infrastructure or cloud GPU access.
OpenRouter and ModelScope: Also live today, useful if you’re already routing through either platform.
What NVIDIA Is Actually Doing Here
NVIDIA is not trying to out-compete OpenAI or Anthropic on model quality. The 9-point gap from the frontier is not an accident — it reflects a deliberate choice to optimize for inference throughput and release it open. When developers adopt Ultra, they run it on NIM. NIM runs best on NVIDIA GPUs. As agent workloads scale, NVIDIA’s inference platform scales with them.
This is the CUDA playbook applied to inference infrastructure. The open model is the hook; the hardware stack is the product. That doesn’t diminish the model’s value — but it does clarify what you’re opting into when you standardize on it.
Bottom Line
If you’re building AI agents and can’t justify proprietary frontier costs at production scale, Nemotron 3 Ultra is your model. Start with the NIM free tier and benchmark on your actual workloads — not the headline index. Before shipping to production, check the NVIDIA Open Model License: commercial use is permitted, but using the weights to train competing foundation models is not.













