NewsAI & Development

MiMo UltraSpeed Hits 1,000 Tokens/Sec on Stock GPUs

Data visualization showing MiMo UltraSpeed achieving 1000 tokens per second on commodity GPUs compared to Groq and Cerebras custom silicon

Xiaomi released MiMo-V2.5-Pro-UltraSpeed on June 9, 2026 — a 1-trillion-parameter model that decodes over 1,000 tokens per second on a single standard 8-GPU node. No Groq chips. No Cerebras wafer-scale silicon. Just software. That distinction matters more than the speed number itself.

Why Commodity GPUs Change the Story

Groq and Cerebras built their businesses on a premise: reaching 1,000 tokens per second with large models requires custom silicon. Groq’s Language Processing Unit hits 300–750 tokens per second depending on model. Cerebras achieved 969 TPS on Meta’s Llama 3.1 405B using wafer-scale integration that cost hundreds of millions to develop. MiMo-V2.5-Pro-UltraSpeed reaches 1,000–1,200 TPS on a 1-trillion-parameter model — two-and-a-half times larger than Llama 405B — using hardware any team can rent on AWS or Azure today.

The open-source element compounds this. Xiaomi is releasing the FP4 model checkpoint on HuggingFace and select TileRT inference runtime modules on GitHub. If the technique is generalizable — and early analysis suggests it is — other labs could apply the same methodology to their own models. That is a more significant development than one fast model from one company.

Related: AI Infrastructure Costs Crack: What Developers Must Know

How Three Stacked Techniques Get to 1,000 TPS

No single optimization achieves this speed. Three co-designed systems stack multiplicatively.

First, MXFP4 selective quantization targets only the Mixture-of-Experts expert weight blocks — where most of the model’s parameters live but which tolerate quantization best. Attention layers, embeddings, and output heads stay at higher precision. Quantization-Aware Training preserves near-baseline quality. Memory bandwidth drops sharply without the usual accuracy penalty of aggressive full-model quantization.

Second, DFlash speculative decoding eliminates the serial bottleneck. A small draft model proposes entire blocks of up to 8 tokens simultaneously using Sliding Window Attention. The main model verifies the block in parallel rather than generating tokens one at a time. Average acceptance per verification round: 6.30 tokens for coding tasks, 5.56 for math and reasoning, 4.29 for agent workflows. For coding specifically — the highest-value use case for most developers — roughly six tokens clear every verification cycle instead of one. MarkTechPost’s technical analysis confirms the co-design approach is what makes these acceptance rates possible at scale.

Third, TileRT’s Persistent Engine Kernel keeps the entire compute pipeline resident on GPU continuously. Traditional inference runtimes launch discrete kernels per operation, incurring overhead that is invisible at 100 tokens per second but catastrophic at 1,000. Warp Specialization distributes memory movement, compute, and communication across coordinated GPU warp groups, each operating independently but in precise synchronization. At this speed, operations like RMSNorm and RoPE — normally negligible — become meaningful bottlenecks requiring micro-optimization.

What 1,000 Tokens Per Second Actually Enables

The practical shift is in how AI agents can be structured. At standard inference speeds, running Best-of-N — generate five candidate implementations, pick the best — takes five times longer than a single generation. At 1,000 TPS, that same Best-of-N fits within the same wall-clock budget as a single inference pass at 200 TPS. Parallel reasoning paths, tree search, and iterative self-correction become viable without increasing latency. For latency-critical applications — real-time fraud detection, medical imaging triage, trading signal generation — a 1-trillion-parameter model becomes a realistic option where only smaller, less capable models were practical before.

Developer reaction on Hacker News (571 points) reflects this: excitement centers not on the speed benchmark itself but on the workflow architecture it unlocks. The concern that follows is equally representative — fast generation doesn’t eliminate compile times, test cycles, or the cognitive overhead of reviewing agent output.

The Caveats Worth Stating Plainly

MiMo-V2.5-Pro-UltraSpeed still trails Anthropic’s and OpenAI’s frontier models on complex multi-step reasoning. Speed does not compensate for capability gaps on hard tasks. The trial window is June 9–23, 2026 only, application-based, with priority given to enterprise and professional developers — most readers will not get access immediately. Cost is 3× the base MiMo rate for roughly 10× the speed, which is a reasonable trade-off for the right use case but not cheap.

Key Takeaways

  • MiMo-V2.5-Pro-UltraSpeed achieves 1,000–1,200 tokens/second on a 1-trillion-parameter model using only commodity 8-GPU hardware — no custom silicon required
  • Three co-designed techniques stack multiplicatively: MXFP4 selective quantization, DFlash block-level speculative decoding (6.30 avg accepted tokens per round for coding), and TileRT Persistent Engine Kernel
  • At 1,000 TPS, Best-of-N sampling and parallel reasoning paths become viable within normal latency budgets — that changes agent architecture decisions
  • Quality still trails frontier models from Anthropic and OpenAI; this is not a replacement for complex reasoning tasks
  • FP4 weights and DFlash code are open-source on HuggingFace and GitHub; the technique may be applicable beyond Xiaomi’s model family
ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News