AI & DevelopmentOpen SourcePerformanceMachine Learning

Gemma 4 MTP: How Google’s 3x Inference Boost Works

Neural network diagram showing Gemma 4 multi-token prediction speculative decoding architecture with parallel inference paths
Google Gemma 4 MTP drafters deliver up to 3x faster inference through speculative decoding

Google just made your local Gemma 4 deployment dramatically faster without touching the model weights, the hardware, or the quality of outputs. On May 5, they shipped Multi-Token Prediction (MTP) drafters for the entire Gemma 4 family — lightweight companion models that deliver up to 3x faster token generation on the same GPU you’re already running. The catch? Mathematically, there isn’t one.

Why LLM Inference Is Slow in the First Place

Every token an LLM generates requires a full forward pass through billions of parameters. One token at a time, sequentially. That’s not an implementation flaw — it’s how autoregressive language models work. It’s also why running a 31B model locally has always felt sluggish compared to API services, even when you’ve got capable hardware.

Closed API providers solved this years ago with speculative decoding: use a small, fast draft model to predict several tokens ahead, then let the large model verify the batch in parallel. The math works out so the verification is lossless — the final token distribution is provably identical to standard generation. No quality trade-off. Just speed.

The problem for the open model ecosystem was that nobody handed you production-grade drafters. You either built your own, ran without them, or used workarounds. Google just changed that.

What Gemma 4 MTP Actually Does

Each MTP drafter is a 4-layer transformer: three sliding-window attention layers and one full global attention layer — a compressed mirror of Gemma 4’s hybrid attention design. Small enough to be nearly free to run, specialized enough to draft coherently.

The key architectural decision is how tightly the drafter integrates with the target model. It shares the input embedding table with its paired target and builds directly on the target model’s last-layer activations rather than running independent inference from scratch. It also shares the target’s KV cache, which means it doesn’t recompute context the target already calculated. The drafter is less a separate model and more a prediction head bolted onto the existing forward pass.

The loop works like this: the drafter predicts N tokens ahead; the target model verifies them in one parallel forward pass. If the target agrees with the draft, you get the entire sequence — plus one bonus token the target generates — in roughly the same time it would have taken to produce a single token normally. When the target disagrees starting at token k, it rejects k onward and resamples from its own distribution. You never get output worse than standard generation. You just sometimes get it much faster.

The Actual Numbers

The official headline is "up to 3x faster." That’s the ceiling under ideal conditions. Here’s what real deployments look like:

  • Gemma 4 31B Dense benchmark: 11.43 tokens/sec → 22.05 tokens/sec with MTP enabled (~1.93x)
  • Community-observed average on consumer hardware: 1.5x to 2.2x
  • Coding and structured output tasks: up to 40% latency reduction (tokens are predictable, acceptance rates hit 70–84%)
  • Creative/open-ended generation: lower gains (less predictable tokens, lower acceptance rate)

A 1.5–2x speedup sounds less exciting than 3x, but context matters. That’s the difference between a model that feels like it’s lagging and one that keeps pace with how fast you read. For developers running Gemma 4 31B on a single A100 or high-end consumer GPU, this is the difference between "workable" and "actually good."

Turning It On

Google released drafter models for all four Gemma 4 variants (E2B, E4B, 26B MoE, 31B Dense) under the Apache 2.0 license. Available on Hugging Face and Kaggle. The naming pattern is straightforward: google/gemma-4-[variant]-it-assistant.

With vLLM — the recommended production serving stack — it’s a single config flag:

vllm serve google/gemma-4-31B-it   --tensor-parallel-size 1   --max-model-len 8192   --speculative-config '{
    "method": "mtp",
    "model": "google/gemma-4-31B-it-assistant",
    "num_speculative_tokens": 4
  }'

For the 31B model, 4–8 speculative tokens is the recommended range. For E2B and E4B, 2–4. vLLM’s MTP documentation covers the full configuration options. HuggingFace Transformers and MLX (for Apple Silicon) also support MTP out of the box. llama.cpp support is still being worked on via a community PR.

What This Means for Self-Hosted AI

The inference efficiency gap between open models and closed API services has been a persistent argument against self-hosting. OpenAI and Anthropic run speculative decoding, batching optimizations, and custom inference kernels that developers running open models couldn’t easily match. MTP for Gemma 4 narrows that gap meaningfully.

The commonly cited self-hosting break-even point sits around 500,000 tokens per day. Double your effective throughput with MTP on the same hardware, and that break-even drops significantly. A GPU setup that was barely adequate for production becomes a reasonable option.

That’s not a recommendation to abandon cloud APIs — for most workloads, hosted inference is still cheaper when you factor in engineering time and ops overhead. But for teams committed to self-hosting, or building applications where data privacy rules out external APIs, MTP changes the calculation.

The Caveats Worth Knowing

A few things the announcement post skips over:

  • 3x is the ceiling, not the average. Expect 1.5–2x in real workloads and be pleasantly surprised if you get more.
  • Task type matters. Coding, JSON generation, and structured outputs see the biggest gains. Free-form creative generation benefits less because token predictions are harder.
  • vLLM + DFlash incompatibility. There’s an open issue (GitHub #42068) where MTP-specific backend propagation conflicts with DFlash attention on some configurations.
  • llama.cpp is still catching up. Official MTP support depends on the community PR landing.

None of these are dealbreakers. They’re just the parts of the story that get left out of the launch announcement.

Getting Started

Google’s MTP overview guide covers the architecture and available drafters in detail. The vLLM MTP documentation has production serving examples. The drafter model pages on Hugging Face include benchmark numbers for each variant so you can set realistic expectations before you run it.

If you’re running Gemma 4 and haven’t enabled MTP yet, there’s no real reason not to try it. Same model, same GPU, Apache 2.0 licensed drafter, and a meaningful throughput improvement that costs nothing but a config change.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *