NewsAI & Development

Gemma 4 QAT Cuts E2B to Under 1GB — Deploy It Now

Gemma 4 QAT featured image showing memory reduction from BF16 to under 1GB for on-device AI deployment

Google DeepMind released Gemma 4 QAT checkpoints on June 5, 2026, and the headline number is worth pausing on: the Gemma 4 E2B text-only model now runs in under 1GB of RAM. For context, the same model in full BF16 precision requires 9.6GB. That’s not a modest optimization — it’s a 90% reduction that moves the model from “requires a beefy laptop” to “fits in a phone.” The reason this works better than standard quantization is that QAT (Quantization-Aware Training) bakes compression simulation directly into the training loop, rather than compressing a finished model and hoping for the best.

Gemma 4 QAT: The Memory Numbers

The clearest way to understand this release is through the comparison table Google published alongside it. For the E2B model: BF16 baseline is 9.6GB, Q4_0 QAT format drops it to 3.2GB, and the mobile-specialized QAT format brings it under 1GB. For E4B: BF16 at 15GB becomes 5GB in Q4_0 QAT. The 12B model in Q4_0 QAT lands around 7GB — workable on any GPU sold in the last three years.

In the real world, LiteRT-LM (Google’s mobile inference runtime) runs the E2B multimodal model at a 607MB active RAM footprint on Apple mobile CPUs, delivering 56 tokens per second on iOS Metal and 52 tokens per second on Android OpenCL. The multimodal version handles text, images, and audio — this isn’t a stripped-down demo model. For purely text applications, the sub-1GB figure applies.

Related: Gemma 4: Google’s Open-Source Model That Runs on a 16GB Laptop

Why QAT Outperforms Post-Training Quantization

Standard post-training quantization (PTQ) takes a fully trained model and compresses its weights after the fact — from BF16 down to INT4. The model was never designed for that precision, so quality drops. How much it drops depends on the architecture and quantization scheme, but you’re always fighting against the grain.

QAT does something different: it simulates quantization noise on every forward pass during training. The model sees fake-quantized weights and learns to work within the precision constraints before they’re real. Google’s own data shows QAT cut the Q4_0 perplexity drop by 54% compared to PTQ on Gemma 3. PyTorch research on Llama 3 found QAT recovers up to 96% of hellaswag accuracy degradation versus PTQ. Unsloth’s dynamic QAT method applied to Gemma 4 26B-A4B improved accuracy from 70.2% to 85.6% over naive conversion.

The practical takeaway: when QAT checkpoints are available, use them. Same file size. Better quality. The tradeoff — that QAT is expensive to produce — is Google’s problem, not yours.

Ollama, llama.cpp, and What Works Today

All the major local inference tools have day-one support. The GGUF Q4_0 checkpoints work with llama.cpp, Ollama, and LM Studio directly. Compressed-tensors format covers vLLM for server-side deployments. MLX handles Apple Silicon natively. Transformers.js brings it to the browser. The mobile format runs on LiteRT-LM for Android and iOS.

However, Ollama has an active tool-calling bug with Gemma 4 models. When combining a system prompt, think:false, and tools, the parser drops tool calls into the content field instead of the tool_calls field. For simple chat and text generation it’s irrelevant. For agent workflows that rely on structured tool calling, use llama.cpp directly until Ollama ships a complete fix.


# llama.cpp — reliable for tool-calling agent workflows
./llama-cli -m gemma-4-e2b-it-qat-q4_0.gguf -c 4096 -p "Your prompt here"

# Ollama — fine for chat and text generation
ollama run gemma4:2b-qat

What Gemma 4 QAT Unlocks for Developers

The mobile story here is real. An app that previously needed to proxy inference requests to a cloud backend can now run Gemma 4 E2B on-device, with full multimodal support, using under 1GB of active RAM. That’s a meaningful threshold for privacy-sensitive apps, offline-capable assistants, and anything where cloud round-trip latency is a problem.

On the desktop side, the 12B model at ~7GB in Q4_0 QAT is now within range of mid-range consumer GPUs. For edge hardware, the E2B text-only model entering the 1GB range opens deployment targets that were previously off the table — industrial controllers, constrained IoT devices, and Raspberry Pi 4 variants with 4GB RAM. The full technical breakdown is on MarkTechPost if you want the complete quantization format comparison.

Related: Apple Foundation Models Framework: On-Device AI for iOS, No API Key

Key Takeaways

  • Gemma 4 E2B now runs in under 1GB RAM (text-only, mobile format) — down from 9.6GB in BF16, a 90% reduction
  • QAT checkpoints are measurably better than PTQ at the same file size: 54% less perplexity drop, up to 96% accuracy recovery in benchmarks
  • Day-one support across llama.cpp, Ollama, LM Studio, vLLM, MLX, LiteRT-LM, and Transformers.js
  • Ollama has an active tool-calling bug with Gemma 4 — use llama.cpp for agent workflows that require structured tool calls
  • Mobile deployment via LiteRT-LM: 607MB active RAM on iOS, 56 tok/s on Metal, full text-image-audio multimodal support
ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News