Run Gemma 4 AI on Edge Devices: LiteRT-LM Tutorial

Cloud AI inference costs add up fast. If you’re running a chatbot with 10,000 daily users, you’re looking at $200-500/month in API fees—forever. Google’s Gemma 4, released in April 2026, flips the model. The 2.3B-parameter E2B variant runs on edge devices using less than 1.5GB RAM, delivering quality that outperforms models 12x its size while eliminating per-request fees entirely. Combined with LiteRT-LM, Google’s production-ready edge inference framework, you can deploy AI on everything from Raspberry Pi to flagship smartphones. This tutorial takes you from zero to running your first inference in under 5 minutes.

What You’re Getting: Gemma 4 E2B and LiteRT-LM

Gemma 4 E2B is a 2.3-billion-parameter model purpose-built for edge deployment. It runs in under 1.5GB RAM thanks to 2-bit and 4-bit quantization, yet benchmarks show it beating Gemma 3 27B despite being 12x smaller. Google’s Per-Layer Embeddings architecture squeezes frontier-level performance into a package small enough for a Raspberry Pi.

LiteRT-LM is the deployment engine. It’s a production-ready, open-source inference framework built on top of LiteRT (the evolution of TensorFlow Lite). Cross-platform support means the same model runs on Android, iOS, Linux, macOS, Windows (via WSL), and IoT devices like Raspberry Pi. Hardware acceleration is baked in: CPU everywhere, GPU on mobile and desktop, NPU on Android with Qualcomm chips.

Performance scales with hardware. A Raspberry Pi 5 gets 7.6 tokens/second on CPU—slow but functional for offline assistants. A flagship Android phone with GPU acceleration hits 3,808 tokens/second prefill, making real-time interactions viable. Apple’s M4 MacBook Pro pushes 7,835 tokens/second with GPU, outpacing cloud latency for local workflows.

The model is Apache 2.0 licensed. No usage restrictions, no ambiguous terms—production deployments are explicitly allowed.

Installation: Two Paths, Same Destination

LiteRT-LM ships as a Python package with a CLI tool. You can install it two ways:

Option 1: uv (recommended)
If you have uv installed, this is the fastest path:

uv tool install litert-lm

This installs litert-lm as a system-wide binary. No virtual environments, no activation steps.

Option 2: pip (classic)
Standard Python workflow:

python3 -m venv .venv
source .venv/bin/activate
pip install litert-lm

Platform support covers Linux, macOS, Windows (via WSL), and Raspberry Pi. You’ll need Python 3.x and about 3GB disk space for the model download.

Running Your First Inference

Once installed, you can pull a model from Hugging Face and run inference in one command:

litert-lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Here’s what happens:

The CLI downloads the Gemma 4 E2B model from Hugging Face (about 2.6GB)
LiteRT-LM loads it into memory with quantization (under 1.5GB RAM)
Your prompt gets processed locally—no API calls, no network requests
The response appears in your terminal along with performance metrics

The first run takes longer because of the model download. Subsequent runs skip that step and start inference immediately. On a MacBook Pro M4, you’ll see responses in under a second. On a Raspberry Pi 5, expect 3-5 seconds for short prompts.

This is the shift: you go from paying per request to paying nothing after initial setup. If you’re running more than a few thousand inferences per month, the economics favor edge deployment heavily.

Integrating with Python: Building Real Applications

The CLI is great for testing, but real applications need the Python API. Here’s the basic pattern:

import litert_lm

# Load the model (one-time operation per session)
engine = litert_lm.Engine("gemma-4-E2B-it.litertlm")

# Create a conversation context
with engine.create_conversation() as conversation:
    response = conversation.send_message("What is the capital of France?")
    print(f"Response: {response['content'][0]['text']}")

The engine loads the model once and keeps it in memory. Subsequent send_message() calls reuse the loaded model, so you’re not paying the load penalty on every request. For multi-turn conversations, the context persists within the conversation object.

Use cases stack up fast. A desktop app with local AI assistant. A Raspberry Pi powering a smart home hub that works offline. A prototype for a mobile app before committing to native SDKs. The Python API bridges the gap between “cool demo” and “shipping product.”

Hardware Acceleration: Making It Fast

LiteRT-LM supports three backends: CPU (universal), GPU (mobile/desktop), and NPU (Android with Qualcomm chips). The performance gaps are massive:

Device	Backend	Prefill Speed (tokens/sec)	Speedup vs CPU
Raspberry Pi 5	CPU	133	1x (baseline)
Android S26 Ultra	CPU	557	4.2x
Android S26 Ultra	GPU	3,808	28.6x
Qualcomm IQ8	NPU	3,700	27.8x
MacBook Pro M4	GPU	7,835	58.9x

GPU acceleration on flagship phones delivers a 28x speedup. That’s the difference between a 5-second response and a near-instant one. NPU acceleration on Android with Qualcomm chips provides similar gains with lower power consumption.

For most developers, the default CPU backend works fine for prototyping. When you move to production, GPU acceleration becomes critical if you’re targeting real-time interactions. The framework handles backend selection automatically based on available hardware.

Why This Matters: Use Cases and Next Steps

Edge AI deployment solves three problems cloud APIs can’t: cost, privacy, and latency.

Cost becomes predictable. Cloud inference scales linearly with usage—10x users means 10x bills. Edge deployment has a fixed upfront cost (hardware + setup time) and zero marginal cost per inference. If you’re running high-volume applications, edge pays for itself in weeks.

Privacy is absolute. Medical apps analyzing patient data, legal tools processing contracts, personal assistants handling sensitive information—all run locally without exposing data to third parties. That simplifies GDPR, HIPAA, and other compliance requirements.

Latency drops to milliseconds. No network round trips, no API rate limits, no downtime when your provider has an outage. For real-time robotics, AR/VR, or gaming, edge inference is the only viable path.

If you’re ready to deploy to mobile, Android offers the smoothest path via the AICore Developer Preview and ML Kit GenAI API. iOS support comes through MediaPipe’s LLM Inference SDK. Both wrap LiteRT-LM under the hood.

Advanced features like tool calling (function execution), multi-modal capabilities (vision + audio), and fine-tuning for domain-specific tasks are all supported. The official documentation covers those in depth. Check out the Google AI Edge Gallery app for hands-on examples of what’s possible.

The shift from cloud to edge isn’t universal—high-complexity models still need datacenter GPUs. But for a huge class of applications, Gemma 4 E2B and LiteRT-LM deliver production-quality AI at zero marginal cost. That changes what’s economically viable.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.