AI & DevelopmentOpen SourceMachine Learning

NVIDIA Nemotron 3 Nano Omni: Run It Locally Now

NVIDIA Nemotron 3 Nano Omni open multimodal model with GPU chip and data streams for video audio image and text processing
NVIDIA Nemotron 3 Nano Omni: 30B-parameter open multimodal model with 3B active parameters per inference pass

NVIDIA just released Nemotron 3 Nano Omni, and if you’re building AI agents that need to process video, audio, images, and text in a single call, it’s the open model you’ve been waiting for. It runs 30 billion parameters but activates only 3 billion per inference pass. It fits on an RTX 4090. And the weights are fully open for commercial use. Let’s get into what it actually is and how to run it.

The Architecture: Why 30B-A3B Actually Works

The model name tells you the key fact: 30B total parameters, 3B active per token. That’s a hybrid Mamba-2 + Transformer Mixture-of-Experts (MoE) backbone. Each MoE layer contains 128 experts plus one shared expert, with only 6 experts activated per token. The router selects which experts fire; the rest are idle. Inference compute is bounded by the 3B active count, not the 30B total.

What Mamba-2 adds is memory efficiency for long contexts. Standard transformer attention scales quadratically with sequence length. Mamba handles long-range dependencies with linear scaling, which is why NVIDIA Nemotron 3 Nano Omni supports up to a 1-million-token context window on multi-GPU setups (256K on a single GPU). For document analysis agents processing lengthy contracts or audio transcriptions, that’s not a footnote — it’s the point.

Four Modalities, One Forward Pass

Before Nano Omni, a typical multimodal agent pipeline meant chaining separate models: a vision model, an ASR model, a reasoning LLM, and glue code to handle format mismatches. Nemotron 3 Nano Omni handles video, audio, images, and text in a unified architecture. You send a mixed input, you get a coherent output.

Practically, this flattens your agent stack. Document intelligence agents that parse PDFs with embedded charts? One model call. Video agents that summarize meeting recordings and extract action items from both speech and screen content? One model call. The latency savings and complexity reduction are real — and they compound as your pipeline grows.

How to Run It

There are three paths depending on your hardware and use case.

vLLM (Recommended for GPU Deployments)

vLLM is the production path for server-grade hardware. The NVIDIA team published a dedicated vLLM integration guide. The setup is a single command:

pip install vllm
python3 -m vllm.entrypoints.openai.api_server   --model "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"   --served-model-name nemotron

Hardware requirements by precision:

PrecisionVRAM RequiredRuns on
BF16~64 GBH100, 2x A100
FP8~32 GBA100 40GB, RTX 6000 Ada
4-bit GGUF~25 GBRTX 4090 (tight), RTX Pro

The server exposes an OpenAI-compatible API — drop it into any existing agent framework without changing your client code.

GGUF via llama.cpp or LM Studio (Consumer GPU Path)

The community has quantized Nano Omni to GGUF format. The 4-bit variant needs around 25 GB of VRAM — an RTX 4090 can handle it, though tightly. Grab the weights from Unsloth’s GGUF repository on Hugging Face or the LM Studio model library.

One important caveat: multimodal GGUF does not work in Ollama. Ollama cannot handle the separate mmproj vision projection files that multimodal GGUF models require. For multimodal inference, use llama-mtmd-cli from llama.cpp, or LM Studio with GPU offloading enabled. Ollama works for text-only inference via the nemotron3 tag — just don’t expect vision or audio support there.

Cloud API (Zero Hardware Path)

If you want to evaluate the model before committing hardware, OpenRouter offers a free tier: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free. NVIDIA’s own NIM microservice API is OpenAI-compatible and the fastest path to production without managing infrastructure. Amazon SageMaker JumpStart also supports one-click deployment using the FP8 model — the right call for existing AWS shops.

Where Nano Omni Wins — and Where It Doesn’t

Independent MediaPerf benchmarks put Nano Omni ahead of GPT 5.1 and Gemini 3.0 Pro on multimodal throughput. In an iterative 5-round video tagging workflow, it completes in 8.3 hours versus 18.4 hours for GPT 5.1 and 33.6 hours for Gemini 3.0 Pro. On H200, it delivers 3.3x higher throughput than Qwen3-30B-A3B on equivalent workloads.

Where it falls short: coding. Claude Sonnet 4.6 leads Nano Omni on code generation benchmarks (66.4 vs 53.5 average). If your agent’s primary job is writing or reviewing code, Nano Omni isn’t the right tool. For multimodal perception — understanding documents, video, audio, mixed media at scale — it’s the best open option at this weight class.

The Verdict

Nemotron 3 Nano Omni is a production-grade multimodal perception engine you can run privately, fine-tune on domain data, and integrate into OpenAI-compatible pipelines with minimal friction. The Ollama multimodal limitation will trip people up, and BF16 still requires serious hardware — but the 4-bit GGUF path brings this to a single high-end consumer GPU, which is new ground for open omni models. If you’re building agents that need to see, hear, and read, the weights are on Hugging Face now.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *