ZAYA1-8B: Run a Frontier Reasoning Model Without NVIDIA

ZAYA1-8B sparse MoE reasoning model trained on AMD MI300X hardware - circuit board with neural network visualization

ZAYA1-8B: Frontier reasoning trained entirely on AMD Instinct MI300X hardware

Zyphra trained a reasoning model that beats DeepSeek-R1 on hard math benchmarks — and not one GPU in the training cluster had NVIDIA’s name on it. ZAYA1-8B activates fewer than 1 billion parameters per token, ships under Apache 2.0, and is available for download today. The bigger story is not the model — it is the precedent: CUDA dependency is now a design choice, not a law of physics.

A Different Kind of 8B Model

ZAYA1-8B is a sparse Mixture-of-Experts (MoE) model with 8.4 billion total parameters but only around 760 million active parameters per token. In practice, roughly 90% of the model sits idle while answering any given query — only the most relevant experts activate. The result is inference speed closer to a 1B model with reasoning capability that punches several weight classes higher.

The model was built by Zyphra, a San Francisco AI lab founded in 2024 and backed by Jaan Tallinn — the same investor behind early Anthropic and DeepMind funding rounds. Apache 2.0 licensing means no RAIL restrictions, no Gemma-style acceptable-use clauses. Use it in your product, in your pipeline, commercially, without asking permission.

The Benchmark Numbers Worth Knowing

Against comparably-sized models, ZAYA1-8B wins cleanly. On AIME’26 it scores 89.1, on HMMT February 2026 it scores 71.6, and on GPQA-Diamond it scores 71.0 — outperforming both Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it across every math and coding category tested. This from a model activating under one billion parameters.

With Markovian RSA extended compute — Zyphra’s novel test-time scaling method — ZAYA1-8B reaches 89.6 on HMMT’25. Claude 4.5 Sonnet scores 88.3 on the same benchmark. A local 8B model edging out Anthropic’s flagship is not a sentence you would have written a year ago.

How Markovian RSA Works (The Short Version)

Standard chain-of-thought reasoning has a problem: the longer the reasoning chain, the bigger the context window, and context windows are expensive. Markovian RSA sidesteps this by generating multiple short reasoning traces in parallel, where each new trace is conditioned on a fixed-length summary of all previous traces rather than the full context. The model accumulates reasoning depth without the context window growing unboundedly. The practical effect is that you can trade compute for accuracy in a controlled, memory-efficient way — exactly what you want for hard math and coding tasks. Zyphra published the full methodology in their technical report on arXiv.

The AMD Story Matters More Than the Benchmarks

ZAYA1-8B is the first production MoE model pretrained, midtrained, and fine-tuned entirely on AMD Instinct MI300X hardware — 1,024 nodes in a custom cluster built with IBM, connected via AMD Pensando Pollara interconnect. No NVIDIA. No CUDA. The entire training stack ran on ROCm.

This matters for reasons beyond GPU preferences. NVIDIA H100 units were priced at $25,000–30,000 each through most of 2025 and into 2026; AMD MI300X units run around $15,000. At cloud spot pricing, the gap is narrower but real. More importantly, enterprises building serious AI infrastructure now have an actual proof of concept that AMD-only training pipelines can produce frontier-class results. ROCm is within 10–20% of CUDA for most training workloads in 2026 — Zyphra’s result suggests that gap is closing further for memory-bound MoE architectures where AMD’s 192GB HBM3 per card is a direct advantage.

How to Run ZAYA1-8B Locally

The main caveat: ZAYA1-8B uses a custom architecture (Compressed Convolutional Attention + MLP router) that does not work with stock vLLM or stock Hugging Face Transformers. You need Zyphra’s forks. This is not unusual for cutting-edge models — it is a temporary friction point while upstreaming happens.

Path A: vLLM (Production / Server)

This is the recommended path for running ZAYA1-8B as an OpenAI-compatible API server:

pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr"

vllm serve Zyphra/ZAYA1-8B   --port 8010   --mamba-cache-dtype float32   --dtype bfloat16   --reasoning-parser qwen3   --enable-auto-tool-choice   --tool-call-parser zaya_xml

VRAM requirements: BF16 full precision needs roughly 18–22GB with overhead. A 4-bit quantized version drops to around 6–9GB, putting it within reach of a single RTX 4090 or a well-specced Apple Silicon machine.

Path B: Transformers (Notebooks / Experimentation)

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"

Path C: Community GGUF (Ollama / llama.cpp)

For users who live in Ollama or llama.cpp, community quantizations exist at lainlives/ZAYA1-8B-GGUF on Hugging Face. These are unofficial and may not behave identically to the original weights, but native llama.cpp support is being tracked at issue #22776 — follow it if you want first-class support before switching.

Should You Use It Now?

ZAYA1-8B is worth running today if you need a local reasoning model for math- or code-heavy workloads, you are evaluating AMD infrastructure as an alternative to NVIDIA, or you need a commercially permissive model without use restrictions. The vLLM fork requirement is real but manageable friction — if you have deployed open-source models before, installing a pip package from a GitHub branch is not a blocker.

If your use case is general-purpose chat rather than structured reasoning, or if you specifically need native Ollama support out of the box, waiting a few weeks for the llama.cpp PR to merge will give you a smoother experience.

The larger point stands either way: the idea that frontier-class AI reasoning requires NVIDIA hardware just took a direct hit from a well-funded lab with a peer-reviewed technical report and reproducible results. ZAYA1-8B is not a one-off research curiosity — it is a commercial-grade, Apache-licensed model that developers can put in production today.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.