Mellum2: JetBrains Open-Sources a Coding Model Built for Agent Pipelines

Mellum2 architecture diagram showing focal model concept with interconnected expert nodes in ByteIota blue and white color scheme

Mellum2: JetBrains' 12B Mixture-of-Experts coding model for agent pipelines

JetBrains just open-sourced Mellum2 — a 12B Mixture-of-Experts coding model released under Apache 2.0. It is explicitly not trying to compete with Claude Fable 5 or GPT-5. Instead, it is designed to sit inside your agent pipeline and handle the fast, repetitive sub-tasks that currently drain your frontier model budget. That framing matters more than any benchmark number.

The Focal Model Concept

JetBrains calls Mellum2 a “focal model” — a fast, specialized component built for high-frequency tasks inside multi-model AI systems, not a standalone replacement for frontier models. The premise is simple: not every step in your agent pipeline needs Claude. Routing a prompt, compressing a retrieved document, classifying a tool call, validating a plan — these steps happen dozens or hundreds of times per user session. Paying frontier rates for each one is a cost problem masquerading as an architecture problem.

Mellum2 is the answer to that problem. Fast, cheap, self-hostable, commercially licensed. You keep the frontier model for the hard reasoning. You let Mellum2 handle everything else.

The Architecture Behind the Speed

The 12B parameter count is the headline, but the operative number is 2.5B — the active parameters per token. Mellum2’s Mixture-of-Experts design routes each token through 8 of 64 experts, keeping per-token compute equivalent to a 2.5B dense model while retaining the capacity of a much larger network. The result is an inference profile that beats similarly-sized dense models under the concurrent loads that production systems actually face.

At 64 concurrent requests, Mellum2 runs 79% faster than Qwen3-8B. At single-request throughput it is roughly tied. The gap widens exactly when it matters: when your agent pipeline is fielding real traffic. A built-in Multi-Token Prediction head also enables speculative decoding, shaving additional latency off each response without changing your call interface.

Context window: 128K tokens, extended via layer-selective YaRN before post-training. The model was trained from scratch on roughly 10.6 trillion tokens using a three-phase curriculum that progressively shifts toward curated code and math data. This is not a fine-tune of Qwen or Llama — it is a purpose-built software engineering model.

Two Variants: Pick the Right One

JetBrains ships two post-trained checkpoints, and choosing the wrong one for your use case is the most avoidable mistake you can make.

Instruct answers directly with no visible chain of thought. Use it for routing, tool selection, code classification, and any task where you need a fast answer and reasoning transparency is irrelevant.

Thinking emits an explicit reasoning trace inside <think>...</think> blocks before its final answer. Use it for complex debugging assistance, multi-step agent planning, and tasks where step-by-step reasoning matters. The Thinking variant is compatible with vLLM’s --reasoning-parser qwen3 flag — a practical detail if you are already running Qwen3 in your stack.

Deployment in Three Commands

For production inference, vLLM handles Mellum2 natively:

vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking   --max-model-len 131072   --reasoning-parser qwen3   --enable-auto-tool-choice   --tool-call-parser hermes

For local development or air-gapped environments, a GGUF-quantized variant (Q4_K_M) is available and runs on 8GB VRAM. Pull it with Ollama:

ollama run MrScratchcat22/Mellum2:Q4_K_M

SGLang, Hugging Face Transformers, Docker Model Runner, and KTransformers are all supported. The full model catalog — Base, Instruct, Thinking, and SFT checkpoints for both — lives on the JetBrains Hugging Face organization.

Benchmarks: Where It Wins and Where It Doesn’t

Mellum2 Thinking scores 78.4% on EvalPlus (the combined HumanEval+ and MBPP+ benchmark), beating Qwen3.5-9B at 71.8%. On LiveCodeBench v6, it scores 69.9%. For code generation at this parameter tier, Mellum2 is best in class.

The honest caveat: on AIME 2025+2026, Mellum2 Thinking scores 58.4%, behind Qwen3.5-4B at 68.3%. A model with fewer active parameters — trained on more general data — outperforms it on competition math. If your agent pipeline has heavy math reasoning requirements, Mellum2 is not your focal model. Use it for what it is good at.

Who Should Use This

Mellum2 is the right call in three specific situations. First, you are building a multi-model agent pipeline and want a fast, cheap specialist for intermediate steps — routing, summarization, validation — without paying frontier rates for each one. Second, you are in a regulated or air-gapped environment where sending code to external APIs is not an option. Third, you want a fine-tuneable base that you own entirely, with no usage restrictions.

It is not the right call if you need a single model that handles everything, or if your pipeline’s bottleneck is reasoning quality rather than throughput and cost. For those cases, keep your frontier model. Mellum2 is the specialist you put in front of it.

The technical report is on arXiv (2605.31268). The model is on Hugging Face. The JetBrains blog post covers the full roadmap context.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.