
MiniMax shipped weights on June 7 for M3, a 428-billion-parameter mixture-of-experts model that scores 59.0% on SWE-Bench Pro—edging past GPT-5.5’s 58.6%—at $0.30 per million input tokens. That’s 16 times cheaper than GPT-5.5 and about 50 times cheaper than Claude Opus 4.8. There’s a catch, though: MiniMax compared against Opus 4.7 at launch, conveniently missing Opus 4.8 (released three days earlier) which scores 69.2% on the same benchmark. M3 isn’t a clean sweep—but at that price, it doesn’t need to be.
What M3 Actually Is
M3 runs 428 billion parameters with roughly 23 billion active per token via a MoE routing architecture. The headline innovation is MiniMax Sparse Attention (MSA)—a replacement for standard full attention that pre-filters relevant KV-cache blocks instead of attending across all tokens. At one million tokens of context, this means 9x faster prefill, 15x faster decoding, and one-twentieth the compute per token versus M2. The model also supports image and video input natively, and can operate a desktop computer in agentic tasks.
Prior long-context models could technically fit a million tokens. M3 makes it economical to actually use them. That’s the architectural bet MiniMax is making: as agents need to hold entire codebases, conversation histories, and document sets in memory simultaneously, the efficiency of the attention mechanism stops being academic. MiniMax’s technical report shows M3 autonomously reproducing a research paper in 12 hours with 18 code commits, and optimizing a CUDA kernel from 7.6% to 71.3% hardware utilization across 147 iterations.
The Benchmark Picture—Honest Version
Here’s what the numbers actually say:
| Model | SWE-Bench Pro | BrowseComp | PostTrainBench | Input ($/M tokens) |
|---|---|---|---|---|
| Claude Opus 4.8 | 69.2% | ~79% | 0.42 (1st) | ~$15 |
| MiniMax M3 | 59.0% | 83.5% | 0.37 (3rd) | $0.30 |
| GPT-5.5 | 58.6% | N/A | 0.39 (2nd) | $5.00 |
M3 beats GPT-5.5 on coding by a narrow margin and leads all models on BrowseComp—autonomous web agent tasks. But Opus 4.8 leads SWE-Bench by 10 points, and PostTrainBench (general instruction following) puts M3 in third. MiniMax’s own benchmarks cherry-picked Opus 4.7 as the comparison target. OpenRouter’s live latency and throughput stats give you an independent read on real-world performance.
The Price Math for Production
For teams running AI agents in production, token costs compound fast. Consider a coding agent processing 10 million input tokens per day:
- M3: $3/day
- GPT-5.5: $50/day
- Opus 4.8: $150/day
Over 30 days, M3 costs $90 for the same volume that costs $1,500 on GPT-5.5 or $4,500 on Opus. For high-volume bug-fix pipelines, automated code review, or eval harnesses, that gap is hard to ignore. M3 is the default choice when you need frontier-range coding ability and Opus-tier polish isn’t strictly required.
Using M3 Today
The integration path is minimal. M3 exposes an OpenAI-compatible endpoint, so existing code changes by two lines:
from openai import OpenAI
client = OpenAI(
base_url="https://api.minimax.io/v1",
api_key="YOUR_MINIMAX_API_KEY",
)
response = client.chat.completions.create(
model="MiniMax-M3",
messages=[{"role": "user", "content": "Refactor this function..."}],
)
print(response.choices[0].message.content)
LangChain integration works identically via ChatOpenAI pointed at the MiniMax base URL. Weights are live on Hugging Face for self-hosting. You’ll need roughly 440GB of storage for the FP8 checkpoint and at least eight high-end GPUs with tensor parallelism. SGLang has official M3 support; vLLM works with MSA support enabled. Mac Studio deployments via llama.cpp are possible—expect practical context limits below the 1M maximum.
Two Things to Know Before You Commit
First, the license. M3 ships under the MiniMax Community License, not Apache 2.0. Commercial use restrictions may apply to your use case. Read the terms before building a product on the weights—“open weights” and “fully open source” are not the same thing.
Second, context discipline. A one-million-token window is not an invitation to stuff everything in. Every token costs money. Filling the context for tasks that don’t need it inflates cost without improving output. Use the window when the task demands it—long document analysis, full-codebase context, multi-hour agentic runs. For standard coding tasks, a shorter context at the same model delivers the same result at a fraction of the cost.
Bottom Line
M3 is the right call for production coding agents and long-horizon agentic pipelines where Opus 4.8 is the quality ceiling but the budget says otherwise. It’s not a replacement for Opus on general instruction tasks. The benchmark cherry-picking at launch is a yellow flag—MiniMax knew the table looked cleaner without Opus 4.8 in it. But the underlying value holds: frontier-range coding at 16x lower input cost, open weights with self-hosting options, and a million-token context window that’s architecturally efficient rather than just technically possible. For cost-sensitive teams building on top of LLMs, that’s worth testing today.













