AI & DevelopmentMachine Learning

MiniMax M3: Open-Weight Frontier Model at 5% of Opus Cost

MiniMax M3 sparse attention architecture visualization with blue neural network connections on dark background
MiniMax M3: Open-Weight Frontier Model with MSA Sparse Attention

MiniMax just released an open-weight model that costs roughly 5% of Claude Opus per task, hits 59% on SWE-Bench Pro, and runs a 1-million-token context window at one-twentieth the compute of its predecessor. Those numbers deserve scrutiny   —   they are all vendor-published   —   but MiniMax M3 is now the most cost-competitive option for developers building agentic systems, and it is worth understanding how it works before deciding whether to trust it.

What Makes 1M Context Actually Practical

Long context claims are common. Most models slow to a crawl or produce degraded output as they approach their limits. MiniMax M3 addresses this with a new attention architecture called MSA   —   MiniMax Sparse Attention   —   which selects relevant key-value blocks rather than computing full attention across the entire context window.

The result: at 1 million tokens, per-token compute drops to one-twentieth of the previous M2 generation, delivering 9x faster prefill and 15x faster decode at max context. MiniMax demonstrated this with a 24-hour autonomous CUDA kernel optimization task (improving hardware utilization from 7.6% to 71.3%) and a 12-hour paper reproduction task, both completed without human intervention. Those are the kinds of long-horizon workflows where a large context window actually earns its keep.

The Benchmarks: What the Numbers Say (and Don’t)

MiniMax M3 scores 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, and 83.5 on BrowseComp   —   figures that reportedly put it above GPT-5.5 and Gemini 3.1 Pro on coding, and above Claude Opus 4.7 on autonomous browsing. The MSA technical report was published on arXiv on June 11.

The necessary caveat: every one of those results was run by MiniMax, on MiniMax infrastructure, using MiniMax agent scaffolding. Independent third-party benchmarks were not available at launch on June 1. The comparison baseline is Claude Opus 4.7, not the more recently released 4.8   —   so the gap to the current frontier is larger than the launch materials suggest. Benchmark your actual workflows before committing to production.

The Cost Breakdown

The economics are where M3 makes its strongest case. Standard API pricing is $0.60 per million input tokens and $2.40 per million output tokens. A task consuming 500,000 input tokens and 100,000 output tokens runs at roughly $0.27 with M3, versus $5.00 with Claude Opus   —   about 5% of the cost. The promotional rate cuts that further, though you should plan against standard rates for production budgeting.

ModelInput / 1M tokensOutput / 1M tokens
MiniMax M3 (standard)$0.60$2.40
MiniMax M3 (promo)$0.30$1.20
Claude Opus 4.7$5.00$25.00
GPT-5.5~$10.00~$30.00

Three Ways to Use It

MiniMax M3 is available through three paths depending on how much control you want:

MiniMax API   —   First-party, OpenAI-compatible endpoint at api.minimax.io. Native multimodal support, thinking mode toggle, and the full 1M context window. This is the path with the data governance considerations discussed below.

OpenRouter   —   Fastest route for testing. Set the model to minimax/minimax-m3 using existing OpenRouter credentials and you are running within minutes.

Self-hosted open weights   —   The model is on Hugging Face (MiniMaxAI/MiniMax-M3) with SGLang, vLLM, and Transformers support. Quantized builds for llama.cpp, Ollama, and LM Studio are available. Fair warning: the full BF16 checkpoint is a 427B-parameter model, with 23B parameters active per inference. That is a real infrastructure commitment.

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "minimax/minimax-m3", "messages": [{"role": "user", "content": "Review this code for bugs and performance issues."}]}'

One Question Worth Answering Before You Deploy

MiniMax is a Shanghai-based company subject to China’s 2017 National Intelligence Law, which requires cooperation with state intelligence operations. For the hosted API, this is a compliance question that deserves an explicit answer depending on what you are sending through the model.

Self-hosting the open weights eliminates this concern entirely. If your workloads involve regulated data or sensitive IP, the self-hosted path is the right call regardless of the cost story.

Bottom Line

MiniMax M3 is a technically serious model with a cost structure that changes the economics of agentic workflows. It is not a drop-in replacement for Claude Opus on sensitive workloads, and its benchmark numbers need independent verification. But for long-horizon coding agents, large-document pipelines, and high-volume inference where the numbers check out on your tasks, it has earned a real evaluation   —   not just a footnote.

Run it on one real workflow first. The official launch post covers the architecture in depth, and the MSA technical report has the attention mechanism specifics for those who want to go deeper. For a warts-and-all practical evaluation, the agentic workflow evaluation on Medium is worth reading before you commit.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *