AI & DevelopmentDeveloper Tools

MiniMax M3: Open-Weight 1M-Context Frontier Model Guide

MiniMax M3 neural network sparse attention architecture visualization with 1M token context window
MiniMax M3: The first open-weight model combining frontier coding, 1M-token context, and native multimodality

MiniMax M3 launched June 1 and broke a ceiling that has defined the open-source LLM space for two years: frontier-level coding performance, a genuine 1M-token context window, and native multimodality — all in one open-weight model you can download and run on your own hardware. The weights are on HuggingFace. The API starts at $0.60 per million input tokens, currently $0.30 on promotional pricing. This is not another model that is only "open" in name while hiding behind a sales call. It is the real thing, and if you are building long-context agentic workflows, it belongs on your evaluation list.

Why the 1M Context Window Is Different This Time

Every major model release in the past 18 months has claimed a million-token context. What most skip over is that context length at speed is the hard problem. At a million tokens, naive full attention becomes prohibitively expensive — compute grows quadratically with input length, which means latency explodes and costs follow.

MiniMax solved this with a new architecture called MiniMax Sparse Attention (MSA). Instead of comparing every token against every other token, MSA selects relevant key-value blocks and runs attention only on those. The result at 1M context: prefill is 9.7x faster and decoding is 15.6x faster compared to the prior generation. Per-token compute drops to 1/20th of what the previous model required at that context length. Four times faster than the best open-source sparse attention alternatives, per MiniMax’s ablations.

The architecture is a Mixture-of-Experts model with 229.9 billion total parameters and 9.8 billion active per token across 256 fine-grained experts. It was trained with native multimodality from the start — text, image, and video input are not bolted on as an afterthought.

Benchmark Numbers — Read the Fine Print

M3 scores 59.0% on SWE-bench Pro, 66.0% on Terminal-Bench 2.1, and 83.5 on BrowseComp. On coding, it edges out GPT-5.5 (58.6% on SWE-bench Pro) and beats Claude Opus 4.7 on browsing tasks. Those are strong numbers by any current standard.

BenchmarkMiniMax M3GPT-5.5Claude Opus 4.7
SWE-bench Pro59.0%58.6%Above M3
Terminal-Bench 2.166.0%
BrowseComp83.579.3
MCP Atlas74.2%

Here is the caveat worth stating plainly: every one of these scores comes from MiniMax’s internal testing environment using agent scaffolding of their own design. Independent third-party replication is still in progress. Treat the benchmarks as directional — they suggest M3 is a serious model — but do not make production architecture decisions based on self-reported numbers alone. MiniMax acknowledges this in their technical report; most coverage glosses over it.

Self-Hosting: Real, But Budget Accordingly

The weights are live on HuggingFace at MiniMaxAI/MiniMax-M3. Unsloth has published GGUF quantizations. AMD has day-0 support on Instinct GPUs. The serving stack is vLLM or SGLang, both of which have official MSA support.

The hardware reality: FP8 precision requires around 230 GB VRAM — two H200 SXM5 GPUs or four H100 SXM5 GPUs as a minimum viable setup. Running at full 1M-token context adds approximately 120 GB of KV cache overhead on top of that. This is not a laptop experiment.

For teams with sustained high-volume workloads, the economics shift. At spot pricing on two H200 GPUs running FP8, cost comes out to roughly $1.26 per million tokens — cheaper than the hosted API at scale. The breakeven is approximately 420 tokens per second of sustained throughput.

A minimal vLLM launch for testing:

python -m vllm.entrypoints.openai.api_server   --model ./minimax-m3   --tensor-parallel-size 2   --quantization fp8   --enable-expert-parallel   --max-model-len 131072   --kv-cache-dtype fp8_e5m2   --gpu-memory-utilization 0.92   --port 8000

One non-obvious trap: set --max-model-len to your workload’s 90th-percentile context length, not to 1M. Setting it to the theoretical maximum on two H200s in FP8 will exhaust VRAM and block concurrent requests. See the official local deployment docs for the full configuration reference.

The Cost Case Is the Real Story

Benchmarks aside, the pricing gap is the number that changes decisions. MiniMax M3 at standard API rates runs $0.60 per million input tokens. GPT-5.5 and Claude Opus 4.8 are roughly $15 to $30 per million input tokens at comparable context lengths. For a pipeline pushing 500K-token inputs at volume, M3 costs in the range of 1/20th to 1/50th of closed frontier alternatives.

ProviderInput ($/M tokens)Self-Host1M Context
MiniMax M3 (API, promo)$0.30NoYes
MiniMax M3 (API, standard)$0.60NoYes
MiniMax M3 (self-host 2xH200 FP8)~$1.26YesYes
GPT-5.5 (API)~$15–30NoYes
Claude Opus 4.8 (API)~$15NoYes

That kind of spread does not require M3 to be the best model to be the right model. If your task is whole-codebase analysis, autonomous research over large document sets, or multi-turn agentic workflows where you are paying per token at scale, the math alone justifies serious evaluation.

Three Things to Verify Before You Commit

  • Check your vLLM version. Only builds explicitly supporting MSA will work correctly. Read the release notes before installing — older builds may silently fall back to full attention and lose the performance gains entirely.
  • Promotional pricing is temporary. Plan against the $0.60 standard rate, not $0.30. Build your cost projections on what will actually be charged in three months.
  • Review the commercial license. Open weights do not mean zero restrictions. M3’s license includes commercial-use conditions that may affect deployment in regulated or high-stakes contexts.

Bottom Line

MiniMax M3 is the most interesting open-weight release of 2026 so far. It is the first model to credibly combine frontier coding performance, a fast 1M-token context window, and native multimodality in an open-weight package — and the cost profile is genuinely disruptive. The benchmarks need independent validation, the infrastructure requirements are real, and the promotional pricing will not last. But the weights exist, the deployment path is documented via community guides and official docs, and the unit economics make it worth testing against production workloads before the window on cheap long-context inference closes.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *