
MiniMax committed to open-sourcing M3’s weights within 10 days of its June 1 API launch. Today is day ten. The weights are on Hugging Face, Ollama has the model ready to pull, and the question shifts from “when” to “how do I actually run this.” If you have been waiting for a frontier-class open-weight model with a genuine 1M-token context window you can run on your own hardware, here is how to get there in the next 20 minutes.
Why MSA Makes 1M Context Actually Usable
Every model announces a long context window. Most of them are technically accurate and practically useless — the cost and latency at 500K tokens makes them unworkable in real agent loops. MiniMax Sparse Attention (MSA) is the architecture that changes this math.
Instead of running full attention across every token pair, MSA uses a lightweight index branch that scans incoming tokens and selects which blocks of the key-value cache are actually relevant. Expensive attention runs only on those selected blocks. The result at 1M tokens: 15.6x faster decoding and 9.7x faster prefill compared to M2’s full-attention baseline, with per-token compute cut to roughly 1/20th. Unlike DeepSeek’s Multi-head Latent Attention, which compresses key-value pairs into a lower-dimensional space, MSA operates on uncompressed keys and values — the selections are sparse, not approximate. That is a meaningful distinction when output quality on long-context tasks matters.
The Quickest Path: Ollama
Two commands to get M3 running locally:
ollama pull minimax-m3
ollama run minimax-m3
If you want to test before committing to the hardware requirements, Ollama offers a cloud-hosted variant with zero local setup and zero data retention:
ollama run minimax-m3:cloud
The :cloud tag routes your requests to Ollama’s US-hosted M3 instance. It is not a local run, but it gives you a valid benchmark of whether M3 fits your use case before you commit to a multi-GPU server build.
Production Path: vLLM with OpenAI-Compatible Endpoint
For teams already running OpenAI-compatible infrastructure, the migration story is as clean as it gets. MiniMax confirmed vLLM support from day one of weight release. Launch the server:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model minimax/minimax-m3 --tensor-parallel-size 4 --max-model-len 1048576 --port 8000
Then point your existing client at it:
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed"
Any existing code that calls OpenAI or Anthropic’s SDK works unchanged. Set --tensor-parallel-size to match the number of GPUs on your machine. Drop --max-model-len to a smaller value if you want to reduce memory pressure without losing the model.
Hardware Reality: What You Actually Need
M3 is a large MoE model. Consumer single-GPU setups are not going to work for local inference. Here is the honest breakdown:
| Hardware | Quantization | Est. Speed | Best For |
|---|---|---|---|
| Mac Studio M4 Ultra 192GB | Q4_K_M (GGUF) | 15–30 t/s | Development |
| 4× NVIDIA H100 80GB | BF16 | 60–100+ t/s | Production |
| 4× NVIDIA RTX 4090 | Q4_K_M | 20–40 t/s | Hobbyist cluster |
| Single NVIDIA RTX 4090 | — | N/A | Use the API instead |
For Mac users, Q4_K_M GGUF files are the path. Community quantizations typically appear on Hugging Face within hours of weight release. Download and run:
huggingface-cli download minimax/minimax-m3-GGUF minimax-m3-Q4_K_M.gguf
./llama-server -m minimax-m3-Q4_K_M.gguf -c 65536 --n-gpu-layers 80
If you are in a compliance-driven environment — healthcare, finance, air-gapped infrastructure — M3 is the first open-weight model where the capability-to-compliance trade-off genuinely tips toward self-hosting. The API at $0.30 per million input tokens is also compelling if hardware costs do not pencil out.
On the Benchmarks: Run Your Own Evals
M3 scores 59.0% on SWE-Bench Pro, edging GPT-5.5 (58.6%) and beating Gemini 3.1 Pro (54.2%). It trails Claude Opus 4.8’s 69.2% — and that comparison matters because the MiniMax launch used Opus 4.7 (64.3%) as the reference point. The gap is smaller than the launch materials implied.
More importantly: every M3 benchmark was produced on MiniMax’s own infrastructure using agent scaffolding that includes Claude Code and Mini-SWE-Agent. That is not unusual — most labs benchmark this way — but it means the 59.0% is a ceiling measured under favorable conditions, not a floor on your real codebase. Run M3 on a representative sample of your actual work before committing to a production migration.
When to Use M3 vs. Staying on Claude or GPT
Self-host M3 if: you need data sovereignty, you are running high-volume workloads where even $0.30/1M adds up, or you are doing research that benefits from true 1M-token context at speed.
Stay on Claude or GPT if: you are building production agents where benchmark accuracy on hard tasks matters, your team is already integrated with the Claude ecosystem, or you do not have the hardware budget. The price gap is real, but the capability gap at the frontier is also real — M3 trails Opus 4.8 by 10 points on the hardest coding tasks. Both facts can be true at once.













