AI & DevelopmentOpen SourceDeveloper Tools

MiniMax M3 Open Weights Are Live: Run It Locally Now

Visualization of MiniMax M3 open-weight model running locally on GPU hardware with neural network patterns
MiniMax M3 open weights are live — run the 1M-context frontier model locally with Ollama or vLLM

MiniMax committed to open-sourcing M3’s weights within 10 days of its June 1 API launch. Today is day ten. The weights are on Hugging Face, Ollama has the model ready to pull, and the question shifts from “when” to “how do I actually run this.” If you have been waiting for a frontier-class open-weight model with a genuine 1M-token context window you can run on your own hardware, here is how to get there in the next 20 minutes.

Why MSA Makes 1M Context Actually Usable

Every model announces a long context window. Most of them are technically accurate and practically useless — the cost and latency at 500K tokens makes them unworkable in real agent loops. MiniMax Sparse Attention (MSA) is the architecture that changes this math.

Instead of running full attention across every token pair, MSA uses a lightweight index branch that scans incoming tokens and selects which blocks of the key-value cache are actually relevant. Expensive attention runs only on those selected blocks. The result at 1M tokens: 15.6x faster decoding and 9.7x faster prefill compared to M2’s full-attention baseline, with per-token compute cut to roughly 1/20th. Unlike DeepSeek’s Multi-head Latent Attention, which compresses key-value pairs into a lower-dimensional space, MSA operates on uncompressed keys and values — the selections are sparse, not approximate. That is a meaningful distinction when output quality on long-context tasks matters.

The Quickest Path: Ollama

Two commands to get M3 running locally:

ollama pull minimax-m3
ollama run minimax-m3

If you want to test before committing to the hardware requirements, Ollama offers a cloud-hosted variant with zero local setup and zero data retention:

ollama run minimax-m3:cloud

The :cloud tag routes your requests to Ollama’s US-hosted M3 instance. It is not a local run, but it gives you a valid benchmark of whether M3 fits your use case before you commit to a multi-GPU server build.

Production Path: vLLM with OpenAI-Compatible Endpoint

For teams already running OpenAI-compatible infrastructure, the migration story is as clean as it gets. MiniMax confirmed vLLM support from day one of weight release. Launch the server:

pip install vllm
python -m vllm.entrypoints.openai.api_server   --model minimax/minimax-m3   --tensor-parallel-size 4   --max-model-len 1048576   --port 8000

Then point your existing client at it:

export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed"

Any existing code that calls OpenAI or Anthropic’s SDK works unchanged. Set --tensor-parallel-size to match the number of GPUs on your machine. Drop --max-model-len to a smaller value if you want to reduce memory pressure without losing the model.

Hardware Reality: What You Actually Need

M3 is a large MoE model. Consumer single-GPU setups are not going to work for local inference. Here is the honest breakdown:

HardwareQuantizationEst. SpeedBest For
Mac Studio M4 Ultra 192GBQ4_K_M (GGUF)15–30 t/sDevelopment
4× NVIDIA H100 80GBBF1660–100+ t/sProduction
4× NVIDIA RTX 4090Q4_K_M20–40 t/sHobbyist cluster
Single NVIDIA RTX 4090N/AUse the API instead

For Mac users, Q4_K_M GGUF files are the path. Community quantizations typically appear on Hugging Face within hours of weight release. Download and run:

huggingface-cli download minimax/minimax-m3-GGUF minimax-m3-Q4_K_M.gguf
./llama-server -m minimax-m3-Q4_K_M.gguf -c 65536 --n-gpu-layers 80

If you are in a compliance-driven environment — healthcare, finance, air-gapped infrastructure — M3 is the first open-weight model where the capability-to-compliance trade-off genuinely tips toward self-hosting. The API at $0.30 per million input tokens is also compelling if hardware costs do not pencil out.

On the Benchmarks: Run Your Own Evals

M3 scores 59.0% on SWE-Bench Pro, edging GPT-5.5 (58.6%) and beating Gemini 3.1 Pro (54.2%). It trails Claude Opus 4.8’s 69.2% — and that comparison matters because the MiniMax launch used Opus 4.7 (64.3%) as the reference point. The gap is smaller than the launch materials implied.

More importantly: every M3 benchmark was produced on MiniMax’s own infrastructure using agent scaffolding that includes Claude Code and Mini-SWE-Agent. That is not unusual — most labs benchmark this way — but it means the 59.0% is a ceiling measured under favorable conditions, not a floor on your real codebase. Run M3 on a representative sample of your actual work before committing to a production migration.

When to Use M3 vs. Staying on Claude or GPT

Self-host M3 if: you need data sovereignty, you are running high-volume workloads where even $0.30/1M adds up, or you are doing research that benefits from true 1M-token context at speed.

Stay on Claude or GPT if: you are building production agents where benchmark accuracy on hard tasks matters, your team is already integrated with the Claude ecosystem, or you do not have the hardware budget. The price gap is real, but the capability gap at the frontier is also real — M3 trails Opus 4.8 by 10 points on the hardest coding tasks. Both facts can be true at once.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *