Moonshot AI dropped Kimi K2.7-Code on June 25. It scores higher on every benchmark the company publishes and uses 30% fewer thinking tokens than its predecessor. That second part is the one worth paying attention to: in a year when every model improvement has come with a larger inference bill, K2.7-Code moves in the opposite direction.
Whether that claim holds up outside Moonshot’s own test suites is a different question — one that doesn’t have an answer yet. But the efficiency angle is real, the weights are on HuggingFace under a Modified MIT license, and the API is live at rates that undercut Claude Sonnet 4.6 by a factor of four. Here’s what you actually need to know.
What Changed from K2.6
K2.7-Code is a coding-specialized fine-tune on top of the same 1-trillion-parameter Mixture-of-Experts architecture that powered K2.6. The headline numbers Moonshot published:
- +21.8% on Kimi Code Bench v2 (50.9 → 62.0)
- +11.0% on Program Bench
- +31.5% on MLS Bench Lite
- 81.1% on MCPMark Verified tool invocation (vs Claude Opus 4.8 at 76.4%)
A caveat that belongs up front: every benchmark above is a Moonshot-designed proprietary suite. There are no independent results on SWE-Bench Verified, SWE-Bench Pro, or Terminal-Bench 2.0 as of this writing. K2.6, by contrast, had public SWE-Bench Verified results (80.2%). That K2.7-Code skipped those at launch is notable, and the practitioner community has said so explicitly.
Treat “21.8% better than our last model on our own coding eval” as a credible signal, not a settled fact.
The Architecture: 1 Trillion Parameters, 32 Billion Active
The efficiency story makes more sense once you understand the Mixture-of-Experts structure. K2.7-Code has 1 trillion total parameters spread across 384 expert networks, but only 8 experts fire on any given token — meaning about 32 billion parameters are active per forward pass. That’s roughly 3% of the total. The model is expensive to store but cheap to run at inference time relative to a dense 32B model.
The 30% thinking token reduction comes from post-training improvements to how the model plans its reasoning chains — not from architectural changes. Moonshot retrained the model to reach correct conclusions with fewer intermediate steps. Whether that matters in practice depends heavily on your workload.
The Catch: No Non-Thinking Mode
K2.7-Code always runs with extended reasoning enabled. There is no fast path, no non-thinking mode, no way to skip the chain-of-thought. If you want quick completions, simple autocomplete, or low-latency responses for interactive coding, K2.6 is still the better option.
K2.7-Code is built for the opposite use case: long-horizon agentic sessions, multi-file refactoring, complex debugging chains, and tool-use workflows that can sustain 12+ hours of continuous execution. Moonshot’s own Kimi Code plugin supports agent swarms of up to 300 parallel sub-agents on higher tiers. This is not a model you reach for to quickly generate a function — it’s a model you point at a codebase and leave running.
Cost and Access
The pricing is competitive with open-weight alternatives and aggressive relative to frontier proprietary models:
| Model | Input $/M | Output $/M |
|---|---|---|
| K2.7-Code (Kimi API) | $0.95 | $4.00 |
| K2.7-Code (OpenRouter) | $0.75 | $3.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
On output tokens — where agent loop costs actually accumulate — K2.7-Code is roughly 4x cheaper than Sonnet 4.6. A 30% reduction in thinking tokens on top of that is a meaningful compounding effect for high-volume agentic workloads.
If you’re already using Claude Code, you can route it through Moonshot’s API with two environment variables:
export ANTHROPIC_BASE_URL="https://api.moonshot.ai/v1"
export ANTHROPIC_AUTH_TOKEN="your_kimi_api_key"
Your Claude Code workflow stays identical; the backend switches to K2.7-Code. Retrieve your key from platform.kimi.ai.
For self-hosting: the full INT4-quantized weights are at moonshotai/Kimi-K2.7-Code on HuggingFace under a Modified MIT license that permits commercial use. Realistic hardware requirement is 8x H200-class GPUs (~640GB VRAM). Community GGUF builds from Unsloth work with llama.cpp, Ollama, and LM Studio for more modest setups. Deployment uses vLLM 0.19.1+ or SGLang with the kimi_k2 tool-call parser flag.
Who Should Try It Now
If you’re running cost-sensitive agent loops that currently use Sonnet 4.6 or another proprietary model, K2.7-Code is worth evaluating immediately. The price delta is large enough to justify the test even without independent benchmark validation.
If you need confidence from independent SWE-Bench or Terminal-Bench results before committing to a production switch, wait a few weeks. The community will have those numbers soon.
If your use case is anything other than long-horizon agentic coding, K2.6 or a different model is probably a better fit. K2.7-Code’s “always thinking” constraint makes it the wrong tool for fast, lightweight tasks — no matter how good the benchmark headlines look.
The open weights are there, the license is permissive, and the efficiency gains appear real. The benchmark claims need outside verification before they become gospel. Both things are true simultaneously.













