
Moonshot AI dropped Kimi K2.7 Code on June 12 — a 1-trillion-parameter open-source coding model that burns 30% fewer reasoning tokens than its predecessor and outscores Claude Opus 4.8 on MCP tool-calling benchmarks. The weights are on Hugging Face under a Modified MIT license. The API runs at $0.95 per million input tokens. For teams running agentic coding loops at scale, the token math alone makes this worth a serious look.
The Token Efficiency Case
The headline number — 30% fewer reasoning tokens — compounds. A 12-hour autonomous coding session that consumed two million reasoning tokens on K2.6 now uses roughly 1.4 million on K2.7 Code. At API prices, that difference is real money, and the savings grow with session length and parallelism.
K2.7 Code introduces a mandatory preserve_thinking mode that retains the model’s reasoning chain across multi-turn interactions instead of resetting it with each message. That continuity enables better coherence on long-horizon tasks — the model builds on its own previous reasoning rather than starting fresh every turn.
There is a trade-off worth flagging: K2.6 had an optional instant mode that let you skip the thinking trace entirely for simple completions. K2.7 Code removes that option. Every call goes through full reasoning. Teams that relied on instant mode for cheap, fast completions will find K2.7 Code more expensive for that use case.
Tool Use: The Benchmark That Actually Matters
On MCPMark Verified — a benchmark testing real MCP servers including Notion, GitHub, Postgres, Filesystem, and Playwright — K2.7 Code scores 81.1%. Claude Opus 4.8 scores 76.4% on the same test.
That gap matters because MCPMark tests real tool-calling behavior against real APIs, not synthetic code generation on curated problems. In 2026, most production AI coding setups run through MCP agents. A 4.7-point lead on the benchmark closest to production reality is a meaningful signal.
What is missing: independent SWE-bench Verified numbers. Claude Opus 4.8 sits at 88.6% on SWE-bench Verified. Kimi K2.7 Code has published no equivalent third-party score as of this writing. Moonshot’s vendor benchmark improvements are plausible but unverified by outside parties. Early practitioners have noted the public claims do not always replicate in production harnesses. Wait for independent evaluation before making this your primary model.
How to Use It
Access is not the bottleneck. Kimi K2.7 Code is available through Cloudflare Workers AI, OpenRouter (moonshotai/kimi-k2.7-code), and directly via the Moonshot API at platform.moonshot.ai. Both OpenAI-compatible (/v1) and Anthropic-compatible (/anthropic) endpoints are available, so teams on either SDK can swap the base URL and model name without touching the rest of their code.
The architecture: 1T total parameters, 32B active per token via Mixture-of-Experts, 384 experts per layer with 8 selected plus one shared. MLA attention compresses the KV cache, which is how a 256K context window stays manageable in practice. A 400-million-parameter MoonViT encoder handles multimodal inputs — pass screenshots or diagrams and the model generates code from them.
The open weights are on Hugging Face under a Modified MIT license permitting commercial use with attribution. Self-hosting requires a minimum of eight H100 GPUs for INT4 quantization (~500 GB VRAM), or eight H200s for FP8. Recommended inference stack: vLLM with tensor parallelism 8, SGLang, or KTransformers. At $0.95 per million input tokens on the managed API, self-hosting only pencils out at tens of billions of tokens per month.
The Bottom Line
Kimi K2.7 Code is the right model to run experiments on if token cost is a genuine budget line in your agentic coding workflows. The 30% token reduction is Moonshot’s most defensible claim, and the OpenAI-compatible API makes a trial frictionless. The open weights under a permissive license are a genuine asset for teams that need deployment flexibility.
What it is not yet is a proven replacement for Claude Opus 4.8 on complex coding tasks. The missing SWE-bench data is a real gap, not a technicality. Run it in parallel with your current setup, evaluate on your actual workloads, and make the switch based on evidence rather than vendor benchmarks. The DevOps.com coverage covers the production considerations in detail if you want a second opinion before committing.













