AI & DevelopmentCloud & DevOps

Cloudflare Workers AI Runs Trillion-Param Models Now

Cloudflare Workers AI running trillion-parameter models at the edge

Cloudflare’s Workers AI just got a lot harder to ignore. The platform that started with 7B models now runs trillion-parameter Mixture-of-Experts models — Kimi K2.7-Code (1T params, 32B active), GLM-5.2 (744B MoE), and NVIDIA Nemotron 3 Super — all via a single HTTP call, no GPU provisioning required. The headline number worth sitting with: Cloudflare’s own internal security agents process 7 billion tokens daily on Workers AI and pay 77% less than they would on an equivalent proprietary frontier API. That’s not a benchmark. That’s their production bill.

What’s on the Platform Now

The model roster has changed significantly in the last three months. Workers AI now carries several frontier-class open models:

  • Kimi K2.7-Code — 1T total parameters, 32B active per token (MoE), 262k context, tool calling, vision, thinking mode. Added June 12.
  • GLM-5.2 — 744B MoE model from Z-AI with a 1M-token context window and MIT license. Added June 16.
  • NVIDIA Nemotron 3 Super — 120B total, 12B active, optimized for multi-agent agentic workloads.
  • Kimi K2.6 — 1T params, 262k context, the predecessor that set the precedent in April.

All of these are MoE architectures, which matters for a specific reason: MoE models activate only a fraction of their parameters per token. You get the quality ceiling of a trillion-parameter model at the inference cost of a 32B dense model. That’s the bet Cloudflare made, and the math works out.

Why It Scales: The Infire Engine

Running trillion-parameter models on distributed edge infrastructure doesn’t happen by bolting vLLM onto an H100. Cloudflare built Infire, a purpose-built Rust inference engine that JIT-compiles inference kernels for the specific model and GPU combination it’s running on, uses disaggregated prefill (separating the prefill and decode stages across machines), and runs at 82% lower CPU overhead than vLLM. The result: even trillion-param models cold-start in under 20 seconds and run 7% faster than vLLM on the same hardware. For agents that do parallel tool calls and generate structured JSON, Cloudflare also benefits from speculative decoding — tool call responses are predictable, so draft token acceptance rates are high.

How to Switch in Five Minutes

Workers AI exposes an OpenAI-compatible endpoint. If you’re already using the OpenAI SDK, swap the base URL:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.CLOUDFLARE_API_TOKEN,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/v1`,
});

const response = await client.chat.completions.create({
  model: '@cf/moonshotai/kimi-k2.7-code',
  messages: [{ role: 'user', content: 'Review this function for security issues' }],
});

No SDK? A plain curl call also works:

curl https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/ai/run/@cf/moonshotai/kimi-k2.7-code   -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"   -d '{"messages": [{"role": "user", "content": "Write a rate limiter in TypeScript"}]}'

If you’re already in the Workers ecosystem, env bindings give you the cleanest integration — no HTTP overhead, no separate auth token, billed directly through your Workers plan.

The Smart Architecture Decision

The mistake is treating Workers AI as a full replacement for Claude or GPT-5. It isn’t — and it doesn’t need to be. The smarter pattern is layered: use Workers AI for work that doesn’t need the best model in the world — sub-agent parsing, context compression, routing decisions, batch summarization, classification. Keep your expensive frontier API calls for user-facing, brand-sensitive, or genuinely hard reasoning tasks. MoE models at 32B active params hit a quality level that’s more than sufficient for the plumbing layers of an agent system.

Cloudflare’s own architecture reflects this. Their security agents use Workers AI for high-volume token work and escalate to more capable models when the task demands it. Seven billion tokens per day at 77% cost reduction means that architectural pattern has real economic consequences at scale.

The Catch

Context windows are generous — 262k for the Kimi models, 1M for GLM-5.2 — but not unlimited. Latency depends on which edge PoP you land on, and Cloudflare is transparent that throughput at peak demand isn’t guaranteed the same way a reserved compute contract is. For production agent workloads with tight latency SLAs, test before committing. For everything below that bar, the economics are hard to argue with.

The model catalog is curated, not open. You get what Cloudflare decides to deploy. If you need a specific model that isn’t there yet, you’re waiting. That said, the addition cadence — four major frontier models in three months — suggests the wait won’t be long. Check the Workers AI model catalog for what’s available now.

Where to Start

The free tier gives you a daily token allowance to evaluate without spending anything. The large models announcement post covers the Infire engine in depth if you want the infrastructure details. And Cloudflare’s own AI engineering stack writeup is worth reading — it’s the unusual case of a vendor showing exactly how they use their own platform in production, including the 77% cost reduction numbers. Try it on your least critical agent workflow first. If the output quality meets the bar, the rest of the decision is arithmetic.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *