NewsAI & DevelopmentCloud & DevOpsOpen Source

GLM-5.2 on Cloudflare Workers AI: 1M Context, Open Weights

GLM-5.2 AI coding model on Cloudflare Workers AI with 1M token context window
GLM-5.2 lands on Cloudflare Workers AI with 1M context and MIT open weights

Cloudflare added GLM-5.2 to Workers AI on June 16 — one day after Z.ai released the model publicly. The model is a 744-billion-parameter Mixture-of-Experts coding model with a 1-million-token context window, function calling, and two reasoning modes. Open weights ship this week under an MIT license. If you are already using Cloudflare Workers AI for inference, switching is a model name change.

What Changed from GLM-5.1

Z.ai’s previous release, GLM-5.1, scored 58.4 on SWE-bench Pro — enough to earn a writeup here for coding eight hours straight. GLM-5.2 moves that number to 62.1, which puts it above GPT-5.5 (58.6) on the same benchmark. On FrontierSWE — a long-horizon test that measures whether an agent can complete open-ended technical projects over hours — GLM-5.2 hit 74.4%, trailing Claude Opus 4.8 (75.1%) by one point and beating GPT-5.5 (72.6%).

The more consequential change is context. GLM-5.1 handled roughly 200K tokens. GLM-5.2 handles 1 million — enough to load an entire mid-sized codebase, test suite, configuration files, and conversation history into a single session without truncation. The output cap also quadrupled to 131K tokens, so the model can generate entire files or multi-file refactors in one shot rather than forcing you to stitch outputs together. VentureBeat put the cost comparison plainly: GLM-5.2 beats GPT-5.5 on long-horizon benchmarks at roughly one-sixth the cost.

How to Use It on Workers AI

The model identifier on Cloudflare’s Workers AI is @cf/zai-org/glm-5.2. Three access paths are available: the Workers AI binding, the REST API, and AI Gateway. The binding is the most straightforward if you are already in the Cloudflare ecosystem:

// wrangler.toml: add [ai] binding = "AI"
const response = await env.AI.run("@cf/zai-org/glm-5.2", {
  messages: [
    { role: "user", content: "Refactor this function to use async/await..." }
  ]
});

For REST API access:

curl https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/ai/run/@cf/zai-org/glm-5.2 \
  --header "Authorization: Bearer $CF_API_TOKEN" \
  --header "Content-Type: application/json" \
  --data '{"messages": [{"role": "user", "content": "Fix this bug..."}]}'

One caveat worth knowing: Workers AI currently caps GLM-5.2’s context at 262,144 tokens rather than the full 1 million. Cloudflare has stated it plans to increase this. The 256K limit is still five times what most models offer on the platform, and it covers the majority of real-world coding sessions.

High vs Max: Picking the Right Thinking Mode

GLM-5.2 ships with two thinking-effort settings. High mode produces balanced reasoning with faster response times and roughly half the token output. It is the right choice for most coding tasks, code review, and routine generation. Max mode runs extended chain-of-thought, generates up to 85K output tokens per task, and adds 30–80% to first-token latency. Z.ai recommends Max as the default for serious coding work — complex refactors, architecture decisions, multi-step debugging sessions where accuracy matters more than speed.

In practice, a hybrid approach works well: Max for the initial architecture pass, High for iteration. Thinking mode adds cost, so keep an eye on Neuron consumption on the free tier (10,000 per day) if you are running extended sessions.

Agent Tool Compatibility

GLM-5.2 ships with an OpenAI-compatible endpoint, which means it drops into agent tools without custom integration code. Claude Code, Cline, OpenCode, Roo Code, Goose, and OpenClaw all connect with a base URL swap and a model name change. Early community reports note that it plugged into agent environments immediately and produces clean code. The 1M context window is particularly useful here — agents can load entire repositories into working memory instead of summarizing and truncating on every turn.

Open Weights and Self-Hosting

Open weights land on Hugging Face under the zai-org/GLM-5.2 organization this week, MIT licensed. Ollama users can pull it with ollama pull glm5:latest once available. The catch: GLM-5.2 is a 744B MoE model. Running it locally requires 256GB or more of RAM for the 2-bit quantization. That is a serious-hardware requirement, not a laptop experiment. Most developers will use it via Workers AI or the Z.ai API — but the MIT license matters for teams that need air-gapped deployment or want to fine-tune without vendor restrictions.

According to the Cloudflare changelog entry from June 16, GLM-5.2 is available now via Workers AI binding, REST API, and AI Gateway. For teams already running inference on Workers AI, this is worth evaluating immediately — particularly if long-horizon coding tasks or large-codebase agents are part of your stack.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News