
Alibaba’s Qwen team launched Qwen3.7-Plus on June 2 — and the benchmark numbers warrant attention. The model scored 79.0 on ScreenSpot Pro, the standard measure of GUI grounding accuracy, beating GPT-5.4’s 67.4 and Claude-Opus-4.6’s 49.5 by significant margins. The API is live now through Alibaba Cloud Model Studio, priced at $0.40 per million input tokens. If you are building anything that involves computer-use agents or visual automation pipelines, this is the release to evaluate this week.
More Than a Vision Model
Qwen3.7-Plus is not just Qwen3.7 with image input bolted on. The “Plus” designation signals five added capabilities that together constitute an agentic loop: deep reasoning, self-programming, tool invocation, verification and testing, and autonomous iteration.
Most vision models stop at describing what they see. Qwen3.7-Plus loops. It writes code to address what it observed, executes that code, checks whether the output matched the goal, and retries if it did not. Alibaba describes practical demonstrations where the model manages small-scale project lifecycles — requirement analysis through iterative debugging — without human intervention between steps.
The autonomous iteration capability is the most consequential addition. The difference between a model that shows you what is wrong with a UI screenshot and one that fixes it, runs the fix, and verifies the result is not incremental. It is architectural.
Why GUI Grounding Numbers Matter at Scale
ScreenSpot Pro measures a single, specific ability: look at a screenshot and identify the exact pixel coordinates of the target element. It is the bottleneck capability for everything called “computer use” in 2026.
At 79.0, Qwen3.7-Plus correctly identifies roughly four out of five UI targets on the first attempt. Consider what that means across a 50-step agentic workflow. A model at 70% per-step accuracy has a vanishingly small chance of completing that workflow without an error. At 79%, the model recovers via its iteration loop. The benchmark gap between 79 and 67 is not marketing; it is the difference between usable computer-use agents and proof-of-concept demos.
Terminal-Bench 2.0 reinforces the picture: Qwen3.7-Plus scored 70.3, ahead of DeepSeek-V4-Pro Max (67.9) and Gemini-3.1 Pro (63.5). Terminal-Bench measures iterative code execution — run code, observe output, fix and retry — a direct proxy for developer automation pipeline usefulness.
| Model | ScreenSpot Pro | Input (per M tokens) | Open Weights |
|---|---|---|---|
| Qwen3.7-Plus | 79.0 | $0.40 | No |
| GPT-5.4 | 67.4 | ~$2.50 | No |
| Claude-Opus-4.6 | 49.5 | $15.00 | No |
| Gemini-3.1 Pro | — | $1.25 | No |
Getting API Access
The model runs through Alibaba Cloud Model Studio on an OpenAI-compatible endpoint. If you are already using the OpenAI Python SDK, the switch is three lines:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_DASHSCOPE_API_KEY",
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
response = client.chat.completions.create(
model="qwen3.7-plus",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://your-screenshot.png"}},
{"type": "text", "text": "What is broken on this UI? Write a fix."}
]
}]
)
Create an account at Model Studio, activate the service, and generate an API key. New accounts receive a one-time free quota sufficient to run meaningful tests. One caveat: keys are region-scoped. A Singapore-provisioned key will not authenticate against the Beijing endpoint, so pick your region at account creation.
The Catch: No Open Weights
Qwen built its community reputation on releasing open weights. Qwen 2.5 and Qwen 3.6 were both publicly available for self-hosting and fine-tuning. Qwen3.7-Plus breaks that pattern — it is API-only, proprietary, and there is no path to running it on your own infrastructure today.
For developers in regulated environments, air-gapped deployments, or with data residency requirements, this is a hard stop. Alibaba has floated an open-weight variant for Q3 2026, but it is unconfirmed. If open weights are a requirement now, Step 3.7 Flash (Apache 2.0) is the current alternative.
The pricing helps soften the tradeoff. At $0.40 per million input tokens and $1.60 per million output tokens, Qwen3.7-Plus costs roughly six times less than GPT-5.4 and thirty-seven times less than Claude-Opus-4.6 on input — while outperforming both on GUI grounding.
What to Do Now
If you are building computer-use agents or visual automation workflows, three steps apply immediately. First, benchmark your current pipeline against Qwen3.7-Plus using the free quota — the ScreenSpot Pro delta is wide enough to produce measurable differences on real tasks. Second, scope your data requirements and confirm whether Alibaba Cloud’s data handling meets your compliance posture before committing. Third, watch for the open-weight announcement in Q3; if Alibaba follows through, the access question becomes moot for teams that need self-hosting.
The full launch details are on MarkTechPost. The API guide with region-specific setup notes is at apidog.com.













