
NVIDIA VP Ian Buck personally drove to Anthropic’s San Francisco office on May 18 and handed over the first Vera CPUs. Then OpenAI. Then SpaceXAI. Then Oracle. This wasn’t a logistics story — it was a calculated signal. NVIDIA just shipped its first custom-designed CPU in company history, and the reason it exists is the exact problem most developers building AI agents keep hitting: the CPU side of the stack is the actual bottleneck.
Your AI Agent Is Waiting on the CPU, Not the GPU
Here’s the uncomfortable truth about agentic workloads: your GPU is often idle. Every tool call, every RAG retrieval, every subprocess your agent spawns, every orchestration decision — that’s CPU work. And the numbers are ugly. Retrieval-heavy RAG systems spend 81 to 89% of total latency on retrieval. Coding agents burn 25 to 65% of latency in Bash or Python execution. Tool-heavy workflows can put 88% of total latency on CPU-side tool processing.
Traditional AI servers shipped at an 8:1 GPU-to-CPU ratio. Agentic deployments are forcing that ratio toward 1:1. The hardware world is catching up to what agentic software already demands.
What the NVIDIA Vera CPU Actually Is
Vera runs 88 custom “Olympus” cores built on Arm v9.2-A — designed entirely by NVIDIA from scratch, not licensed Arm microarchitecture. Each core uses NVIDIA’s Spatial Multithreading, which physically partitions core resources rather than time-slicing them (unlike Intel’s Hyper-Threading). That gives you 176 threads with more predictable, consistent latency under concurrent workloads.
The headline spec is memory bandwidth: 1.2 TB/s via LPDDR5X across a 1,024-bit interface and eight SOCAMM modules. Per core, that’s roughly 14 GB/s — about 3x what traditional data center CPUs provision per core. For agents constantly shuffling context windows, embedding lookups, and long chains of tool outputs through memory, that bandwidth difference is real. The chip also connects to Rubin GPUs via NVLink-C2C at 1.8 TB/s coherent bandwidth — roughly 10x what PCIe 5.0 can sustain.
The Performance Numbers (Read the Fine Print)
NVIDIA’s own benchmarks against AMD EPYC Turin and Intel Xeon 6 Granite Rapids show 1.5x higher agentic sandbox performance, 2x efficiency, and 50% faster sandbox execution. That’s vendor data, so treat it as a ceiling, not a floor.
The more interesting numbers come from Redpanda, who tested Vera on Kafka-compatible streaming workloads. Their results: 5.6x lower latency than AMD EPYC Turin and 2.7x lower than Intel Xeon 6 Granite Rapids on a triple-replicated 24-core cluster, plus 73% higher ring-shuffle SQL throughput at 64 cores. These numbers are legitimately striking — but Redpanda is a partner, and the benchmark post omits TDP and full configuration details. Worth keeping in mind before you build a budget around it.
When Developers Actually Get Access
Right now, Vera lives at Anthropic, OpenAI, SpaceXAI, and Oracle Cloud Infrastructure. OCI is the first hyperscaler deploying at scale — they’ve committed to hundreds of thousands of units starting in 2026. CoreWeave is confirmed as the first cloud customer for standalone Vera CPU access. Broader availability across AWS, Google Cloud, Azure, Lambda, and Nebius is expected in H2 2026.
Pricing estimates: $15–25/hr on-demand at hyperscalers; specialty clouds (CoreWeave, Lambda) typically come in 40–50% lower once allocations arrive. If you’re on OCI today, you’ll likely see Vera-backed compute options before year end. Everyone else waits.
The Bigger Play
Context: NVIDIA’s last major CPU attempt was Project Denver in 2014. It failed. This one is structurally different. Vera isn’t competing on CPU benchmarks alone — it’s designed to be paired with Rubin GPUs over NVLink-C2C. NVIDIA is building an end-to-end AI factory compute stack where CPU and GPU share a coherent memory space. If you deploy on OCI or CoreWeave, you may end up on Vera whether you specifically chose it or not.
For developers, the near-term implication is pricing: as Anthropic and OpenAI deploy Vera at scale, the per-token inference cost should drop as CPU-side overhead shrinks. The longer-term implication is that Jensen Huang’s COMPUTEX keynote on June 1 will likely detail deployment timelines further. Worth watching if you’re planning cloud infra decisions for Q3. And if you want the full technical breakdown, NVIDIA’s technical blog post covers the architecture in depth.













