
AMD CEO Lisa Su showed up to COMPUTEX 2026 this week with good news and a warning. The good news: EPYC Venice—the first 2nm HPC CPU in volume production—is shipping. The warning: high-bandwidth memory has replaced advanced packaging as the constraint that limits how fast anyone in the industry can scale AI chips. That second part applies equally to AMD, NVIDIA, and every team currently booking GPU capacity for 2026.
HBM Is Now the Wall
Su was explicit about it, which is unusual. CEOs tend to highlight what is working, not flag supply risks at developer conferences. She named HBM, specifically, as the binding constraint—and the reason is physics. Producing one gigabyte of HBM3E or HBM4 consumes roughly three times the wafer capacity required to produce the equivalent in DDR5. You cannot spin up more fab lines overnight. The three manufacturers who make essentially all of the world’s HBM—SK Hynix, Samsung, and Micron—have already committed their entire 2026 output. Epoch AI’s chip cost tracker shows HBM now makes up 63% of AI accelerator component costs, up from 52% in early 2024. SK Hynix had booked every 2026 HBM wafer before mid-2025. All of Micron’s 2026 HBM supply is already allocated. Samsung’s HBM4 qualification may slip to late 2026. The shortage, per multiple producers, runs through 2027 at minimum.
The strategic subtext in Su’s statement matters. AMD has been signaling for months that advanced packaging—the CoWoS interposers that connect GPU dies—was the previous constraint. Su is now saying AMD is satisfied with CoWoS supply. That is progress. But it means the bottleneck has moved upstream to memory, which is harder to solve on a short timeline and affects every accelerator vendor simultaneously.
Why 432 GB vs 192 GB Actually Matters for Inference
AMD’s MI455X carries 432 GB of HBM4 at 19.6 TB/s bandwidth. NVIDIA’s B200 has 192 GB at 8 TB/s. The memory gap matters more than raw compute numbers for teams running large models. When your model fits in a single GPU’s memory, you avoid the latency penalty and communication overhead of model sharding across multiple cards. At 432 GB, the MI455X can hold a quantized 405B-parameter model—or an unsharded 70B at FP16—without splitting it. The bandwidth gap translates directly to inference throughput: more bandwidth per GPU means fewer GPUs needed to hit a tokens-per-second target.
AMD’s MLPerf Inference 6.0 results project lower cost-per-token than NVIDIA at high-interactivity operating points (60+ tokens/sec/user). That is a meaningful claim if it holds in production deployments. ROCm 6.4 now supports PyTorch 2.5/2.6, vLLM, and SGLang with native Hugging Face integration, closing the software gap that previously kept most teams on NVIDIA by default.
Venice Is Real, and So Is the Helios Timeline
AMD officially confirmed on May 20 that EPYC Venice is now in volume production on TSMC’s 2nm process—making it the first HPC CPU in the industry to reach this node. Venice brings 256 cores and 512 threads under a Zen 6 architecture, with AMD claiming 70%+ performance and efficiency gains over the current EPYC Turin. It is also the host CPU for the Helios rack-scale platform, so its production confirmation means the Helios H2 2026 timeline is holding.
Helios is AMD’s answer to NVIDIA’s DGX SuperPOD: 72 MI455X GPUs across 18 compute trays, 31 TB of total HBM4, and 2.9 exaFLOPS of FP4 compute in a double-wide rack. Early adopters include Oracle and TCS India. Engineering samples ship H2 2026; mass production ramps Q2 2027.
What AMD’s $10B Taiwan Bet Is Actually Saying
Alongside the Venice announcement, Su committed more than $10 billion in investment across Taiwan’s electronics ecosystem—targeting advanced packaging capacity at TSMC and an HBM4 supply agreement with Samsung. AMD is also expanding Venice production to TSMC Arizona, providing geographic diversification now standard for any chip company operating at scale.
The AMD/Samsung MoU is the telling detail. Samsung is the second-largest HBM producer and is currently qualifying HBM4. AMD locking in Samsung supply is a hedge against SK Hynix’s dominant position—and signals how long AMD expects the scarcity window to last: long enough to justify bilateral supply agreements stretching into 2027 and 2028.
What Developers Should Do Now
- Lock in 2026–2027 GPU contracts now. GPU spot pricing will not drop meaningfully until HBM capacity expansions complete—production forecasts put that in late 2026 at the earliest, and 2027 more reliably. Reserved instances outperform spot economics right now for predictable inference load.
- Run the AMD benchmark yourself. AMD’s AI Endpoint APIs are OpenAI-compatible, run on MI350X hardware, and include 25 million free tokens for developers. ROCm 6.4 supports vLLM and SGLang natively. The software parity gap is narrower than it was a year ago.
- Size models for memory budgets, not just compute. As HBM stays scarce, the cost of memory-inefficient inference remains high. FP4 quantization on MI455X and B200 is production-grade now—consider it if you are not already running quantized weights in production.
Su’s HBM warning is not catastrophizing. She is describing a supply curve every accelerator vendor is working within. The teams that plan for it—locking compute contracts, evaluating AMD’s growing memory advantage, and keeping model footprints lean—will be in better shape when the constraint eventually eases. That is currently forecasted for 2027. Plan accordingly.













