Groq Raises $650M for Inference Neocloud After Nvidia Deal

Groq LPU inference chip data streams representing high token throughput for AI neocloud

Groq announced a $650 million funding round this week. That sounds routine until you know what happened six months ago: Nvidia paid $20 billion to license Groq’s core chip architecture and walked away with Groq’s founder, its CEO, and most of its senior engineering team. The $650M is not a victory lap. It is the company’s attempt to survive as an AI inference neocloud after the industry’s dominant player stripped it for parts.

What the Nvidia Deal Actually Was

In December 2025, Nvidia and Groq announced what both parties carefully avoided calling an acquisition. Nvidia paid approximately $20 billion for a non-exclusive, perpetual license to Groq’s Language Processing Unit (LPU) architecture. Alongside that license, Jonathan Ross — Groq’s founder and CEO — and president Sunny Madra joined Nvidia, along with the majority of Groq’s engineering leadership.

The legal structure matters here. A license is not an acquisition. Groq remains an independent company and retains the right to use its own intellectual property. However, Senators Warren and Blumenthal were unconvinced by the framing, formally calling it a “reverse acquihire” in March 2026 and urging the DOJ and FTC to investigate whether the deal was structured to avoid antitrust review under Hart-Scott-Rodino thresholds. That inquiry remains open.

Meanwhile, the Groq 3 LPU appeared inside Nvidia’s Vera Rubin platform at GTC 2026. Nvidia is already commercializing Groq’s architecture through its own product lines — developed by the team it hired from Groq.

The LPU Speed That Made This Worth $20 Billion

To understand why Nvidia wanted Groq’s inference neocloud technology badly enough to pay $20 billion for a license, look at the throughput numbers. Llama 4 Scout runs at over 460 tokens per second on GroqCloud. The same model on an Nvidia H100 delivers roughly 100 to 150 tokens per second. For Llama 3.3 70B, Groq achieves around 800 tokens per second against 120 to 140 on GPU alternatives — a 3-4x advantage on every single inference call.

The architecture behind this is a set of deliberate tradeoffs. GPUs optimize for training workloads, using DRAM and HBM as primary storage with cache hierarchies that introduce memory latency on every weight fetch. The LPU stores model weights directly in on-chip SRAM at 80 terabytes per second of bandwidth, eliminating that memory wall. Execution is statically scheduled at compile time, removing runtime arbitration overhead. The compiler pre-computes the entire execution graph down to individual clock cycles before a single inference request runs.

These tradeoffs matter most at latency-sensitive boundaries. For voice agents, GPU-based inference consumes 400 to 600 milliseconds on the LLM call alone, leaving almost no budget for speech-to-text and text-to-speech within a natural conversation turn. Groq reduces that to 100 to 150 milliseconds. For agentic workflows that chain 5 to 20 LLM calls sequentially, a 3x per-call speedup compounds across the entire pipeline. The Groq 3 LPU, shown at GTC 2026, targets 1,500 tokens per second specifically for agentic workloads.

The Neocloud Bet: Inference-as-a-Service on Proprietary Hardware

Groq’s pivot is straightforward in concept: stop building chips for sale, start selling inference from those chips. GroqCloud is now positioned as an AI inference neocloud — a cloud provider differentiated by proprietary hardware optimized for a single workload rather than general-purpose compute. The model exists elsewhere; CoreWeave built a significant business running GPU clusters. Groq’s version is built on the LPU throughput advantage.

The numbers behind the platform are substantial. More than 3.5 million developers currently use GroqCloud, with enterprise customers including Dropbox, Volkswagen, and Riot Games running production workloads. Pricing starts at $0.05 per million input tokens for Llama 8B models, with a free tier offering 14,400 API requests per day at no cost. The company commits to beating published competitor pricing while maintaining the throughput advantage. Batch processing is available at 50% discount for async workloads. The model catalog includes Llama 4, Llama 3.3 70B, DeepSeek V4, Qwen3 32B, and Whisper Large.

The $650 million will primarily fund infrastructure expansion — more LPU clusters to handle growing inference demand — and the engineering rebuild Groq needs after losing its founding team to Nvidia. It is also, effectively, a defense budget against the company’s central structural threat.

The Risk Nvidia’s Presence Creates

Groq’s neocloud strategy has a structural vulnerability that $650 million does not eliminate. Nvidia already has the LPU architecture. The license is non-exclusive, so Groq can continue using it — but so can Nvidia. Moreover, Nvidia has distribution at scale: AWS, Azure, and Google Cloud partnerships, a CUDA developer ecosystem built over fifteen years, and hardware supply chains that no neocloud can match. The Vera Rubin Ultra platform, expected in the second half of 2027, will combine Nvidia’s silicon with HBM4E memory and LPU inference acceleration.

When Nvidia begins offering LPU-speed inference through its existing cloud partnerships, Groq’s speed advantage becomes table stakes rather than a differentiator. The company has an estimated 18 to 24 months to build enough enterprise contracts, infrastructure scale, and developer lock-in to survive as a viable alternative rather than a historical footnote.

What Developers Should Do Right Now

If your application needs fast open-source model inference today, GroqCloud is still the strongest option on the market. The throughput numbers are not close to any alternative, and the pricing is competitive. Specifically, voice agents, real-time streaming chat interfaces, and agentic loops with multiple sequential calls all benefit substantially from the LPU speed advantage.

The limitations are real: Groq only hosts open-source models. If your stack requires GPT-5, Claude, or Gemini, GroqCloud is not applicable. Fine-tuning and vision workloads beyond its current catalog are also off the table.

The longer question is whether the neocloud strategy succeeds. If Groq deploys its $650 million effectively and builds durable infrastructure before Nvidia commoditizes the LPU advantage, developers get a lasting fast-inference alternative. If it fails, the December 2025 deal will look, in retrospect, like the moment Nvidia removed the last serious competitor in AI inference hardware — and bought time by calling it a license.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.