NVIDIA RTX Spark: Running a 120B Model Locally

NVIDIA RTX Spark superchip with neural network visualization and 120B parameter local inference

NVIDIA RTX Spark — Grace Blackwell superchip with 128 GB unified memory for local frontier-scale AI inference

NVIDIA just made “run it locally” a real answer for frontier-scale language models. The RTX Spark, announced June 1 at Computex, is a Grace Blackwell superchip designed for consumer laptops and compact desktops. It pairs a 20-core ARM CPU with a Blackwell GPU inside a single 128 GB unified memory pool. That last number is the story.

What RTX Spark Actually Is

RTX Spark is built around the GB10 superchip — the same silicon that powers NVIDIA’s DGX Spark workstation, now repackaged for OEM laptops from ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Two chip dies — a Grace CPU and a Blackwell GPU — sit on the same TSMC 3nm-class package, joined by NVLink-C2C, NVIDIA’s chip-to-chip interconnect. The GPU brings 6,144 CUDA cores and one petaFLOP of FP4 AI compute. The CPU contributes 20 ARM cores. Both share the same 128 GB pool of LPDDR5X memory.

That shared memory is not a marketing detail. It is the entire premise of the product.

Why 128 GB Unified Memory Is the Real Story

A 120-billion-parameter model at 4-bit quantization requires roughly 60–70 GB just to load. On a discrete GPU — even a high-end 24 GB GDDR7 card — you cannot load a model that size without splitting across multiple devices or compressing it into a shape that degrades quality. You make tradeoffs. RTX Spark eliminates that constraint.

The 128 GB pool is coherently shared between the CPU and GPU via NVLink-C2C’s 600 GB/s bidirectional link. The GPU does not need to request a memory transfer across PCIe to access system RAM. It addresses the entire 128 GB directly. This is architecturally different from “a GPU with a lot of VRAM” — it is one address space, one pool, shared natively by both processors. At 273 GB/s of memory bandwidth, it is also fast enough to make that access practical.

The CUDA Advantage Nobody Is Talking About Enough

Apple Silicon has been the default developer recommendation for local AI inference since the M3 generation. The MLX framework delivers solid performance. But CUDA is where the tooling lives. PyTorch’s CUDA backend, TensorRT-LLM, llama.cpp, NIM containers — every major inference optimization in the last five years was built for CUDA first. On Apple, you adapt. On RTX Spark, you don’t.

NVIDIA and Microsoft rebuilt the CUDA runtime for AArch64 Windows (CUDA 13.0, WDDM 3.2). x86 CUDA binaries run without recompilation. The same PyTorch model that runs on an H100 in your cloud cluster runs on RTX Spark without a code change. Add WSL3 — announced at Build 2026 with near-native GPU passthrough — and your entire Linux-based ML toolchain works directly: Ollama, vLLM, LlamaFile, all of it, with full CUDA acceleration inside the Linux subsystem.

What You Can Run, and How Fast

In practice: Llama 4 Maverick and Qwen3 models at reasonable quantization fit cleanly in 128 GB. At the 70B tier with 4-bit quantization, TensorRT-LLM delivers around 62 tokens per second. Full 120B models are memory-bandwidth-constrained; expect 2–15 tokens per second depending on quantization and context length, which is workable for local agents and development workflows where you are not serving thousands of concurrent users.

The 1 million token context window is locally available too. Large codebase RAG, entire document corpus ingestion, extended reasoning chains — these no longer require a cloud API with a corresponding data transfer.

The Privacy Angle Is Underrated

Healthcare developers building on patient data. Legal teams processing privileged documents. Financial services running inference on non-public information. For all of them, cloud AI has always carried a compliance cost: Business Associate Agreements, data residency audits, HIPAA and GDPR exposure from every API call. Local inference on hardware you control eliminates that problem entirely. The data does not leave the machine. There is no counterparty. RTX Spark hardware at this capability tier makes that compliance posture practical for the first time at frontier model scales.

The Honest Numbers on Price

The N1X flagship — 20 CPU cores, full 128 GB — starts around $2,899. The base N1 variant starts around $1,799, but at reduced memory configurations that do not support the use cases NVIDIA is marketing. Microsoft’s Surface RTX Spark Dev Box is estimated at $3,000–$3,500. For comparison, a Mac Studio with M4 Max at 128 GB costs roughly $3,199 and is available today — RTX Spark devices arrive in fall 2026.

The pricing is not outrageous given what the hardware delivers, but NVIDIA is not helping itself by marketing lower-memory SKUs alongside the 128 GB tier. If you are buying this for local frontier-model inference, you are spending $2,899 minimum. Be clear about that going in.

Who Should Actually Pay Attention

AI engineers who need the CUDA ecosystem and cannot or will not rewrite their tooling for Apple MLX. Developers building in regulated industries where cloud inference is a compliance liability. Teams running coding agents or RAG pipelines that benefit from zero-latency local inference. And researchers who want 120B-scale experimentation without cloud bills.

RTX Spark is a real hardware milestone. The unified memory architecture is not incremental — it breaks a constraint that has limited local AI inference for years. The price is real too. If the 128 GB tier fits your budget and your workflow, this is the most capable local AI development machine announced to date. If it does not, the Mac Studio M4 Max is already on shelves.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.