MLX Distributed Training with JACCL: Multi-Mac LLM Clusters, Explained

Apple Silicon Mac mini cluster connected by Thunderbolt 5 cables with neural network data streams visualized in blue light, representing MLX distributed training with JACCL

MLX distributed training with JACCL enables multi-Mac LLM clusters via Thunderbolt 5 RDMA

Apple just made running trillion-parameter models a realistic option for any developer with a few Macs and a Thunderbolt cable. At WWDC 2026, Apple shipped JACCL — the Jack and Angelos’ Collective Communication Library — a distributed backend that runs MLX collectives over RDMA on Thunderbolt 5. The headline numbers: 50–60 Gbps throughput, sub-50 microsecond latency, and up to 3x inference speed-up with four Macs connected. The 1-trillion-parameter Kimi 2.6 model ran at 28+ tokens per second across four M3 Ultras in the WWDC demo. This shipped with macOS 26.2. It is not a research preview.

JACCL Is Apple’s NCCL — and That Name Is Deliberate

JACCL (pronounced “Jackal”) is a pun on NVIDIA’s NCCL, the collective communications library that underpins most GPU cluster training today. Apple is directly positioning JACCL as its answer to the GPU cluster networking problem — the same all-reduce, all-gather, and broadcast primitives, running over Thunderbolt 5 instead of NVLink or InfiniBand.

The library is named after Jack Beasley, who led RDMA over Thunderbolt development at Apple. RDMA — Remote Direct Memory Access — moves data from one machine’s memory to another’s while bypassing the CPU and OS entirely. That’s how JACCL achieves sub-50 µs latency: it removes the software overhead that made previous ring-allreduce backends on MLX an order of magnitude slower.

The full stack, from hardware to your Python code:

Thunderbolt 5 cables — the physical layer
RDMA over Thunderbolt — a macOS 26.2 OS feature
JACCL — collective communication primitives
MLX — Apple’s open-source ML framework
MLX LM — high-level LLM API
Your code — minimal changes needed

That last point matters. Your existing single-device MLX training or inference code needs almost no changes to run distributed. You initialize the world, and MLX handles the rest.

What You Can Actually Do With This

Distributed Inference

If you have models that exceed single-device memory, distributed inference is the answer. MLX LM splits large models across machines using tensor parallelism (splitting by width, better for throughput) or pipeline parallelism (splitting by depth, simpler communication). The WWDC demo ran Kimi 2.6 — a trillion-parameter model — across four M3 Ultras at 28+ tokens per second. Before JACCL, that was not happening locally.

For context: a single Mac mini M4 Pro with 64GB of unified memory hits around 130 tokens per second on Qwen3-Coder-30B-A3B. With four such machines connected via Thunderbolt 5, you can exceed single-device memory limits entirely and tackle model sizes that were cloud-only territory.

Distributed Fine-Tuning

LoRA fine-tuning across multiple Macs is now practical without wrestling with MPI configurations. JACCL uses data parallelism: your dataset is sharded across devices, each processes its partition, and gradients are aggregated via all-reduce. The mlx.distributed_config utility automates the network interface setup that previously required manual hostfile configuration and SSH key juggling.

This is the “sovereign AI cluster” scenario: four Mac mini M4 Pros (64GB each) costs roughly $10,000 total and lets you fine-tune and run 100B+ parameter models with no cloud dependency and no data leaving your hardware. For researchers and indie developers handling sensitive data, that’s a meaningful alternative to cloud GPU spend.

Setup: What It Actually Takes

The requirements are specific — be clear-eyed about them:

macOS 26.2 or later — RDMA support lives in the OS; currently in developer preview
Thunderbolt 5 — M4 and later Macs only
Fully-connected topology — every Mac must be directly cabled to every other; no daisy-chaining
Same Python environment — identical conda/venv setup across all nodes
SSH keys — configured for passwordless authentication between all machines

The fully-connected requirement is the real constraint. It caps practical cluster size at 4–6 nodes — the number of cables grows quadratically. You will not be building a 32-node Mac cluster. But 4 nodes is enough to run trillion-parameter models and serious fine-tuning jobs.

Setup flow:

Connect Macs with Thunderbolt 5 cables (fully connected mesh, not daisy-chained)
Run mlx.distributed_config to auto-configure network interfaces
Add a hostfile with Thunderbolt Bridge IP addresses
Use mlx.launch to execute your script across all nodes

In code, going distributed requires one new line:

import mlx.core as mx

# Initialize distributed world
world = mx.distributed.init()

# Your model, optimizer, and training loop are unchanged
# Aggregate gradients across nodes via:
mx.distributed.all_sum(gradients)

MLX handles topology selection automatically — mesh communication when latency matters, ring when bandwidth matters. The framework’s abstraction means you do not rewrite your training loop to go distributed.

The Honest Comparison With NVIDIA

JACCL is not competing with NVIDIA’s NVLink at scale. NVLink 4.0 delivers around 900 GB/s; Thunderbolt 5 caps at 50–60 Gbps. NCCL scales to thousands of GPUs across a datacenter; JACCL tops out at a handful of Macs on a desk.

But that’s the wrong comparison. The right comparison is JACCL versus paying for cloud GPU instances. If you’re a researcher or independent developer who wants to fine-tune a 70B model privately, without sending your data to a cloud provider or burning through API credits, a 4-Mac Thunderbolt cluster is now a legitimate setup. Apple Silicon has crossed the line from “runs inference adequately” to “credible fine-tuning platform.”

The community has been building toward this — distributed LoRA experiments via MPI, the sovereign cluster movement, peer-to-peer inference tools. JACCL gives all of that a proper foundation at the OS level instead of bolted-on workarounds.

Where to Start

If you’re on M4 hardware with access to the macOS 26.2 developer preview, the WWDC 2026 session “Explore distributed inference and training with MLX” is the starting point. Apple’s MLX distributed documentation covers the full API. The DaveAldon starter repo on GitHub gives you a working distributed MLX project to fork. For single-device performance context, Apple’s M5 neural accelerator research post is worth reading alongside the distributed session.

macOS 26.2 is not shipping to the public yet, but the developer preview is live post-WWDC. If local LLM work on Apple Silicon is part of your roadmap, now is the time to get comfortable with the stack — not after the general release when everyone else is catching up.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.