Dan Woods released Flash-Moe in March 2026, a pure C/Metal inference engine that runs Qwen3.5-397B—a 397 billion parameter model—on a MacBook Pro with 48GB RAM at 4.4 tokens/second. This seemingly impossible feat works by streaming expert weights from SSD on-demand while keeping only 5.5GB resident in memory. The 209GB model fits into 48GB through three engineering breakthroughs: pruning active experts from 512 to 4 per token, applying 4-bit quantization, and leveraging hand-tuned Metal shaders for Apple Silicon.
The innovation shifts the feasibility boundary for local AI from 70B models to 400B+ models on consumer hardware. For enterprises handling privacy-sensitive data or developers processing 2M+ tokens daily, Flash-Moe makes local deployment 2-3x cheaper than cloud APIs long-term. This isn’t just a demo—it’s production-ready with full tool calling support, proving that careful engineering beats brute-force hardware scaling.
How Flash-Moe Runs 397B Models on Mac: Three Breakthroughs
Flash-Moe overcomes hardware limitations through expert pruning, aggressive quantization, and streaming architecture. Qwen3.5-397B uses Mixture-of-Experts (MoE) architecture with 512 experts per layer but only activates 10-11 per token in standard configuration. Flash-Moe prunes this to just 4 active experts with no quality degradation, meaning less than 2% of expert weights are needed for any given token. Each expert weighs ~6.75MB at 4-bit quantization, making on-demand loading from SSD viable.
The implementation keeps 5.5GB of non-expert weights (embeddings, routing matrices) at full precision memory-mapped in RAM, while streaming the bulk of the model from SSD via parallel pread() calls with GCD dispatch groups. This division is strategic: routing decisions need precision to select the right experts, but expert weights themselves tolerate 4-bit quantization. Moreover, the macOS page cache naturally achieves ~71% hit rate for frequently accessed experts, which Flash-Moe leverages instead of implementing custom caching—custom logic was 38% slower in experiments.
Performance breakdown per layer reveals where time goes: GPU attention and delta-net operations take 1.22ms, SSD expert loading dominates at 2.41ms (56% of total), and expert computation completes in just 0.04ms through deferred GPU pipeline submission. The 7,000 lines of C code plus 1,200 lines of hand-optimized Metal GPU shaders include FMA-optimized dequantization that rearranges math to enable fused multiply-add operations, delivering 12% performance gains. On Apple Silicon, unified memory architecture eliminates CPU-GPU transfer overhead, but shared memory controller between SSD and GPU mandates serialized operations—attempts to overlap hurt performance by 73%.
Production Performance: Why 2-Bit Quantization Fails
Flash-Moe achieves 4.36 tokens/second with production-quality output including full tool calling support at 4-bit quantization. This is 5-7x slower than cloud APIs like GPT-4o (20-30 tok/s) but significantly faster than previous local deployment methods that couldn’t handle 400B models at all. However, the project extensively tested 2-bit quantization, which hits 5.74 tok/s—30% faster than 4-bit. Unfortunately, 2-bit breaks tool calling by producing malformed JSON output: \name\ instead of “name”. For agentic applications requiring structured data, 2-bit is unusable despite the impressive speed numbers.
Dan Woods documented 58 failed optimization attempts in the project repository, providing rare transparency about what doesn’t work. Furthermore, LZ4 compression hurt performance by 13% due to decompression overhead. Speculative expert prefetching tanked performance by 73% from GPU contention. Custom Metal caching implementations performed 38% worse than simply trusting the OS page cache. Expert prediction models achieved only 31% accuracy, worse than temporal baseline caching. These documented failures save other developers months of dead-end work and reveal critical insights: trust OS-level optimizations, don’t over-engineer, and respect Apple Silicon’s architectural constraints.
MoE Architecture: Why 397B Models Can Stream from SSD
MoE architecture is the enabling technology—without it, streaming 397B parameters from SSD would fail. Qwen3.5-397B has 60 transformer layers following the pattern: 15 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)). Each MoE layer contains 512 experts but activates only a sparse subset per token. Consequently, standard configuration uses 10 routed experts plus 1 shared expert (11 total), but Flash-Moe proves 4 active experts maintain quality. Loading 4 experts at ~6.75MB each from SSD takes 2.41ms—acceptable latency for 4.36 tok/s throughput.
This wouldn’t work for dense models like Llama where all parameters activate for every token. Loading 397B of dense weights from SSD would take seconds per token, making generation unusable. In contrast, MoE’s sparse activation—only 2% of weights per token—turns the problem from impossible to tractable. Additionally, the 3:1 ratio of Gated Delta Networks to standard Gated Attention blocks enables 8.6x to 19.0x decoding throughput improvement over previous Qwen generations, though Flash-Moe’s streaming architecture means real-world gains depend on SSD bandwidth rather than just compute efficiency.
Cost Analysis: When Local Deployment Beats Cloud APIs
Flash-Moe requires $3,500-4,500 upfront (MacBook Pro M3 Max 48GB, 1TB SSD) but has zero operating costs. Meanwhile, cloud APIs cost $35-200/month for daily heavy use, creating an 18-22 month breakeven point. At scale—2 million tokens daily or more—local deployment becomes 2-3x cheaper long-term. Privacy-sensitive industries like legal, healthcare, and finance get additional value from keeping data on-premises, avoiding cloud provider access to confidential information. However, this isn’t universally better than cloud.
Cloud APIs offer 5-7x faster inference (20-30 tok/s vs 4.36), higher capability for complex reasoning tasks, no upfront investment, and variable cost scaling with usage. Specifically, for developers spending under $50/month on APIs, the 45+ month ROI makes Flash-Moe’s hardware investment uneconomical. The decision criteria: choose local for consistent high volume, privacy requirements, and offline operation; stick with cloud for variable workloads, maximum speed, or budget constraints. Flash-Moe targets a specific niche—it’s not trying to replace cloud APIs universally.
Related: Cloud Waste Hits 30%: Why Companies Can’t See Costs
Apple Silicon Lock-In: Why This Won’t Run on NVIDIA
Flash-Moe is Apple Silicon-exclusive because it relies on unified memory architecture (UMA) and Metal GPU framework. UMA means CPU, GPU, and Neural Engine share one high-bandwidth memory pool without PCIe transfer overhead—the 5.5GB of routing weights stay memory-mapped and accessible to both processors simultaneously. Consequently, Metal provides low-level GPU control for custom shader optimization. FMA-optimized dequantization rearranges 4-bit math to enable GPU fused multiply-add operations, requiring hand-tuned Metal shaders that deliver 12% performance improvements through hardware-specific techniques.
However, Apple Silicon’s shared memory controller between SSD DMA and GPU compute prevents profitable parallelization. Attempts to overlap SSD loading and GPU computation failed because they compete for the same memory controller arbitration. This architectural constraint shaped Flash-Moe’s serialized processing design. Therefore, the optimization techniques are deeply tied to Apple hardware—there’s no straightforward port to NVIDIA GPUs or Windows/Linux platforms. Future M4/M5 chips with Neural Accelerators and higher memory bandwidth will directly benefit Flash-Moe performance without code changes.
Key Takeaways
- Flash-Moe enables 400B models on consumer Macs through expert pruning (512 → 4 active), 4-bit quantization, and SSD streaming—fitting 209GB models into 48GB RAM at 4.36 tokens/second
- Production viability requires 4-bit quantization; 2-bit achieves 5.74 tok/s but breaks tool calling with malformed JSON output, making it unusable for agentic applications
- MoE architecture is critical—only 2% of weights activate per token, making SSD streaming viable; this wouldn’t work for dense models like Llama where all parameters activate
- Economics favor local deployment at scale (2M+ tokens/day) with 18-22 month breakeven vs cloud APIs ($3,500 hardware vs $35-200/month), plus privacy advantages for sensitive data
- Apple Silicon-exclusive optimization leverages unified memory and Metal framework; shared memory controller prevents GPU/SSD parallelization, architectural constraints that shape the design
Basic usage for developers:
cd metal_infer
make
./infer --prompt "Explain quantum computing" --tokens 500
./chat # Interactive mode with tool calling

