
PrismML emerged from stealth yesterday (March 31, 2026) with the world’s first commercially viable 1-bit large language models. Their flagship Bonsai 8B delivers performance comparable to 16-bit models like Llama 3 8B, but runs 14 times smaller (1GB vs 16GB), 8 times faster, and consumes 4-5 times less energy. The breakthrough: Every weight uses just three values (-1, 0, +1), eliminating expensive floating-point operations. Your iPhone can now run a powerful LLM at 44 tokens per second, fitting in 1GB of storage. Available for free download today under Apache 2.0 license.
Ternary Quantization: -1, 0, +1
Unlike post-training quantization—where you compress an existing 16-bit model—Bonsai is trained natively in 1-bit from scratch. This matters. When every weight is restricted to -1, 0, or +1, multiplication operations become trivial: multiplying by 0 is a no-op, by 1 is identity, by -1 is just a sign flip. No expensive floating-point math. Result: The 8B model shrinks from 16GB to 1.15GB while maintaining comparable benchmark performance.
The numbers tell the story. Bonsai 8B runs at 136 tokens per second on an M4 Pro Mac, compared to roughly 17 tokens/sec for standard 16-bit Llama 3. On an iPhone 17 Pro Max, it hits 44 tokens/sec. Energy consumption drops to 0.068 mWh per token on iPhone—nearly 5 times more efficient than 16-bit models. PrismML’s “intelligence density” metric (negative log of error rate divided by model size) scores 1.06/GB versus Qwen3 8B’s 0.10/GB. That’s a 10x improvement in intelligence per gigabyte.
Download Now: Apache 2.0, Multi-Platform
PrismML released three models yesterday: Bonsai 8B (1.15GB), Bonsai 4B (0.5GB), and Bonsai 1.7B (0.24GB). All free under Apache 2.0 license—no restrictions on commercial use. They’re available on HuggingFace, with demo code on GitHub and pre-configured Google Colab notebooks. The iOS app Locally AI already supports Bonsai models, proving real-world viability from day one.
Framework support launched immediately. MLX handles Apple Silicon (Mac, iPhone, iPad), llama.cpp works with NVIDIA GPUs via CUDA, and bitnet.cpp targets CPU-only deployments. This isn’t vaporware—it’s production-ready code you can download right now. PrismML, founded by Caltech researchers, raised $16.25M in SAFE and seed funding from Khosla Ventures, Cerberus, and Google.
Efficiency vs Scale: Two AI Futures
Here’s the irony: The same day PrismML launched Bonsai, OpenAI closed a $122 billion funding round at an $852 billion valuation. Two opposing bets. OpenAI doubles down on scale—bigger models, massive data centers, $2 billion monthly revenue from cloud APIs charging $0.01-$0.06 per thousand tokens. PrismML bets on efficiency—smaller models, edge deployment, zero marginal cost after the initial download.
The efficiency narrative isn’t just technical—it’s economic and regulatory. Privacy-sensitive applications (healthcare, finance) can’t send data to external APIs. Offline environments (remote sites, air-gapped systems) have no cloud access. Real-time applications (robotics, autonomous systems) can’t tolerate network latency. High-volume applications can’t afford per-token pricing. Edge AI solves all of this. Bonsai proves it’s not just viable—it’s competitive.
Limitations and Future Hardware
PrismML is honest about current limitations. Today’s 8x speedup comes primarily from reduced memory footprint, not from fully exploiting 1-bit computation during inference. Specialized hardware—chips optimized for ternary operations—could unlock “another order-of-magnitude” improvement, the company says. That means today’s 8x could become 80x with purpose-built silicon.
Quality trade-offs remain unclear. Benchmarks show “comparable” performance to 16-bit models, but real-world testing is just beginning. Bonsai launched yesterday—production stability is unknown, community validation is pending. Developers should test thoroughly for their specific use cases. This is version 1.0, not a mature platform.
What It Means for Edge AI
Edge AI is no longer theoretical. You can download a 1GB model today that runs on consumer hardware, processes 44 tokens per second on a phone, and costs nothing per inference. Privacy becomes a feature, not a limitation. Offline operation becomes practical. Battery life improves 5x. Latency drops to near-zero.
Cloud AI isn’t going away—OpenAI’s $122B funding round proves that. But edge AI now offers a viable alternative for applications where privacy, cost, latency, or offline operation matter. The efficiency narrative challenges the scale narrative. Sometimes 1GB beats 16GB. Sometimes smaller wins.











