
Andrej Karpathy dropped nanochat in October 2025: a complete LLM training pipeline — raw text to deployed chat UI — for around $100 on cloud spot instances. Not a demo. The same pipeline OpenAI and Anthropic use, compressed into roughly 8,000 lines of readable Python and Rust. Nine months later it has 55,000+ GitHub stars, an autoresearch extension that lets AI agents improve the training recipe overnight, and a Llama 3 successor called nanollama that exports directly to llama.cpp. This is what the $100 actually gets you — and why the model you train matters less than what you learn building it.
nanochat vs nanoGPT: Why the Full Pipeline Changes Everything
Karpathy’s original nanoGPT (2022) covered pretraining. Clean, minimal, ~2,000 lines. It taught you how a transformer learns to predict text. nanochat extends that into the stages that actually define a model’s behavior: midtraining, supervised finetuning, optional reinforcement learning, and inference with a working chat UI.
The gap matters. Pretraining is where the model learns facts and patterns. Everything after — midtraining on SmolTalk conversation data, SFT on curated assistant examples, GRPO on math problems — is where the model learns to be useful. Most developers who work with LLMs every day have never seen what happens in those stages. nanochat shows you exactly what they do, in code you can read and modify.
On MMLU benchmarks: nanoGPT-trained models score around 20%. The full nanochat pipeline hits ~40%. That 20-point gap is entirely the post-pretraining stack doing its job.
The $100 Cost Breakdown (and What You’re Actually Buying)
Here is the arithmetic: 8 H100 GPUs at ~$3/GPU/hr on a spot instance runs about $24/hour. nanochat trains in roughly 4 hours. Total: $96 on-demand, $15–20 on spot. You get a 561M parameter model trained on 11.2 billion tokens from FineWeb-EDU.
You are not buying a model you will ship. At 561M parameters and 40% MMLU, this model can hold a conversation and answer simple questions. It cannot compete with Llama 3, Qwen3, or any frontier model. That is not the point.
The point is the process. Here is what nanochat forces you to actually implement:
- A custom BPE tokenizer in Rust (65,536-token vocab, 4.8 chars/token compression)
- A depth-20 Transformer with 1,280 hidden channels and 10 attention heads
- The Muon optimizer for matmul weights alongside AdamW for embeddings
- Midtraining on SmolTalk to teach conversation format and tool use
- GRPO — reinforcement learning on verifiable math tasks with no reward model
- KV-caching inference with streaming and a web-based chat UI
Context: training a GPT-2 equivalent cost roughly $43,000 in 2019. In 2026, nanochat achieves the same benchmark on spot instances for $48. The democratization of ML training is real — and nanochat is the most readable proof of it. If you are curious about what Python’s concurrency story looks like on the hardware running these training jobs, see our post on Python 3.14’s free-threaded mode.
The 2026 Update: nanollama Modernizes the Architecture
nanochat uses a custom GPT-style architecture — intentionally, for clarity. But modern production models run on the Llama family: RoPE positional embeddings, SwiGLU activations, RMSNorm, grouped query attention (GQA), untied embeddings. If you want “learn from scratch” to mean something in 2026, you need to learn the current stack.
That is what nanollama delivers. Born from Karpathy’s nanochat discussion #557 and released as v0.1.0 in early 2026, nanollama trains Llama 3 from scratch and exports GGUF v3 — which means your trained model runs directly in llama.cpp, Ollama, or LM Studio. No conversion scripts. No compatibility headaches.
It ships eight model configs from nano (46M) to big (7B), all with head_dim=64. The inference engine is a 9MB pure Go binary with seven quantization formats and a built-in web chat UI. No runtime dependencies. This is what the educational baseline should have been from the start.
autoresearch: When nanochat Trains Itself
In March 2026, Karpathy released autoresearch: AI agents that run ML experiments on nanochat overnight without human intervention. The loop: an agent modifies the training code, runs a 5-minute training session, checks if validation loss improved, keeps the change or discards it, and repeats.
One run sent 35 agents across 333 experiments, evaluating roughly 700 changes. The result was an 11% improvement in the Time-to-GPT-2 benchmark — down from 2.02 hours to 1.80 hours — with changes that transferred cleanly from small to larger models. nanochat is not just a tutorial. It is a research harness you can extend. For context on what open-weight models this educational run competes against, see our coverage of MiniMax M3.
Who Should Actually Run nanochat
Run nanochat if you want to understand what SFT, GRPO, and midtraining actually do to a model — not in theory but in loss curves and benchmark scores. Run it if you are interviewing for ML engineering roles and want to say you have trained end-to-end. Run it if you want domain-specific intuition by substituting your own dataset for FineWeb-EDU.
Skip it if you need a model to deploy to users — use Llama 3.3 or Qwen3 instead. Skip it if you want to understand transformers conceptually without running anything; the code walkthrough guides cover that without the GPU bill.
The model nanochat produces is not the deliverable. The comprehension is. After 4 hours and $100, you will know why GRPO exists, what midtraining does to loss, why Muon converges faster for matmul weights, and what it means to export a model to GGUF. That knowledge is worth more than the 561M-parameter chat model sitting in your cloud storage. The live nanochat demo is there if you want to see the output before committing the budget.













