Every developer who has built a RAG pipeline over PDFs knows the loop: load document, slice into individual pages, send page one to OCR, clear the cache, send page two, stitch the results back together, and hope the seams don’t break your retrieval. That loop exists because LLM-based OCR decoders have a dirty secret — their memory consumption grows linearly with every token they output. Baidu’s Unlimited-OCR, released June 22 under an MIT license, fixes that at the architecture level: constant KV cache, single inference pass, up to 40 pages in one shot.
The Problem: Linear Memory Growth Kills Long Documents
Standard LLM decoders store a key-value cache for every token decoded. Process one page and the cache is manageable. Process ten pages and it’s grown tenfold. Process a 30-page research paper and you’re either running out of VRAM or watching throughput collapse as each new token attends to an ever-longer history. DeepSeek-OCR addressed this partially with Multi-Head Latent Attention, but the cache still grew. The workaround — page-by-page for-loops — breaks document coherence: tables split across pages lose their structure, and context from page three doesn’t inform the model reading page four.
What Changed: Reference Sliding Window Attention (R-SWA)
Unlimited-OCR replaces all attention layers in its LLM decoder with Reference Sliding Window Attention. The mechanism splits attention into two segments:
- Prefix segment: All visual tokens and the prompt — kept fixed throughout decoding. The full source document stays in context.
- Decode segment: A 128-token sliding window over generated output. The model only needs to remember what it just wrote, not everything it has written.
The result: the KV cache ceiling is bounded, not growing. At 6,144 output tokens, Unlimited-OCR runs at roughly 7,847 tokens per second against DeepSeek-OCR’s 5,823 — a 35% throughput advantage that compounds as documents get longer. Where DeepSeek-OCR’s latency climbs with sequence length, Unlimited-OCR’s stays flat. One developer on Hacker News reported transcribing a 200-page Japanese grammar PDF on a consumer 4090 in about an hour, with accurate mixed-language handling throughout.
The Benchmark Numbers
Against DeepSeek-OCR on OmniDocBench v1.5, Unlimited-OCR’s improvements are consistent across document types:
- Overall score: 93.23% vs 87.01% (+6.2 points)
- Formulas: 92.61% vs 83.37% (+9.2 points)
- Tables: 90.93% vs 84.97% (+6 points)
- Text edit distance: 0.038 vs 0.073 (lower is better)
On OmniDocBench v1.6 it reaches 93.92%, beating DeepSeek-OCR 2 on seven of nine document subcategories. For 20-page documents, edit distance sits at 0.0572; at 40+ pages, it degrades to 0.1069 — usable, if not perfect.
How to Run It
The model is 3B parameters, BF16, MIT licensed, and available on HuggingFace. It supports HuggingFace Transformers, vLLM, SGLang (both with OpenAI-compatible APIs), and Docker Model Runner. For PDF input, PyMuPDF handles the page-to-image conversion automatically.
# Multi-page inference — no loop required
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained(
'baidu/Unlimited-OCR', trust_remote_code=True, torch_dtype=torch.bfloat16
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model.infer_multi(
tokenizer,
prompt='Multi page parsing.',
image_files=['page1.png', 'page2.png', 'page3.png'],
image_size=1024,
max_length=32768,
)
What’s Still Broken
The Hacker News thread drew 446 points and a predictable debate: is OCR solved? The honest answer is no, with asterisks. Joss82, a Parseur founder with a decade building OCR products, put it flatly: “OCR still sucks in 2026.” The failure modes are specific — scientific PDFs converted to LaTeX, electronics datasheets with merged table cells, handwriting, non-standard fonts. These don’t improve much regardless of the decoder architecture.
Unlimited-OCR also has a hard ceiling. The 32K context limit means roughly 40 pages is the practical maximum today — the paper acknowledges this and plans 128K context in future work. For most enterprise document use cases, 40 pages covers a lot, but for legal discovery or full book ingestion it’s still a constraint.
Why This Matters Beyond OCR
R-SWA is described by the authors as a general-purpose parsing attention mechanism, applicable to automatic speech recognition and machine translation — any task where a decoder needs permanent access to a fixed source while generating a long output sequence. The architectural idea extends beyond document parsing. Given that the paper is public, the weights are MIT licensed, and the code integrates with existing inference stacks, expect derivative work quickly.
The for-loop workaround isn’t gone because the underlying problem was too hard to solve. It persisted because nobody prioritized the decoder memory architecture. Unlimited-OCR did. That’s the actual change.













