Mistral OCR 4 Ships: Bounding Boxes, Not Just Text

Document page with glowing bounding box regions and confidence score overlays representing Mistral OCR 4 structured document intelligence

Mistral AI released Mistral OCR 4 today, and the headline feature isn’t accuracy — it’s structure. Where earlier versions of Mistral OCR (and most competing services) returned a flat text dump, OCR 4 now returns bounding boxes for every content region, typed block classification, and inline confidence scores alongside the extracted text. That’s a meaningful shift: this is a document intelligence API, not just an OCR service.

The practical consequence is that developers building RAG pipelines, agentic document workflows, and compliance systems no longer need to layer a second LLM pass on top of OCR output to figure out what they’re reading. OCR 4 tells you whether a text block is a table cell, a contract clause, or an equation — before you do anything with it.

Mistral OCR 4 Adds Bounding Boxes, Block Types, and Confidence Scores

The structural upgrade in OCR 4 consists of three new outputs alongside the existing markdown text. Bounding boxes localize every content region spatially, which matters for in-context highlighting, clickable citations, and layout-aware chunking. Block classification tags each region as one of 7+ types: title, paragraph, table, equation, signature, figure, or caption. Confidence scores are available at page or per-word granularity — off by default to keep response payloads small, opt-in when you need them.

These aren’t separate API calls. A single request to Mistral’s OCR endpoint with mistral-ocr-latest returns all of this together. You pass a bbox_annotation_format parameter with a Pydantic or Zod schema to shape structured field extraction, and set confidence_scores_granularity to "word" or "page" when you need reliability signals.

POST /v1/ocr
{
  "model": "mistral-ocr-latest",
  "document": { "type": "pdf_url", "url": "..." },
  "confidence_scores_granularity": "word",
  "bbox_annotation_format": { /* Pydantic/Zod schema */ }
}

Three Use Cases This Unlocks

Flat OCR output forced developers to write brittle parsing logic that broke on layout changes. OCR 4’s structured representation changes what’s practical to build. Three scenarios become straightforwardly solvable:

Human-in-the-loop document pipelines. Confidence scores let you auto-approve high-confidence extractions and route low-confidence regions to human review. For invoice processing, contract review, or compliance workflows, this is the threshold-based routing that previously required a second model pass.

Semantic RAG chunking. Chunking PDFs by character count produces semantically incoherent retrieval units — half a table mixed with paragraph text. OCR 4’s block classification lets you chunk by block type: table cells to one handler, equations excluded from retrieval entirely, paragraph text to the embedding pipeline. Better chunks mean better retrieval.

Agentic document workflows. Agents interacting with documents via flat text have to infer structure. With OCR 4’s output, an agent knows it’s reading a signature field versus a terms clause, and can act accordingly — triggering different tools or workflows based on block type.

Related: LangSmith Engine and SmithDB: Fix Agent Failures Fast

The Pricing Case Against AWS Textract

Mistral OCR 4 costs $4 per 1,000 pages via API and $2 per 1,000 pages via the batch API. Compare that to AWS Textract’s pricing for structured extraction: $15 per 1,000 pages for tables, $50 per 1,000 pages for forms. For the use cases where OCR 4 competes directly — structured extraction from invoices, contracts, and forms — Mistral is 4 to 12 times cheaper.

The math gets stark at volume. According to a 2026 document AI cost comparison, processing 50,000 invoices per month runs roughly $3,250 with AWS Textract versus $100 with Mistral’s API or $50 via batch. Additionally, the self-hosted option — OCR 4 in a single container, available to enterprise customers — is a differentiator none of the big three offer. AWS Textract, Azure Document Intelligence, and Google Document AI all require sending documents to their cloud. For healthcare, finance, and legal teams with data residency requirements, self-hosted changes the conversation entirely.

Before You Switch: Two Caveats

Mistral reports strong benchmark numbers: 85.20 on OlmOCRBench (top score among tested models), 93.07 on OmniDocBench, and a 72% win rate in independent annotator comparisons. These are legitimately impressive. However, they all come from Mistral’s own evaluation harness. Independent third-party verification hasn’t happened yet — it’s too new. Mistral themselves recommend running tests on your specific document types before committing, because benchmark artifacts from formatting variations and ground-truth errors affect scores across the board.

There’s also a specific gotcha for developers using Mistral OCR via the Azure marketplace: the confidence_scores_granularity parameter is not supported in the Azure-hosted version. If you’re integrating through Azure, you won’t get per-word confidence scores until Microsoft adds support. A confirmed limitation, not a rumor — documented in Microsoft’s own Q&A.

Related: Mistral Vibe: Coding Agent With Open Weights and Half the Cost

Key Takeaways

OCR 4 returns structured document representation — bounding boxes, block type classification, and confidence scores — not just markdown text. This changes what developers can build on top of document ingestion.
The three practical unlocks are human-in-the-loop pipelines (confidence-based routing), semantic RAG chunking (block-type-aware), and agentic document workflows (structure-aware agents).
Pricing for structured extraction is $4/1K pages API and $2/1K batch — 4–12x cheaper than AWS Textract for equivalent use cases. Self-hosted container available for regulated industries.
Benchmark numbers are strong but sourced from Mistral’s own evaluation. Test on your documents before switching pipelines.
If you’re using Mistral OCR via the Azure marketplace, confidence scores aren’t available yet — direct API access is required for full OCR 4 features.

The full release details are on Mistral’s blog. Given the pricing gap and the self-hosted option, this is worth testing against your current document processing stack — especially if you’re paying Textract rates for structured extraction.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.