Baidu shipped Unlimited-OCR on June 22, 2026 — and it collected 1.8k GitHub stars in under 24 hours. The model tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward.
Mistral OCR 4 landed the next day (full guide) with the opposite trade-off: managed API, bounding boxes, block classification, and confidence scores — but not open weights. The two releases frame the June 2026 document-AI split: self-hosted long-horizon parsing vs structured managed extraction.
The arXiv paper dropped the same day. The model is live on Hugging Face and ModelScope, and the full inference code — including a bundled SGLang wheel — is at github.com/baidu/Unlimited-OCR.
What problem does it solve
Most OCR and document parsing pipelines have a hard limit: they process one page or one fixed-size image at a time, then glue the outputs together. That stitching step is where errors compound. A table that spans two pages gets split. A footnote reference loses its anchor. Layout context that spans multiple sections disappears.
Unlimited-OCR's central claim is long-horizon parsing — treating an entire document as a single sequence and maintaining structural context across pages. The project frames itself as pushing Deepseek-OCR further, building on the ngram-based repetition suppression that made Deepseek-OCR reliable on dense text.
Two inference modes: gundam and base
The model ships with two named configurations:
| Config | image_size | crop_mode | Best for |
|---|---|---|---|
| gundam | 640 | True | Single images, fast throughput |
| base | 1024 | False | Multi-page docs, PDFs, full fidelity |
Gundam trades resolution for speed by cropping aggressively. Base preserves full image size for documents where layout and density matter — scientific papers, financial reports, legal filings.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
Running it with Transformers
The simplest path uses Hugging Face Transformers with bfloat16 on a CUDA GPU:
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained(
'baidu/Unlimited-OCR',
trust_remote_code=True,
use_safetensors=True,
torch_dtype=torch.bfloat16,
).eval().cuda()
# Single image — gundam config
model.infer(
tokenizer,
prompt='<image>document parsing.',
image_file='your_image.jpg',
output_path='./output',
base_size=1024, crop_mode=True,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=128,
save_results=True,
)
# Multi-page PDF
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=['page1.png', 'page2.png', 'page3.png'],
output_path='./output',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
no_repeat_ngram_size=35 and ngram_window are the repetition-suppression parameters inherited from the Deepseek-OCR lineage — they are what stops the model from looping on dense repeated patterns in tables and forms.
PDF-native workflow
The repo ships a PyMuPDF helper that converts PDF pages to PNG at 300 DPI before feeding them to infer_multi:
import tempfile, fitz
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
mat = fitz.Matrix(dpi / 72, dpi / 72)
paths = []
for i, page in enumerate(doc):
out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
page.get_pixmap(matrix=mat).save(out)
paths.append(out)
doc.close()
return paths
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=pdf_to_images('your_doc.pdf', dpi=300),
output_path='./output',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
High-throughput with SGLang
For production workloads, the repository bundles an SGLang wheel that runs an OpenAI-compatible API server with streaming support and concurrent request handling:
# Start the server
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--attention-backend fa3 \
--context-length 32768 \
--enable-custom-logit-processor \
--host 0.0.0.0 \
--port 10000
Clients send streaming requests to http://localhost:10000/v1/chat/completions using standard multimodal message format. The server accepts images_config.image_mode (gundam or base) and custom_params for ngram_size and window_size.
For batch jobs, infer.py starts the SGLang server automatically and dispatches concurrent requests:
# Image directory
python infer.py \
--image_dir ./examples/images \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
# PDF
python infer.py \
--pdf ./examples/document.pdf \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
The --concurrency flag controls how many pages are processed in parallel — useful for large PDF batches.
What makes the ngram suppression significant
One of the recurring failures of long-context OCR models is repetition: the model starts looping on a header, a table row, or a footer as it loses track of what it has already generated. Deepseek-OCR introduced no_repeat_ngram_size as a hard constraint at the logit level. Unlimited-OCR inherits this and extends ngram_window — so the constraint is applied across a sliding window rather than the full context, which becomes important when documents are hundreds of pages long and exact repetition from chapter to chapter is legitimate.
Who is it for
Legal and compliance teams parsing dense contracts, regulatory filings, and multi-page agreements where a missed clause is a liability.
Finance and accounting extracting structured data from annual reports, balance sheets, and multi-table PDFs.
Research and academia digitising scanned papers, dissertations, and archival documents where standard OCR breaks on equations, footnotes, and mixed-column layouts.
Developers building document pipelines who need a reliable open-weight model they can self-host without per-page API costs.
How it compares to the alternatives
| Model | Multi-page support | Open weights | PDF native | Context length | Bboxes / confidence |
|---|---|---|---|---|---|
| Unlimited-OCR | ✅ infer_multi | ✅ MIT | ✅ via PyMuPDF | 32,768 | ❌ |
| Mistral OCR 4 | API per doc | Enterprise self-host | ✅ native | API | ✅ |
| Deepseek-OCR | Single image | ✅ | ❌ | Shorter | ❌ |
| Deepseek-OCR-2 | Limited | ✅ | ❌ | Longer | ❌ |
| GPT-4o Vision | Page-by-page API | ❌ | Via preprocessing | API limit | Partial |
The MIT licence and self-hostable weights make it a credible alternative for teams that cannot send documents to third-party APIs for compliance or cost reasons.
Getting started
# Install dependencies
pip install torch==2.10.0 torchvision==0.25.0 transformers==4.57.1
pip install Pillow matplotlib einops addict easydict pymupdf psutil
# Pull model and run
python -c "
from transformers import AutoModel, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
model.infer(tokenizer, prompt='<image>document parsing.', image_file='test.jpg', output_path='./out', base_size=1024, crop_mode=True, max_length=32768, no_repeat_ngram_size=35, ngram_window=128, save_results=True)
"
The paper is on arXiv at 2606.23050. The model is at baidu/Unlimited-OCR on Hugging Face. The code — including the SGLang wheel and infer.py batch runner — is at github.com/baidu/Unlimited-OCR.
For teams already using Deepseek-OCR or page-by-page vision APIs, Unlimited-OCR is worth a close look this week. If you need bounding boxes, typed blocks, and a managed Document AI layer instead, read Mistral OCR 4: bounding boxes and API guide. For visual retrieval without text extraction, see PixelRAG.
Related ExplainX guides
- Mistral OCR 4: bounding boxes, Document AI, and API — managed structured extraction (released June 23)
- PixelRAG: visual RAG from screenshots — skip text parsing entirely
- RAG vs agentic RAG — chunking strategies for parsed documents
- What are embeddings and vector search? — indexing extracted text
- Closed source vs open source AI alternatives — when to self-host vs use APIs