If your RAG pipeline still treats PDFs as "extract text with PyPDF and hope," you are leaving layout, tables, formulas, and multi-column structure on the floor. MinerU — OpenDataLab's document parsing engine with ~69.7k GitHub stars — exists to fix that: turn complex PDFs and Office documents into LLM-ready Markdown and JSON with headings, tables, formulas, and images preserved.
Version 3.4.0 landed June 18, 2026 with a focused upgrade: PP-OCRv6 for the pipeline backend (~11% OCR accuracy gain on OmniDocBench v1.6), roughly 100% faster OCR processing, and smarter model download / cache reuse. For agent builders, MinerU is increasingly the default ingestion layer before chunking, embedding, and retrieval.
TL;DR
| Detail | MinerU 3.4 |
|---|---|
| Repo | github.com/opendatalab/MinerU |
| Docs | opendatalab.github.io/MinerU |
| Latest release | mineru-3.4.0 (June 2026) |
| Stars / forks | ~69.7k / ~5.9k |
| Inputs | PDF, images, DOCX, PPTX, XLSX |
| Outputs | Markdown, JSON, multimodal formats |
| License | MinerU Open Source License (Apache 2.0–based) |
| Install | uv pip install -U "mineru[all]" |
| CLI | mineru -p <input> -o <output> |
| CPU path | -b pipeline |
Why MinerU Matters for RAG and Agents
Document ingestion is the silent failure mode in most RAG systems. Chunk a badly parsed PDF and you get:
- Tables split across chunks with no header context
- Formulas rendered as garbage Unicode
- Multi-column layouts read in wrong order
- Headers, footers, and page numbers polluting embeddings
MinerU addresses parsing before chunking. It removes headers/footers/page numbers, preserves document structure (headings, lists, paragraphs), converts formulas to LaTeX, tables to HTML, extracts images with captions, and detects scanned PDFs for automatic OCR.
The project originated during InternLM pre-training — built to solve symbol conversion in scientific literature. That pedigree shows in formula and table handling, where generic text extractors fail.
June 2026 sits in a crowded document-AI moment: Baidu Unlimited-OCR targets one-shot long-horizon parsing; Mistral OCR 4 offers managed API extraction with bounding boxes. MinerU's position: full-stack open ingestion with multiple backends, local deployment, and production routing (mineru-router) — not a single-model demo.
Version 3.4: What Changed (June 18, 2026)
PP-OCRv6 upgrade
The pipeline backend's OCR model moved to PP-OCRv6, improving OCR accuracy by about 11% on OmniDocBench v1.6. Japanese, Traditional Chinese, English, and Latin were removed as separate OCR language options — those scenarios now route through the ch OCR model, simplifying configuration.
~100% OCR speed improvement
MinerU optimized the OCR inference and processing pipeline, roughly doubling OCR throughput — significant for batch document jobs and OCR-heavy scans.
Model download and cache
- Automatic model source selection on first install based on network environment (HuggingFace, ModelScope, etc.)
- Local cache priority — checks downloaded model files before remote requests
- Reduces repeated downloads across dev/staging/prod environments
See OpenDataLab's Model Source Documentation for configuration details.
Parsing Backends Compared
MinerU is not one model — it is an orchestration stack with backend selection:
| Backend | Accuracy (OmniDocBench v1.6 E2E) | CPU | GPU | Best for |
|---|---|---|---|---|
| pipeline | 86.47 | ✅ | Optional | Homelab, CPU-only, batch OCR |
| hybrid medium (default) | 95.26 | ❌ | 8GB+ VRAM | Daily production — speed/accuracy balance |
| hybrid high | 95.39 | ❌ | 8GB+ VRAM | Max accuracy, image analysis |
| vlm / vlm-http-client | 95.30 | ❌ | 2GB+ VRAM (client) | OpenAI-compatible remote servers |
Hybrid medium (added in v3.3, now default) sacrifices only 0.13 accuracy points vs high while delivering 35–220% speed improvements by platform:
| Platform | Text PDF speedup | OCR scenario speedup |
|---|---|---|
| Linux | ~80% | ~35% |
| Windows | ~90% | ~45% |
| macOS | ~220% | ~50% |
Medium does not support image analysis inside documents — switch to effort=high when you need that.
VLM model: MinerU2.5-Pro
The primary VLM is MinerU2.5-Pro-2605-1.2B (v3.3+) with native multilingual OCR, image/chart parsing, truncated paragraph merging, and cross-page table merging. v3.1.0 added native PPTX and XLSX parsing alongside PDF, DOCX, and images.
Key Features
- Multi-format input: PDF, PNG/JPG, DOCX, PPTX, XLSX
- Layout-aware output: reading order for single/multi-column and complex layouts
- Formula → LaTeX, table → HTML
- OCR: 109 languages; auto-detect scanned/garbled PDFs
- Outputs: NLP Markdown, multimodal Markdown, JSON by reading order, layout/span visualizations
- Interfaces: CLI, FastAPI (
mineru-api), Gradio WebUI,mineru-routerfor multi-GPU load balancing - Async tasks:
POST /tasksfor submit/status/result (v3.0+) - Long documents: sliding-window parsing + streaming disk writes — tens of thousands of pages without manual splitting
- Thread-safe multi-threaded inference for high-concurrency production
Quick Start
Install
pip install --upgrade pip
pip install uv
uv pip install -U "mineru[all]"
mineru[all] is the recommended bundle for Windows, Linux, and macOS.
Parse a document (GPU path)
mineru -p document.pdf -o ./output
Parse on CPU only
mineru -p document.pdf -o ./output -b pipeline
Supports single files or directories. Outputs land in structured Markdown/JSON under the output path.
Docker
Docker deployment is documented for Linux and Windows WSL2 — macOS should use pip/uv install instead. See Docker deployment docs.
Production: mineru-router and Multi-GPU
mineru-router (v3.0+) provides unified entry deployment across multiple services and GPUs:
- Interfaces fully compatible with
mineru-api - Automatic task load balancing
- Designed for high-concurrency, high-throughput parsing farms
Combined with thread-safe concurrent inference and streaming writes, MinerU 3.x targets enterprise document pipelines — not just one-off CLI conversions. That aligns with Liquid AI LFM2.5-230M's data-extraction positioning: parse at scale upstream, route structured chunks to small edge models downstream.
Hardware Requirements (Summary)
| pipeline | hybrid / vlm | |
|---|---|---|
| OS | Linux 2019+, Windows, macOS 14+ | Same |
| Python | 3.10–3.13 (Windows: 3.10–3.12) | Same |
| RAM | Min 16GB, rec 32GB+ | Min 16GB |
| VRAM | 4GB optional | Min 8GB (hybrid), 2GB (http client) |
| Disk | Min 20GB SSD recommended | Min 2GB (+ models) |
Pure CPU inference is pipeline-only. Apple Silicon supports GPU acceleration via MPS on supported backends.
MinerU vs Alternatives (June 2026)
| Tool | Strength | Trade-off |
|---|---|---|
| MinerU 3.4 | Full stack, multi-backend, Office formats, router | Heavy install, GPU for best accuracy |
| Unlimited-OCR | One-shot long PDFs, SGLang throughput | Vision-model path, different architecture |
| Mistral OCR 4 | Managed API, bounding boxes, confidence | Not self-hosted weights |
| Generic PyPDF | Fast, trivial | No layout, tables, or formulas |
For RAG specifically, parsed output quality directly affects chunking and retrieval strategy. MinerU's JSON-sorted-by-reading-order output is designed for downstream indexing — or wire parsed Markdown into a Langflow RAG pipeline for visual retriever tuning. At the extreme end of "hard documents," the Vesuvius Challenge applies a similar parse-then-verify loop to carbonized 2,000-year-old scrolls — with papyrologists, not chunkers, as the final gate.
License Evolution
v3.1.0 (April 2026) moved MinerU from AGPLv3 to the MinerU Open Source License — Apache 2.0–based with additional conditions. The change explicitly targets lower adoption friction for commercial deployments while keeping the codebase open.
v3.0 also removed dependencies on AGPLv3 models (doclayoutyolo, mfd_yolov8) and a CC-BY-NC-SA layoutreader — cleaning the license stack for enterprise use.
Online Demos (Try Before Deploy)
| Demo | Notes |
|---|---|
| Official web app | Full features, login required |
| OpenDataLab | Same as official |
| ModelScope Gradio | Core parsing, no login |
| HuggingFace Gradio | Core parsing, no login |
MinerU's own docs recommend trying online demos first — complex layouts, scans, and handwriting may still fall short of expectations.
Related ExplainX coverage
| Post | Connection |
|---|---|
| Baidu Unlimited-OCR | Alternative long-horizon parsing approach |
| Mistral OCR 4 | Managed document AI API comparison |
| RAG vs agentic RAG | What to do with parsed documents |
| Liquid AI LFM2.5-230M | Edge extraction after MinerU ingestion |
| arXiv AI-generated errors ban | Why grounded document pipelines matter |
| Vesuvius Challenge scroll read | Extreme document recovery — ML ink detection + human transcription |
Summary
MinerU 3.4 reinforces its role as the default open-source document ingestion engine for LLM workflows: PP-OCRv6 accuracy, doubled OCR speed, smarter model caching, 95%+ hybrid parsing, full Office format support, and mineru-router for production scale.
69.7k stars reflect years of iteration from InternLM's pre-training needs to today's agent/RAG stacks. If your agents read PDFs, MinerU is the layer to install before you embed a single chunk.
Last updated: June 26, 2026. Version details from github.com/opendatalab/MinerU release mineru-3.4.0 and project README.