olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Jake Poznanski; Aman Rangapur; Jon Borchardt; Jason Dunkelberger; Christopher Wilhelm; Kyle Lo; Luca Soldaini

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Christopher Wilhelm, Kyle Lo, Luca Soldaini

Published: 09 Jun 2025, Last Modified: 14 Jul 2025CODEML@ICML25EveryoneRevisionsBibTeXCC BY 4.0

Keywords: pretraining data, vision language models, PDFs, OCR, content extraction

TL;DR: open-source VLM model and efficient tool to convert millions of PDFs to plain text at low cost, dataset for finetuning VLMs, benchmarking PDF content extraction

Abstract: PDF documents contain trillions of novel, high-quality tokens valuable for language model training, but their diverse formats and layouts complicate content extraction. Traditional open source tools yield lower quality results than vision-language models (VLMs), yet the best VLMs are costly (e.g., over 6,240 USD per million PDF pages for GPT-4o) or inaccessible when working with proprietary documents. We present olmOCR, an open-source toolkit for converting PDFs into clean, linearized plain text in natural reading order while preserving structure such as sections, tables, and equations. Our toolkit uses a fine-tuned 7B VLM trained on 260,000 pages from over 100,000 varied PDFs, including graphics, handwritten text, and poor scans. olmOCR is optimized for large-scale batch processing, converting a million pages for only 176 USD. We find olmOCR outperforms even top VLMs including GPT-4o, Gemini Flash 2, and Qwen-2.5-VL on OLMOCR-BENCH, a curated set of 1,400 challenging PDFs with fine-grained unit tests that remain difficult even for the best tools and VLMs. We openly release all components of olmOCR: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and the benchmark.

Submission Number: 42

Loading