Open Source Document Parser including OCR

Docling is a popular open-source project contributed by IBM. It supports easy and fast parsing of PDFs and several other file types including images. This post gives a quick introduction to Docling.

Here are the high-level features from the Docling repo:

Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
Advanced PDF document understanding including page layout, reading order & table structures
Unified, expressive DoclingDocument representation format
Easy integration with LlamaIndex & LangChain for powerful RAG / QA applications
OCR support for scanned PDFs
Simple and convenient CLI

Setup

The setup is very easy.

  
python3 -m venv venv
source venv/bin/activate
pip install docling

docling https://arxiv.org/pdf/2206.01062 --ocr --to json  

The models are automatically downloaded from HuggingFace and are relatively small. You can run them on GPUs or CPUs.

OCR

As pdf backends pypdfium2, dlparse_v1 and dlparse_v2 are integrated. For OCR (Optical Character Recognition) EasyOCR, Tesseract, RapidOCR and Mac OCR can be utilized. The following screenshot shows how an image can be put as encoding in a markdown file and how text information is extracted to JSON.

There is also an API to read the document structures.

Tables

Tables can be read as well including tables that span pages.

Next Steps

The repo mentions some other forthcoming features like code and equation extractions. Check out the documentation to learn more.