heidloff.net - Building is my Passion
Post
Cancel

Open Source Document Parser including OCR

Docling is a popular open-source project contributed by IBM. It supports easy and fast parsing of PDFs and several other file types including images. This post gives a quick introduction to Docling.

Here are the high-level features from the Docling repo:

  • Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
  • Advanced PDF document understanding including page layout, reading order & table structures
  • Unified, expressive DoclingDocument representation format
  • Easy integration with LlamaIndex & LangChain for powerful RAG / QA applications
  • OCR support for scanned PDFs
  • Simple and convenient CLI

Setup

The setup is very easy.

1
2
3
4
5
python3 -m venv venv
source venv/bin/activate
pip install docling

docling https://arxiv.org/pdf/2206.01062 --ocr --to json  

The models are automatically downloaded from HuggingFace and are relatively small. You can run them on GPUs or CPUs.

OCR

As pdf backends pypdfium2, dlparse_v1 and dlparse_v2 are integrated. For OCR (Optical Character Recognition) EasyOCR, Tesseract, RapidOCR and Mac OCR can be utilized. The following screenshot shows how an image can be put as encoding in a markdown file and how text information is extracted to JSON.

image

There is also an API to read the document structures.

Tables

Tables can be read as well including tables that span pages.

image

Next Steps

The repo mentions some other forthcoming features like code and equation extractions. Check out the documentation to learn more.

Featured Blog Posts
Disclaimer
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Contents
Trending Tags