Data extractor for PDF invoices - invoice2data

A command line tool and Python library that automates the extraction of key information from invoices to support your accounting process. The library is very flexible and can be used on other types of business documents as well.

In essence, invoice2data simplifies getting data from invoices by:

  • Automating text extraction — no more manual copying and pasting.

  • Using templates for structure — handles different invoice layouts.

  • Providing structured output — data ready for analysis or further processing.

This makes it a valuable tool for businesses and developers dealing with a large volume of invoices, saving time and reducing manual-entry errors. It:

  1. extracts text from PDF files with a pluggable, cascading backend — pdfium (default, no system deps), pdftotext, text, pdfminer, pdfplumber, or OCR (tesseract, ocrmypdf, docTR, paddleocr, gvision).

  2. searches for regex in the result using a YAML or JSON-based template system (with an optional AI fallback).

  3. saves results as CSV, JSON or XML, or renames PDF files to match the content.

With the flexible template system you can:

  • precisely match content PDF files

  • plugins available to match line items and tables

  • define static fields that are the same for every invoice

  • define custom fields needed in your organisation or process

  • have multiple regex per field (if layout or wording changes)

  • define currency

  • extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'issuer': 'QualityHosting', 'amount': 34.73, 'date': datetime.datetime(2014, 5, 7, 0, 0), 'invoice_number': '30064443', 'currency': 'EUR', 'desc': 'Invoice 30064443 from QualityHosting', 'template_name': 'com.qualityhosting.yml'}
{'issuer': 'Amazon EU', 'amount': 35.24, 'date': datetime.datetime(2014, 6, 4, 0, 0), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'currency': 'EUR', 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'issuer': 'Amazon Web Services', 'amount': 4.11, 'date': datetime.datetime(2014, 8, 3, 0, 0), 'invoice_number': '42183017', 'currency': 'USD', 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'issuer': 'Envato', 'amount': 101.0, 'date': datetime.datetime(2015, 1, 28, 0, 0), 'invoice_number': '12429647', 'currency': 'USD', 'desc': 'Invoice 12429647 from Envato'}

Quickstart

pip install invoice2data
invoice2data invoice.pdf                          # extract -> CSV
invoice2data --output-format json invoice.pdf     # or JSON / XML

As a Python library:

from invoice2data import extract_data

result = extract_data("invoice.pdf")

No system libraries are required by default — the pdfium backend bundles its own engine. Optional backends and extras (poppler, OCR, AI, …) are covered in the installation guide.

Documentation

Full documentation: https://invoice2data.readthedocs.io/

Development

If you are interested in improving this project, have a look at our contributor guide to get you started quickly.

Roadmap and open tasks

  • integrate with online OCR?

  • try to ‘guess’ parameters for new invoice formats.

  • apply machine learning to guess new parameters / template creation

  • Data cleanup per field

  • advanced table parsing with pypdf_table_extraction

Maintainers

Contributors and Credits

Contributions are very welcome. To learn more, see the Contributor Guide.

Used By