Data extractor for PDF invoices - invoice2data¶

A command line tool and Python library that automates the extraction of key information from invoices to support your accounting process. The library is very flexible and can be used on other types of business documents as well.

In essence, invoice2data simplifies getting data from invoices by:

Automating text extraction — no more manual copying and pasting.
Using templates for structure — handles different invoice layouts.
Providing structured output — data ready for analysis or further processing.

This makes it a valuable tool for businesses and developers dealing with a large volume of invoices, saving time and reducing manual-entry errors. It:

extracts text from PDF files with a pluggable, cascading backend — pdfium (default, no system deps), pdftotext, text, pdfminer, pdfplumber, or OCR (tesseract, ocrmypdf, docTR, paddleocr, gvision).
searches for regex in the result using a YAML or JSON-based template system (with an optional AI fallback).
saves results as CSV, JSON or XML, or renames PDF files to match the content.

With the flexible template system you can:

precisely match content PDF files
plugins available to match line items and tables
define static fields that are the same for every invoice
define custom fields needed in your organisation or process
have multiple regex per field (if layout or wording changes)
define currency
extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'issuer': 'QualityHosting', 'amount': 34.73, 'date': datetime.datetime(2014, 5, 7, 0, 0), 'invoice_number': '30064443', 'currency': 'EUR', 'desc': 'Invoice 30064443 from QualityHosting', 'template_name': 'com.qualityhosting.yml'}
{'issuer': 'Amazon EU', 'amount': 35.24, 'date': datetime.datetime(2014, 6, 4, 0, 0), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'currency': 'EUR', 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'issuer': 'Amazon Web Services', 'amount': 4.11, 'date': datetime.datetime(2014, 8, 3, 0, 0), 'invoice_number': '42183017', 'currency': 'USD', 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'issuer': 'Envato', 'amount': 101.0, 'date': datetime.datetime(2015, 1, 28, 0, 0), 'invoice_number': '12429647', 'currency': 'USD', 'desc': 'Invoice 12429647 from Envato'}

Quickstart¶

pip install invoice2data
invoice2data invoice.pdf                          # extract -> CSV
invoice2data --output-format json invoice.pdf     # or JSON / XML

As a Python library:

from invoice2data import extract_data

result = extract_data("invoice.pdf")

No system libraries are required by default — the pdfium backend bundles its own engine. Optional backends and extras (poppler, OCR, AI, …) are covered in the installation guide.

Documentation¶

Full documentation: https://invoice2data.readthedocs.io/

How it works — the extraction pipeline
Installation — backends, OCR and optional extras
Usage — all CLI options and common tasks
Template creation — write templates for your invoices
Recommended fields — the canonical output schema
AI features — optional LLM fallback & template generation
FAQ — including a comparison with other tools

Development¶

If you are interested in improving this project, have a look at our contributor guide to get you started quickly.

Roadmap and open tasks¶

integrate with online OCR?
try to ‘guess’ parameters for new invoice formats.
apply machine learning to guess new parameters / template creation
Data cleanup per field
advanced table parsing with pypdf_table_extraction

Maintainers¶

Contributors and Credits¶

Harshit Joshi: As Google Summer of Code student.
Holger Brunn: Add support for parsing invoice items.

Contributions are very welcome. To learn more, see the Contributor Guide.

Used By¶

Odoo, OCA module account_invoice_import_invoice2data