Reference

API reference for using invoice2data as a Python library. For the command-line interface see the usage page.

Library API

invoice2data.extract_data(invoicefile, templates=None, input_module=None, ai_fallback=False, raise_on_error=False)

Extracts structured data from PDF/image invoices.

This function uses the text extracted from a PDF file or image and pre-defined regex templates to find structured data.

Reads template if no template assigned. Required fields are matches from templates.

Parameters:
  • invoicefile (str) – Path of electronic invoice file in PDF, JPEG, PNG

  • templates (list[InvoiceTemplate] | None) – List of instances of class InvoiceTemplate. Templates are loaded using read_template function in loader.py.

  • input_module (Any, optional) – Backend used to extract text from the given invoicefile, as a module or a registry name (e.g. ‘pdftotext’, ‘pdfium’, ‘pdfminer’, ‘tesseract’, ‘text’). When None (the default), a cascade of backends (DEFAULT_INPUT_READERS) is tried in order until one yields a template match with all required fields.

  • ai_fallback (bool, optional) – When True and no template matches (or every match is incomplete) and OCR does not help, extract fields with the configured AI provider (see INVOICE2DATA_AI_* env vars). Result is tagged extraction_method: "ai". Opt-in; defaults to False.

  • raise_on_error (bool, optional) – When True, raise a typed InvoiceProcessingError on failure instead of returning {}RequiredFieldsMissingError when a template matched but a required field could not be parsed, otherwise NoTemplateFoundError. Defaults to False (the historical {} contract).

Returns:

Extracted and matched fields, or an empty dict {} if

text extraction fails or no template matches (unless raise_on_error is set).

Return type:

dict[str, Any]

Raises:

InvoiceProcessingError – When raise_on_error is True and extraction fails (RequiredFieldsMissingError or NoTemplateFoundError).

Notes

Import the required input_module when using invoice2data as a library. A template may pin the backend it was authored for with a top-level input_module: key; that backend is then used for that template regardless of which one matched it first.

See also

read_template: Function to load templates. InvoiceTemplate: Class representing a single invoice template.

Examples

When using invoice2data as a library:

>>> from invoice2data.input import pdftotext
>>> extract_data("./tests/compare/oyo.pdf", None, pdftotext)
{'issuer': 'OYO', 'template_name': 'com.oyo.invoice.yml', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087', 'currency': 'INR', 'hotel_details': ' OYO 4189 Resort Nanganallur', 'date_check_in': datetime.datetime(2017, 12, 31, 0, 0), 'date_check_out': datetime.datetime(2018, 1, 1, 0, 0), 'amount_rooms': 1.0, 'booking_id': 'IBZY2087', 'payment_method': 'Cash at Hotel', 'gstin': '06AABCO6063D1ZQ', 'cin': 'U63090DL2012PTC231770', 'desc': 'Invoice from OYO'}

Load templates with read_templates (documented under Extract → loader).

Exceptions

By default extract_data returns {} on failure. Pass raise_on_error=True to get a typed exception instead, so a caller can tell why extraction failed:

from invoice2data import extract_data, NoTemplateFoundError, RequiredFieldsMissingError

try:
    data = extract_data("invoice.pdf", raise_on_error=True)
except RequiredFieldsMissingError as exc:
    print("matched a template but missing:", exc.fields)
except NoTemplateFoundError:
    print("no template matched")

Typed exceptions for invoice2data (issue #190).

By default invoice2data.extract_data() returns {} on failure (the historical contract). Pass raise_on_error=True to get one of these instead, so a library caller can tell why extraction failed and show a useful message.

exception invoice2data.exceptions.InvoiceProcessingError

Base class for invoice2data extraction failures.

Only raised when extract_data(..., raise_on_error=True).

exception invoice2data.exceptions.NoTemplateFoundError

No template matched the document under any input backend.

exception invoice2data.exceptions.RequiredFieldsMissingError(fields, template_name=None)

A template matched but one or more required fields could not be parsed.

Subclasses ValueError so the input-backend cascade’s existing except ValueError retry handling keeps working unchanged.

Parameters:
  • fields (Iterable[str]) – Required field names that could not be parsed.

  • template_name (str | None) – The matched template’s name, when known.

Return type:

None

fields

The required field names that could not be parsed.

Type:

set[str]

template_name

The template that matched, when known.

Type:

str | None

Input modules

invoice2data resolves a backend by name or module object. When none is forced it tries an ordered cascade (see How It Works) and falls back to OCR. Backends expose a common interface; those backed by optional dependencies self-exclude via is_available().

Backend interface and registry

Input (text-extraction) backends and their registry.

See __interface__ for the backend contract. INPUT_MODULES maps the stable backend name (the –input-reader value) to its module.

invoice2data.input.INPUT_MODULES: dict[str, ModuleType] = {'doctr': <module 'invoice2data.input.doctr' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/doctr.py'>, 'gvision': <module 'invoice2data.input.gvision' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/gvision.py'>, 'hotpdf': <module 'invoice2data.input.hotpdf' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/hotpdf.py'>, 'ocrmypdf': <module 'invoice2data.input.ocrmypdf' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/ocrmypdf.py'>, 'paddleocr': <module 'invoice2data.input.paddleocr' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/paddleocr.py'>, 'pdfium': <module 'invoice2data.input.pdfium' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfium.py'>, 'pdfminer': <module 'invoice2data.input.pdfminer_wrapper' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfminer_wrapper.py'>, 'pdfoxide': <module 'invoice2data.input.pdfoxide' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfoxide.py'>, 'pdfplumber': <module 'invoice2data.input.pdfplumber' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfplumber.py'>, 'pdftotext': <module 'invoice2data.input.pdftotext' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdftotext.py'>, 'tesseract': <module 'invoice2data.input.tesseract' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/tesseract.py'>, 'text': <module 'invoice2data.input.text' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/text.py'>}

backend name (the --input-reader value) -> backend module.

Type:

Registry

invoice2data.input.available_modules()

Return the registered backends whose dependencies are available.

Returns:

Subset of INPUT_MODULES usable in the

current environment.

Return type:

dict[str, ModuleType]

invoice2data.input.extract_text(module, invoicefile, area=None)

Extract text with a backend, memoized per (backend, file, mtime, area).

Avoids re-parsing the same document within a run – e.g. when several template fields share one area, or the same full text is requested again. The file mtime is part of the key so a changed file is re-read.

Parameters:
  • module (ModuleType) – An input backend exposing to_text.

  • invoicefile (str) – Path to the document.

  • area (dict[str, Any] | None) – Optional area-restriction passed through.

Returns:

The extracted text.

Return type:

str

invoice2data.input.is_available(module)

Return whether a backend’s runtime dependency is available.

Parameters:

module (ModuleType) – An input backend module.

Returns:

The result of the backend’s is_available() if it defines one,

otherwise True (the backend is assumed always available).

Return type:

bool

invoice2data.input.supports_area(module)

Return whether a backend supports area-restricted extraction.

Parameters:

module (ModuleType) – An input backend module.

Returns:

True if the backend declares SUPPORTS_AREA = True.

Return type:

bool

pdfium (default)

pypdfium2 input module for invoice2data.

A fast, dependency-light PDF text backend (Google’s PDFium bindings). Its text order/spacing differs from poppler’s pdftotext -layout (PDFium has no layout mode), so layout-sensitive templates should still pin input_module: pdftotext. Area (region) extraction is supported in-process via PDFium’s get_text_bounded; note its output is not identical to pdftotext’s area output, so an area template targets one backend’s text, not both.

invoice2data.input.pdfium.SUPPORTS_AREA = True

PDFium can extract a bounded region in-process (see _crop_pages).

invoice2data.input.pdfium.is_available()

Return whether the optional pypdfium2 package is importable.

Returns:

True if pypdfium2 is installed.

Return type:

bool

invoice2data.input.pdfium.to_text(path, area_details=None, **kwargs)

Extract text from a PDF using pypdfium2.

Parameters:
  • path (str) – Path to the PDF file.

  • area_details (dict[str, Any] | None) – Restrict extraction to a region. Keys (pixels at r dpi, top-left origin): f/l (first/last page), x/y (top-left), W/H (size), r (dpi). Defaults to None (whole document).

  • **kwargs (Any) – Ignored; accepted for backend compatibility.

Returns:

The extracted text, pages joined by newlines.

Return type:

str

pdftotext

Poppler pdftotext input module for invoice2data.

Full-page extraction shells out to pdftotext -layout. Area (region) extraction no longer re-runs pdftotext per area: word positions are read once via pdftotext -bbox-layout (cached per file) and the requested rectangle is cropped in Python, so several area fields on one document cost a single parse.

invoice2data.input.pdftotext.is_available()

Return whether the poppler pdftotext binary is on the PATH.

Returns:

True if pdftotext can be run.

Return type:

bool

invoice2data.input.pdftotext.to_text(path, area_details=None)

Extract text from a PDF file using pdftotext.

Parameters:
  • path (str) – Path to the PDF file.

  • area_details (dict[str, Any] | None, optional) – Restrict extraction to a region. Keys (pixels at r dpi): f/l (first/last page), x/y (top-left), W/H (size), r (resolution dpi). Defaults to None (whole document).

Returns:

The extracted text.

Return type:

str

Raises:
  • FileNotFoundError – If the specified PDF file is not found.

  • OSError – If pdftotext is not installed.

text

text input module for invoice2data.

invoice2data.input.text.to_text(path)

Reads the content of a text file.

Parameters:

path (str) – The path to the text file.

Returns:

The content of the text file.

Return type:

str

pdfplumber

pdfplumber input module for invoice2data.

invoice2data.input.pdfplumber.is_available()

Return whether the optional pdfplumber package is importable.

Returns:

True if pdfplumber is installed.

Return type:

bool

invoice2data.input.pdfplumber.to_text(path, **kwargs)

Extract text from PDF using pdfplumber.

Parameters:
  • path (str) – Path to the PDF file.

  • **kwargs (dict[str, Any]) – Keyword arguments to be passed to pdfplumber.

Returns:

Extracted text from the PDF.

Return type:

str

Raises:

ImportError – If the optional pdfplumber dependency is not installed.

pdfminer

pdminer input module for invoice2data.

invoice2data.input.pdfminer_wrapper.is_available()

Return whether the optional pdfminer.six package is importable.

Returns:

True if pdfminer is installed.

Return type:

bool

invoice2data.input.pdfminer_wrapper.to_text(path, **kwargs)

Wrapper around pdfminer to extract text from PDF.

Parameters:
  • path (str) – Path to the PDF file.

  • **kwargs (dict[str, Any]) – Keyword arguments to be passed to pdfminer.

Returns:

Extracted text from the PDF.

Return type:

str

pdfoxide

pdf-oxide input module for invoice2data.

A fast Rust-based PDF text backend (pdf_oxide). Its text order/spacing differs from poppler’s pdftotext -layout, so templates tuned for pdftotext may need adjustment.

invoice2data.input.pdfoxide.is_available()

Return whether the optional pdf-oxide package is importable.

Returns:

True if pdf_oxide is installed.

Return type:

bool

invoice2data.input.pdfoxide.to_text(path, **kwargs)

Extract text from a PDF using pdf-oxide.

Parameters:
  • path (str) – Path to the PDF file.

  • **kwargs (dict[str, Any]) – Ignored; accepted for backend compatibility.

Returns:

The extracted text, pages joined by newlines.

Return type:

str

hotpdf

hotpdf input module for invoice2data.

hotpdf is a fast pdfminer.six-based reader. Its plain-text output runs words together more than pdftotext -layout, so templates tuned for pdftotext may need adjustment.

invoice2data.input.hotpdf.is_available()

Return whether the optional hotpdf package is importable.

Returns:

True if hotpdf is installed.

Return type:

bool

invoice2data.input.hotpdf.to_text(path, **kwargs)

Extract text from a PDF using hotpdf.

Parameters:
  • path (str) – Path to the PDF file.

  • **kwargs (dict[str, Any]) – Ignored; accepted for backend compatibility.

Returns:

The extracted text, pages joined by newlines.

Return type:

str

tesseract (OCR)

Tesseract OCR input module for invoice2data.

invoice2data.input.tesseract.is_available()

Return whether the tesseract and ImageMagick binaries are present.

Returns:

True if both tesseract and convert are on the PATH.

Return type:

bool

invoice2data.input.tesseract.to_text(path, area_details=None)

Extract text from image using tesseract OCR.

Parameters:
  • path (str) – Path to the image file.

  • area_details (dict[str, Any] | None, optional) – Specific area in the image to extract text from. Defaults to None (extract from the entire image).

Returns:

The extracted text.

Return type:

str

Raises:
  • FileNotFoundError – If the specified image file is not found.

  • OSError – If Tesseract OCR fails to extract text.

ocrmypdf (OCR)

OCRmyPDF input module for invoice2data.

invoice2data.input.ocrmypdf.RECOMMENDED_SCAN_OPTIONS = {'clean': True, 'deskew': True, 'rotate_pages': True}

Common OCRmyPDF pre-processing knobs. Any of these may be passed through input_reader_config / pre_conf – they are forwarded verbatim to ocrmypdf.ocr – to clean up noisy scans: deskew, clean, clean_final, rotate_pages, remove_background, optimize (0-3, image/size optimization) and oversample (target DPI). A recommended starting set for scanned receipts (spread it into input_reader_config):

invoice2data.input.ocrmypdf.is_available()

Backend availability check (see input.__interface__).

Return type:

bool

invoice2data.input.ocrmypdf.ocrmypdf_available()

Checks if the ocrmypdf module is available.

Returns:

True if ocrmypdf is available, False otherwise.

Return type:

bool

invoice2data.input.ocrmypdf.pre_process_pdf(path, pre_conf=None)

Pre-process a PDF with ocrmypdf, returning the cleaned PDF path.

The output is a deskewed/cleaned/optimized, text-layered PDF – usually smaller than the original. Callers (e.g. an Odoo integration) can use the returned path to attach or replace the stored file for size savings, not just to feed pdftotext. Writes to a unique temp file unless pre_conf sets output_file. Logs a warning if ocrmypdf is not available.

Parameters:
  • path (str) – Path to the PDF invoice file.

  • pre_conf (dict[str, Any] | None, optional) – Settings forwarded to ocrmypdf.ocr (merged over OPTIONS_DEFAULT); pass pre-processing knobs here (see RECOMMENDED_SCAN_OPTIONS). Defaults to None.

Returns:

Path to the processed (cleaned, smaller) PDF, or None if

processing fails.

Return type:

str | None

invoice2data.input.ocrmypdf.to_text(path, area_details=None, input_reader_config=None)

Pre-processes PDF files with ocrmypdf before PDFtotext parsing.

Ensures OCRmyPDF is installed before attempting to use it. If OCRmyPDF is not available, logs a warning and returns an empty string.

Parameters:
  • path (str) – Path to the PDF invoice file.

  • area_details (dict[str, Any] | None, optional) – Details about the area to extract. Defaults to None.

  • input_reader_config (dict[str, Any] | None, optional) – Settings forwarded to ocrmypdf.ocr – e.g. pre-processing knobs like deskew / clean / rotate_pages / optimize (see RECOMMENDED_SCAN_OPTIONS). Defaults to None.

Returns:

Extracted text from the PDF, or an empty string if OCRmyPDF is not available or processing fails.

Return type:

str

docTR (deep-learning OCR)

docTR (deep-learning OCR) input module for invoice2data.

Local, trained OCR that handles scanned/photographed documents well, usually without manual pre-processing (issue #526). Optional: install with pip install invoice2data[doctr] (pulls in docTR + a torch backend; the model weights download on first use). The OCR predictor is cached after the first call.

docTR reads PDFs and images directly (PDF rendering via pypdfium2 under the hood), so this backend OCRs the whole document and has no area-restricted mode.

invoice2data.input.doctr.SUPPORTS_AREA = False

docTR OCRs the whole document; it has no area-restricted mode.

invoice2data.input.doctr.doctr_available()

Return whether the optional python-doctr package is importable.

Returns:

True if docTR can be imported.

Return type:

bool

invoice2data.input.doctr.is_available()

Backend availability check (see input.__interface__).

Return type:

bool

invoice2data.input.doctr.to_text(path, area_details=None, **kwargs)

Extract text from a PDF or image with docTR OCR.

Parameters:
  • path (str) – Path to the PDF or image file.

  • area_details (dict[str, Any] | None) – Ignored (docTR has no area mode).

  • **kwargs (Any) – Ignored; accepted for backend-interface compatibility.

Returns:

The OCR’d text, or an empty string if docTR is not available.

Return type:

str

PaddleOCR (deep-learning OCR)

PaddleOCR (deep-learning OCR) input module for invoice2data.

Local, trained OCR with very broad language coverage (issue #526). Optional: install with pip install invoice2data[paddleocr] (pulls in paddleocr + paddlepaddle + pypdfium2; model weights download on first use). The OCR engine is cached after the first call.

PaddleOCR works on images, so PDFs are rendered to page images with pypdfium2 first. The whole document is OCR’d; there is no area-restricted mode. The result parser handles both the PaddleOCR 2.x ([box, (text, score)]) and 3.x ({"rec_texts": [...]}) shapes defensively.

invoice2data.input.paddleocr.SUPPORTS_AREA = False

PaddleOCR OCRs the whole document; it has no area-restricted mode.

invoice2data.input.paddleocr.is_available()

Backend availability check (see input.__interface__).

Return type:

bool

invoice2data.input.paddleocr.paddleocr_available()

Return whether the optional paddleocr package is importable.

Returns:

True if PaddleOCR can be imported.

Return type:

bool

invoice2data.input.paddleocr.to_text(path, area_details=None, **kwargs)

Extract text from a PDF or image with PaddleOCR.

Parameters:
  • path (str) – Path to the PDF or image file.

  • area_details (dict[str, Any] | None) – Ignored (PaddleOCR has no area mode).

  • **kwargs (Any) – Optional lang (PaddleOCR language code, default “en”); other keys are ignored.

Returns:

The OCR’d text, or an empty string if PaddleOCR is not available.

Return type:

str

Google Vision (OCR)

Google Cloud Vision input module for invoice2data.

Uses Cloud Vision’s async DOCUMENT_TEXT_DETECTION staged through Google Cloud Storage, so a GCS bucket is required (set GOOGLE_CLOUD_BUCKET_NAME) plus GOOGLE_APPLICATION_CREDENTIALS.

A modern, bucket-free alternative is Google Document AI (an “OCR processor” run synchronously) — see the OCA module account_invoice_google_document_ai (OCA/account-invoicing) for that approach. Worth considering as a future backend that drops the GCS-bucket setup; it needs a Document AI processor id + the google-cloud-documentai client.

invoice2data.input.gvision.SUPPORTS_AREA = False

Google Vision OCRs the whole document; it has no area-restricted mode.

invoice2data.input.gvision.is_available()

Backend availability check (see input.__interface__).

Return type:

bool

invoice2data.input.gvision.to_text(path, bucket_name=None, language='en')

Sends PDF files to Google Cloud Vision for OCR.

Before using invoice2data, make sure you have the auth JSON path set as the environment variable GOOGLE_APPLICATION_CREDENTIALS.

Parameters:
  • path (str) – Path of the electronic invoice in JPG or PNG format.

  • bucket_name (str | None) – Name of the bucket to use for file storage and results cache. Defaults to “cloud-vision-84893”.

  • language (str, optional) – Language to use for OCR. Defaults to “en”.

Returns:

Extracted text from the image.

Return type:

str

Raises:

OSError – If the google cloud bucket_name is not set.

Output modules

csv

CSV output module for invoice2data.

invoice2data.output.to_csv.write_to_file(data, path, date_format='%Y-%m-%d', lines_mode='json')

Export extracted fields to CSV.

Appends .csv to path if missing and generates a CSV file in the specified directory, otherwise in the current directory.

Parameters:
  • data (list[dict[str, Any]]) – A list of dictionaries of extracted fields. If only a single file was processed, it must be passed as a single-element list.

  • path (str) – CSV file to save output to.

  • date_format (str) – Date format used in the generated file. Defaults to “%Y-%m-%d”.

  • lines_mode (str) – How to render line-item arrays. “json” (default) JSON-encodes lines/tax_lines cells; “explode” writes one row per lines item, repeating invoice-level fields.

Return type:

None

Notes

Provide a filename to the path parameter.

Examples

>>> import tempfile
>>> from pathlib import Path
>>> from invoice2data.output import to_csv
>>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}]
>>> path = Path(tempfile.mkdtemp()) / "invoice.csv"
>>> to_csv.write_to_file(data, str(path))
>>> path.exists()
True

json

JSON output module for invoice2data.

invoice2data.output.to_json.format_item(item, date_format)

Format an item for JSON serialization.

Parameters:
  • item (Any) – The item to format.

  • date_format (str) – The date format to use.

Returns:

The formatted item.

Return type:

Any

invoice2data.output.to_json.write_to_file(data, path, date_format='%Y-%m-%d')

Export extracted fields to JSON.

Appends .json to path if missing and generates JSON file in the specified directory, otherwise in the current directory.

Parameters:
  • data (list[dict[str, Any]]) – Dictionary of extracted fields.

  • path (str) – Directory to save the generated JSON file.

  • date_format (str) – Date format used in the generated file. Defaults to “%Y-%m-%d”.

Return type:

None

Notes

Provide a filename to the path parameter.

Examples

>>> import tempfile
>>> from pathlib import Path
>>> from invoice2data.output import to_json
>>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}]
>>> path = Path(tempfile.mkdtemp()) / "invoice.json"
>>> to_json.write_to_file(data, str(path))
>>> path.exists()
True

xml

XML output module for invoice2data.

invoice2data.output.to_xml.defusedxml_available()

Checks if the defusedxml module is available.

Returns:

True if defusedxml is available, False otherwise.

Return type:

bool

invoice2data.output.to_xml.dict_to_tags(parent, data, date_format)

Convert a dictionary to XML tags.

This function iterates through the dictionary and creates XML tags for each key-value pair. It handles different data types and formats dates according to the specified format.

Parameters:
  • parent (ElementTree.Element) – The parent element.

  • data (dict[str, Any]) – The dictionary to be converted.

  • date_format (str) – The date format to use.

Return type:

None

invoice2data.output.to_xml.prettify(elem)

Return a pretty-printed XML string for the Element.

Parameters:

elem (ElementTree.Element) – The Element to be pretty-printed.

Returns:

A pretty-printed XML string.

Return type:

Any

invoice2data.output.to_xml.write_to_file(data, path, date_format='%Y-%m-%d')

Export extracted fields to xml.

Appends .xml to path if missing and generates xml file in specified directory, if not then in root.

Parameters:
  • data (list[dict[str, Any]]) – List of dictionaries containing extracted fields.

  • path (str) – Path to save the generated XML file.

  • date_format (str, optional) – Date format used in generated file. Defaults to “%Y-%m-%d”.

Return type:

None

Notes

Provide a filename to the path parameter.

Examples

>>> import tempfile
>>> from pathlib import Path
>>> from invoice2data.output import to_xml
>>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}]
>>> path = Path(tempfile.mkdtemp()) / "invoice.xml"
>>> to_xml.write_to_file(data, str(path))
>>> path.exists()
True

Output streams

Output modules and the shared output-destination helper.

invoice2data.output.open_output(path, suffix, **open_kwargs)

Open the destination for a given --output-name.

Streams to stdout for - / /dev/stdout and stderr for /dev/stderr; otherwise writes to path with suffix appended when missing.

Parameters:
  • path (str) – The requested output name.

  • suffix (str) – File extension to ensure (e.g. .json) for file paths.

  • **open_kwargs (Any) – Extra arguments forwarded to Path.open for files.

Yields:

TextIO – The writable text stream (a standard stream is not closed here).

Return type:

Iterator[TextIO]

Extract

loader

This module abstracts templates for invoice providers.

Templates are initially read from .yml or .json files and then kept as class.

invoice2data.extract.loader.ordered_load(stream, loader=<function loads>)

Parse templates from an in-memory string instead of from disk.

Useful when templates live outside the filesystem (e.g. a database column or an API payload): extract_data(file, templates=ordered_load(db_text)). For YAML data pass loader=yaml.safe_load.

Parameters:
  • stream (str) – Serialized templates – a JSON (default) or YAML array of template mappings.

  • loader (Callable[[str], Any], optional) – Callable turning stream into a list of template dicts. Defaults to json.loads; pass yaml.safe_load for YAML.

Returns:

Parsed, prepared templates (empty list on a parse

error, which is logged).

Return type:

list[InvoiceTemplate]

invoice2data.extract.loader.prepare_template(tpl)

Prepare a template for use.

Parameters:

tpl (dict[str, Any]) – Template dictionary.

Returns:

Processed template dictionary.

Return type:

dict[str, Any] | None

invoice2data.extract.loader.read_templates(folder=None)

Load YAML templates from template folder. Return list of dicts.

Use built-in templates if no folder is set.

Parameters:

folder (str | None) – User-defined folder where templates are stored. If None, uses built-in templates.

Returns:

List of InvoiceTemplate objects.

Return type:

list[InvoiceTemplate]

Examples

>>> templates = read_templates("./src/invoice2data/extract/templates/au")
>>> len(templates)  # Check the number of loaded templates
2
>>> templates[0]['template_name']  # Check the name of the first template
        'au.com.opal.yml'

InvoiceTemplate

class invoice2data.extract.invoice_template.InvoiceTemplate(*args, **kwargs)

Represents single template files that live as .yml files on the disk.

Parameters:
  • args (Any)

  • kwargs (Any)

prepare_input(extracted_str)

Input raw string and perform transformations, as set in the template file.

Parameters:

extracted_str (str)

Return type:

str

matches_input(extracted_str)

Check if the string matches keywords set in the template file.

Parameters:

extracted_str (str)

Return type:

bool

parse_number(value)

Parse number, remove decimal separator and add other options.

Parameters:

value (str)

Return type:

float

parse_date(value)

Parse date and return the date after parsing.

Parameters:

value (str)

Return type:

Any

coerce_type(value, target_type)

Change the type of values.

Parameters:
  • value (str)

  • target_type (str)

Return type:

Any

extract(optimized_str)

Given a template file and a string, extract matching data fields.

Parameters:
  • optimized_str (str)

  • invoice_file (str)

  • input_module (Any)

Return type:

dict[str, Any]

coerce_type(value, target_type)

Coerces a value to the specified target type.

Parameters:
  • value (str) – The value to be coerced.

  • target_type (str) – The target type to which the value should be coerced. Valid values: ‘int’, ‘float’, ‘date’.

Returns:

The coerced value.

Return type:

Any

Raises:

AssertionError – If the target_type is unknown.

extract(optimized_str, invoice_file, input_module)

Extracts data from the optimized string using the template.

Parameters:
  • optimized_str (str) – The optimized string.

  • invoice_file (str) – The path to the invoice file.

  • input_module (Any) – The input module used.

Returns:

The extracted data.

Return type:

dict[str, Any]

matches_input(extracted_str)

Check if the extracted string matches the template keywords.

Parameters:

extracted_str (str) – The extracted text from the invoice.

Returns:

True if the extracted string matches the template keywords,

False otherwise.

Return type:

bool

parse_date(value)

Parses date and returns date after parsing.

Parameters:

value (str)

Return type:

Any

parse_number(value)

Parses a number from a string.

This function parses a numerical value from a string, handling different decimal separators and thousands separators based on locale.

Parameters:

value (str) – The string containing the number to be parsed.

Returns:

The parsed numerical value.

Return type:

float

prepare_input(extracted_str)

Input raw string and do transformations, as set in template file.

Parameters:

extracted_str (str)

Return type:

str

Note

:no-index: works around an autodoc quirk where members of a typing-generic OrderedDict[str, Any] subclass are emitted twice. The methods still render here; they’re internal — the public API is extract_data().

Canonical field schema

Canonical output-field vocabulary and (opt-in) validation.

The field names below mirror docs/recommended-template-fields.md and the OCA account_invoice_import_invoice2data Odoo module. This is the single source of truth used by validation (and, later, the benchmark quality scoring).

Validation is intentionally conservative: templates may legitimately emit custom fields, so validate_output() only flags an unrecognized field when it looks like a typo of a canonical name. A template can opt into strict checking and whitelist custom fields via its options (strict_fields / extra_fields).

invoice2data.extract.schema.INVOICE_FIELDS: frozenset[str] = frozenset({'amount', 'amount_tax', 'amount_untaxed', 'bic', 'company_vat', 'country_code', 'currency', 'currency_symbol', 'date', 'date_due', 'date_end', 'date_start', 'desc', 'iban', 'incoterm', 'invoice_number', 'issuer', 'lines', 'mandate_id', 'mobile', 'narration', 'note', 'partner_city', 'partner_coc', 'partner_email', 'partner_name', 'partner_ref', 'partner_street', 'partner_street2', 'partner_street3', 'partner_website', 'partner_zip', 'payment_reference', 'payment_unece_code', 'siren', 'state_code', 'tax_lines', 'telephone', 'template_name', 'vat'})

Canonical top-level (invoice-level) field names.

invoice2data.extract.schema.LINE_FIELDS: frozenset[str] = frozenset({'barcode', 'code', 'date_end', 'date_start', 'discount', 'line_note', 'line_tax_amount', 'line_tax_percent', 'name', 'price_subtotal', 'price_total', 'price_unit', 'product', 'qty', 'sectionheader', 'taxes', 'unece_code', 'uom'})

Canonical per-line-item field names.

invoice2data.extract.schema.LINE_FIELD_ALIASES: dict[str, str] = {'description': 'name', 'tax_percent': 'line_tax_percent', 'unit_price': 'price_unit', 'unitprice': 'price_unit', 'vat_rate': 'line_tax_percent'}

Non-canonical line/tax-line field names mapped to their canonical equivalent. Applied to the output (not the templates) by normalize_line_fields(), so a template may keep these aliases and still produce the standard vocabulary. product is intentionally absent — it is a distinct field (product matching), not a synonym for name. description maps to name because Odoo reads a line’s label from name (description is only an invoice-level field for single-line imports).

invoice2data.extract.schema.TAX_LINE_FIELDS: frozenset[str] = frozenset({'line_tax_amount', 'line_tax_code', 'line_tax_percent', 'price_subtotal'})

Canonical per-rate tax-line field names.

invoice2data.extract.schema.normalize_line_fields(output)

Rename non-canonical keys in lines/tax_lines rows in place.

Maps each alias in LINE_FIELD_ALIASES to its canonical name so that extraction output uses one vocabulary regardless of the group names a template happens to use. An alias is only applied when the canonical key is not already present (so an explicit canonical value always wins).

Parameters:

output (dict[str, Any]) – The extracted-fields dictionary, mutated in place.

Return type:

None

invoice2data.extract.schema.validate_output(output, extra_fields=())

Find unrecognized field names in an extracted output.

Checks top-level fields against INVOICE_FIELDS, lines items against LINE_FIELDS, and tax_lines items against TAX_LINE_FIELDS. Recognized names, auto-typed (amount*/date*) names, and names in extra_fields are ignored.

Parameters:
  • output (dict[str, Any]) – The extracted-fields dictionary.

  • extra_fields (Iterable[str]) – Custom field names to treat as known.

Returns:

(field_name, suggestion) pairs for each

unrecognized field; suggestion is the closest canonical name if the field looks like a typo, else None.

Return type:

list[tuple[str, str | None]]

Validators

Lightweight, offline validators for disambiguating extracted identifiers.

Field-type regexes overlap – an IBAN and an EU VAT number can match each other’s pattern – so a captured value is classified by validating it rather than by the regex alone: try the strong checksum first (IBAN mod-97), then the format checks (VAT, BIC). Pure-Python, no network, no heavy deps.

Used by the template-authoring suggestion layer (to assign a captured value to the right canonical field) and as an optional soft-validation of extracted fields. For full per-country VAT/IBAN checksum coverage, python-stdnum can be layered on later; this module stays dependency-free.

invoice2data.extract.validators.VALIDATORS = {'bic': <function validate_bic>, 'iban': <function validate_iban>, 'vat': <function validate_vat>}

Validators in discriminating order – strongest (checksum) first, so an IBAN is never mistaken for a VAT number.

invoice2data.extract.validators.classify_identifier(value)

Classify a captured value as one of the known identifier types.

Runs the validators in discriminating order (IBAN mod-97 checksum first, then the format-based VAT/BIC checks) so overlapping patterns resolve correctly – e.g. a value that passes the IBAN checksum is reported as "iban" even though it may also look VAT-shaped.

Parameters:

value (str) – The captured string to classify.

Returns:

The matching key (“iban”, “vat” or “bic”), or None if nothing

validates.

Return type:

str | None

invoice2data.extract.validators.validate_bic(value)

Return whether value matches the SWIFT/BIC format (ISO 9362).

Parameters:

value (str) – Candidate BIC; spaces and case are ignored.

Returns:

True if it is a structurally valid 8- or 11-character BIC.

Return type:

bool

invoice2data.extract.validators.validate_iban(value)

Return whether value is a structurally valid IBAN (ISO 13616 mod-97).

Parameters:

value (str) – Candidate IBAN; spaces and case are ignored.

Returns:

True if the structure and the mod-97 checksum are both valid.

Return type:

bool

invoice2data.extract.validators.validate_vat(value)

Return whether value matches a known EU VAT number format.

Parameters:

value (str) – Candidate VAT number; spaces and case are ignored.

Returns:

True if it matches a supported country’s VAT format. This is a

format check, not a per-country checksum.

Return type:

bool

Candidate extraction

Candidate extraction for guided / AI-assisted template authoring (AUTH-1).

Scans a document’s extracted text for typed candidate values – dates, monetary amounts and identifiers (IBAN/VAT/BIC) – each with its position in the text. The template-authoring layers (the copier-style CLI builder and AI template generation) consume these candidates to propose fields/regexes; the guided heuristics (AUTH-2) turn them into first-guess field assignments.

Pure-Python: reuses the offline validators to type identifiers and dateparser (already a dependency) to parse dates.

class invoice2data.extract.candidates.Candidate(kind, value, start, end, parsed)

A typed value found in document text.

Parameters:
  • kind (str)

  • value (str)

  • start (int)

  • end (int)

  • parsed (Any)

kind

Candidate type – “date”, “amount”, “iban”, “vat” or “bic”.

Type:

str

value

The raw substring as it appears in the text.

Type:

str

start

Start character offset in the source text.

Type:

int

end

End character offset in the source text.

Type:

int

parsed

Normalised value – a datetime for dates, a float for amounts, the whitespace-stripped identifier for iban/vat/bic.

Type:

Any

invoice2data.extract.candidates.find_amounts(text)

Find parseable monetary-amount candidates in text.

Parameters:

text (str) – The document’s extracted text.

Returns:

Amount candidates with a parsed float value.

Return type:

list[Candidate]

invoice2data.extract.candidates.find_candidates(text)

Find all typed candidates in text, sorted by position.

Amount candidates that fall inside a date (e.g. the 12.05 in 12.05.2024) are dropped to avoid double counting.

Parameters:

text (str) – The document’s extracted text.

Returns:

All date/amount/identifier candidates, ordered by

start offset.

Return type:

list[Candidate]

invoice2data.extract.candidates.find_dates(text)

Find parseable date candidates in text.

Parameters:

text (str) – The document’s extracted text.

Returns:

Date candidates whose value dateparser could parse.

Return type:

list[Candidate]

invoice2data.extract.candidates.find_identifiers(text)

Find validated identifier candidates (IBAN/VAT/BIC) in text.

Each potential identifier is classified via validators.classify_identifier() so overlapping patterns resolve correctly (e.g. an IBAN is not mistaken for a VAT number).

Parameters:

text (str) – The document’s extracted text.

Returns:

Identifier candidates with kind set to the validated

type and parsed to the whitespace-stripped value, sorted by position.

Return type:

list[Candidate]

Template suggestions

Guided heuristics that turn candidates into first-guess fields (AUTH-2).

Given the typed candidates from candidates, propose a first draft of field assignments using deterministic rules – the same “collect all values of a type, then pick by min/max/position” idea used by OCA’s account_invoice_import_simple_pdf:

  • of the captured dates, the earliest is most likely the invoice date and the latest the date_due;

  • the largest monetary amount is most likely the total amount;

  • the first validated iban / vat / bic is offered as-is.

These are suggestions for the authoring layers (the CLI builder and AI template generation) to present for confirmation – never authoritative extraction.

invoice2data.extract.suggestions.suggest_fields(candidates)

Propose first-guess field assignments from typed candidates.

Field keys use the canonical invoice vocabulary (date, date_due, amount, iban, vat, bic).

Parameters:

candidates (list[Candidate]) – Typed candidates from candidates.find_candidates().

Returns:

Canonical field name -> the chosen candidate. Only

fields with a confident guess are included.

Return type:

dict[str, Candidate]

invoice2data.extract.suggestions.suggest_from_text(text)

Extract candidates from text and return first-guess field assignments.

Convenience wrapper around candidates.find_candidates() + suggest_fields() for the authoring layers.

Parameters:

text (str) – The document’s extracted text.

Returns:

Canonical field name -> the chosen candidate.

Return type:

dict[str, Candidate]

Template builder

Deterministic template drafting for the CLI builder (AUTH-3).

Turns the AUTH candidates/suggestions into a first-draft invoice2data template (issuer + keywords + field regexes) without any AI. The CLI builder (invoice2data --new-template) presents this draft for confirmation; the AI-assisted mode swaps this for ai.template_generator.generate_template().

invoice2data.extract.template_builder.field_regex(spec)

Return the regex of a field spec (a bare string or a field dict).

Parameters:

spec (str | dict[str, Any]) – A template field value.

Returns:

The field’s regex pattern.

Return type:

str

invoice2data.extract.template_builder.field_regex_from_candidate(text, candidate)

Build a field regex anchored on the label preceding a candidate value.

Uses the text before the value on its line as a literal anchor (with flexible whitespace) plus a typed capture group, e.g. Date:\s*(\d[\d/.\-]+\d).

Parameters:
  • text (str) – The full document text.

  • candidate (Candidate) – The candidate whose value to capture.

Returns:

A regex with one capturing group around the value.

Return type:

str

invoice2data.extract.template_builder.preview_field(spec, text)

Return what a field spec captures from text, after any cleanup.

Parameters:
  • spec (str | dict[str, Any]) – A template field value (regex or dict).

  • text (str) – The sample text to match against.

Returns:

The captured (and replace-cleaned) value, or None if the

regex does not match.

Return type:

str | None

invoice2data.extract.template_builder.set_field_regex(spec, regex)

Return spec with its regex replaced (keeping any cleanup/replace).

Parameters:
  • spec (str | dict[str, Any]) – The existing field spec.

  • regex (str) – The new regex pattern.

Returns:

The updated spec (same shape as the input).

Return type:

str | dict[str, Any]

invoice2data.extract.template_builder.suggested_template(text)

Draft a template from a sample’s deterministic candidates (no AI).

Parameters:

text (str) – The sample document’s extracted text.

Returns:

A template dict with issuer, keywords and

fields (canonical field name -> regex).

Return type:

dict[str, Any]

invoice2data.extract.template_builder.to_yaml(template)

Serialise a template dict to YAML for writing to a .yml file.

Parameters:

template (dict[str, Any]) – The template to serialise.

Returns:

The YAML document.

Return type:

str

Date parsing

Tiered, cached date parsing.

Order, fastest applicable first:

  1. the template’s explicit date_formats via stdlib datetime.strptime (microseconds, deterministic);

  2. dateutil (fast, fuzzy, English-centric);

  3. dateparser (slow, but multilingual / localized month names) – which is an optional dependency (pip install invoice2data[dateparser]).

With dateparser absent, localized month-name dates won’t parse, but numeric and English dates still do via tiers 1-2. Results are memoized (absolute-date parsing is deterministic for given inputs).

invoice2data.extract._dates.parse_date(value, date_formats=(), languages=())

Parse a date string using the tiered strategy (memoized).

Parameters:
  • value (str) – The date string to parse.

  • date_formats (tuple[str, ...]) – Template formats, tried first via strptime.

  • languages (tuple[str, ...]) – Language codes for the dateparser fallback.

Returns:

The parsed datetime, or None.

Return type:

datetime.datetime | None

Regex engine

Internal regex helpers with a compile-once cache.

All regex matching in invoice2data.extract goes through these thin wrappers so each pattern is compiled only once (via an LRU cache) instead of on every call. The engine is selected once at import time: the stdlib re by default, or the API-compatible third-party regex package when INVOICE2DATA_REGEX_ENGINE=regex is set in the environment.

invoice2data.extract._regex.ENGINE: str = 're'

Name of the active regex engine (“re” or “regex”).

invoice2data.extract._regex.compile(pattern, flags=0)

Compile a regex pattern, caching the result.

Parameters:
  • pattern (str) – The regular expression pattern.

  • flags (int) – Regex flags passed to the engine. Defaults to 0.

Returns:

The compiled pattern object. The active engine

(re or the API-compatible regex) is treated as re for typing.

Return type:

re.Pattern[str]

invoice2data.extract._regex.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string.

Parameters:
  • pattern (str) – The regular expression pattern.

  • string (str) – The text to search.

  • flags (int) – Regex flags. Defaults to 0.

Returns:

A list of matches (strings or tuples of groups).

Return type:

Any

invoice2data.extract._regex.finditer(pattern, string, flags=0)

Iterate over all non-overlapping match objects of pattern in string.

Parameters:
  • pattern (str) – The regular expression pattern.

  • string (str) – The text to search.

  • flags (int) – Regex flags. Defaults to 0.

Returns:

A callable iterator yielding re.Match objects (or the active

engine’s equivalent).

Return type:

Any

invoice2data.extract._regex.search(pattern, string, flags=0)

Search string for the first match of pattern.

Parameters:
  • pattern (str) – The regular expression pattern.

  • string (str) – The text to search.

  • flags (int) – Regex flags. Defaults to 0.

Returns:

A match object, or None if there is no match.

Return type:

re.Match[str] | None

invoice2data.extract._regex.split(pattern, string, maxsplit=0, flags=0)

Split string by occurrences of pattern.

Parameters:
  • pattern (str) – The regular expression pattern.

  • string (str) – The text to split.

  • maxsplit (int) – Maximum number of splits. Defaults to 0 (no limit).

  • flags (int) – Regex flags. Defaults to 0.

Returns:

A list of substrings.

Return type:

Any

invoice2data.extract._regex.sub(pattern, repl, string, count=0, flags=0)

Replace occurrences of pattern in string with repl.

Parameters:
  • pattern (str) – The regular expression pattern.

  • repl (str) – The replacement string.

  • string (str) – The text to operate on.

  • count (int) – Maximum number of replacements. Defaults to 0 (all).

  • flags (int) – Regex flags. Defaults to 0.

Returns:

The string with replacements applied.

Return type:

str

Plugins

tables

Plugin to extract tables from an invoice.

invoice2data.extract.plugins.tables.extract(self, content, output, invoice_file=None)

Try to extract tables from an invoice.

Parameters:
  • self (InvoiceTemplate) – The current instance of the class. # noqa: DOC103

  • content (str) – The content of the invoice.

  • output (dict[str, Any]) – The updated output dictionary with extracted data or None if parsing fails.

  • invoice_file (str | None) – Unused; accepted for plugin-interface compatibility (path-based plugins such as camelot need it).

Returns:

The extracted data as a list of dictionaries, or None if table parsing fails.

Each dictionary represents a row in the table.

Return type:

list[Any] | None

lines

Plugin to extract individual lines from an invoice.

This plugin has been replaced by the “lines” parser. All new templates should use the parser instead. It’s provided for backward compatibility only.

invoice2data.extract.plugins.lines.extract(self, content, output, invoice_file=None)

Extract individual lines from an invoice.

This plugin has been replaced by the “lines” parser. All new templates should use the parser instead. It’s provided for backward compatibility only.

Parameters:
  • self (InvoiceTemplate) – The current instance of the class.

  • content (str) – The text content to parse.

  • output (dict[str, Any]) – A dictionary to store the extracted data.

  • invoice_file (str | None) – Unused; accepted for plugin-interface compatibility (path-based plugins such as camelot need it).

Return type:

None

camelot

Camelot table-extraction plugin (optional) for invoice2data.

Opt-in: requires the optional camelot-py dependency (pip install invoice2data[camelot]). Unlike the text parsers/plugins, camelot re-reads the PDF itself to detect ruled (lattice) or whitespace-aligned (stream) tables, so it needs the source file path.

Enable it with a top-level camelot: block in a template — either a single mapping or a list of them. Recognised camelot.read_pdf keys (pages, flavor, table_areas, columns, …) are forwarded as-is; the plugin-specific keys are:

field output key to populate (default: lines) header use the table’s first row as the column names (default: true) tables which detected table to use: an index, or all (default: all)

Example:

camelot:
  flavor: lattice
  pages: "1"
  field: lines
invoice2data.extract.plugins.camelot.extract(self, content, output, invoice_file=None)

Detect tables with camelot and add their rows to the output.

Parameters:
  • self (Any) – The matched template (an InvoiceTemplate); provides the camelot: block.

  • content (str) – Unused — camelot reads the PDF directly.

  • output (dict[str, Any]) – Output dictionary to populate.

  • invoice_file (str | None) – Path to the source PDF (required).

Return type:

None

invoice2data.extract.plugins.camelot.is_available()

Return whether the optional camelot-py package is importable.

Returns:

True if camelot is installed.

Return type:

bool

Parsers

static

Pseudo-parser returning a static (predefined) value.

lines

Parser to extract individual lines from an invoice.

Initial work and maintenance by Holger Brunn @hbrunn

invoice2data.extract.parsers.lines.parse(template, field, settings, content)

Parse lines from the content based on the given settings.

Parameters:
  • template (InvoiceTemplate) – The template dictionary.

  • field (str) – The field name.

  • settings (dict[str, Any]) – The settings dictionary.

  • content (str) – The text content to parse.

Returns:

The parsed lines.

Return type:

list[dict[str, Any]]

invoice2data.extract.parsers.lines.parse_block(template, field, settings, content)

Parse a block of lines to extract data.

This function parses a block of lines from an invoice to extract data based on the provided template and settings. It handles different line types (first line, last line, regular lines) and can skip specific lines based on the configuration.

Parameters:
  • template (InvoiceTemplate) – The template containing extraction rules.

  • field (str) – The name of the field to extract.

  • settings (dict[str, Any]) – The settings for the extraction rule.

  • content (str) – The text content to parse.

Returns:

A list of dictionaries, where each dictionary

represents an extracted row with field-value pairs.

Return type:

list[dict[str, Any]]

invoice2data.extract.parsers.lines.parse_by_rule(template, field, rule, content)

Parse lines from a block of text based on a rule.

Parameters:
  • template (InvoiceTemplate) – The template dictionary.

  • field (str) – The field name.

  • rule (dict[str, Any]) – The rule dictionary.

  • content (str) – The text content to parse.

Returns:

The parsed lines.

Return type:

list[dict[str, Any]]

invoice2data.extract.parsers.lines.parse_current_row(match, current_row)

Parse the current row data.

Parameters:
  • match (Match[str] | None) – The match object.

  • current_row (dict[str, Any]) – The current row dictionary.

Returns:

The updated current row dictionary.

Return type:

dict[str, Any]

invoice2data.extract.parsers.lines.parse_line(patterns, line)

Parse a line using a given pattern or list of patterns.

This function searches for a match in the given line using the provided pattern or list of patterns. If a match is found, it returns the match object; otherwise, it returns None.

Parameters:
  • patterns (str | list[str]) – The pattern(s) to search for.

  • line (str) – The line to parse.

Returns:

A match object if a match is found, otherwise None.

Return type:

Match[str] | None

regex

Parser extracting data using regexes.

One or more regexes can be specified using the “regex” setting. By default it ignores duplicates and returns: - single value if there was only a single match - array for multiple matches

For more detailed parsing “type” and “group” settings can be specified.

invoice2data.extract.parsers.regex.parse(template, field, settings, content, legacy=False)

Parse a field from the content using regular expressions.

Parameters:
  • template (Any) – The template object.

  • field (str) – The name of the field to extract.

  • settings (dict[str, Any]) – The settings for the field extraction.

  • content (str) – The text content to parse.

  • legacy (bool, optional) – Whether to use legacy parsing. Defaults to False.

Returns:

The extracted value(s) or None if parsing fails.

Return type:

Any

AI (optional)

The AI subsystem is opt-in and provider-pluggable (cloud LLMs or a local Ollama). See AI features for configuration and usage. Requires the ai extra.

Configuration

AI provider configuration, resolved from environment variables.

All settings come from INVOICE2DATA_AI_* env vars so the core library never hard-codes credentials. The default provider is mock (no network), so nothing AI-related runs unless explicitly configured.

class invoice2data.ai.config.AIConfig(provider, model, base_url, api_key)

Resolved AI configuration.

Parameters:
  • provider (str)

  • model (str)

  • base_url (str | None)

  • api_key (str | None)

provider

Provider key – “mock” or a vendor name (openai/deepseek/mistral/gemini/ollama) routed to the OpenAI-compatible provider.

Type:

str

model

Model identifier passed to the provider.

Type:

str

base_url

API base URL (vendor default when unset).

Type:

str | None

api_key

API key; not required for local providers (Ollama).

Type:

str | None

invoice2data.ai.config.VENDOR_BASE_URLS: dict[str, str] = {'deepseek': 'https://api.deepseek.com/v1', 'gemini': 'https://generativelanguage.googleapis.com/v1beta/openai', 'mistral': 'https://api.mistral.ai/v1', 'ollama': 'http://localhost:11434/v1', 'openai': 'https://api.openai.com/v1'}

Default OpenAI-compatible base URLs for known vendors, so a user only needs to set the provider name + key (DeepSeek/Mistral/Ollama and Gemini’s compat endpoint all speak the OpenAI chat-completions API).

invoice2data.ai.config.load_config()

Load AI configuration from INVOICE2DATA_AI_* environment variables.

Reads INVOICE2DATA_AI_PROVIDER (default “mock”), ..._MODEL, ..._BASE_URL (falls back to the vendor default) and ..._API_KEY.

Returns:

The resolved configuration.

Return type:

AIConfig

Provider interface

The AIProvider contract and the provider registry.

Mirrors the input-backend seam (input/__interface__.py): a small structural contract plus a factory that instantiates the configured provider. Optional provider dependencies are imported lazily inside get_provider() so importing this module never requires the ai extra.

class invoice2data.ai.__interface__.AIProvider(*args, **kwargs)

Structural contract for an AI extraction provider.

name

Short provider identifier.

Type:

str

extract_structured(text, json_schema, *, instructions=None)

Extract structured fields from text, constrained to json_schema.

Parameters:
  • text (str) – The document’s extracted text.

  • json_schema (dict[str, Any]) – JSON Schema the result must match.

  • instructions (str | None) – Optional system prompt override.

Returns:

The structured fields.

Return type:

dict[str, Any]

is_available()

Return whether the provider is configured and its deps are present.

Returns:

True if extract_structured() can be called.

Return type:

bool

invoice2data.ai.__interface__.get_provider(config=None)

Instantiate the configured AI provider.

Parameters:

config (AIConfig | None) – Configuration to use; loaded from the environment when None.

Returns:

A MockProvider when the provider is “mock”, otherwise an

OpenAICompatibleProvider (which covers every supported vendor).

Return type:

AIProvider

LLM fallback extraction

Runtime LLM fallback extraction (AI-2).

Opt-in: when no template matches (or every match is missing required fields) and OCR doesn’t help, extract fields with the configured AIProvider, constrained to the canonical JSON schema and validated. Results are tagged extraction_method: "ai" so they are never confused with a deterministic template match. Off unless explicitly enabled.

invoice2data.ai.fallback.ai_fallback_extract(text, *, provider=None)

Extract fields from text via the configured AI provider (opt-in).

Parameters:
  • text (str) – The document’s extracted text.

  • provider (AIProvider | None) – Provider to use; the configured one (get_provider()) when None.

Returns:

Extracted fields tagged extraction_method: "ai", or an

empty dict when text is empty, the provider is unavailable, or nothing was found.

Return type:

dict[str, Any]

AI template generation

AI-assisted template generation (AI-1).

Drafts an invoice2data template from a sample document’s text using the configured AIProvider, grounded with the deterministic candidates from invoice2data.extract.suggestions so the model has concrete values to anchor its regexes. preview_template then round-trips the draft against the same text so the user can see what it captures before saving it.

Authoring-time only – this never runs during normal extraction; the default path stays deterministic templates.

invoice2data.ai.template_generator.TEMPLATE_SCHEMA: dict[str, Any] = {'properties': {'exclude_keywords': {'items': {'type': 'string'}, 'type': 'array'}, 'fields': {'additionalProperties': {'type': 'string'}, 'type': 'object'}, 'issuer': {'type': 'string'}, 'keywords': {'items': {'type': 'string'}, 'type': 'array'}}, 'required': ['keywords', 'fields'], 'type': 'object'}

JSON Schema for the template the model must produce (not the invoice values).

invoice2data.ai.template_generator.generate_template(text, *, provider=None, issuer=None)

Draft an invoice2data template from a sample document’s text.

Parameters:
  • text (str) – The sample document’s extracted text.

  • provider (AIProvider | None) – Provider to use; the configured one (get_provider()) when None.

  • issuer (str | None) – Optional issuer name to force into the template.

Returns:

A template dict (issuer/keywords/fields) ready to review,

preview and save.

Return type:

dict[str, Any]

invoice2data.ai.template_generator.preview_template(template, text)

Apply a template’s field regexes to text to preview what it captures.

A lightweight round-trip so the user can confirm the draft before saving it; the real engine applies these regexes with the template’s options at runtime.

Parameters:
  • template (dict[str, Any]) – A template dict with a fields mapping of field name -> regex string.

  • text (str) – The sample text to match against.

Returns:

Field name -> the first captured value (group 1 when the

regex has a group, otherwise the whole match). Fields that do not match are omitted.

Return type:

dict[str, str]

JSON schema

Build a JSON Schema for invoice extraction from the canonical field registry.

The canonical vocabulary in invoice2data.extract.schema is the single source of truth; this turns it into a JSON Schema so an LLM’s structured-output mode can be constrained to the exact fields invoice2data understands.

invoice2data.ai.schema_json.invoice_json_schema()

Return a JSON Schema describing the canonical invoice output.

Top-level invoice fields plus a lines array of line items, typed from the canonical registries.

Returns:

A JSON Schema object suitable for structured-output APIs.

Return type:

dict[str, Any]