Reference¶
API reference for using invoice2data as a Python library. For the command-line interface see the usage page.
Library API¶
- invoice2data.extract_data(invoicefile, templates=None, input_module=None, ai_fallback=False, raise_on_error=False)¶
Extracts structured data from PDF/image invoices.
This function uses the text extracted from a PDF file or image and pre-defined regex templates to find structured data.
Reads template if no template assigned. Required fields are matches from templates.
- Parameters:
invoicefile (str) – Path of electronic invoice file in PDF, JPEG, PNG
templates (list[InvoiceTemplate] | None) – List of instances of class InvoiceTemplate. Templates are loaded using read_template function in loader.py.
input_module (Any, optional) – Backend used to extract text from the given invoicefile, as a module or a registry name (e.g. ‘pdftotext’, ‘pdfium’, ‘pdfminer’, ‘tesseract’, ‘text’). When
None(the default), a cascade of backends (DEFAULT_INPUT_READERS) is tried in order until one yields a template match with all required fields.ai_fallback (bool, optional) – When True and no template matches (or every match is incomplete) and OCR does not help, extract fields with the configured AI provider (see
INVOICE2DATA_AI_*env vars). Result is taggedextraction_method: "ai". Opt-in; defaults to False.raise_on_error (bool, optional) – When True, raise a typed
InvoiceProcessingErroron failure instead of returning{}–RequiredFieldsMissingErrorwhen a template matched but a required field could not be parsed, otherwiseNoTemplateFoundError. Defaults to False (the historical{}contract).
- Returns:
- Extracted and matched fields, or an empty dict
{}if text extraction fails or no template matches (unless
raise_on_erroris set).
- Extracted and matched fields, or an empty dict
- Return type:
dict[str, Any]
- Raises:
InvoiceProcessingError – When
raise_on_erroris True and extraction fails (RequiredFieldsMissingErrororNoTemplateFoundError).
Notes
Import the required input_module when using invoice2data as a library. A template may pin the backend it was authored for with a top-level
input_module:key; that backend is then used for that template regardless of which one matched it first.See also
read_template: Function to load templates. InvoiceTemplate: Class representing a single invoice template.
Examples
When using invoice2data as a library:
>>> from invoice2data.input import pdftotext >>> extract_data("./tests/compare/oyo.pdf", None, pdftotext) {'issuer': 'OYO', 'template_name': 'com.oyo.invoice.yml', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087', 'currency': 'INR', 'hotel_details': ' OYO 4189 Resort Nanganallur', 'date_check_in': datetime.datetime(2017, 12, 31, 0, 0), 'date_check_out': datetime.datetime(2018, 1, 1, 0, 0), 'amount_rooms': 1.0, 'booking_id': 'IBZY2087', 'payment_method': 'Cash at Hotel', 'gstin': '06AABCO6063D1ZQ', 'cin': 'U63090DL2012PTC231770', 'desc': 'Invoice from OYO'}
Load templates with read_templates
(documented under Extract → loader).
Exceptions¶
By default extract_data returns {} on failure. Pass raise_on_error=True to
get a typed exception instead, so a caller can tell why extraction failed:
from invoice2data import extract_data, NoTemplateFoundError, RequiredFieldsMissingError
try:
data = extract_data("invoice.pdf", raise_on_error=True)
except RequiredFieldsMissingError as exc:
print("matched a template but missing:", exc.fields)
except NoTemplateFoundError:
print("no template matched")
Typed exceptions for invoice2data (issue #190).
By default invoice2data.extract_data() returns {} on failure (the
historical contract). Pass raise_on_error=True to get one of these instead,
so a library caller can tell why extraction failed and show a useful message.
- exception invoice2data.exceptions.InvoiceProcessingError¶
Base class for invoice2data extraction failures.
Only raised when
extract_data(..., raise_on_error=True).
- exception invoice2data.exceptions.NoTemplateFoundError¶
No template matched the document under any input backend.
- exception invoice2data.exceptions.RequiredFieldsMissingError(fields, template_name=None)¶
A template matched but one or more required fields could not be parsed.
Subclasses
ValueErrorso the input-backend cascade’s existingexcept ValueErrorretry handling keeps working unchanged.- Parameters:
fields (Iterable[str]) – Required field names that could not be parsed.
template_name (str | None) – The matched template’s name, when known.
- Return type:
None
- fields¶
The required field names that could not be parsed.
- Type:
set[str]
- template_name¶
The template that matched, when known.
- Type:
str | None
Input modules¶
invoice2data resolves a backend by name or module object. When none is forced it
tries an ordered cascade (see How It Works) and falls back to OCR. Backends
expose a common interface; those backed by optional dependencies self-exclude via
is_available().
Backend interface and registry¶
Input (text-extraction) backends and their registry.
See __interface__ for the backend contract. INPUT_MODULES maps the stable backend name (the –input-reader value) to its module.
- invoice2data.input.INPUT_MODULES: dict[str, ModuleType] = {'doctr': <module 'invoice2data.input.doctr' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/doctr.py'>, 'gvision': <module 'invoice2data.input.gvision' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/gvision.py'>, 'hotpdf': <module 'invoice2data.input.hotpdf' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/hotpdf.py'>, 'ocrmypdf': <module 'invoice2data.input.ocrmypdf' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/ocrmypdf.py'>, 'paddleocr': <module 'invoice2data.input.paddleocr' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/paddleocr.py'>, 'pdfium': <module 'invoice2data.input.pdfium' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfium.py'>, 'pdfminer': <module 'invoice2data.input.pdfminer_wrapper' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfminer_wrapper.py'>, 'pdfoxide': <module 'invoice2data.input.pdfoxide' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfoxide.py'>, 'pdfplumber': <module 'invoice2data.input.pdfplumber' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdfplumber.py'>, 'pdftotext': <module 'invoice2data.input.pdftotext' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/pdftotext.py'>, 'tesseract': <module 'invoice2data.input.tesseract' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/tesseract.py'>, 'text': <module 'invoice2data.input.text' from '/home/docs/checkouts/readthedocs.org/user_builds/invoice2data/checkouts/latest/src/invoice2data/input/text.py'>}¶
backend name (the
--input-readervalue) -> backend module.- Type:
Registry
- invoice2data.input.available_modules()¶
Return the registered backends whose dependencies are available.
- Returns:
- Subset of
INPUT_MODULESusable in the current environment.
- Subset of
- Return type:
dict[str, ModuleType]
- invoice2data.input.extract_text(module, invoicefile, area=None)¶
Extract text with a backend, memoized per (backend, file, mtime, area).
Avoids re-parsing the same document within a run – e.g. when several template fields share one
area, or the same full text is requested again. The file mtime is part of the key so a changed file is re-read.- Parameters:
module (ModuleType) – An input backend exposing
to_text.invoicefile (str) – Path to the document.
area (dict[str, Any] | None) – Optional area-restriction passed through.
- Returns:
The extracted text.
- Return type:
str
- invoice2data.input.is_available(module)¶
Return whether a backend’s runtime dependency is available.
- Parameters:
module (ModuleType) – An input backend module.
- Returns:
- The result of the backend’s
is_available()if it defines one, otherwise True (the backend is assumed always available).
- The result of the backend’s
- Return type:
bool
- invoice2data.input.supports_area(module)¶
Return whether a backend supports area-restricted extraction.
- Parameters:
module (ModuleType) – An input backend module.
- Returns:
True if the backend declares
SUPPORTS_AREA = True.- Return type:
bool
pdfium (default)¶
pypdfium2 input module for invoice2data.
A fast, dependency-light PDF text backend (Google’s PDFium bindings). Its text
order/spacing differs from poppler’s pdftotext -layout (PDFium has no layout
mode), so layout-sensitive templates should still pin input_module: pdftotext.
Area (region) extraction is supported in-process via PDFium’s
get_text_bounded; note its output is not identical to pdftotext’s area output,
so an area template targets one backend’s text, not both.
- invoice2data.input.pdfium.SUPPORTS_AREA = True¶
PDFium can extract a bounded region in-process (see _crop_pages).
- invoice2data.input.pdfium.is_available()¶
Return whether the optional
pypdfium2package is importable.- Returns:
True if
pypdfium2is installed.- Return type:
bool
- invoice2data.input.pdfium.to_text(path, area_details=None, **kwargs)¶
Extract text from a PDF using pypdfium2.
- Parameters:
path (str) – Path to the PDF file.
area_details (dict[str, Any] | None) – Restrict extraction to a region. Keys (pixels at
rdpi, top-left origin):f/l(first/last page),x/y(top-left),W/H(size),r(dpi). Defaults to None (whole document).**kwargs (Any) – Ignored; accepted for backend compatibility.
- Returns:
The extracted text, pages joined by newlines.
- Return type:
str
pdftotext¶
Poppler pdftotext input module for invoice2data.
Full-page extraction shells out to pdftotext -layout. Area (region) extraction
no longer re-runs pdftotext per area: word positions are read once via
pdftotext -bbox-layout (cached per file) and the requested rectangle is cropped
in Python, so several area fields on one document cost a single parse.
- invoice2data.input.pdftotext.is_available()¶
Return whether the poppler
pdftotextbinary is on the PATH.- Returns:
True if
pdftotextcan be run.- Return type:
bool
- invoice2data.input.pdftotext.to_text(path, area_details=None)¶
Extract text from a PDF file using pdftotext.
- Parameters:
path (str) – Path to the PDF file.
area_details (dict[str, Any] | None, optional) – Restrict extraction to a region. Keys (pixels at
rdpi):f/l(first/last page),x/y(top-left),W/H(size),r(resolution dpi). Defaults to None (whole document).
- Returns:
The extracted text.
- Return type:
str
- Raises:
FileNotFoundError – If the specified PDF file is not found.
OSError – If pdftotext is not installed.
text¶
text input module for invoice2data.
- invoice2data.input.text.to_text(path)¶
Reads the content of a text file.
- Parameters:
path (str) – The path to the text file.
- Returns:
The content of the text file.
- Return type:
str
pdfplumber¶
pdfplumber input module for invoice2data.
- invoice2data.input.pdfplumber.is_available()¶
Return whether the optional
pdfplumberpackage is importable.- Returns:
True if
pdfplumberis installed.- Return type:
bool
- invoice2data.input.pdfplumber.to_text(path, **kwargs)¶
Extract text from PDF using pdfplumber.
- Parameters:
path (str) – Path to the PDF file.
**kwargs (dict[str, Any]) – Keyword arguments to be passed to pdfplumber.
- Returns:
Extracted text from the PDF.
- Return type:
str
- Raises:
ImportError – If the optional pdfplumber dependency is not installed.
pdfminer¶
pdminer input module for invoice2data.
- invoice2data.input.pdfminer_wrapper.is_available()¶
Return whether the optional
pdfminer.sixpackage is importable.- Returns:
True if
pdfmineris installed.- Return type:
bool
- invoice2data.input.pdfminer_wrapper.to_text(path, **kwargs)¶
Wrapper around pdfminer to extract text from PDF.
- Parameters:
path (str) – Path to the PDF file.
**kwargs (dict[str, Any]) – Keyword arguments to be passed to pdfminer.
- Returns:
Extracted text from the PDF.
- Return type:
str
pdfoxide¶
pdf-oxide input module for invoice2data.
A fast Rust-based PDF text backend (pdf_oxide). Its text order/spacing
differs from poppler’s pdftotext -layout, so templates tuned for pdftotext
may need adjustment.
- invoice2data.input.pdfoxide.is_available()¶
Return whether the optional
pdf-oxidepackage is importable.- Returns:
True if
pdf_oxideis installed.- Return type:
bool
- invoice2data.input.pdfoxide.to_text(path, **kwargs)¶
Extract text from a PDF using pdf-oxide.
- Parameters:
path (str) – Path to the PDF file.
**kwargs (dict[str, Any]) – Ignored; accepted for backend compatibility.
- Returns:
The extracted text, pages joined by newlines.
- Return type:
str
hotpdf¶
hotpdf input module for invoice2data.
hotpdf is a fast pdfminer.six-based reader. Its plain-text output runs words
together more than pdftotext -layout, so templates tuned for pdftotext may
need adjustment.
- invoice2data.input.hotpdf.is_available()¶
Return whether the optional
hotpdfpackage is importable.- Returns:
True if
hotpdfis installed.- Return type:
bool
- invoice2data.input.hotpdf.to_text(path, **kwargs)¶
Extract text from a PDF using hotpdf.
- Parameters:
path (str) – Path to the PDF file.
**kwargs (dict[str, Any]) – Ignored; accepted for backend compatibility.
- Returns:
The extracted text, pages joined by newlines.
- Return type:
str
tesseract (OCR)¶
Tesseract OCR input module for invoice2data.
- invoice2data.input.tesseract.is_available()¶
Return whether the
tesseractand ImageMagick binaries are present.- Returns:
True if both
tesseractandconvertare on the PATH.- Return type:
bool
- invoice2data.input.tesseract.to_text(path, area_details=None)¶
Extract text from image using tesseract OCR.
- Parameters:
path (str) – Path to the image file.
area_details (dict[str, Any] | None, optional) – Specific area in the image to extract text from. Defaults to None (extract from the entire image).
- Returns:
The extracted text.
- Return type:
str
- Raises:
FileNotFoundError – If the specified image file is not found.
OSError – If Tesseract OCR fails to extract text.
ocrmypdf (OCR)¶
OCRmyPDF input module for invoice2data.
- invoice2data.input.ocrmypdf.RECOMMENDED_SCAN_OPTIONS = {'clean': True, 'deskew': True, 'rotate_pages': True}¶
Common OCRmyPDF pre-processing knobs. Any of these may be passed through
input_reader_config/pre_conf– they are forwarded verbatim toocrmypdf.ocr– to clean up noisy scans:deskew,clean,clean_final,rotate_pages,remove_background,optimize(0-3, image/size optimization) andoversample(target DPI). A recommended starting set for scanned receipts (spread it intoinput_reader_config):
- invoice2data.input.ocrmypdf.is_available()¶
Backend availability check (see input.__interface__).
- Return type:
bool
- invoice2data.input.ocrmypdf.ocrmypdf_available()¶
Checks if the ocrmypdf module is available.
- Returns:
True if ocrmypdf is available, False otherwise.
- Return type:
bool
- invoice2data.input.ocrmypdf.pre_process_pdf(path, pre_conf=None)¶
Pre-process a PDF with ocrmypdf, returning the cleaned PDF path.
The output is a deskewed/cleaned/optimized, text-layered PDF – usually smaller than the original. Callers (e.g. an Odoo integration) can use the returned path to attach or replace the stored file for size savings, not just to feed pdftotext. Writes to a unique temp file unless
pre_confsetsoutput_file. Logs a warning if ocrmypdf is not available.- Parameters:
path (str) – Path to the PDF invoice file.
pre_conf (dict[str, Any] | None, optional) – Settings forwarded to
ocrmypdf.ocr(merged overOPTIONS_DEFAULT); pass pre-processing knobs here (seeRECOMMENDED_SCAN_OPTIONS). Defaults to None.
- Returns:
- Path to the processed (cleaned, smaller) PDF, or None if
processing fails.
- Return type:
str | None
- invoice2data.input.ocrmypdf.to_text(path, area_details=None, input_reader_config=None)¶
Pre-processes PDF files with ocrmypdf before PDFtotext parsing.
Ensures OCRmyPDF is installed before attempting to use it. If OCRmyPDF is not available, logs a warning and returns an empty string.
- Parameters:
path (str) – Path to the PDF invoice file.
area_details (dict[str, Any] | None, optional) – Details about the area to extract. Defaults to None.
input_reader_config (dict[str, Any] | None, optional) – Settings forwarded to
ocrmypdf.ocr– e.g. pre-processing knobs likedeskew/clean/rotate_pages/optimize(seeRECOMMENDED_SCAN_OPTIONS). Defaults to None.
- Returns:
Extracted text from the PDF, or an empty string if OCRmyPDF is not available or processing fails.
- Return type:
str
docTR (deep-learning OCR)¶
docTR (deep-learning OCR) input module for invoice2data.
Local, trained OCR that handles scanned/photographed documents well, usually
without manual pre-processing (issue #526). Optional: install with
pip install invoice2data[doctr] (pulls in docTR + a torch backend; the model
weights download on first use). The OCR predictor is cached after the first call.
docTR reads PDFs and images directly (PDF rendering via pypdfium2 under the hood), so this backend OCRs the whole document and has no area-restricted mode.
- invoice2data.input.doctr.SUPPORTS_AREA = False¶
docTR OCRs the whole document; it has no area-restricted mode.
- invoice2data.input.doctr.doctr_available()¶
Return whether the optional
python-doctrpackage is importable.- Returns:
True if docTR can be imported.
- Return type:
bool
- invoice2data.input.doctr.is_available()¶
Backend availability check (see input.__interface__).
- Return type:
bool
- invoice2data.input.doctr.to_text(path, area_details=None, **kwargs)¶
Extract text from a PDF or image with docTR OCR.
- Parameters:
path (str) – Path to the PDF or image file.
area_details (dict[str, Any] | None) – Ignored (docTR has no area mode).
**kwargs (Any) – Ignored; accepted for backend-interface compatibility.
- Returns:
The OCR’d text, or an empty string if docTR is not available.
- Return type:
str
PaddleOCR (deep-learning OCR)¶
PaddleOCR (deep-learning OCR) input module for invoice2data.
Local, trained OCR with very broad language coverage (issue #526). Optional:
install with pip install invoice2data[paddleocr] (pulls in paddleocr +
paddlepaddle + pypdfium2; model weights download on first use). The OCR engine is
cached after the first call.
PaddleOCR works on images, so PDFs are rendered to page images with pypdfium2
first. The whole document is OCR’d; there is no area-restricted mode. The result
parser handles both the PaddleOCR 2.x ([box, (text, score)]) and 3.x
({"rec_texts": [...]}) shapes defensively.
- invoice2data.input.paddleocr.SUPPORTS_AREA = False¶
PaddleOCR OCRs the whole document; it has no area-restricted mode.
- invoice2data.input.paddleocr.is_available()¶
Backend availability check (see input.__interface__).
- Return type:
bool
- invoice2data.input.paddleocr.paddleocr_available()¶
Return whether the optional
paddleocrpackage is importable.- Returns:
True if PaddleOCR can be imported.
- Return type:
bool
- invoice2data.input.paddleocr.to_text(path, area_details=None, **kwargs)¶
Extract text from a PDF or image with PaddleOCR.
- Parameters:
path (str) – Path to the PDF or image file.
area_details (dict[str, Any] | None) – Ignored (PaddleOCR has no area mode).
**kwargs (Any) – Optional
lang(PaddleOCR language code, default “en”); other keys are ignored.
- Returns:
The OCR’d text, or an empty string if PaddleOCR is not available.
- Return type:
str
Google Vision (OCR)¶
Google Cloud Vision input module for invoice2data.
Uses Cloud Vision’s async DOCUMENT_TEXT_DETECTION staged through Google Cloud
Storage, so a GCS bucket is required (set GOOGLE_CLOUD_BUCKET_NAME) plus
GOOGLE_APPLICATION_CREDENTIALS.
A modern, bucket-free alternative is Google Document AI (an “OCR processor”
run synchronously) — see the OCA module account_invoice_google_document_ai
(OCA/account-invoicing) for that approach. Worth considering as a future backend
that drops the GCS-bucket setup; it needs a Document AI processor id + the
google-cloud-documentai client.
- invoice2data.input.gvision.SUPPORTS_AREA = False¶
Google Vision OCRs the whole document; it has no area-restricted mode.
- invoice2data.input.gvision.is_available()¶
Backend availability check (see input.__interface__).
- Return type:
bool
- invoice2data.input.gvision.to_text(path, bucket_name=None, language='en')¶
Sends PDF files to Google Cloud Vision for OCR.
Before using invoice2data, make sure you have the auth JSON path set as the environment variable GOOGLE_APPLICATION_CREDENTIALS.
- Parameters:
path (str) – Path of the electronic invoice in JPG or PNG format.
bucket_name (str | None) – Name of the bucket to use for file storage and results cache. Defaults to “cloud-vision-84893”.
language (str, optional) – Language to use for OCR. Defaults to “en”.
- Returns:
Extracted text from the image.
- Return type:
str
- Raises:
OSError – If the google cloud bucket_name is not set.
Output modules¶
csv¶
CSV output module for invoice2data.
- invoice2data.output.to_csv.write_to_file(data, path, date_format='%Y-%m-%d', lines_mode='json')¶
Export extracted fields to CSV.
Appends .csv to path if missing and generates a CSV file in the specified directory, otherwise in the current directory.
- Parameters:
data (list[dict[str, Any]]) – A list of dictionaries of extracted fields. If only a single file was processed, it must be passed as a single-element list.
path (str) – CSV file to save output to.
date_format (str) – Date format used in the generated file. Defaults to “%Y-%m-%d”.
lines_mode (str) – How to render line-item arrays. “json” (default) JSON-encodes
lines/tax_linescells; “explode” writes one row perlinesitem, repeating invoice-level fields.
- Return type:
None
Notes
Provide a filename to the path parameter.
Examples
>>> import tempfile >>> from pathlib import Path >>> from invoice2data.output import to_csv >>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}] >>> path = Path(tempfile.mkdtemp()) / "invoice.csv" >>> to_csv.write_to_file(data, str(path)) >>> path.exists() True
json¶
JSON output module for invoice2data.
- invoice2data.output.to_json.format_item(item, date_format)¶
Format an item for JSON serialization.
- Parameters:
item (Any) – The item to format.
date_format (str) – The date format to use.
- Returns:
The formatted item.
- Return type:
Any
- invoice2data.output.to_json.write_to_file(data, path, date_format='%Y-%m-%d')¶
Export extracted fields to JSON.
Appends .json to path if missing and generates JSON file in the specified directory, otherwise in the current directory.
- Parameters:
data (list[dict[str, Any]]) – Dictionary of extracted fields.
path (str) – Directory to save the generated JSON file.
date_format (str) – Date format used in the generated file. Defaults to “%Y-%m-%d”.
- Return type:
None
Notes
Provide a filename to the path parameter.
Examples
>>> import tempfile >>> from pathlib import Path >>> from invoice2data.output import to_json >>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}] >>> path = Path(tempfile.mkdtemp()) / "invoice.json" >>> to_json.write_to_file(data, str(path)) >>> path.exists() True
xml¶
XML output module for invoice2data.
- invoice2data.output.to_xml.defusedxml_available()¶
Checks if the defusedxml module is available.
- Returns:
True if defusedxml is available, False otherwise.
- Return type:
bool
- invoice2data.output.to_xml.dict_to_tags(parent, data, date_format)¶
Convert a dictionary to XML tags.
This function iterates through the dictionary and creates XML tags for each key-value pair. It handles different data types and formats dates according to the specified format.
- Parameters:
parent (ElementTree.Element) – The parent element.
data (dict[str, Any]) – The dictionary to be converted.
date_format (str) – The date format to use.
- Return type:
None
- invoice2data.output.to_xml.prettify(elem)¶
Return a pretty-printed XML string for the Element.
- Parameters:
elem (ElementTree.Element) – The Element to be pretty-printed.
- Returns:
A pretty-printed XML string.
- Return type:
Any
- invoice2data.output.to_xml.write_to_file(data, path, date_format='%Y-%m-%d')¶
Export extracted fields to xml.
Appends .xml to path if missing and generates xml file in specified directory, if not then in root.
- Parameters:
data (list[dict[str, Any]]) – List of dictionaries containing extracted fields.
path (str) – Path to save the generated XML file.
date_format (str, optional) – Date format used in generated file. Defaults to “%Y-%m-%d”.
- Return type:
None
Notes
Provide a filename to the path parameter.
Examples
>>> import tempfile >>> from pathlib import Path >>> from invoice2data.output import to_xml >>> data = [{'amount': 123.45, 'date': datetime.datetime(2024, 1, 1)}] >>> path = Path(tempfile.mkdtemp()) / "invoice.xml" >>> to_xml.write_to_file(data, str(path)) >>> path.exists() True
Output streams¶
Output modules and the shared output-destination helper.
- invoice2data.output.open_output(path, suffix, **open_kwargs)¶
Open the destination for a given
--output-name.Streams to stdout for
-//dev/stdoutand stderr for/dev/stderr; otherwise writes topathwithsuffixappended when missing.- Parameters:
path (str) – The requested output name.
suffix (str) – File extension to ensure (e.g.
.json) for file paths.**open_kwargs (Any) – Extra arguments forwarded to
Path.openfor files.
- Yields:
TextIO – The writable text stream (a standard stream is not closed here).
- Return type:
Iterator[TextIO]
Extract¶
loader¶
This module abstracts templates for invoice providers.
Templates are initially read from .yml or .json files and then kept as class.
- invoice2data.extract.loader.ordered_load(stream, loader=<function loads>)¶
Parse templates from an in-memory string instead of from disk.
Useful when templates live outside the filesystem (e.g. a database column or an API payload):
extract_data(file, templates=ordered_load(db_text)). For YAML data passloader=yaml.safe_load.- Parameters:
stream (str) – Serialized templates – a JSON (default) or YAML array of template mappings.
loader (Callable[[str], Any], optional) – Callable turning
streaminto a list of template dicts. Defaults tojson.loads; passyaml.safe_loadfor YAML.
- Returns:
- Parsed, prepared templates (empty list on a parse
error, which is logged).
- Return type:
list[InvoiceTemplate]
- invoice2data.extract.loader.prepare_template(tpl)¶
Prepare a template for use.
- Parameters:
tpl (dict[str, Any]) – Template dictionary.
- Returns:
Processed template dictionary.
- Return type:
dict[str, Any] | None
- invoice2data.extract.loader.read_templates(folder=None)¶
Load YAML templates from template folder. Return list of dicts.
Use built-in templates if no folder is set.
- Parameters:
folder (str | None) – User-defined folder where templates are stored. If None, uses built-in templates.
- Returns:
List of InvoiceTemplate objects.
- Return type:
list[InvoiceTemplate]
Examples
>>> templates = read_templates("./src/invoice2data/extract/templates/au") >>> len(templates) # Check the number of loaded templates 2 >>> templates[0]['template_name'] # Check the name of the first template 'au.com.opal.yml'
InvoiceTemplate¶
- class invoice2data.extract.invoice_template.InvoiceTemplate(*args, **kwargs)
Represents single template files that live as .yml files on the disk.
- Parameters:
args (Any)
kwargs (Any)
- prepare_input(extracted_str)
Input raw string and perform transformations, as set in the template file.
- Parameters:
extracted_str (str)
- Return type:
str
- matches_input(extracted_str)
Check if the string matches keywords set in the template file.
- Parameters:
extracted_str (str)
- Return type:
bool
- parse_number(value)
Parse number, remove decimal separator and add other options.
- Parameters:
value (str)
- Return type:
float
- parse_date(value)
Parse date and return the date after parsing.
- Parameters:
value (str)
- Return type:
Any
- coerce_type(value, target_type)
Change the type of values.
- Parameters:
value (str)
target_type (str)
- Return type:
Any
- extract(optimized_str)
Given a template file and a string, extract matching data fields.
- Parameters:
optimized_str (str)
invoice_file (str)
input_module (Any)
- Return type:
dict[str, Any]
- coerce_type(value, target_type)
Coerces a value to the specified target type.
- Parameters:
value (str) – The value to be coerced.
target_type (str) – The target type to which the value should be coerced. Valid values: ‘int’, ‘float’, ‘date’.
- Returns:
The coerced value.
- Return type:
Any
- Raises:
AssertionError – If the target_type is unknown.
- extract(optimized_str, invoice_file, input_module)
Extracts data from the optimized string using the template.
- Parameters:
optimized_str (str) – The optimized string.
invoice_file (str) – The path to the invoice file.
input_module (Any) – The input module used.
- Returns:
The extracted data.
- Return type:
dict[str, Any]
- matches_input(extracted_str)
Check if the extracted string matches the template keywords.
- Parameters:
extracted_str (str) – The extracted text from the invoice.
- Returns:
- True if the extracted string matches the template keywords,
False otherwise.
- Return type:
bool
- parse_date(value)
Parses date and returns date after parsing.
- Parameters:
value (str)
- Return type:
Any
- parse_number(value)
Parses a number from a string.
This function parses a numerical value from a string, handling different decimal separators and thousands separators based on locale.
- Parameters:
value (str) – The string containing the number to be parsed.
- Returns:
The parsed numerical value.
- Return type:
float
- prepare_input(extracted_str)
Input raw string and do transformations, as set in template file.
- Parameters:
extracted_str (str)
- Return type:
str
Note
:no-index: works around an autodoc quirk where members of a typing-generic
OrderedDict[str, Any] subclass are emitted twice. The methods still render
here; they’re internal — the public API is extract_data().
Canonical field schema¶
Canonical output-field vocabulary and (opt-in) validation.
The field names below mirror docs/recommended-template-fields.md and the
OCA account_invoice_import_invoice2data Odoo module. This is the single
source of truth used by validation (and, later, the benchmark quality scoring).
Validation is intentionally conservative: templates may legitimately emit custom
fields, so validate_output() only flags an unrecognized field when it looks
like a typo of a canonical name. A template can opt into strict checking and
whitelist custom fields via its options (strict_fields / extra_fields).
- invoice2data.extract.schema.INVOICE_FIELDS: frozenset[str] = frozenset({'amount', 'amount_tax', 'amount_untaxed', 'bic', 'company_vat', 'country_code', 'currency', 'currency_symbol', 'date', 'date_due', 'date_end', 'date_start', 'desc', 'iban', 'incoterm', 'invoice_number', 'issuer', 'lines', 'mandate_id', 'mobile', 'narration', 'note', 'partner_city', 'partner_coc', 'partner_email', 'partner_name', 'partner_ref', 'partner_street', 'partner_street2', 'partner_street3', 'partner_website', 'partner_zip', 'payment_reference', 'payment_unece_code', 'siren', 'state_code', 'tax_lines', 'telephone', 'template_name', 'vat'})¶
Canonical top-level (invoice-level) field names.
- invoice2data.extract.schema.LINE_FIELDS: frozenset[str] = frozenset({'barcode', 'code', 'date_end', 'date_start', 'discount', 'line_note', 'line_tax_amount', 'line_tax_percent', 'name', 'price_subtotal', 'price_total', 'price_unit', 'product', 'qty', 'sectionheader', 'taxes', 'unece_code', 'uom'})¶
Canonical per-line-item field names.
- invoice2data.extract.schema.LINE_FIELD_ALIASES: dict[str, str] = {'description': 'name', 'tax_percent': 'line_tax_percent', 'unit_price': 'price_unit', 'unitprice': 'price_unit', 'vat_rate': 'line_tax_percent'}¶
Non-canonical line/tax-line field names mapped to their canonical equivalent. Applied to the output (not the templates) by
normalize_line_fields(), so a template may keep these aliases and still produce the standard vocabulary.productis intentionally absent — it is a distinct field (product matching), not a synonym forname.descriptionmaps tonamebecause Odoo reads a line’s label fromname(descriptionis only an invoice-level field for single-line imports).
- invoice2data.extract.schema.TAX_LINE_FIELDS: frozenset[str] = frozenset({'line_tax_amount', 'line_tax_code', 'line_tax_percent', 'price_subtotal'})¶
Canonical per-rate tax-line field names.
- invoice2data.extract.schema.normalize_line_fields(output)¶
Rename non-canonical keys in
lines/tax_linesrows in place.Maps each alias in
LINE_FIELD_ALIASESto its canonical name so that extraction output uses one vocabulary regardless of the group names a template happens to use. An alias is only applied when the canonical key is not already present (so an explicit canonical value always wins).- Parameters:
output (dict[str, Any]) – The extracted-fields dictionary, mutated in place.
- Return type:
None
- invoice2data.extract.schema.validate_output(output, extra_fields=())¶
Find unrecognized field names in an extracted output.
Checks top-level fields against
INVOICE_FIELDS,linesitems againstLINE_FIELDS, andtax_linesitems againstTAX_LINE_FIELDS. Recognized names, auto-typed (amount*/date*) names, and names inextra_fieldsare ignored.- Parameters:
output (dict[str, Any]) – The extracted-fields dictionary.
extra_fields (Iterable[str]) – Custom field names to treat as known.
- Returns:
(field_name, suggestion)pairs for eachunrecognized field;
suggestionis the closest canonical name if the field looks like a typo, else None.
- Return type:
list[tuple[str, str | None]]
Validators¶
Lightweight, offline validators for disambiguating extracted identifiers.
Field-type regexes overlap – an IBAN and an EU VAT number can match each other’s pattern – so a captured value is classified by validating it rather than by the regex alone: try the strong checksum first (IBAN mod-97), then the format checks (VAT, BIC). Pure-Python, no network, no heavy deps.
Used by the template-authoring suggestion layer (to assign a captured value to the
right canonical field) and as an optional soft-validation of extracted fields. For
full per-country VAT/IBAN checksum coverage, python-stdnum can be layered on
later; this module stays dependency-free.
- invoice2data.extract.validators.VALIDATORS = {'bic': <function validate_bic>, 'iban': <function validate_iban>, 'vat': <function validate_vat>}¶
Validators in discriminating order – strongest (checksum) first, so an IBAN is never mistaken for a VAT number.
- invoice2data.extract.validators.classify_identifier(value)¶
Classify a captured value as one of the known identifier types.
Runs the validators in discriminating order (IBAN mod-97 checksum first, then the format-based VAT/BIC checks) so overlapping patterns resolve correctly – e.g. a value that passes the IBAN checksum is reported as
"iban"even though it may also look VAT-shaped.- Parameters:
value (str) – The captured string to classify.
- Returns:
- The matching key (“iban”, “vat” or “bic”), or None if nothing
validates.
- Return type:
str | None
- invoice2data.extract.validators.validate_bic(value)¶
Return whether value matches the SWIFT/BIC format (ISO 9362).
- Parameters:
value (str) – Candidate BIC; spaces and case are ignored.
- Returns:
True if it is a structurally valid 8- or 11-character BIC.
- Return type:
bool
- invoice2data.extract.validators.validate_iban(value)¶
Return whether value is a structurally valid IBAN (ISO 13616 mod-97).
- Parameters:
value (str) – Candidate IBAN; spaces and case are ignored.
- Returns:
True if the structure and the mod-97 checksum are both valid.
- Return type:
bool
- invoice2data.extract.validators.validate_vat(value)¶
Return whether value matches a known EU VAT number format.
- Parameters:
value (str) – Candidate VAT number; spaces and case are ignored.
- Returns:
- True if it matches a supported country’s VAT format. This is a
format check, not a per-country checksum.
- Return type:
bool
Candidate extraction¶
Candidate extraction for guided / AI-assisted template authoring (AUTH-1).
Scans a document’s extracted text for typed candidate values – dates, monetary amounts and identifiers (IBAN/VAT/BIC) – each with its position in the text. The template-authoring layers (the copier-style CLI builder and AI template generation) consume these candidates to propose fields/regexes; the guided heuristics (AUTH-2) turn them into first-guess field assignments.
Pure-Python: reuses the offline validators to type identifiers and
dateparser (already a dependency) to parse dates.
- class invoice2data.extract.candidates.Candidate(kind, value, start, end, parsed)¶
A typed value found in document text.
- Parameters:
kind (str)
value (str)
start (int)
end (int)
parsed (Any)
- kind¶
Candidate type – “date”, “amount”, “iban”, “vat” or “bic”.
- Type:
str
- value¶
The raw substring as it appears in the text.
- Type:
str
- start¶
Start character offset in the source text.
- Type:
int
- end¶
End character offset in the source text.
- Type:
int
- parsed¶
Normalised value – a
datetimefor dates, afloatfor amounts, the whitespace-stripped identifier for iban/vat/bic.- Type:
Any
- invoice2data.extract.candidates.find_amounts(text)¶
Find parseable monetary-amount candidates in text.
- Parameters:
text (str) – The document’s extracted text.
- Returns:
Amount candidates with a parsed float value.
- Return type:
list[Candidate]
- invoice2data.extract.candidates.find_candidates(text)¶
Find all typed candidates in text, sorted by position.
Amount candidates that fall inside a date (e.g. the
12.05in12.05.2024) are dropped to avoid double counting.- Parameters:
text (str) – The document’s extracted text.
- Returns:
- All date/amount/identifier candidates, ordered by
startoffset.
- Return type:
list[Candidate]
- invoice2data.extract.candidates.find_dates(text)¶
Find parseable date candidates in text.
- Parameters:
text (str) – The document’s extracted text.
- Returns:
Date candidates whose value
dateparsercould parse.- Return type:
list[Candidate]
- invoice2data.extract.candidates.find_identifiers(text)¶
Find validated identifier candidates (IBAN/VAT/BIC) in text.
Each potential identifier is classified via
validators.classify_identifier()so overlapping patterns resolve correctly (e.g. an IBAN is not mistaken for a VAT number).- Parameters:
text (str) – The document’s extracted text.
- Returns:
- Identifier candidates with
kindset to the validated type and
parsedto the whitespace-stripped value, sorted by position.
- Identifier candidates with
- Return type:
list[Candidate]
Template suggestions¶
Guided heuristics that turn candidates into first-guess fields (AUTH-2).
Given the typed candidates from candidates, propose a first draft of field
assignments using deterministic rules – the same “collect all values of a type,
then pick by min/max/position” idea used by OCA’s account_invoice_import_simple_pdf:
of the captured dates, the earliest is most likely the invoice
dateand the latest thedate_due;the largest monetary amount is most likely the total
amount;the first validated
iban/vat/bicis offered as-is.
These are suggestions for the authoring layers (the CLI builder and AI template generation) to present for confirmation – never authoritative extraction.
- invoice2data.extract.suggestions.suggest_fields(candidates)¶
Propose first-guess field assignments from typed candidates.
Field keys use the canonical invoice vocabulary (
date,date_due,amount,iban,vat,bic).
- invoice2data.extract.suggestions.suggest_from_text(text)¶
Extract candidates from text and return first-guess field assignments.
Convenience wrapper around
candidates.find_candidates()+suggest_fields()for the authoring layers.- Parameters:
text (str) – The document’s extracted text.
- Returns:
Canonical field name -> the chosen candidate.
- Return type:
dict[str, Candidate]
Template builder¶
Deterministic template drafting for the CLI builder (AUTH-3).
Turns the AUTH candidates/suggestions into a first-draft invoice2data template
(issuer + keywords + field regexes) without any AI. The CLI builder
(invoice2data --new-template) presents this draft for confirmation; the
AI-assisted mode swaps this for ai.template_generator.generate_template().
- invoice2data.extract.template_builder.field_regex(spec)¶
Return the regex of a field spec (a bare string or a field dict).
- Parameters:
spec (str | dict[str, Any]) – A template field value.
- Returns:
The field’s regex pattern.
- Return type:
str
- invoice2data.extract.template_builder.field_regex_from_candidate(text, candidate)¶
Build a field regex anchored on the label preceding a candidate value.
Uses the text before the value on its line as a literal anchor (with flexible whitespace) plus a typed capture group, e.g.
Date:\s*(\d[\d/.\-]+\d).- Parameters:
text (str) – The full document text.
candidate (Candidate) – The candidate whose value to capture.
- Returns:
A regex with one capturing group around the value.
- Return type:
str
- invoice2data.extract.template_builder.preview_field(spec, text)¶
Return what a field spec captures from
text, after any cleanup.- Parameters:
spec (str | dict[str, Any]) – A template field value (regex or dict).
text (str) – The sample text to match against.
- Returns:
- The captured (and
replace-cleaned) value, or None if the regex does not match.
- The captured (and
- Return type:
str | None
- invoice2data.extract.template_builder.set_field_regex(spec, regex)¶
Return
specwith its regex replaced (keeping any cleanup/replace).- Parameters:
spec (str | dict[str, Any]) – The existing field spec.
regex (str) – The new regex pattern.
- Returns:
The updated spec (same shape as the input).
- Return type:
str | dict[str, Any]
- invoice2data.extract.template_builder.suggested_template(text)¶
Draft a template from a sample’s deterministic candidates (no AI).
- Parameters:
text (str) – The sample document’s extracted text.
- Returns:
- A template dict with
issuer,keywordsand fields(canonical field name -> regex).
- A template dict with
- Return type:
dict[str, Any]
- invoice2data.extract.template_builder.to_yaml(template)¶
Serialise a template dict to YAML for writing to a
.ymlfile.- Parameters:
template (dict[str, Any]) – The template to serialise.
- Returns:
The YAML document.
- Return type:
str
Date parsing¶
Tiered, cached date parsing.
Order, fastest applicable first:
the template’s explicit
date_formatsvia stdlibdatetime.strptime(microseconds, deterministic);dateutil(fast, fuzzy, English-centric);dateparser(slow, but multilingual / localized month names) – which is an optional dependency (pip install invoice2data[dateparser]).
With dateparser absent, localized month-name dates won’t parse, but numeric and English dates still do via tiers 1-2. Results are memoized (absolute-date parsing is deterministic for given inputs).
- invoice2data.extract._dates.parse_date(value, date_formats=(), languages=())¶
Parse a date string using the tiered strategy (memoized).
- Parameters:
value (str) – The date string to parse.
date_formats (tuple[str, ...]) – Template formats, tried first via strptime.
languages (tuple[str, ...]) – Language codes for the dateparser fallback.
- Returns:
The parsed datetime, or None.
- Return type:
datetime.datetime | None
Regex engine¶
Internal regex helpers with a compile-once cache.
All regex matching in invoice2data.extract goes through these thin
wrappers so each pattern is compiled only once (via an LRU cache) instead of on
every call. The engine is selected once at import time: the stdlib re by
default, or the API-compatible third-party regex package when
INVOICE2DATA_REGEX_ENGINE=regex is set in the environment.
- invoice2data.extract._regex.ENGINE: str = 're'¶
Name of the active regex engine (“re” or “regex”).
- invoice2data.extract._regex.compile(pattern, flags=0)¶
Compile a regex pattern, caching the result.
- Parameters:
pattern (str) – The regular expression pattern.
flags (int) – Regex flags passed to the engine. Defaults to 0.
- Returns:
- The compiled pattern object. The active engine
(re or the API-compatible regex) is treated as re for typing.
- Return type:
re.Pattern[str]
- invoice2data.extract._regex.findall(pattern, string, flags=0)¶
Return all non-overlapping matches of
patterninstring.- Parameters:
pattern (str) – The regular expression pattern.
string (str) – The text to search.
flags (int) – Regex flags. Defaults to 0.
- Returns:
A list of matches (strings or tuples of groups).
- Return type:
Any
- invoice2data.extract._regex.finditer(pattern, string, flags=0)¶
Iterate over all non-overlapping match objects of
patterninstring.- Parameters:
pattern (str) – The regular expression pattern.
string (str) – The text to search.
flags (int) – Regex flags. Defaults to 0.
- Returns:
- A callable iterator yielding
re.Matchobjects (or the active engine’s equivalent).
- A callable iterator yielding
- Return type:
Any
- invoice2data.extract._regex.search(pattern, string, flags=0)¶
Search
stringfor the first match ofpattern.- Parameters:
pattern (str) – The regular expression pattern.
string (str) – The text to search.
flags (int) – Regex flags. Defaults to 0.
- Returns:
A match object, or None if there is no match.
- Return type:
re.Match[str] | None
- invoice2data.extract._regex.split(pattern, string, maxsplit=0, flags=0)¶
Split
stringby occurrences ofpattern.- Parameters:
pattern (str) – The regular expression pattern.
string (str) – The text to split.
maxsplit (int) – Maximum number of splits. Defaults to 0 (no limit).
flags (int) – Regex flags. Defaults to 0.
- Returns:
A list of substrings.
- Return type:
Any
- invoice2data.extract._regex.sub(pattern, repl, string, count=0, flags=0)¶
Replace occurrences of
patterninstringwithrepl.- Parameters:
pattern (str) – The regular expression pattern.
repl (str) – The replacement string.
string (str) – The text to operate on.
count (int) – Maximum number of replacements. Defaults to 0 (all).
flags (int) – Regex flags. Defaults to 0.
- Returns:
The string with replacements applied.
- Return type:
str
Plugins¶
tables¶
Plugin to extract tables from an invoice.
- invoice2data.extract.plugins.tables.extract(self, content, output, invoice_file=None)¶
Try to extract tables from an invoice.
- Parameters:
self (InvoiceTemplate) – The current instance of the class. # noqa: DOC103
content (str) – The content of the invoice.
output (dict[str, Any]) – The updated output dictionary with extracted data or None if parsing fails.
invoice_file (str | None) – Unused; accepted for plugin-interface compatibility (path-based plugins such as camelot need it).
- Returns:
- The extracted data as a list of dictionaries, or None if table parsing fails.
Each dictionary represents a row in the table.
- Return type:
list[Any] | None
lines¶
Plugin to extract individual lines from an invoice.
This plugin has been replaced by the “lines” parser. All new templates should use the parser instead. It’s provided for backward compatibility only.
- invoice2data.extract.plugins.lines.extract(self, content, output, invoice_file=None)¶
Extract individual lines from an invoice.
This plugin has been replaced by the “lines” parser. All new templates should use the parser instead. It’s provided for backward compatibility only.
- Parameters:
self (InvoiceTemplate) – The current instance of the class.
content (str) – The text content to parse.
output (dict[str, Any]) – A dictionary to store the extracted data.
invoice_file (str | None) – Unused; accepted for plugin-interface compatibility (path-based plugins such as camelot need it).
- Return type:
None
camelot¶
Camelot table-extraction plugin (optional) for invoice2data.
Opt-in: requires the optional camelot-py dependency
(pip install invoice2data[camelot]). Unlike the text parsers/plugins,
camelot re-reads the PDF itself to detect ruled (lattice) or
whitespace-aligned (stream) tables, so it needs the source file path.
Enable it with a top-level camelot: block in a template — either a single
mapping or a list of them. Recognised camelot.read_pdf keys (pages,
flavor, table_areas, columns, …) are forwarded as-is; the
plugin-specific keys are:
field output key to populate (default:
lines) header use the table’s first row as the column names (default:true) tables which detected table to use: an index, orall(default:all)
Example:
camelot:
flavor: lattice
pages: "1"
field: lines
- invoice2data.extract.plugins.camelot.extract(self, content, output, invoice_file=None)¶
Detect tables with camelot and add their rows to the output.
- Parameters:
self (Any) – The matched template (an InvoiceTemplate); provides the
camelot:block.content (str) – Unused — camelot reads the PDF directly.
output (dict[str, Any]) – Output dictionary to populate.
invoice_file (str | None) – Path to the source PDF (required).
- Return type:
None
- invoice2data.extract.plugins.camelot.is_available()¶
Return whether the optional
camelot-pypackage is importable.- Returns:
True if
camelotis installed.- Return type:
bool
Parsers¶
static¶
Pseudo-parser returning a static (predefined) value.
lines¶
Parser to extract individual lines from an invoice.
Initial work and maintenance by Holger Brunn @hbrunn
- invoice2data.extract.parsers.lines.parse(template, field, settings, content)¶
Parse lines from the content based on the given settings.
- Parameters:
template (InvoiceTemplate) – The template dictionary.
field (str) – The field name.
settings (dict[str, Any]) – The settings dictionary.
content (str) – The text content to parse.
- Returns:
The parsed lines.
- Return type:
list[dict[str, Any]]
- invoice2data.extract.parsers.lines.parse_block(template, field, settings, content)¶
Parse a block of lines to extract data.
This function parses a block of lines from an invoice to extract data based on the provided template and settings. It handles different line types (first line, last line, regular lines) and can skip specific lines based on the configuration.
- Parameters:
template (InvoiceTemplate) – The template containing extraction rules.
field (str) – The name of the field to extract.
settings (dict[str, Any]) – The settings for the extraction rule.
content (str) – The text content to parse.
- Returns:
- A list of dictionaries, where each dictionary
represents an extracted row with field-value pairs.
- Return type:
list[dict[str, Any]]
- invoice2data.extract.parsers.lines.parse_by_rule(template, field, rule, content)¶
Parse lines from a block of text based on a rule.
- Parameters:
template (InvoiceTemplate) – The template dictionary.
field (str) – The field name.
rule (dict[str, Any]) – The rule dictionary.
content (str) – The text content to parse.
- Returns:
The parsed lines.
- Return type:
list[dict[str, Any]]
- invoice2data.extract.parsers.lines.parse_current_row(match, current_row)¶
Parse the current row data.
- Parameters:
match (Match[str] | None) – The match object.
current_row (dict[str, Any]) – The current row dictionary.
- Returns:
The updated current row dictionary.
- Return type:
dict[str, Any]
- invoice2data.extract.parsers.lines.parse_line(patterns, line)¶
Parse a line using a given pattern or list of patterns.
This function searches for a match in the given line using the provided pattern or list of patterns. If a match is found, it returns the match object; otherwise, it returns None.
- Parameters:
patterns (str | list[str]) – The pattern(s) to search for.
line (str) – The line to parse.
- Returns:
A match object if a match is found, otherwise None.
- Return type:
Match[str] | None
regex¶
Parser extracting data using regexes.
One or more regexes can be specified using the “regex” setting. By default it ignores duplicates and returns: - single value if there was only a single match - array for multiple matches
For more detailed parsing “type” and “group” settings can be specified.
- invoice2data.extract.parsers.regex.parse(template, field, settings, content, legacy=False)¶
Parse a field from the content using regular expressions.
- Parameters:
template (Any) – The template object.
field (str) – The name of the field to extract.
settings (dict[str, Any]) – The settings for the field extraction.
content (str) – The text content to parse.
legacy (bool, optional) – Whether to use legacy parsing. Defaults to False.
- Returns:
The extracted value(s) or None if parsing fails.
- Return type:
Any
AI (optional)¶
The AI subsystem is opt-in and provider-pluggable (cloud LLMs or a local Ollama).
See AI features for configuration and usage. Requires the ai extra.
Configuration¶
AI provider configuration, resolved from environment variables.
All settings come from INVOICE2DATA_AI_* env vars so the core library never
hard-codes credentials. The default provider is mock (no network), so nothing
AI-related runs unless explicitly configured.
- class invoice2data.ai.config.AIConfig(provider, model, base_url, api_key)¶
Resolved AI configuration.
- Parameters:
provider (str)
model (str)
base_url (str | None)
api_key (str | None)
- provider¶
Provider key – “mock” or a vendor name (openai/deepseek/mistral/gemini/ollama) routed to the OpenAI-compatible provider.
- Type:
str
- model¶
Model identifier passed to the provider.
- Type:
str
- base_url¶
API base URL (vendor default when unset).
- Type:
str | None
- api_key¶
API key; not required for local providers (Ollama).
- Type:
str | None
- invoice2data.ai.config.VENDOR_BASE_URLS: dict[str, str] = {'deepseek': 'https://api.deepseek.com/v1', 'gemini': 'https://generativelanguage.googleapis.com/v1beta/openai', 'mistral': 'https://api.mistral.ai/v1', 'ollama': 'http://localhost:11434/v1', 'openai': 'https://api.openai.com/v1'}¶
Default OpenAI-compatible base URLs for known vendors, so a user only needs to set the provider name + key (DeepSeek/Mistral/Ollama and Gemini’s compat endpoint all speak the OpenAI chat-completions API).
Provider interface¶
The AIProvider contract and the provider registry.
Mirrors the input-backend seam (input/__interface__.py): a small structural
contract plus a factory that instantiates the configured provider. Optional
provider dependencies are imported lazily inside get_provider() so importing
this module never requires the ai extra.
- class invoice2data.ai.__interface__.AIProvider(*args, **kwargs)¶
Structural contract for an AI extraction provider.
- name¶
Short provider identifier.
- Type:
str
- extract_structured(text, json_schema, *, instructions=None)¶
Extract structured fields from text, constrained to
json_schema.- Parameters:
text (str) – The document’s extracted text.
json_schema (dict[str, Any]) – JSON Schema the result must match.
instructions (str | None) – Optional system prompt override.
- Returns:
The structured fields.
- Return type:
dict[str, Any]
- is_available()¶
Return whether the provider is configured and its deps are present.
- Returns:
True if
extract_structured()can be called.- Return type:
bool
- invoice2data.ai.__interface__.get_provider(config=None)¶
Instantiate the configured AI provider.
- Parameters:
config (AIConfig | None) – Configuration to use; loaded from the environment when None.
- Returns:
- A
MockProviderwhen the provider is “mock”, otherwise an OpenAICompatibleProvider(which covers every supported vendor).
- A
- Return type:
LLM fallback extraction¶
Runtime LLM fallback extraction (AI-2).
Opt-in: when no template matches (or every match is missing required fields) and
OCR doesn’t help, extract fields with the configured AIProvider,
constrained to the canonical JSON schema and validated. Results are tagged
extraction_method: "ai" so they are never confused with a deterministic
template match. Off unless explicitly enabled.
- invoice2data.ai.fallback.ai_fallback_extract(text, *, provider=None)¶
Extract fields from text via the configured AI provider (opt-in).
- Parameters:
text (str) – The document’s extracted text.
provider (AIProvider | None) – Provider to use; the configured one (
get_provider()) when None.
- Returns:
- Extracted fields tagged
extraction_method: "ai", or an empty dict when text is empty, the provider is unavailable, or nothing was found.
- Extracted fields tagged
- Return type:
dict[str, Any]
AI template generation¶
AI-assisted template generation (AI-1).
Drafts an invoice2data template from a sample document’s text using the
configured AIProvider, grounded with the deterministic candidates from
invoice2data.extract.suggestions so the model has concrete values to anchor
its regexes. preview_template then round-trips the draft against the same text
so the user can see what it captures before saving it.
Authoring-time only – this never runs during normal extraction; the default path stays deterministic templates.
- invoice2data.ai.template_generator.TEMPLATE_SCHEMA: dict[str, Any] = {'properties': {'exclude_keywords': {'items': {'type': 'string'}, 'type': 'array'}, 'fields': {'additionalProperties': {'type': 'string'}, 'type': 'object'}, 'issuer': {'type': 'string'}, 'keywords': {'items': {'type': 'string'}, 'type': 'array'}}, 'required': ['keywords', 'fields'], 'type': 'object'}¶
JSON Schema for the template the model must produce (not the invoice values).
- invoice2data.ai.template_generator.generate_template(text, *, provider=None, issuer=None)¶
Draft an invoice2data template from a sample document’s text.
- Parameters:
text (str) – The sample document’s extracted text.
provider (AIProvider | None) – Provider to use; the configured one (
get_provider()) when None.issuer (str | None) – Optional issuer name to force into the template.
- Returns:
- A template dict (issuer/keywords/fields) ready to review,
preview and save.
- Return type:
dict[str, Any]
- invoice2data.ai.template_generator.preview_template(template, text)¶
Apply a template’s field regexes to text to preview what it captures.
A lightweight round-trip so the user can confirm the draft before saving it; the real engine applies these regexes with the template’s options at runtime.
- Parameters:
template (dict[str, Any]) – A template dict with a
fieldsmapping of field name -> regex string.text (str) – The sample text to match against.
- Returns:
- Field name -> the first captured value (group 1 when the
regex has a group, otherwise the whole match). Fields that do not match are omitted.
- Return type:
dict[str, str]
JSON schema¶
Build a JSON Schema for invoice extraction from the canonical field registry.
The canonical vocabulary in invoice2data.extract.schema is the single
source of truth; this turns it into a JSON Schema so an LLM’s structured-output
mode can be constrained to the exact fields invoice2data understands.
- invoice2data.ai.schema_json.invoice_json_schema()¶
Return a JSON Schema describing the canonical invoice output.
Top-level invoice fields plus a
linesarray of line items, typed from the canonical registries.- Returns:
A JSON Schema object suitable for structured-output APIs.
- Return type:
dict[str, Any]