Input-backend benchmark (B2)

Speed vs. extraction accuracy of each text-extraction backend, measured by benchmarks/run.py over the bundled tests/compare/*.pdf invoices and their golden *.json outputs. Accuracy is the fraction of golden fields each backend reproduces through the matching template — so it reflects real template compatibility, not raw text quality. The bundled templates are tuned for poppler’s pdftotext -layout.

Per-backend (each backend in isolation)

backend

accuracy

matched

speed (raw to_text)

pdftotext

85.9 %

146/170

20.6 ms/file

pdfium

42.9 %

73/170

4.2 ms/file

pdfoxide

34.7 %

59/170

5.4 ms/file

pdfminer

24.7 %

42/170

77.9 ms/file

pdfplumber

7.1 %

12/170

120.9 ms/file

hotpdf

4.1 %

7/170

368.1 ms/file

pdftotext is the accuracy anchor (its -layout mode is what the templates expect). Among the fast, layout-less backends, pypdfium2 (pdfium) is both the fastest and the most accurate, which is why it is the fast-path backend in the cascade.

Cascade (end-to-end extract_data, incl. fallback)

The cascade tries a fast backend first and falls back to pdftotext when a template fails to match or a required field is missing. Layout/area-sensitive templates pin input_module: pdftotext so they always use poppler.

cascade order

accuracy

matched

speed (full extract)

pdftotext-first

85.9 %

146/170

41.7 ms/file

pdfium-first (no pins)

82.4 %

140/170

19.9 ms/file

pdfium-first + pins

85.9 %

146/170

28.3 ms/file

Pinning the four layout-sensitive bundled templates (com.amazon.aws, nl.be.coolblue, fr.free.adsl-fiber, fr.publicationannoncelegale) recovers the full accuracy of the pdftotext default while keeping pypdfium2’s speed for everything else. The 6 fields lost without pins are silent soft-failures (pypdfium2 matches the template and fills the required fields, but a value — usually a lines/line-item table, or an area-based field — is wrong because it has no layout mode).

Takeaway

  • pypdfium2 is now the default (DEFAULT_INPUT_READERS = [pdfium, pdftotext]): fast, dependency-light, permissive Apache/BSD licence (unlike AGPL PyMuPDF), and the best layout-less accuracy.

  • pdftotext stays the accuracy anchor and the cascade’s safety net. The cascade falls back to it automatically when pypdfium2 fails to match, misses a required field, or returns an empty lines/tables table — so most layout breakage is recovered without any per-template change.

  • Only templates where pypdfium2 returns a populated-but-wrong non-required value (an area: field, or a column-aligned table) need an explicit input_module: pdftotext pin; the bundled ones are pinned. Reproduce with python benchmarks/run.py.