PDF Workflow
Dongler includes a Rust-native PDF extraction path for digitally born PDFs: text, page geometry, source anchors, table structure, image positions, and metadata rendered to Markdown, JSON, and LaTeX.
The intended workflow is:
import dongler
doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
PDF Output Goals
- Preserve readable page order.
- Extract text into paragraphs and sections.
- Convert positioned and ruled tables into structured table blocks where the detected grid is reliable.
- Carry useful metadata.
- Preserve block/page bounding boxes for citations.
- Record image object positions and source anchors.
- Render clean Markdown and LaTeX from the same document object.
Practical Checks
When you integrate Dongler into a PDF pipeline, start with these checks:
doc = dongler.load("report.pdf")
data = doc.to_dict()
print(data["metadata"]["word_count"])
print(data["metadata"]["block_count"])
for warning in data.get("warnings", []):
print(warning)
If word_count and block_count are both zero for a visually simple PDF, the
file may be scanned, image-only, encrypted in an unusual way, or using a PDF
encoding path Dongler does not yet model. Keep a small fixture for that document
and file an issue with the PDF if it can be shared.
For table-heavy PDFs, inspect JSON in addition to Markdown:
dongler extract report.pdf --format json
The JSON output includes page/block geometry, table rows/cells, images, source anchors, and warnings that are intentionally not all visible in rendered Markdown.
Native-First Scope
The v1 engine is deterministic and native-first. OCR and VLM/LLM repair are not default dependencies; low-confidence or unsupported PDF structures are surfaced as warnings and can be evaluated separately.