Skip to main content

PDF Workflow

Dongler includes a Rust-native PDF extraction path for digitally born PDFs: text, page geometry, source anchors, table structure, image positions, and metadata rendered to Markdown, JSON, and LaTeX.

The intended workflow is:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

PDF Output Goals

  • Preserve readable page order.
  • Extract text into paragraphs and sections.
  • Convert positioned and ruled tables into structured table blocks where the detected grid is reliable.
  • Carry useful metadata.
  • Preserve block/page bounding boxes for citations.
  • Record image object positions and source anchors.
  • Render clean Markdown and LaTeX from the same document object.

Practical Checks

When you integrate Dongler into a PDF pipeline, start with these checks:

doc = dongler.load("report.pdf")
data = doc.to_dict()

print(data["metadata"]["word_count"])
print(data["metadata"]["block_count"])
for warning in data.get("warnings", []):
print(warning)

If word_count and block_count are both zero for a visually simple PDF, the file may be scanned, image-only, encrypted in an unusual way, or using a PDF encoding path Dongler does not yet model. Keep a small fixture for that document and file an issue with the PDF if it can be shared.

For table-heavy PDFs, inspect JSON in addition to Markdown:

dongler extract report.pdf --format json

The JSON output includes page/block geometry, table rows/cells, images, source anchors, and warnings that are intentionally not all visible in rendered Markdown.

Native-First Scope

The v1 engine is deterministic and native-first. OCR and VLM/LLM repair are not default dependencies; low-confidence or unsupported PDF structures are surfaced as warnings and can be evaluated separately.