Developer Guide
Dongler is built for a simple integration path: install one package, pass it a document path, inspect the document object when needed, and choose the output your application needs.
Pick a Runtime
Use the runtime that matches the rest of your pipeline:
- Python: best for notebooks, ingestion jobs, data pipelines, and quick PDF to Markdown conversion.
- TypeScript: best for Node.js services, queues, and application backends.
- Rust: best when you want direct control over the core document IR.
- CLI: best for smoke tests, shell scripts, and inspecting what Dongler sees.
All of these call the same Rust extraction engine.
Parse a PDF
import dongler
doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
json_payload = doc.to_json()
data = doc.to_dict()
The Markdown renderer is the fastest path when you need readable text for review, indexing, RAG ingestion, or downstream text processing. The JSON output is better when you need metadata, page numbers, block types, bounding boxes, warnings, images, or table cells.
Handle Failures
Treat document extraction as I/O. Files can be encrypted, scanned, malformed, or unsupported.
import dongler
try:
doc = dongler.load("report.pdf")
except Exception as exc:
print(f"Could not extract report.pdf: {exc}")
else:
print(doc.to_markdown())
For ingestion jobs, prefer load_many() so one bad file does not stop the
batch:
for result in dongler.load_many(["report.pdf", "invoice.pdf", "notes.txt"]):
if result["ok"]:
process(result["document"].to_markdown())
else:
log_error(result["path"], result["error"])
Inspect Output Quality
Start with the document metadata:
doc = dongler.load("report.pdf")
data = doc.to_dict()
print(data["metadata"]["character_count"])
print(data["metadata"]["word_count"])
print(data["metadata"]["block_count"])
print(data.get("warnings", []))
For a normal digitally born PDF, zero words and zero blocks usually means the file is scanned, image-only, encrypted in an unusual way, or using PDF features Dongler does not yet model. Keep a minimal fixture for documents like that so you can verify future upgrades.
Choose an Output Format
Use Markdown when the next step is search, display, review, RAG ingestion, or plain text processing.
Use LaTeX when the document contains technical text and you want a renderer that keeps more document-like structure.
Use JSON when your application needs the document IR: pages, blocks, tables, images, warnings, metadata, and anchors.
Keep It Local
Dongler's default PDF path is native and local. It does not send documents to a hosted API and does not require an OCR or LLM dependency for digitally born PDFs. If your workload includes scanned PDFs, add OCR as a separate step and feed the OCR output into your own pipeline until Dongler's optional OCR path is ready.
Prepare a Release
The repository keeps versions in package manifests and uses CI to publish after tests and package builds pass. Use the release helper to bump all package surfaces together:
python3 scripts/prepare-release.py <version>
The script updates manifests, refreshes lock files, runs the version consistency check, and prints the exact verification and push commands. Avoid hardcoded version text in docs unless it is necessary for a historical note.