Skip to main content
Rust-native document enginev0.3.17

Documents in.
Structure out.

Dongler is a from-scratch Rust engine that turns PDFs and 15+ formats into clean Markdown, LaTeX, and typed JSON — locally, in milliseconds, from Python, TypeScript, Rust, or the CLI.

90+
pages / second
15+
input formats
4
languages
0
cloud calls
import dongler

doc = dongler.load("report.pdf")
print(doc.to_markdown())
pip install dongler

One API, identical output across every binding.

PDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONLPDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONLPDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONLPDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONLPDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONLPDFDOCXXLSXPPTXODTODSODPHTMLXMLEMLMarkdownTeXCSVTSVJSONJSONL
90+
pages / second
release build, single core
15+
input formats
PDF, Office, web, email
4
languages
Python · TS · Rust · CLI
0
cloud calls
fully local & deterministic
How it works

One path, document to structure.

Load a path once. The engine parses, lays out, and structures the document, then renders it in the format your pipeline needs.

Any document

PDF and 15+ formats — born-digital or messy.

Dongler engine

Parse · font metrics · reading order · tables.

Structured output

Markdown, LaTeX, or a typed JSON document.

The engine

Sophisticated extraction, built from scratch.

No cloud, no OCR fallback by default, no third-party PDF runtime — just a purpose-built Rust core measured against real benchmarks.

A custom parser, not a wrapper

A from-scratch Rust PDF engine with its own tokenizer, font decoding, and CMap/ToUnicode handling — no pdfium, no poppler, no native bindings.

Font-metric bounding boxes

Glyph boxes derived from real font ascent/descent and the text matrix, rotation-aware, so geometry stays tight under scaling and /Rotate.

Reading-order reconstruction

Multi-column layouts are detected and re-sequenced into natural reading order instead of raw stream order.

Table structure with spans

Ruled, aligned, and implied tables are recovered into a real cell grid — including merged column headers.

Benchmarked, not hand-waved

Accuracy is measured with TEDS, GriTS, CER/WER, edit-similarity, and bbox IoU across a 1,400-PDF benchmark suite.

Local & deterministic

No hosted service, API key, or model dependency in the default path. The same input always yields the same structured output.

Everywhere you build

Same engine. Four ways to call it.

The Python and TypeScript packages are thin wrappers over the Rust core, so every binding returns the identical document model.

PythonIngestion jobs & notebooks
pip install dongler
TypeScriptServices & queues
npm install @cristianexer/dongler
RustThe core API, directly
cargo add dongler-core
CLIInspect & pipe anywhere
cargo install dongler

Start extracting in one line.

Install, point it at a document, and read structured output back.