API Reference
Dongler exposes an object API for path-based extraction and keeps the original text helper functions for compatibility.
Object API
Python:
import dongler
doc = dongler.load("report.pdf")
doc.to_markdown()
doc.to_latex()
doc.to_json()
doc.to_dict()
TypeScript:
import { load } from "@cristianexer/dongler";
const doc = load("report.pdf");
doc.toMarkdown();
doc.toLatex();
doc.toJson();
doc.toObject();
Rust:
let doc = dongler_core::load_path("report.pdf")?;
doc.to_markdown()?;
doc.to_latex()?;
doc.to_json()?;
Batch API
Batch processing returns one result per path. A failed or unsupported file does not stop the batch.
Python:
results = dongler.load_many(["notes.txt", "invoice.pdf"])
TypeScript:
const results = loadMany(["notes.txt", "invoice.pdf"]);
Rust:
let results = dongler_core::load_many(["notes.txt", "invoice.pdf"]);
Each result has:
pathokdocumenterror
Choosing an Output
- Use Markdown when indexing, displaying, reviewing, or passing document text to downstream text systems.
- Use LaTeX when preserving technical text, formulas, or document-oriented rendering is more important.
- Use JSON when you need page numbers, block types, bounding boxes, warnings, image references, or table cells.
Compatibility Helpers
These functions still operate on in-memory text:
parse_textto_markdownto_latexto_jsondetect_format
Document IR
The document object wraps Dongler's serializable IR:
Document
schema_version
metadata
pages[]
assets[]
warnings[]
Page
number
width
height
rotation
bbox
blocks[]
images[]
assets[]
warnings[]
Block
text | table | figure
PDF blocks include source anchors and optional bounding boxes so rendered
content can point back to page regions. TableBlock renders to Markdown and
LaTeX when its extracted grid is rectangular.