API Reference

Dongler exposes an object API for path-based extraction and keeps the original text helper functions for compatibility.

Object API

Python:

import dongler

doc = dongler.load("report.pdf")
doc.to_markdown()
doc.to_latex()
doc.to_json()
doc.to_dict()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
doc.toMarkdown();
doc.toLatex();
doc.toJson();
doc.toObject();

Rust:

let doc = dongler_core::load_path("report.pdf")?;
doc.to_markdown()?;
doc.to_latex()?;
doc.to_json()?;

Batch API

Batch processing returns one result per path. A failed or unsupported file does not stop the batch.

Python:

results = dongler.load_many(["notes.txt", "invoice.pdf"])

TypeScript:

const results = loadMany(["notes.txt", "invoice.pdf"]);

Rust:

let results = dongler_core::load_many(["notes.txt", "invoice.pdf"]);

Each result has:

path
ok
document
error

Choosing an Output

Use Markdown when indexing, displaying, reviewing, or passing document text to downstream text systems.
Use LaTeX when preserving technical text, formulas, or document-oriented rendering is more important.
Use JSON when you need page numbers, block types, bounding boxes, warnings, image references, or table cells.

Compatibility Helpers

These functions still operate on in-memory text:

parse_text
to_markdown
to_latex
to_json
detect_format

Document IR

The document object wraps Dongler's serializable IR:

Document
  schema_version
  metadata
  pages[]
  assets[]
  warnings[]

Page
  number
  width
  height
  rotation
  bbox
  blocks[]
  images[]
  assets[]
  warnings[]

Block
  text | table | figure

PDF blocks include source anchors and optional bounding boxes so rendered content can point back to page regions. TableBlock renders to Markdown and LaTeX when its extracted grid is rectangular.

Object API​

Batch API​

Choosing an Output​

Compatibility Helpers​

Document IR​

Object API

Batch API

Choosing an Output

Compatibility Helpers

Document IR