Architecture
One-page orientation for contributors. Details live in the source; this page is the map.
Pipeline
HTML string ─▶ html5ever parse ─▶ DOM ─▶ CSS cascade ─▶ box tree ─▶ layout ─▶ paint ops ─▶ PDF writer ─▶ bytes
(scraper) (src/css.rs) (src/box_ (src/ (src/ (pdf-writer
tree.rs) layout. lib.rs) crate)
rs)
Each arrow is a function call, not a process boundary. The Python side is a thin wrapper (src/pdfun/*.py) that mostly calls into the Rust extension pdfun._core.
Crates we depend on
pdf-writer— low-level PDF writer.scraper— HTML parsing (wraps html5ever).cssparser— CSS tokenizer.image— PNG / JPEG decode for<img>andbackground-image.ttf-parser— font metrics for text measurement.pyo3— Python bindings.
All of these are pure Rust. No system libraries are linked.
Where things live
| Area | File | What it does |
|---|---|---|
| Python entry | src/pdfun/__init__.py |
Re-exports the public API |
| HTML wrapper | src/pdfun/html.py |
HtmlDocument class, ToC prepending |
| ToC builder | src/pdfun/toc.py |
Heading scrape → <ul> markup |
| CLI | src/pdfun/cli.py |
click-based pdfun render |
| Python ↔ Rust | src/lib.rs |
PyO3 #[pymodule], page/document/font types |
| HTML → box tree | src/html_render.rs |
DOM walk, pseudo-element insertion, style dispatch |
| CSS parser | src/css.rs |
Property parsing, inheritance, calc(), @page |
| Box tree | src/box_tree.rs |
Intermediate tree shape |
| Layout | src/layout.rs |
Line breaking, page breaking, float placement, background/border paint |
| DOM helpers | src/dom.rs |
Shared node-walking utilities |
| Fonts | src/font_metrics.rs |
Per-font width tables for the 14 built-ins |
| Images | src/image.rs |
PNG/JPEG decode, XObject registration |
Data flow per page
- Parse HTML into a
scrapertree. - Cascade — walk each element, collect matching rules from inline
<style>,<link>, and the author stylesheet; compute an inheritedComputedStyle. - Build a box tree: one box per generator node, with pseudo-elements (
::before/::after) spliced in. - Layout — flow children through lines, break into pages, compute final
(x, y, w, h)for every box. - Paint — walk laid-out boxes emitting PDF operators (
q/Qsave/restore,cmtransforms,W nclips,Dofor XObjects,Tjfor text). - Emit —
pdf-writerserializes the object graph into the final byte stream; content streams are FlateDecode-compressed.
Font story
Only the 14 built-in PDF fonts are understood today: Helvetica (×4), Times (×4), Courier (×4), Symbol, ZapfDingbats. Each has a hardcoded AFM metrics table in src/font_metrics.rs. @font-face and system-font discovery are not implemented — they're on the roadmap but gated on picking a pure-Rust font shaping story (likely rustybuzz).
Tests
tests/test_pdfun.py— the low-level API (PdfDocument, Layout, Page).tests/test_html.py— HTML/CSS end-to-end, assertions against decompressed content streams (seetests/_pdf_helpers.py).tests/test_text_runs.py— multi-run paragraphs.tests/test_visual.py—visual snapshotstyle comparisons.
Parity tracking lives in tools/parity/catalog.toml plus inline # spec: markers in tests. tools/parity/generate.py --check regenerates docs/PARITY.md and fails CI if it drifts.