Architecture

One-page orientation for contributors. Details live in the source; this page is the map.

Pipeline

HTML string ─▶ html5ever parse ─▶ DOM ─▶ CSS cascade ─▶ box tree ─▶ layout ─▶ paint ops ─▶ PDF writer ─▶ bytes
               (scraper)                  (src/css.rs)   (src/box_  (src/     (src/          (pdf-writer
                                                          tree.rs)   layout.   lib.rs)        crate)
                                                                     rs)

Each arrow is a function call, not a process boundary. The Python side is a thin wrapper (src/pdfun/*.py) that mostly calls into the Rust extension pdfun._core.

Crates we depend on

pdf-writer — low-level PDF writer.
scraper — HTML parsing (wraps html5ever).
cssparser — CSS tokenizer.
image — PNG / JPEG decode for <img> and background-image.
ttf-parser — font metrics for text measurement.
pyo3 — Python bindings.

All of these are pure Rust. No system libraries are linked.

Where things live

Area	File	What it does
Python entry	`src/pdfun/__init__.py`	Re-exports the public API
HTML wrapper	`src/pdfun/html.py`	`HtmlDocument` class, ToC prepending
ToC builder	`src/pdfun/toc.py`	Heading scrape → `<ul>` markup
CLI	`src/pdfun/cli.py`	`click`-based `pdfun render`
Python ↔ Rust	`src/lib.rs`	PyO3 `#[pymodule]`, page/document/font types
HTML → box tree	`src/html_render.rs`	DOM walk, pseudo-element insertion, style dispatch
CSS parser	`src/css.rs`	Property parsing, inheritance, `calc()`, @page
Box tree	`src/box_tree.rs`	Intermediate tree shape
Layout	`src/layout.rs`	Line breaking, page breaking, float placement, background/border paint
DOM helpers	`src/dom.rs`	Shared node-walking utilities
Fonts	`src/font_metrics.rs`	Per-font width tables for the 14 built-ins
Images	`src/image.rs`	PNG/JPEG decode, XObject registration

Data flow per page

Parse HTML into a scraper tree.
Cascade — walk each element, collect matching rules from inline <style>, <link>, and the author stylesheet; compute an inherited ComputedStyle.
Build a box tree: one box per generator node, with pseudo-elements (::before/::after) spliced in.
Layout — flow children through lines, break into pages, compute final (x, y, w, h) for every box.
Paint — walk laid-out boxes emitting PDF operators (q/Q save/restore, cm transforms, W n clips, Do for XObjects, Tj for text).
Emit — pdf-writer serializes the object graph into the final byte stream; content streams are FlateDecode-compressed.

Font story

Only the 14 built-in PDF fonts are understood today: Helvetica (×4), Times (×4), Courier (×4), Symbol, ZapfDingbats. Each has a hardcoded AFM metrics table in src/font_metrics.rs. @font-face and system-font discovery are not implemented — they're on the roadmap but gated on picking a pure-Rust font shaping story (likely rustybuzz).

Tests

tests/test_pdfun.py — the low-level API (PdfDocument, Layout, Page).
tests/test_html.py — HTML/CSS end-to-end, assertions against decompressed content streams (see tests/_pdf_helpers.py).
tests/test_text_runs.py — multi-run paragraphs.
tests/test_visual.py — visual snapshot style comparisons.

Parity tracking lives in tools/parity/catalog.toml plus inline # spec: markers in tests. tools/parity/generate.py --check regenerates docs/PARITY.md and fails CI if it drifts.