PDF export / extraction — polish + test plan
What this is: research into the “PDF extraction”/export feature of Galaxy Markdown (Notebooks, Pages, workflow reports), prioritized improvement ideas, and a detailed plan to programmatically test it. Produced by a research subagent reading the history_pages worktree. File:line refs are to that branch.
(A) Current PDF/export pipeline map (end to end)
Three entry points, all converging on markdown_util.py.
- Page → synchronous PDF —
GET /api/pages/{id}.pdf(api/pages.py:196-217) →PagesService.show_pdf(services/pages.py:156-169) →internal_galaxy_markdown_to_pdf(..., PdfDocumentType.page), streamsapplication/pdf(blocks the web thread). - Page → async PDF —
services/pages.py:171-187prepare_pdf→ allocatesshort_term_storagetarget, runsto_basic_markdownin the web process, shipsGeneratePdfDownload(...)to Celeryprepare_pdf_download(celery/tasks.py:568-576) →generate_branded_pdf(markdown_util.py:1208). Asymmetry: directive walk runs sync in the web worker; only weasyprint render is offloaded. - Workflow invocation report → PDF —
GET /api/invocations/{id}/report.pdf(+2 aliases)api/workflows.py:1610-1647→show_invocation_report(format=pdf)→get_invocation_report→WorkflowMarkdownGeneratorPlugin.generate_report_pdf(workflow/reports/generators/__init__.py:63-67) →internal_galaxy_markdown_to_pdf(..., invocation_report). Fully synchronous.
Core conversion (markdown_util.py):
internal_galaxy_markdown_to_pdf(trans, md, document_type) # :1201
├─ _check_can_convert_to_pdf_or_raise() # :1195 raises ServerNotConfiguredForRequest if no weasyprint
├─ basic_markdown = to_basic_markdown(trans, md) # :1155
│ ├─ resolve_invocation_markdown(...) # :1311 output=/input=/step= → real ids
│ └─ ToBasicMarkdownDirectiveHandler.walk(...) # :876 directive → inline markdown/base64 images
└─ to_branded_pdf(basic_markdown, document_type, config) # :1221 prologue/epilogue + per-doc-type CSS
└─ to_pdf_raw(branded_markdown, css_paths) # :1169
├─ to_html(basic_markdown) # :1163 markdown(... ["tables"]) + sanitize_html(allow_data_urls=True)
└─ weasyprint.HTML(...).write_pdf(stylesheets=[markdown_export_base.css, *css_paths]) # :1179
- PDF engine = WeasyPrint (guarded import
markdown_util.py:34-37;weasyprint_available():1191). Renders in a temp dir,shutil.rmtreeinfinally. - Branding (
to_branded_pdf:1221):config.markdown_export_prologue[_pages|_invocation_reports],..._epilogue,..._css. Base stylesheet packagedlib/galaxy/managers/markdown_export_base.css(resource_string). - Async contract:
GeneratePdfDownload(schema/tasks.py:37-41);PdfDocumentType={invocation_report, page}(schema/__init__.py:104). - Client capability flag:
markdown_to_pdf_available=weasyprint_available()(managers/configuration.py:168).
Rasterization (the “extraction”) — ToBasicMarkdownDirectiveHandler (markdown_util.py:876-1152):
handle_dataset_as_image(:911-932):path=→ embed raw bytes as PNG; else delegate todatatype.handle_dataset_as_image(hda),try/exceptfalling back to embedding raw file bytes aspng.handle_dataset_as_pdf(:934-949): parse optionalpage=N, callgetattr(datatype,"render_pdf_page_as_image_markdown",None)(hda,page); on missing/exception →*cannot display PDF page N for {name}*._embed_image(:951-953):.handle_workflow_image(:1004-1009): workflow SVG via_embed_image(...,"svg+xml",...).
Datatype rasterizer (lib/galaxy/datatypes/images.py):
Pdf(Image):536;handle_dataset_as_image(:546) →render_pdf_page_as_image_markdown(hda, page=1)(:552);_page_as_png(file_name, page_number=1, dpi=150)(:569,@staticmethod): pymupdf optional (:31-34), returns None if absent/exception, clamps page to[0, page_count-1], clamps DPI so longest side ≤MAX_RENDER_PX=2000,get_pixmap(dpi).tobytes("png").
Directive registration: markdown_parse.py:30 VALID_ARGUMENTS (history_dataset_as_pdf: hid, history_dataset_id, input, invocation_id, output, page); dispatch markdown_util.py:290-293; abstract :439; implemented in all 3 handlers (ReadyForExport no-op :609, collector :753, ToBasic :934).
Dependency status:
- WeasyPrint NOT installed by default — conditional (
dependencies/conditional-requirements.txt:65-73,weasyprint>=61.2), gated onGALAXY_DEPENDENCIES_INSTALL_WEASYPRINT=1(dependencies/__init__.py:324). Needs system cairo/Pango. - PyMuPDF now a hard dep —
packages/data/pyproject.toml:44+pinned-requirements.txt:224(pymupdf==1.27.2.3). The optional-import guard inimages.pyis now mostly defensive.
(B) Rough edges / gaps
- Inconsistent “cannot display” fallbacks.
markdown_util.handle_dataset_as_pdfreturns the string with\n\n;images.render_pdf_page_as_image_markdownreturns it without. Two layers of fallback, divergent formatting, duplicate page-clamp/format logic. history_dataset_as_pdfsilent in live client / not embed-capable. Not inEMBED_CAPABLE_DIRECTIVES(markdown_parse.py:71-87); pure no-op inReadyForExportMarkdownDirectiveHandler(:609). Live-view vs baked-report discrepancy — confirm whether the Vue client renders it live; if not, it’s export-only.- Multi-page PDFs: only one page ever rendered (default 1). No page-range / all-pages. A multi-page scientific PDF silently shows only page 1 via
history_dataset_as_image. handle_dataset_as_imageraw-bytes fallback is wrong for PDFs (markdown_util.py:927-931): on exception it embeds the file asdata:image/png— for a PDF that’s raw%PDFbytes mislabeled PNG → broken<img>.- No whole-document size/time guard.
MAX_RENDER_PXcaps one page bitmap, but a report base64-inlines every referenced dataset into one in-memory HTML string then hands it to WeasyPrint. No cap on image count, total HTML size, or per-dataset file size beforeread(). Memory blow-up risk in the web worker (esp.prepare_pdf). - Synchronous render paths block workers: page
show_pdfand the entire invocation-report path render WeasyPrint inline in the request; onlyprepare_pdfoffloads (and even it walks directives in-process). - DPI/sizing hardcoded & not directive-driven:
dpi=150,MAX_RENDER_PX=2000hardcoded; nowidth/sizearg onhistory_dataset_as_pdf(unlikeworkflow_image’ssize).handle_dataset_as_tableignores compact/title/footer/headers (explicit TODO :956). - Security / SSRF (most important).
to_htmlusessanitize_html(..., allow_data_urls=True)(markdown_util.py:1163; whitelistsanitize_html.py:255). WeasyPrint’s defaulturl_fetcherwill fetch anyfile:///http(s)://resource left in the HTML → SSRF / local-file disclosure during server-side render of user-authored markdown. No custom restrictiveurl_fetcher. - Secret-leak footgun.
to_branded_pdfuses agetattr(config, f"markdown_export_..._{document_type}s")pattern (:1222-1231) — safe today (enum-driven), but any future directive interpolatingconfig.<attr>by a user-controlled name could echo secrets (e.g. the OpenAI key ingalaxy.yml). Worth a guard/test. - Optional-dep UX: missing pymupdf only
log.warnings → report shows “cannot display” with no admin signal. Missing weasyprint → 501 only at PDF step. - Weak test assertions: the one integration PDF test asserts only headers, never that a figure rendered. No test for
history_dataset_as_pdf, page clamping, DPI clamp, missing-pymupdf handler fallback, or SSRF. handle_invocation_inputs/outputs/visualizationare stubs in ToBasic (:1128-1135) →*... not implemented*(normally pre-expanded byresolve_invocation_markdown; a bare directive degrades to a stub).
(C) Prioritized improvement ideas (with files)
P1 — Harden WeasyPrint against SSRF / local-file reads. In to_pdf_raw (markdown_util.py:1169) pass a restrictive url_fetcher that resolves only data: URIs and rejects file:/http(s):. Keep it a module-level (importable/testable) function. Highest value, small. (Test: feed <img src="file:///etc/passwd"> / external URL, assert refusal.)
P2 — Unify PDF-page rasterization + fallback. Collapse the duplicated “cannot display” logic: have ToBasic.handle_dataset_as_pdf delegate page-parse + render entirely to the datatype; datatype owns the single fallback string (consistent newlines). Also fix handle_dataset_as_image’s raw-bytes fallback (:927) to emit a *cannot display* note for non-raster bytes instead of a mislabeled PNG.
P3 — Centralize image-embed + one size/clamp policy. Three base64-embed sites (_embed_image :951, Image.handle_dataset_as_image :156, Pdf.render_pdf_page_as_image_markdown :562). Consolidate data-URI construction into one helper (extend galaxy.util.image_util or a small images.py helper). Make MAX_RENDER_PX/dpi config-overridable via the existing config.<...> pattern.
P4 — Page-range / size arg + multi-page. history_dataset_as_pdf already takes page; consider pages="1-3" / size= (mirror workflow_image’s size, VALID_ARGUMENTS :70), wired via the existing PAGE_PATTERN (:81). Lower priority; product-driven.
P5 — Move sync render paths to short-term-storage/async. Reuse the existing GeneratePdfDownload + prepare_pdf_download Celery infra for the invocation report too (api/workflows.py:1610) instead of inline render. Reuses an existing abstraction.
P6 — Make history_dataset_as_pdf consistent live vs baked. If the Vue client doesn’t render it live, add it to the client renderer or document export-only; at minimum consider EMBED_CAPABLE_DIRECTIVES if inline ${galaxy history_dataset_as_pdf(...)} is intended. (Confirm client first.)
P7 — Admin visibility for missing optional deps. Add a pdf_rasterization_available capability flag alongside markdown_to_pdf_available (managers/configuration.py:168), backed by a new images.pymupdf_available() helper mirroring weasyprint_available(). Reuses the capability-flag mechanism.
P8 — Finish handle_dataset_as_table advanced options (markdown_util.py:955, standing TODO) so PDF and web converge — only if in scope.
(All new helpers reuse existing modules — markdown_util, images, image_util, short_term_storage, capability flags; imports at top of file.)
(D) Detailed test plan
D.1 Unit — rasterization (test/unit/data/datatypes/test_images.py)
Extend existing file (has the Pdf/pymupdf skipif pattern). All @pytest.mark.skipif(images_module.pymupdf is None), reuse get_dataset("454Score.pdf") + MockDatasetDataset.
test_render_pdf_page_as_image_markdown_page_clamped_high—page=9999→ still valid PNG data URI (clamped); decode base64, assert PNG magic...._page_clamped_low—page=0/negative → page 1.test_page_as_png_dpi_clamp— open produced PNG, assert longest side ≤MAX_RENDER_PX.test_page_as_png_missing_pymupdf—monkeypatch.setattr(images_module,"pymupdf",None); assert_page_as_pngNone andrender_pdf_page_as_image_markdownreturns the unified*cannot display*(red-to-green for P2).test_page_as_png_corrupt_pdf— non-PDF temp file → None.
D.2 Unit — directive parsing & to-basic handlers (test/unit/app/managers/test_markdown_export.py)
Reuse BaseExportTestCase/TestToBasicMarkdown (mocked managers, _new_hda, _expect_get_hda). hda.datatype derives from extension — set hda.extension="pdf", point hda.dataset.get_file_name at get_test_fname("454Score.pdf").
test_history_dataset_as_pdf_default_page— assert result containsdata:image/png;base64,(skipif no pymupdf).test_history_dataset_as_pdf_explicit_page—page=2; assert PNG embed or*cannot display*(pick existing page / assert non-empty no crash).test_history_dataset_as_pdf_no_pymupdf_fallback— monkeypatch pymupdf None; assert unified*cannot display PDF page(red-to-green P2).test_history_dataset_as_image_pdf_uses_rasterizer—extension="pdf"; assert PNG data URI (covers delegation :921).- Parse-level: extend
test/unit/app/test_markdown_validate.py—history_dataset_as_pdf(page=2)validates; bogusfoo=1raises.
D.3 Unit — to_html / to_pdf_raw + SSRF
test_to_html_allows_data_urls—data:image/png<img>survives sanitization.test_to_pdf_raw_url_fetcher_blocks_file_and_http— red-to-green P1: markdown referencingfile:///etc/passwd+http://169.254.169.254/...; call the customurl_fetcherdirectly, assert refusal (no live render needed).test_to_pdf_raw_smoke—skipif not weasyprint_available(); render# Hi, assertbytes[:4]==b"%PDF".
D.4 Integration / API — assert on produced PDFs
Reuse existing fixtures/helpers (don’t invent):
lib/galaxy_test/api/test_pages.py(extendtest_pdf_when_service_available), gated onconfiguration["markdown_to_pdf_available"](pattern at :507-534).test/integration/test_workflow_tasks.py::test_workflow_invocation_pdf_report+WorkflowPopulator.workflow_report_pdf(populators.py:2872).DatasetPopulatorto upload454Score.pdf→ HDA, build a page referencing it viahistory_dataset_as_pdf(history_dataset_id=<id>).
New/strengthened:
test_pdf_export_embeds_referenced_pdf_figure(api/test_pages.py): page withhistory_dataset_as_pdf→GET pages/{id}.pdf→ assert on content (D.5). Gate onmarkdown_to_pdf_available+ newpdf_rasterization_available(P7).- Strengthen
test_workflow_invocation_pdf_reportto assert the PDF parses, ≥1 page, expected text present (currently headers-only).
D.5 How to assert on a generated PDF (no pixel diffs)
Use PyMuPDF (now hard dep) on returned bytes:
import pymupdf
doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
assert doc.page_count >= 1
text = "\n".join(p.get_text() for p in doc)
assert "Expected Heading" in text # text made it in
images = [img for p in range(doc.page_count) for img in doc.load_page(p).get_images()]
assert len(images) >= 1 # a figure was embedded
Assert on page count, expected text (headings, dataset name, prologue), and image count > 0. Robust, not brittle. No pixel comparison.
D.6 Optional-dependency handling
- Unit rasterization →
skipif pymupdf is None(now effectively always runs in CI, safe in minimal envs). - WeasyPrint render →
skipif not weasyprint_available()(skips locally unless the env flag is set). - API/integration PDF → keep the capability-flag gate (
configuration["markdown_to_pdf_available"]) → no-ops where weasyprint absent. - CI: add a dedicated PDF integration target/marker (reuse Galaxy’s test-selection markers) on a runner that installs weasyprint (
GALAXY_DEPENDENCIES_INSTALL_WEASYPRINT=1+ system cairo/Pango). PyMuPDF-only unit tests run everywhere.
D.7 Fixtures to reuse (present — don’t create new)
test-data/454Score.pdf + lib/galaxy/datatypes/test/454Score.pdf (get_test_fname); MockDatasetDataset, get_dataset (test/unit/data/datatypes/util.py); BaseExportTestCase, MockTrans (galaxy_mock, version_major 19.09), _new_hda, _expect_get_hda; DatasetPopulator, WorkflowPopulator.workflow_report_pdf, run_workflow; capability-flag pattern in api/test_pages.py + managers/configuration.py.
(E) Open questions
- Does the Vue client render
history_dataset_as_pdf(andas_imagefor PDFs) live, or is rasterization export-only? Drives P2/P6 severity. (Needs client check.) - Move page
show_pdf+ invocation report PDF to async short-term-storage (P5), or keep sync? Async changes the/report.pdfcontract. - Is WeasyPrint outbound fetching an accepted risk or in scope to lock down now (P1)? Are pages/reports ever rendered server-side for other users’ content (sharing) — raises SSRF severity.
- Multi-page / page-range for
history_dataset_as_pdf— in scope or defer (P4)? - Should
MAX_RENDER_PX/dpibecome admin config + a hard per-document image-byte cap (P3/P5)? - Public HTML export endpoint for pages, or PDF-only? (
to_html/to_basic_markdownexist independently of weasyprint — ties into the embedding plan’s Option 2.) - Naming/overlap:
history_dataset_as_pdfoverlaps semantically withhistory_dataset_as_image(which also rasterizes PDF page 1). Two directives doing nearly the same thing — confirm both are wanted, or fold one into the other.