PDF_IMAGES

Displaying PDF figures in notebooks / reports

Status: design open — two prototypes landed on the history_pages branch; neither is the answer we want to ship. Context: notebook/report pages reference on-graph outputs via galaxy directives. Many real figures are PDF, not raster, so they could not be displayed inline.


Problem

A large fraction of Galaxy’s plotting tools emit PDF, often multi-page (R-based tools especially):

A notebook/report directive (history_dataset_as_image) only knew how to embed rasters. So PDF figures forced a re-upload-PNG workaround: screenshot/convert the figure outside Galaxy, upload it as a new dataset, reference that. This breaks provenance (the PNG is not the tool’s real output) and breaks workflow extraction (the uploaded PNG is an orphan input, not a graph output).

We want: reference the real PDF output of a real tool step, show a chosen page as a flush figure, and have that reference survive extraction into a workflow.


What a good solution needs (from the use cases)

  1. Page selection. Multi-page PDFs are the common case; must show a specific 1-based page (DESeq2 page 1).
  2. Figure framing. A single page should read as a flush figure — no PDF-viewer chrome (toolbar, page nav, scrollbars).
  3. Live == baked. The live editor preview and the server-baked/exported report must render the same pixels. Divergence here is the main thing that made the prototypes feel wrong.
  4. Seeds extraction. Referencing a PDF output must record the HDA so page→workflow extraction still seeds the producing step and exposes the output.
  5. Graceful absence. If the renderer/dependency is missing, degrade to a clear message, not a crash.

What we prototyped (two approaches, both on history_pages)

APPROACH_A — overload history_dataset_as_image for PDFs

Commit 58a7351 “Render PDF datasets as images (rasterize first page)”.

Why unsatisfying: no page control (always page 1). Live ≠ baked: live shows the whole PDF in the browser’s PDF viewer with chrome; baked shows only page 1 as a PNG. Conflates “this image is a PDF” with “show one page as a figure.”

APPROACH_B — dedicated history_dataset_as_pdf directive with page

Commit b982369 “Add history_dataset_as_pdf notebook directive with page control”.

Why unsatisfying:


Cross-cutting tension

Is a PDF output a document (browse all pages) or a figure (show one page)? The prototypes try to be both and end up with two code paths whose only hard requirement — live preview matches the exported artifact — is the one they don’t satisfy, because one path is a browser embed and the other is a server raster.

Two ways to collapse that:


Alternative: tool-based PDF→image extraction (no core renderer)

Instead of rasterizing at render time, convert the PDF to an image as a workflow step. The figure becomes a real on-graph dataset, referenced with the existing history_dataset_as_image — no new directive, no pymupdf in core, live==baked trivially (it’s just a PNG), and extraction is automatic because it’s a real step.

Existing tool — yes, one fits directly:

graphicsmagick_image_convert — bgruening, main ToolShed (toolshed.g2.bx.psu.edu/repos/bgruening/graphicsmagick_image_convert), id graphicsmagick_image_convert, GraphicsMagick 1.3.46. (Source: bgruening/galaxytoolstools/image_processing/graphicsmagick/convert.xml.)

Other tools seen, not a fit: graphicsmagick_image_montage (combines images), xy_plot_multiformat (generates plots, doesn’t convert), bio-image imgteam tools (Bio-Formats — microscopy, not PDF). No standalone ImageMagick convert on the ToolShed; GraphicsMagick is the maintained equivalent.

Trade-offs of the tool approach:


Open questions