Notebook Extraction — Carrying the Markdown into the Workflow Report
STATUS: IMPLEMENTED (branch
extract_next). All of §1–§5 landed; tests green. Debrief:EXTRACT_NOTEBOOK_REPORT_DEBRIEF.md. Where the implementation diverged from the plan, the plan sections below are annotated inline with DONE / NOTE.
- §1 label index →
workflow/extract.py:ExtractionLabelIndex,extract_steps_by_ids_with_index,extract_workflow_by_ids_with_index;_WorkItem.icj_id.- §1 reconcile + §2 rewrite orchestration → new
managers/workflow_extraction_report.py(build_report_for_page,reconcile_report_labels).- §2 rewriter →
managers/markdown_util.py:_ReportLabelRewriter+rewrite_page_markdown_to_workflow_report.- §3 wiring →
schema/workflows.py(from_page_id,report_title,report_warnings),services/workflows.py::extract_by_ids(+_load_report_pagegating).- §4 frontend →
WorkflowExtractionForm.vue(+schema.tshand-edited, +15 lines).- §5 tests →
test/unit/workflows/test_extract_report.py(17),WorkflowExtractionForm.test.ts(+2),test_workflow_extraction.py::TestNotebookWorkflowExtractionReport(3 live API).- NOTE (atomicity, decided leave-as-is): report is built after the workflow is finalized/committed (reuses
extract_workflow_by_ids_with_index). A rewrite failure would leave a report-less workflow — accepted: reconcile guarantees resolution andvalidate_galaxy_markdownis a defensive backstop.- Selenium (§5 last row) NOT yet done.
Where we are
The prior PR (EXTRACT_NOTEBOOK_PR.md) made Extract Workflow from a notebook
seed the form: it walks the page’s referenced outputs/jobs back through provenance
and pre-checks the producing subgraph (seeded), pre-stars the displayed outputs
(exposed), and flags job-referenced uploads (seed_warning). It extracts the right
inputs / outputs / steps — but throws the notebook’s prose and directives away.
Commit 30359cc (step labels) added the missing piece: extracted tool steps can now
carry stable labels (step_labels on WorkflowExtractionByIdsPayload), alongside the
input labels (dataset_names) and output labels (output_labels) that already existed.
This plan: when extracting from a notebook, also carry the page’s markdown into the
new workflow as its report (workflow.reports_config["markdown"]), rewriting every
internal-ID directive into a portable workflow-relative label directive.
page markdown (internal ids) workflow report markdown (labels)
------------------------------ ----------------------------------
history_dataset_display( history_dataset_display(
history_dataset_id=1234) --> output="aligned_reads")
job_metrics(job_id=987) --> job_metrics(step="bwa_mem")
history_dataset_collection_display( history_dataset_collection_display(
history_dataset_collection_id=55)--> input="samples")
The labels on the right are exactly the ones extraction already assigns to inputs
(step.label), workflow outputs (workflow_output.label), and tool steps
(step.label). The rewrite is the inverse of resolve_invocation_markdown()
(markdown_util.py:1271), which turns output=/input=/step= back into instance ids
at invocation time. We’re moving the page from the internal representation to the
workflow-relative representation — the same direction the workflow-report authoring UI
produces by hand.
The shape of the change
-
from_page_idon the extract-by-ids payload. When present, the service fetches the page’s internal markdown, rewrites it against the labels the extraction just assigned, validates it, and stores it as the workflow’s report. -
A label index out of
extract_steps_by_ids. Extraction already builds the content-key →(step, output_name)map it needs to wire connections. Return it (plus job-id/icj-id → step) so the rewriter can resolve a referenced id to its label without re-deriving anything. -
A directive-rewrite handler in
markdown_util.py, a sibling of_ReferencedContentCollector, that reuses the sameGalaxyInternalMarkdownDirectiveHandlertaxonomy: eachhandle_*resolves the directive’s object → normalizes → looks up its label → returns the rewritten directive line. Unresolvable references are dropped with a warning rather than leaking an internal id into a portable document.
Nothing new is invented for resolution, normalization, or the directive walk — each already exists and is reused.
1. Backend: the label index
extract_steps_by_ids (workflow/extract.py:641) already maintains
id_to_output_pair: dict[("dataset"|"collection", original_id) -> (WorkflowStep, output_name)]
— the connection-wiring map. It also applies labels: input step.label (from
dataset_names), step.label for tool steps (from step_labels), and
create_or_update_workflow_output(label=...) for exposed outputs (from output_labels).
Add a typed return so the constructed WorkflowStep objects are the single source of
truth for labels (avoid rebuilding a parallel label map from raw payload hints, which
would re-derive the input-name dedup and drift):
@dataclass(frozen=True)
class ExtractionLabelIndex:
content_to_step: dict[OutputLabelKey, tuple[WorkflowStep, str]] # reuse id_to_output_pair
job_to_step: dict[int, WorkflowStep]
icj_to_step: dict[int, WorkflowStep]
content_to_stepisid_to_output_pair(already keyed by original id) — just returned.job_to_step/icj_to_step: populated in the work-item loop. Today_WorkItemdrops the ICJ id after the sort; thread it through (addicj_id: Optional[int]) so animplicit_collection_jobs_iddirective resolves to the same step a constituentjob_iddirective would.
Keep extract_steps_by_ids returning just steps for existing callers; add
extract_steps_by_ids_with_labels(...) -> (steps, ExtractionLabelIndex) (or a keyword
flag) used by the report path. extract_workflow_by_ids gains an internal hand-off of the
index when a report is requested.
Resolving a referenced object to a label
Given a resolved object from the directive walk:
| Directive object | Normalize | Look up | Rewrite to |
|---|---|---|---|
| HDA | ("hda", _original_hda(hda).id) | content_to_step | input step → input="{step.label}"; tool output → workflow-output label for output_name → output="{label}" |
| HDCA | ("hdca", _original_hdca(hdca).id) | content_to_step | same input/output split |
| Job | fold to ICJ via implicit_collection_jobs_association, else plain | icj_to_step / job_to_step | step="{step.label}" |
The HDA/HDCA normalization mirrors normalize_output_label_key (extract.py:586); the
job→ICJ fold mirrors _ReferencedContentCollector._record_job (markdown_util.py:782).
Whether content is an input or an output ref is read off step.type
(data_input/data_collection_input → input; tool → output).
Auto-label reconcile (decided: Q1 = auto-label)
The notebook’s seeding closure guarantees every referenced item is in the extracted subgraph (a referenced output is produced by a seeded step; a referenced input is a seeded input row). So an item can be in the subgraph but unlabeled — a displayed tool output the user never starred, or a step the user never named. Rather than drop the notebook’s own content, the report path auto-labels so the rewrite always resolves.
After steps are finalized, run a reconcile pass driven by the page’s referenced ids
(referenced_content_ids, the same call summary_from_page uses):
- referenced tool output (HDA/HDCA) with no
workflow_outputlabel → callcreate_or_update_workflow_output(output_name, label=<generated>)on its step, exposing it as a workflow output. (This does add a workflow output the user didn’t explicitly star — that is the accepted cost of a complete report; it only ever fires for content the notebook itself displays.) - referenced input with
step.label is None(lost to the input-name dedup) → assign a generated label. - referenced step (
job_id/icj) withstep.label is None→ assign a generated label.
Label generation reuses the existing suggested_name/suggested_name_source the
summary already computes (renamed → rendered-label → bare-label → port-name), falling back
to tool name; deduped against the shared step-label/output-label namespace with a numeric
suffix, and run through the same raw-reject validation as user labels
(_validate_extraction_labels). A truly unresolvable reference (outside the subgraph —
should not occur given seeding) is the only remaining drop-and-warn case, kept as a
defensive guard so a bare history_dataset_id= can never reach a portable report.
2. Backend: the rewrite handler (markdown_util.py)
A new _ReportLabelRewriter(GalaxyInternalMarkdownDirectiveHandler), built exactly like
_ReferencedContentCollector (the directive-walk dispatch already resolves and
access-checks the HDA/HDCA/job before calling handle_*, and uses each handler’s return
value as the replacement line — see _walk_directives / _remap, lines 262-409):
class _ReportLabelRewriter(GalaxyInternalMarkdownDirectiveHandler):
def __init__(self, label_index, security):
self.index = label_index
self.warnings: list[str] = []
def _dataset_ref(self, container, hda):
# After the auto-label reconcile (§1) every referenced item is labeled,
# so ref is None only for a defensive out-of-subgraph case -> drop + warn.
ref = self.index.content_ref(("hda", _original_hda(hda).id)) # -> ("input"|"output", label) | None
if ref is None:
self.warnings.append(...); return ("", False) # defensive drop
kind, label = ref
return (f'{container}({kind}="{label}")\n', False)
handle_dataset_display = handle_dataset_as_table = ... = _dataset_ref-bound
def handle_dataset_collection_display(self, line, hdca): ...
def _job_ref(self, container, job): ... # -> step="label"
handle_tool_stdout = handle_tool_stderr = handle_job_metrics = handle_job_parameters = _job_ref-bound
# every other handler: passthrough (return the original line)
Public entry:
def rewrite_page_markdown_to_workflow_report(trans, internal_markdown, label_index)
-> tuple[str, list[str]]:
rewriter = _ReportLabelRewriter(label_index, trans.security)
markdown = rewriter._walk_directives(trans, internal_markdown) # block-directive pass
return markdown, rewriter.warnings
Mirror the collector’s scope: block (fenced) directives only for V1 — _walk_directives
is the block pass; inline ${galaxy ...} directives are out of scope here exactly as they
are for the seeding collector (open question Q3).
Non-content prose (headings, text, vega, etc.) passes through untouched: the rewrite only touches the four ID-bearing directive families.
Validate the result with validate_galaxy_markdown(markdown, internal=False) before
storing, so a malformed rewrite fails loudly at extraction rather than at report render.
3. Backend: wiring it into extraction
schema/workflows.py — WorkflowExtractionByIdsPayload:
from_page_id: Optional[DecodedDatabaseIdField] = None # build a report from this notebook page
report_title: Optional[str] = None # defaults to workflow_name
services/workflows.py::extract_by_ids:
- existing
_validate_extract_by_ids_payload+extract_workflow_by_ids(now returning the label index whenfrom_page_idis set). - if
from_page_id:- load the page; gate like the summary endpoint —
400if the page has no history,403if the user can’t access it (reusesummary_from_page’s access checks /PagesService). content = page.latest_revision.content(same fetch assummary_from_page,workflow_extraction_summary.py:486).markdown, warnings = rewrite_page_markdown_to_workflow_report(trans, content, index).stored.latest_workflow.reports_config = {"markdown": markdown, "title": report_title or workflow_name}(same dict shapegenerators/markdown.py:34-38reads; same column set on import atmanagers/workflows.py:863).- return warnings in the result (extend
WorkflowExtractionResultwithreport_warnings).
- load the page; gate like the summary endpoint —
The report is generated after steps are finalized so it reads the labels actually assigned (post-dedup, post-collision-raise).
4. Frontend
Minimal — the form already gathers dataset_names / output_labels / step_labels and
knows fromPageId (WorkflowExtractionForm.vue:49):
- Pass
from_page_id: fromPageId(and optionalreport_title) in the extract-by-ids payload when extracting from a page. - Surface
report_warningsfrom the result (defensive/out-of-subgraph drops only, now rare given auto-label) as a toast / inline list. - Auto-labeling lives in the backend, so the form needs no label-completeness guarantee.
Still nice-to-have: pre-fill output/step labels for
exposed/seededrows fromsuggested_nameso the user-visible workflow output labels match what the report will generate (otherwise the backend silently names auto-exposed outputs the user never saw).
schema.ts regen picks up the two new payload fields and report_warnings.
5. Testing strategy
Same layered discipline as the prior PR — test each concern at the cheapest faithful layer.
| Layer | Concern | Tests |
|---|---|---|
| Rewriter (unit) | directive → label substitution, normalization, job→ICJ fold, defensive drop | TestReportLabelRewriter: HDA→output=, HDCA→input=, job_metrics→step=, implicit_collection_jobs_id→step=, copied HDA normalized to original’s label, out-of-subgraph ref → dropped + warning, prose/vega passthrough, multi-directive page |
| Auto-label reconcile (unit) | referenced-but-unlabeled item gets exposed/labeled; generated labels unique vs namespace | displayed-but-unstarred output → auto-exposed with suggested_name; unnamed referenced step → auto-labeled; collision with an existing label → suffixed |
| Label index (unit) | extraction returns correct content/job/icj → step+label mapping | extend extract unit tests: input label, exposed-output label, step label, ICJ step label all reachable by original id |
| End-to-end report (API) | real server: extract-from-page produces a report whose directives resolve | extend TestNotebookWorkflowExtractionSummary → TestNotebookWorkflowExtractionReport: extract, read back reports_config.markdown, assert output=/input=/step= present and no history_dataset_id=; then run the workflow and render the report (or resolve_invocation_markdown) to prove the labels resolve round-trip; map-over page → step= on the ICJ label; gating 400/403 |
| Payload wiring (vitest) | form sends from_page_id; surfaces report_warnings | WorkflowExtractionForm from_page block |
| Wiring (Selenium) | notebook → extract → workflow has a report tab/markdown | extend TestNotebookWorkflowExtraction: after extract, open the workflow report and assert the page’s heading + a resolved embed render |
Red-to-green: write the rewriter unit tests against a hand-built label index first; then the API round-trip (extract → render report) which is the real proof the labels resolve.
6. Reuse audit (what we are not building)
- Directive walk / access checks / object resolution — reuse
GalaxyInternalMarkdownDirectiveHandler._walk_directives; the rewriter is a handler subclass exactly like_ReferencedContentCollector. - Label assignment — already done by
extract_steps_by_ids; we only return the map. - Normalization (
_original_hda/_hdca) and job→ICJ fold — reuse from extract.py and the collector. - Report storage + render —
reports_configdict +MarkdownWorkflowMarkdownReportGeneratorPluginalready exist; we only populate the dict. - Label→id resolution at run time —
resolve_invocation_markdownalready does the inverse; our output is precisely its input. (Good regression target: rewrite then resolve should land back on equivalent instance directives.)
Unresolved questions
- Q1 — Unlabeled referenced item — DECIDED: auto-label. Backend reconcile pass (§1)
auto-exposes/auto-labels any referenced-but-unlabeled item from
suggested_nameso the report is always complete. Accepted cost: extraction may add workflow outputs the user didn’t explicitly star (only ever for content the notebook displays). Remaining sub-question: should auto-exposed outputs be visibly reflected back in the form before submit (§4), or is silent backend naming fine for V1? - Q2 — Field name — DECIDED:
from_page_id. Implemented as the payload field; the client query param staysfrom_page(the form maps one to the other). - Q3 — Inline
${galaxy ...}directives — V1: out of scope (block-only). Matches the seeding collector. Still open whether to handle inline later via the embedded-container pass. - Q4 — Title source — V1:
report_titlepayload →workflow_namefallback. Page-title fallback not wired; still open. - Q5 — Separate endpoint vs. payload flag — DONE: payload flag.
extract-by-idsgainedfrom_page_id;_load_report_pageowns the gating (400 no-history / 403 no-history-access). - Q6 — Non-rewritable directives — V1: drop + warn. Id-bearing non-portable directives
(
history_link,workflow_display/image/license,invocation_*) are dropped with areport_warningsentry; id-less directives pass through. Open whether any deserve richer handling. - Sub-question (open): reflect auto-exposed outputs back in the form before submit, or leave silent backend naming for V1 (currently silent).