NOTEBOOK_EXTRACTION_MVP_PLAN

Notebook → List-Extraction MVP — Implementation Plan

Date: 2026-05-23 Branch: off workflow_state_backfill (after HISTORY_GRAPH_UI_INTEGRATION_PLAN and WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN land) Tracking issue: #22709 — bullet “Add extraction from notebook API and UI that builds an initial graph…”. This plan is the list-UI MVP slice (graph-UI variant deferred). Related:

  • WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLANDIRECT DEPENDENCY. Ships the naming-chain helper, output_labels payload field, extractor emitting WorkflowOutput rows, and suggested_name on the summary surface. This plan presumes those primitives exist.
  • HISTORY_GRAPH_UI_INTEGRATION_PLAN — backend prep shipping the dataset_element edge walker and /api/tool_executions/{id} surface this plan leans on.
  • MAP_OVER_EMPTY_EXTRACTION_TOOL_REQUEST_PLANtool_request_ids selection primitive.
  • QUEUED_EXECUTION_EXTRACTION_TOOL_REQUEST_PLAN — same primitive, queued executions.
  • CAPTURE_WORKFLOW_EXECUTION_STATE_PLANtool_execution_state capture path.
  • HISTORY_MARKDOWN_ARCHITECTURE — Page model + notebook directive surface.
  • GRAPH_WORKFLOW_EXTRACTION_PLAN — graph-UI variant the seeding endpoint is forward-compatible with.

At a glance

ProblemWorkflow extraction surface is a flat job list. The narrative structure of an analysis (which outputs matter, what they mean) is not captured. Notebooks already contain that narrative.
Key insightA notebook’s history_dataset_display / history_dataset_collection_display directives are the workflow-output spec. Walk backward from each through the history graph and you have everything the by-ids extraction endpoint (plus WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN‘s output_labels) needs.
This plan deliversGET /api/pages/{id}/workflow_extraction_summary — a notebook-seeded WorkflowExtractionResult (same wire shape as the history surface, with per-row seeded: bool added). The existing WorkflowExtractionForm.vue consumes the seeded summary and pre-checks rows. suggested_name is already populated by the upstream plan’s helper; this plan just routes the seed. Submit path unchanged. One toolbar action in the notebook editor wires the entry point.
ReusableThe seeding endpoint is graph-UI-ready (same payload shape both consumers want). The history-graph backward walker is reused, not forked.
This plan does NOTBuild the graph-mode extraction UI (guerler’s territory after rebase). Chase backward across histories (cross-history items become inputs, per #22709 first-pass policy). Touch the legacy HID-based extract_workflow_from_history flow when no pageId is in scope. Re-introduce any primitives owned by WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN.
RiskLow — most of the load-bearing work is in the upstream plan. The new code here is the notebook scan + backward-closure walk + thin form branch.

Why this exists

#22709’s vision is “user writes a notebook documenting their analysis → workflow extraction works from that narrative, not from a flat history list.” The graph view + notebook are the new abstractions that make this possible.

The full vision has two UIs: list (this plan) and graph (deferred). The shared piece is the backend seeding: given a notebook, produce a structured-summary payload that says “these are the producer steps, these are the boundary inputs.” Both UIs consume the same payload; they differ only in how they present it for confirmation.

The “outputs are labeled” half of the value lives in WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN — a deliberately upstream, independently valuable enhancement to the existing extraction flow. This plan rides on top: a notebook is just one source of which artifacts deserve to be marked workflow outputs.

Shipping the list-MVP after the upstream output-labeling plan does three things for the presentation timeline: (1) the upstream PR lands as a real, demoable ergonomic win on its own; (2) this MVP gets a clean dependency story rather than a sprawling single PR; (3) the graph UI later reuses both layers byte-identical.

Settled decisions

Architecture / seam

Notebook (Page with history_id) — content_editor markdown
   └─ history_dataset_display(history_dataset_id=N), history_dataset_collection_display(history_dataset_collection_id=M), …


WorkflowExtractionSummaryManager.summary_from_page(trans, page)   ← NEW (managers/workflow_extraction_summary.py)
   1. Scan markdown directives via _remap_galaxy_markdown_calls (markdown_util.py:1403) → set of (HDA|HDCA) ids referenced.
   2. For each id, backward-closure via history_graph.HistoryGraphBuilder.build() (existing).
   3. Bucket reachable producers by preference (tool_request → ICJ → job) → seeded extraction payload bundle.
   4. Build WorkflowExtractionResult by *delegating* to create_workflow_extraction_summary (services/histories.py:824),
      then flagging rows where the producer is in the seeded bucket with seeded=True.
   5. suggested_name on each tool row is already populated by [[WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN]] step 4.


GET /api/pages/{id}/workflow_extraction_summary  → WorkflowExtractionResult


WorkflowExtractionForm.vue (pageId query param branch)
   - Pre-checks seeded rows (using the new seeded flag).
   - suggested_name on tool rows already drives the output-rename pre-fill (from upstream plan).
   - Input rows pre-named via the existing dataset_names path with sensible defaults from the source HDA/HDCA name.
   - Submits via existing extractWorkflowByIds — output_labels plumbing already exists from upstream plan.


POST /api/workflows/extract  (existing, extended by upstream plan)
   - WorkflowExtractionByIdsPayload.output_labels honored, WorkflowOutput rows emitted.

The seeding endpoint is consumed by list-UI today; graph-UI hydrates from the same payload tomorrow.

Steps

Precondition: WORKFLOW_EXTRACTION_OUTPUT_LABELING_PLAN must be merged. This plan presumes suggested_name is already on WorkflowExtractionJob, output_labels is on the payload, the extractor emits WorkflowOutput rows, and suggested_output_name is callable.

1. Notebook scan

2. Backward walk + producer bucketing

3. summary_from_page + endpoint

4. Form pageId branch

5. Notebook entry point

6. Verify

Files to touch

FileStepScope
lib/galaxy/managers/workflow_extraction_summary.py1, 2, 3new — notebook scan + backward walk + summary build
lib/galaxy/managers/markdown_util.py1possible new visitor helper exposing referenced-ids extraction (if a clean entry point doesn’t already exist)
lib/galaxy/schema/workflows.py3WorkflowExtractionJob += seeded: bool = False
lib/galaxy/webapps/galaxy/api/pages.py3GET /api/pages/{id}/workflow_extraction_summary
lib/galaxy/webapps/galaxy/services/pages.py3workflow_extraction_summary service method
client/src/api/pages.ts4fetchWorkflowExtractionSummary(pageId)
client/src/components/History/WorkflowExtractionForm.vue4pageId branch + seeded pre-fill
client/src/components/History/WorkflowExtraction/types.ts4seeded type addition
client/src/components/PageEditor/PageEditorView.vue5”Extract Workflow” toolbar action (notebook mode only)
client/src/api/schema/schema.ts3regenerated
test/unit/app/managers/test_workflow_extraction_summary.py1, 2, 3new
lib/galaxy_test/api/test_pages_history_attached.py (or new)3endpoint test
client/src/components/History/WorkflowExtractionForm.test.ts4seeded flow
client/src/components/PageEditor/PageEditorView.test.ts5toolbar button gating

What this sets up for the graph-UI follow-up (documented, not in this PR)

Out of scope

Unresolved questions