EXTRACT_BY_ID_PLAN

Plan: ID-Based Workflow Extraction (#21722)

Migrate workflow extraction from HID-based dataset/collection identification to ID-based, additive. New params live alongside HID params; HID path stays for back-compat. Cross-history extraction enabled from day one (permission-checked).

Current State on dev (verified 2026-04-28)

Goals

Endpoint Decision (resolved)

New endpoint: POST /api/workflows/extract (plural, matches rest of workflow API).

History-optional, ID-based payload only. Existing endpoints stay untouched for back-compat:

Vue UI: introduce a new client helper (e.g. extractWorkflowByIds) calling the new endpoint; replace the existing extractWorkflowFromHistory submit path.

Files Touched

FileChange
lib/galaxy/workflow/extract.pyAdd BaseWorkflowSummary (pulls __original_hda / __original_hdca / __check_state / warnings out of existing WorkflowSummary); add WorkflowSummaryByIds, extract_workflow_by_ids, extract_steps_by_ids, step_inputs_by_id. Existing WorkflowSummary becomes a BaseWorkflowSummary subclass with no behavior change
lib/galaxy/schema/workflows.pyNew WorkflowExtractionByIdsPayload model (separate from existing WorkflowExtractionPayload)
lib/galaxy/webapps/galaxy/api/workflows.pyNew endpoint POST /api/workflows/extract — body has optional from_history_id, hda_ids, hdca_ids, job_ids. Returns WorkflowExtractionResult. (No HID fields accepted — mixing impossible by construction.)
lib/galaxy/webapps/galaxy/services/workflows.pyAdd WorkflowsService.extract_by_ids(...). Controller stays thin. (Note: legacy HID controller currently inlines extract_workflow(...) directly; consider following up by moving that into the service too — out of scope for this PR.)
client/src/api/histories.ts (or new client/src/api/workflows.ts)Add extractWorkflowByIds(payload)
client/src/components/History/WorkflowExtractionForm.vueTrack encoded IDs in selection; submit hda_ids / hdca_ids; HIDs only for display
client/src/components/History/WorkflowExtractionForm.test.tsUpdate assertions (encoded IDs in payload)
lib/galaxy_test/api/test_workflow_extraction.pyNew TestWorkflowExtractionByIds class
lib/galaxy_test/unit/workflow/test_extract.py (new)Unit tests for extract_steps_by_ids
lib/galaxy_test/selenium/workflow_extraction*.py (if present)Verify Vue UI submits encoded IDs
OpenAPI / generated TS schemaRegenerated artifacts (schema.ts) — automatic but call out review

API Surface

Schema additions in lib/galaxy/schema/workflows.py

class WorkflowExtractionByIdsPayload(Model):
    workflow_name: str
    from_history_id: Optional[DecodedDatabaseIdField] = None  # optional context only
    job_ids: list[DecodedDatabaseIdField] = []                # already DB IDs
    hda_ids:  list[DecodedDatabaseIdField] = []
    hdca_ids: list[DecodedDatabaseIdField] = []
    dataset_names: list[str] = []
    dataset_collection_names: list[str] = []

    @model_validator(mode="after")
    def _at_least_one(self):
        if not (self.hda_ids or self.hdca_ids or self.job_ids):
            raise ValueError("At least one of hda_ids, hdca_ids, job_ids required")
        return self

fake_<id> job IDs: not needed in this payload. Verified in WorkflowExtractionJob schema: fake-job entries serialize with id=None. The Vue form already treats them as input selections, not job selections. So in the new payload, fake-job outputs flow through hda_ids/hdca_ids only; job_ids stays clean list[DecodedDatabaseIdField].

No HID fields on this model — mixing is impossible by construction (per design decision). Clients wanting HID semantics use the legacy endpoints.

New endpoint

POST /api/workflows/extract
Body: WorkflowExtractionByIdsPayload
Returns: WorkflowExtractionResult

History scoping: payload-level optional from_history_id (UI context, not validation). All access decisions via per-item permission checks.

Legacy paths untouched:

Backend Implementation

1. extract_workflow_by_ids / extract_steps_by_ids

New top-level functions in lib/galaxy/workflow/extract.py. Mirror the structure of extract_workflow / extract_steps but:

2. step_inputs_by_id

Sibling of step_inputs. Same tool.get_param_values + __cleanup_param_values walk, but emit (("dataset", hda.id), prefix+key) for HDAs and (("collection", hdca.id), prefix+key) for HDCAs.

Split the cleanup walk by parameter type rather than reconciling against JobToInputDatasetCollectionAssociation after the fact:

Refactor: pull the recursion in __cleanup_param_values into a helper that yields (item, prefix+key) events; HID and ID variants format keys differently.

3. BaseWorkflowSummary + WorkflowSummaryByIds

Step 3a: refactor existing WorkflowSummary into BaseWorkflowSummary + WorkflowSummary(BaseWorkflowSummary) first.

Scope of the lift is narrow — only pure helpers move to base:

What does not move: the __summarize traversal of history.visible_contents, __summarize_dataset / __summarize_dataset_collection (currently extract.py:322-401). Those entangle HID bookkeeping (hda_hid_in_history, hdca_hid_in_history) with self.jobs / implicit_map_jobs building; cleanly separating them is more work than this PR should bite off. WorkflowSummaryByIds builds its self.jobs differently (walking from supplied job_ids outward, not via visible_contents), so it doesn’t need that code path.

Run full extraction test suite (unit + test_workflow_extraction.py) to confirm no regression before continuing.

Step 3b: add WorkflowSummaryByIds(BaseWorkflowSummary):

4. Implicit-map jobs

The HID path has special-case handling at extract.py:172-176 (input rewrite via find_implicit_input_collection) and :195-209 (output HDCA selection). Port carefully:

5. Permission model

SourceCheck
hda_ids[i]hda_manager.error_unless_accessible
hdca_ids[i]hdca_manager.error_unless_accessible — HDCA-level only
job_ids[i]job_manager.get_accessible_job
from_history_id (if given)history_manager.get_accessible — context only

Anonymous: blocked at API guard (extraction needs a real user).

Frontend (WorkflowExtractionForm.vue)

Currently the form tracks selections by HID:

// dev
const selectedDatasets = getSelectedInputs("dataset");          // { hids, names }
// payload: { dataset_hids, dataset_collection_hids, job_ids }

Change:

API summary endpoint (extraction_summary) already returns full output objects — verify it includes the encoded id field; if not, extend WorkflowExtractionSummary to include encoded ids alongside HIDs.

Test Plan (red → green)

Write tests in this order; each must fail before implementation lands.

Unit (lib/galaxy_test/unit/workflow/test_extract.py, new)

  1. test_extract_steps_by_ids_basic — 1 input HDA, 1 tool job; verify data_input step + tool step + 1 connection.
  2. test_extract_steps_by_ids_collection — 1 input HDCA mapped through tool.
  3. test_invalid_dataset_id_raises_object_not_found.
  4. test_inaccessible_dataset_rejected — second user’s private HDA → ItemAccessibilityException.
  5. test_implicit_map_job_resolves_collection_input.
  6. test_id_pair_uses_original_after_copy — HDA copied A→B; passing copy’s id resolves connection. Important: cover both directions — user passes pre-copy HDA id (A) AND user passes post-copy HDA id (B) — since JobToInputDatasetAssociation.dataset may point to either. 6b. test_paired_collection_inputpaired collection mapped as input. 6c. test_dce_as_data_param — single DCE supplied to a DataToolParameter (the tricky case from §2 collection-element handling); ensure parent HDCA is recovered correctly.

API (lib/galaxy_test/api/test_workflow_extraction.py, extend)

  1. test_extract_with_hda_ids — happy path through HTTP.
  2. test_extract_without_from_history_id — only hda_ids/hdca_ids set; succeeds.
  3. test_cross_history_extraction — datasets from two histories owned by same user; succeeds.
  4. test_inaccessible_dataset_rejected — another user’s private HDA in payload → 403.
  5. test_copied_dataset_extraction_no_foreign_jobs — regression for #9161: dataset copied A→B, tool run in B, extract from B; resulting workflow contains only the B job, B input.
  6. test_legacy_endpoint_still_works — back-compat: POST /api/histories/{id}/extract_workflow with dataset_hids unchanged.
  7. test_request_param_missing_when_empty_payload — 400 when no inputs/jobs.
  8. test_job_cache_cross_history_output — run a tool in A, run again in B with cache hit, extract from B. Pre-step: write this test against current HID path on dev first to confirm it actually fails. Reviewer flagged that __summarize_dataset already walks __original_hda and may map original-id → B-hid correctly, so the HID path may already handle this. If green on dev, drop or recharacterize the test. Pattern after test_run_cat1_use_cached_job_from_public_history (test_tools.py:1414).

Roundtrip

  1. test_roundtrip_basic_by_ids — port one existing roundtrip helper to the new endpoint; confirm extracted workflow runs successfully on fresh history.

Subworkflow roundtrip (#15b) dropped from this PR: subworkflow-ness lives on the invocation, not on resulting jobs/datasets — extraction sees a flat post-run history regardless. No existing subworkflow extraction test to port. Punted to follow-up if needed.

Vue / Selenium

  1. WorkflowExtractionForm.test.ts — payload assertions: hda_ids / hdca_ids are encoded IDs, not HIDs.
  2. Selenium roundtrip (if existing extraction selenium suite present) — UI extracts → run → result identical to old behavior in single-history scenario.

Implementation Order

  1. ✅ Refactor WorkflowSummaryBaseWorkflowSummary + WorkflowSummary subclass (helpers only — see §3); full existing extraction test suite green; commit. (df4339e966, a0ecf5312e) 0b. ✅ Land empty stubs + endpoint scaffold. (f0ded11559)
  2. ✅ Tests #1, #3, #4 → extract_steps_by_ids skeleton + permission checks. (fd9f63fd71)
  3. ✅ Tests #2, #5 → collection + implicit-map handling. (1787213de2)
  4. ✅ Test #6 → copied-dataset / original-resolution path. (c08be7007b)
  5. ✅ Tests #7, #8, #13 → schema + new endpoint wired. (e4100cc83a, 0e62499b2d, 4c7a90b17b)
  6. ✅ Tests #9, #10, #11 → cross-history + perm rejection + #9161 regression. (b64cc45819, b04351a861)
  7. ✅ Test #14 → job cache cross-history. (e5dbb3b9d8)
  8. ✅ Test #15 → roundtrip helper port. (e5dbb3b9d8)
  9. ✅ Vue: switch payload to hda_ids/hdca_ids, swap to extractWorkflowByIds, update WorkflowExtractionForm.test.ts (#16). Schema regen via make update-client-api-schema. Removed unused legacy submitWorkflowExtraction helper. (b7aec49113, e045b0a979)
  10. Deferred to CI Selenium suite (#17) — local env lacks chromedriver/geckodriver and playwright backend hits driver_factory auto-detect. Existing Selenium suite was green on this branch under HID payload (per prior agent run); UI-level interactions unchanged so passthrough on CI expected. Back-compat (#12) covered by API test_legacy_endpoint_still_works.

11/11 by-ids API tests + 16/16 Vue unit tests passing as of 2026-04-29. Type-check clean.

Each step: small commit, full unit suite + targeted API subset before next step. Full test_workflow_extraction.py run before PR open.

Risks / Edge Cases

Resolved By Research (2026-04-28)

Design Decisions (resolved 2026-04-28)

Unresolved Questions