ICJ_NATIVE_PLAN

ICJ-Native Workflow Extraction — Implementation Plan

Date: 2026-05-12 Branch: history_notebook_extract (jmchilton/galaxy, currently at b3c55e2d03) Predecessor: Closed PR galaxyproject/galaxy#22675 (“Allow extracting workflows by ID instead HID.”) Tracking issue: #21722 (ID-based extraction for cross-history) Related research:

  • vault/research/Workflow Extraction Issues.md
  • vault/research/Component - Workflow Extraction Models.md
  • vault/research/Problem - YAML Tool Post-Hoc State Divergence.md

Why this exists

Closed PR 22675 added a parallel ID-based extraction endpoint (POST /api/workflows/extract, extract_workflow_by_idsextract_steps_by_ids) that mirrors the HID path’s “infer map/over from individual jobs” model:

  1. Caller passes job_ids (any constituent job).
  2. Server checks each job.implicit_collection_jobs_association, dedups via seen_icj_ids + SELECT Job ORDER BY id LIMIT 1 to pick a representative.
  3. Per-job connection keys later get rewritten via find_implicit_input_collection on output HDCAs.

That is the server reverse-engineering “this is part of a map” from data the caller already knew. The next iteration makes the implicit collection itself the unit of selection: callers pass implicit_collection_jobs_ids for mapped steps and job_ids only for non-mapped jobs. This:

The branch is not abandoned — the refactor scaffolding (helpers, schema model, route, service method, test mixin, ~all non-mapping tests) is reused. Only the mapped-job semantics change.


Current branch state to build on

Reuse as-is (do not redo):

FileReuse
lib/galaxy/workflow/extract.py_walk_data_param_tree, _finalize_workflow, _original_hda, _original_hdca, _skip_output_assoc_name, BaseWorkflowSummary._check_state, __cleanup_param_values_by_id, IdKey, IdAssociations typedefs
lib/galaxy/schema/workflows.pyWorkflowExtractionByIdsPayload (extend, don’t rewrite)
lib/galaxy/webapps/galaxy/api/workflows.pyPOST /api/workflows/extract route
lib/galaxy/webapps/galaxy/services/workflows.pyWorkflowsService.extract_by_ids + _to_extraction_result
lib/galaxy_test/api/test_workflow_extraction.py_ExtractionHelpersMixin, TestWorkflowExtractionByIdsApi non-mapped cases, helpers

Rewrite:

FileRewrite scope
lib/galaxy/workflow/extract.pyextract_steps_by_ids body — split into ICJ branch + job branch
lib/galaxy/schema/workflows.pyWorkflowExtractionByIdsPayload — add implicit_collection_jobs_ids, tighten validator
lib/galaxy_test/api/test_workflow_extraction.pytest_extract_mapping_workflow_by_ids, test_extract_reduction_by_ids, test_subcollection_mapping_by_ids — all switch from job_ids=[mapped_job] to implicit_collection_jobs_ids=[icj_id]

Delete:


Target API shape

# lib/galaxy/schema/workflows.py
class WorkflowExtractionByIdsPayload(Model):
    workflow_name: str
    from_history_id: Optional[DecodedDatabaseIdField] = None  # UI hint only
    hda_ids: list[DecodedDatabaseIdField] = []
    hdca_ids: list[DecodedDatabaseIdField] = []
    dataset_names: list[str] = []
    dataset_collection_names: list[str] = []
    job_ids: list[DecodedDatabaseIdField] = []                       # NEW semantics
    implicit_collection_jobs_ids: list[DecodedDatabaseIdField] = []  # NEW

    @model_validator(mode="after")
    def _at_least_one_input(self):
        if not (self.hda_ids or self.hdca_ids or self.job_ids or self.implicit_collection_jobs_ids):
            raise ValueError("At least one of hda_ids, hdca_ids, job_ids, implicit_collection_jobs_ids required")
        return self

Service-layer validation (runs after fetch, not in pydantic since it needs DB access):

  1. For every job_id: if job.implicit_collection_jobs_association is not None → 400 with message “job %s is part of implicit collection jobs %s — pass via implicit_collection_jobs_ids instead”.
  2. Each icj_id and job_id must be unique within its list; across lists, no job_id may belong to any supplied icj_id.
  3. Each icj_id must reference an ICJ whose populated_state == "ok" (reject "new" and "failed" with informative error).
  4. Each icj_id must have at least one output HDCA (no-output ICJs have nothing to wire).

Implementation in extract_steps_by_ids

Signature change:

def extract_steps_by_ids(
    trans: ProvidesHistoryContext,
    job_manager: Optional[JobManager] = None,
    job_ids: Optional[list[int]] = None,
    implicit_collection_jobs_ids: Optional[list[int]] = None,  # NEW
    hda_ids: Optional[list[int]] = None,
    hdca_ids: Optional[list[int]] = None,
    dataset_names: Optional[list[str]] = None,
    dataset_collection_names: Optional[list[str]] = None,
) -> list[WorkflowStep]:

Body shape (replaces current ~110 lines from # Resolve and dedup jobs through return steps):

hda input steps    -> id_to_output_pair[("dataset",    original_hda.id)]
hdca input steps   -> id_to_output_pair[("collection", original_hdca.id)]

resolve work items (kept in submission order):
    for each icj_id:  fetch ICJ + its output_hdcas; access-check via output_hdca.history
    for each job_id:  fetch + access-check; reject if job has ICJ association
    sort all by representative-job.id ascending  (monotonic submission order)

for each work item in order:
    if ICJ:
        representative = icj.job_list sorted by ImplicitCollectionJobsJobAssociation.order_index, first
        tool_inputs, associations = step_inputs_by_id(trans, representative)
        # Rewrite collection-element associations to the parent input HDCA up-front.
        # The mapping is known directly: icj output_hdcas -> implicit_input_collections.
        for hdca in output_hdcas[0].implicit_input_collections:
            # ImplicitlyCreatedDatasetCollectionInput rows give us {name -> input_hdca}
            mapped_inputs[name] = _original_hdca(input_hdca)
        for key, input_name in associations:
            if input_name in mapped_inputs:
                key = ("collection", mapped_inputs[input_name].id)
            wire (step_input, other_step, other_name) if key in id_to_output_pair

        outputs: dedup output_hdcas by implicit_output_name, register each via original HDCA id.
    else:  # plain job
        tool_inputs, associations = step_inputs_by_id(trans, job)
        for key, input_name in associations:
            wire if key in id_to_output_pair
        outputs: job.output_datasets + job.output_dataset_collection_instances (current logic)

The wiring logic at the call site collapses to one loop because the key-rewrite now happens before wiring, not interleaved with it.

step_inputs_by_id itself does not change shape — it still returns (tool_inputs, IdAssociations). The ICJ branch just calls it with the representative job.


Files to touch (concrete checklist)

lib/galaxy/schema/workflows.py

lib/galaxy/workflow/extract.py

lib/galaxy/webapps/galaxy/services/workflows.py

lib/galaxy/webapps/galaxy/api/workflows.py

lib/galaxy_test/api/test_workflow_extraction.py


Red-to-green test order

Per project convention (write tests first, then make them pass). Suggested commit cadence:

  1. Commit 1 — payload field + validator. Add implicit_collection_jobs_ids to the model + the _at_least_one_input extension. Write test_empty_payload_rejected variant covering the new field path; existing tests pass unchanged. Green.
  2. Commit 2 — validator: reject mapped-job in job_ids. Write test_job_with_icj_via_job_ids_rejected (red). Add service-layer check (green). At this point existing mapped tests (test_extract_mapping_workflow_by_ids etc.) break — that’s expected, fix in next commit.
  3. Commit 3 — ICJ branch implementation. Write/extend test_extract_mapping_workflow_by_ids for the new payload shape (red against commit 2 state). Implement ICJ branch in extract_steps_by_ids. Green.
  4. Commit 4 — port remaining mapped tests. test_extract_reduction_by_ids + test_subcollection_mapping_by_ids switch to implicit_collection_jobs_ids. Green.
  5. Commit 5 — strip legacy inference. Delete seen_icj_ids block, find_implicit_input_collection post-hoc rewrite, the if output_hdcas: branch in the output-registration section, and the speculative comment. All tests still green.
  6. Commit 6 — cross-list dedup + edge-case validators. test_mixed_icj_and_member_job_rejected, test_icj_not_populated_rejected (if feasible).

Run after each commit: ./run_tests.sh -api lib/galaxy_test/api/test_workflow_extraction.py — the full file should stay green.


What still requires walking a representative job

Until Problem - YAML Tool Post-Hoc State Divergence is addressed:

When Job.tool_state / ToolRequest.request_state becomes the canonical post-hoc source (work in PRs 20935, 21828, 21842):

Document this in extract.py next to the representative-job pick with a # FIXME: source of truth for params; swap to Job.tool_state when post-hoc reader exists comment referencing the issue.


Out of scope (do not pull into this PR)


Bugs this PR directly fixes (cross-ref to research notes)

IssueHow this PR helps
#21789 (dynamic nested collection misconnected)ID-based wiring + up-front ICJ input lookup removes HID-based inference for mapped steps.
#13823 (multi-output copied collection downstream fail)Outputs keyed by HDCA id, not HID; copy-history-membership stops mattering for mapped outputs.
#9161 (copied datasets break)Already largely addressed by ID-based path; ICJ restructure removes one more HID source in the mapped case.

Not fully fixed by this PR (still need ToolRequest):


Unresolved questions


References (in-repo)