EXTRACT_BY_ID_RESEARCH_2026-04-28

Research: open questions before resuming Step 1 (tool-job branch) and Step 2

Branch: history_notebook_extract. Last commit 1787213de2. Stub for extract_steps_by_ids raises NotImplementedError for job_ids. This note resolves the three open questions blocking forward motion.

Q1 — DCE → parent HDCA navigation: not needed

Design call (2026-04-28)

We do not recover the parent HDCA from a DCE input. The workflow format has no clean way to express “pick element N of a collection” anyway — if a user wants that, they should use the extract_dataset tool (which produces its own HDA with its own creating job). A DCE manually fed to a DataToolParameter is, for extraction purposes, just an HDA.

Resulting per-input-type rule

Job input sourceResolution
Job.input_datasets (HDA)param walk over tool.get_param_values; emit ("dataset", hda.id, name)
Job.input_dataset_collections (HDCA)iterate DB rows directly; emit ("collection", hdca.id, assoc.name)
Job.input_dataset_collection_elements (DCE)resolve to leaf HDA via dce.hda (or dce.first_dataset_instance() for nested); emit ("dataset", hda.id, assoc.name)no HDCA recovery

Plus implicit-map override (Q2): when the job is implicit-map, output_hdca.find_implicit_input_collection(name) overrides per-job input rows with the actual input HDCA the mapping consumed.

Implementation note for __cleanup_param_values

The existing HID-path __cleanup_param_values flattens DataCollectionToolParameter values via first_dataset_instance() (extract.py:464), losing the HDCA. For the ID path:

The HID path stays unchanged; we add a sibling helper rather than refactoring the original.

Model facts (for reference)

Q2 — Implicit-map representative-job logic in ID path

Model

So given any participating job_id:

job = sa_session.get(Job, job_id)
icj_assoc = job.implicit_collection_jobs_association
if icj_assoc is None:
    # not an implicit-map participant; treat as singleton tool job
    ...
else:
    icj_id = icj_assoc.implicit_collection_jobs_id
    # representative job: pick a deterministic one (first-by-id), matches HID path's cja[0] behavior
    representative_job = (
        sa_session.execute(
            select(Job)
            .join(ImplicitCollectionJobsJobAssociation)
            .where(ImplicitCollectionJobsJobAssociation.implicit_collection_jobs_id == icj_id)
            .order_by(Job.id)
            .limit(1)
        ).scalar_one()
    )
    # output HDCAs of the mapping
    output_hdcas = sa_session.scalars(
        select(HistoryDatasetCollectionAssociation)
        .where(HistoryDatasetCollectionAssociation.implicit_collection_jobs_id == icj_id)
    ).all()

How to wire connections

Mirrors HID path exactly, just keyed differently:

Deduplication

If user passes multiple jobs from the same ICJ, collapse to one representative. Mirror HID path’s job_id2representative_job dict — but keyed by ICJ id rather than building it from a history.visible_contents walk:

icj_to_representative: dict[int, Job] = {}
for job_id in job_ids:
    job = ...accessible...
    icj_assoc = job.implicit_collection_jobs_association
    icj_id = icj_assoc.implicit_collection_jobs_id if icj_assoc else None
    if icj_id is not None:
        if icj_id not in icj_to_representative:
            icj_to_representative[icj_id] = pick_representative(icj_id)
        # silently skip non-representative participating jobs
        continue
    # singleton job — process directly

This is cleaner than HID path because we don’t iterate the whole history.

Q3 — Test-strategy reconciliation

User decision: keep unit tests small and targeted; move the DB-heavy cases to API/integration.

Unit (test/unit/workflows/test_extract_by_ids.py) — keep tight, focus on input-step construction + permission rejection. Already implemented:

Add only one or two more unit tests, both narrow:

Drop from unit, move to API:

API (lib/galaxy_test/api/test_workflow_extraction.py) — add a sibling class TestWorkflowExtractionByIdsApi with a parallel helper _extract_and_download_workflow_by_ids that posts to POST /api/workflow/extract. Port the existing HID test bodies one-for-one — same setups, swap dataset_ids=hidshda_ids=encoded_ids, swap endpoint. Existing HID class becomes a regression baseline.

Tests to write at API level (replacing former unit cases #2/#5/#6/#6b/#6c plus the pre-planned API set):

Job-cache cross-history (#14) and roundtrip (#15/#15b) stay as planned.

Why this split is safer than the original plan

The original plan called #2/#5/#6 “unit” but they need a real tool.get_param_values cycle, real JobToInput* rows, real implicit-map structures, real copy chains. Mocking those would be brittle and would catch the wrong things (mock contract drift, not real behavior). API tests already have all the populator helpers (run_tool, run_random_lines_mapped_over_pair, __copy_content_to_history) that build these structures correctly.

Concrete next-step sequence

  1. Step 1 finishing — tool-job branch.
    • Implement step_inputs_by_id as a thin variant of step_inputs: same param walk for HDA params; for HDCA, append from Job.input_dataset_collections directly; for DCE-as-data-param, append from Job.input_dataset_collection_elements keyed by parent HDCA.
    • Implement implicit-map dedup + representative-job selection (pure DB queries, no visible_contents).
    • Build id_to_output_pair: dict[tuple[Literal["dataset","collection"], int], (WorkflowStep, str)].
    • Wire connections.
    • Replace the if job_ids: raise NotImplementedError block.
  2. Add API test class TestWorkflowExtractionByIdsApi with the helper. First red test: happy-path cat1 (#7).
  3. Land tests one at a time in the order listed above; each goes red → green per existing convention.

Open ambiguities still unresolved

No blockers identified. Recommend resuming with Step 1’s tool-job branch.