CAPTURE_WORKFLOW_EXECUTION_STATE_PLAN

Capture Workflow Execution State — Implementation Plan

Date: 2026-05-17 Branch: start from graph_workflow_extract (or its successor once EXTRACT_TOOL_REQUEST_STATE_PLAN merges). The converter is a convert.py sibling of to_workflow_step_state, which already lives on this branch. Tracking issue: TBD (create from this doc). Decision context: USING_TOOL_STATE_DESIGN_OPTIONS — read that first for why this shape. This plan implements the recommendation there: execution-time capture → STEP_STATE. EXEC_STATE is the documented north star, out of scope here. (Labels: MINT = workflows mint a ToolRequest; READ_TIME = synthesize on read, rejected; STEP_STATE = capture onto the workflow step ★; EXEC_STATE = extract a shared value object.) Related research:

  • vault/research/Component - Tool State Specification.md
  • vault/research/PR 21932 - History Graph API.md
  • vault/projects/history_markdown/EXTRACT_TOOL_REQUEST_STATE_PLAN.md

Why this exists

History Graph (PR 21932 - History Graph API) and structured workflow extraction (EXTRACT_TOOL_REQUEST_STATE_PLAN) both read the validated structured request_internal payload off ToolRequest.request. Workflow invocations never create a ToolRequest (ToolRequest is minted only at lib/galaxy/webapps/galaxy/services/jobs.py:265, the async tool-request API path). So both consumers dead-end or fall back to lossy legacy state for anything produced by a workflow.

This plan gives a workflow tool-step execution the same structured, validated state a direct tool request has — captured at execution time, where the resolved state is faithful — without minting a ToolRequest for it.

Settled decisions (see USING_TOOL_STATE_DESIGN_OPTIONS for rationale)

The half that already exists (do not rebuild)

Job.tool_state (JSONB) already exists (lib/galaxy/model/__init__.py:1641, migration 566b691307a5). The async tool-request path already persists validated per-job job_internal there at lib/galaxy/tools/execute.py:258-260:

if execution_slice.validated_param_combination:
    tool_state = execution_slice.validated_param_combination.input_state
    job.tool_state = tool_state

It never fires for workflow jobs only because ToolModule.execute builds MappingParameters with 2 of its 4 fields (lib/galaxy/workflow/modules.py:2877): MappingParameters(tool_state.inputs, param_combinations)validated_param_template=None, validated_param_combinations=None (execute.py:81-90). The persistence column and machinery are built; they are simply not fed on the workflow path.

Architecture / seam

ToolModule.execute  (workflow/modules.py ~2877)
   has: resolved param_combinations + collection_info + tool + step


 ┌──────────────────────────────────────────────┐  PHASE 1 — atomic core
 │ CONVERTER  (tool_util/parameters/convert.py)  │  decision-independent
 │ resolved workflow exec state → request_       │  (MINT/STEP_STATE/EXEC_
 │ internal  (incl. Batch / linked encoding)     │  STATE all need it); no
 │                                               │  schema; lands first
 └──────────────────────────────────────────────┘


 ┌──────────────────────────────────────────────┐
 │ VALIDATE request_internal → request_state      │  enum: not_validated /
 │ enum;  derive per-job job_internal →           │  validated /
 │ populate MappingParameters.validated_*         │  validation_failed
 └──────────────────────────────────────────────┘
        │ existing execute.py:258-260 now fires for workflow jobs
        │ → Job.tool_state gets validated job_internal (per-job leg, free)
        │ INSTRUMENT: log + statsd counter of request_state outcomes
   ═════╪═══════════════════════════════════════════  ◄ PHASE 1 HARD STOP

 ┌──────────────────────────────────────────────┐  PHASE 2 — STEP_STATE
 │ persist request_internal on                   │  begins the where-axis
 │ workflow_invocation_step.request               │  decision; STEP_STATE
 │ + workflow_invocation_step.request_state       │  chosen
 └──────────────────────────────────────────────┘


 ┌──────────────────────────────────────────────┐
 │ RESOLVER  (manager/helper):                    │  the stable seam.
 │ (Job | ICJ | WIS) → (request_internal,         │  EXEC_STATE later swaps
 │  request_state)  — sourced from ToolRequest    │  backing without touching
 │  OR workflow_invocation_step                   │  consumers.
 └──────────────────────────────────────────────┘

        ├── History Graph (_hda/_hdca_producers, _fetch_payloads)
        └── Extraction    (step_inputs_by_id structured branch)

Phase 1 is the uncontroversial atomic core: it produces + validates + instruments the payload and stops before anything a consumer can see. It is identical across MINT / STEP_STATE / EXEC_STATE and lands ahead of the where-axis decision. Phase 2 commits the STEP_STATE shape.

The headline risk (retire it in Phase 1, before anything else)

Is the Batch / linked synthesis total? The workflow path encodes map-over as collection_info (MatchingCollections) + per-iteration param_combinations. ToolRequest.request encodes it as {"__class__": "Batch", "values": [...], "linked": bool}. The converter must reconstruct the Batch form and derive linked from MATCHED vs MULTIPLIED semantics (lib/galaxy/tools/parameters/meta.py:348-372).

Unverified totality cases — these are the gate:

If any of these cannot be synthesized faithfully, Phase 1 is not done. This is answerable purely at the unit level via the parameter-spec harness — no history fixtures, no schema. Do it first.

Do not treat this risk as pre-cleared by the gate commit (5699c2c324). The gate proved only the forward direction — converting an already-persisted request_internal Batch shape into linked workflow state (to_workflow_step_state), with API coverage for single-level matched Batch and list:paired/subcollection map-over. Phase 1’s risk is the inverse: synthesizing the Batch/linked form from collection_info + param_combinations, which nothing in the gate touches. Carried-forward, still open: true nested list:paired map-over (2-level, not single-level subcollection), and the synthesis itself. Additional baked-in constraint discovered: to_workflow_step_state (convert.py) hard-rejects a Batch whose values length ≠ 1 (“Batch map-over inputs must contain exactly one value”). The Phase-1 synthesizer must emit length-1 values Batches (one Batch wrapper per batched input), or the forward converter consuming the same shape later will raise. Add this as an explicit synthesizer post-condition in 1.1.

RETIRED 2026-05-17 — synthesis is total for the workflow execution path. Answer: YES. Decisive structural invariant, verified in code: the workflow ToolModule path never produces cross-product (linked:false) map-over. Every collections_to_match.add(...) in _find_collections_to_match (lib/galaxy/workflow/modules.py:606-702, sites 627/679/681/685/700) uses the default linked=True. Consequences:

  • Multi-input matched Batch is the only multi-input case workflows generate, and it is trivially total: MatchingCollections.collections (lib/galaxy/model/dataset_collections/matching.py:62) is an input-name-keyed dict, so each mapped input synthesizes its own length-1 Batch independently.
  • Nested list:paired / subcollection map-over is recoverable: parent collection from collection_info.collections[name], subcollection type from collection_info.subcollection_types[name] → emitted as the Batch value’s map_over_type. BatchCollectionInstanceInternal / BatchDataHdcaInstanceInternal carry map_over_type (tool_util_models/parameters.py:1086,1282); __expand_collection_parameter_async reads it back forward — the round-trip closes.
  • linked:false is the one genuinely lossy case in MatchingCollections (unlinked collections drop input-name+hdca into unlinked_structures) — but unreachable from the workflow path. The plan’s hard-fail for it is correct defensive symmetry, not a totality gap.

Wording correction (was wrong in this plan): the converter is not “the inverse of to_workflow_step_state” — that function’s domain is request_internal; the new converter’s domain is post-expansion workflow state (param_combinations + collection_info). It is the inverse of expand_meta_parameters_async (meta.py); they share only the Batch vocabulary.

Source-neutral seam (makes “purely unit-level” true): collection_info (MatchingCollections) holds live SQLAlchemy objects, not unit-constructible. The pure converter takes a normalized MappedCollectionInput{src,id,map_over_type,linked} per mapped input; a thin DB-bound adapter at the workflow execute site extracts it from collection_info (integration-tested in 1.2/Phase 2, not here). Landed: from_workflow_execution_state + MappedCollectionInput in convert.py, exported, with 7 red→green unit cases in test/unit/tool_util/test_parameter_convert.py (incl. an explicit forward to_workflow_step_state round-trip proving the length-1 post-condition).

Phase 1 — atomic core (decision-independent, lands first)

1.1 Converter — lib/galaxy/tool_util/parameters/convert.py / __init__.py

1.2 Validate + derive per-job state

Decisions (2026-05-17, this session):

Bookkeeping correction (2026-05-17, resume session). The prior session marked 1.1/1.3/1.4 done but never flipped these 1.2 boxes though the code had landed — and the test_modules.py “62/62 green” claim was not real: the new helpers used capitalized Dict/List/Tuple/Callable with no matching typing import → NameError at collection. Since modules.py is imported by the running server, the integration test’s prior “1 passed” could not have been real either. Fixed by switching to PEP 585 lowercase generics + Callable from collections.abc (the file’s own convention). Re-verified this session: test_parameter_convert.py 27/27, test_modules.py 62 passed, integration test 1 passed (25.75s). Phase 1 then committed (d27cb7ac3a). Post-commit subagent review → ship-with-nits: refactor byte-diff-verified behavior-preserving, no runtime defects. Findings #1 (coarse except Exception/log.debug hides capture-code bugs) + #2 (SkipWorkflowStepEvaluation mis-recorded as validation_failed) applied — see 1.2 enum item; +4 taxonomy unit tests (test_modules.py now 66 passed, 93 with convert suite). Open review item still deferred (per chosen scope): broader mapped-step integration coverage + parameter_specification.yml round-trip rows.

1.3 Instrumentation (earns the core its keep pre-decision)

1.4 Tests (red-to-green)

Phase 2 — STEP_STATE (commits the chosen shape)

Bookkeeping (2026-05-19, resume session). Phase 2 landed: 2.1 (columns+migration, 0c5e7c9f9d/8a36efb343), 2.2 (persist at execute, 8a36efb343), 2.3 (resolver lifted to a shared manager + ICJ rename, 3635680b8c; source-identity sibling added 156b9839e1), 2.4 (history_graph routed through the seam additively 156b9839e1; extract.py earlier in branch), 2.5 unit parity (156b9839e1). Polish caught on review: the 2.1 migration shipped request as JSON().with_variant(JSONB) while the ORM model + the mirrored tool_request migration use JSONType (BLOB-backed) — a real-DB corruption bug invisible to ORM-create_all test schemas; corrected in 1456e14b6b along with two stale Phase-N docstrings in modules.py. The resolver module docstring’s “extraction and the History Graph share one resolution path” was aspirational until 2.4 — now literally true. Remaining: nothing in this plan’s scope. 2.5’s API-suite parity case was added then dropped on review (see 2.5 first item) — the asserted shape didn’t discriminate the structured path from the legacy fallback; structured-vs-legacy discrimination is left to a future mechanism-level proof (config-disabled fallback at the integration tier), out of this plan’s scope. 151 unit assertions green; zero regression on the 44 pre-existing History Graph cases (additive design, byte-identical tool_request path).

2.1 Schema — lib/galaxy/model/__init__.py + one alembic migration

2.2 Persist at execute time

2.3 Resolver — the stable seam

2.4 Repoint the two consumers through the resolver (~4 files total surface)

2.5 Tests

Files to touch (checklist)

FilePhaseScope
lib/galaxy/tool_util/parameters/convert.py / __init__.py1the converter (inverse of to_workflow_step_state)
lib/galaxy/tool_util/parameters/request.py1reuse shared ref-walk; extend only if necessary
lib/galaxy/workflow/modules.py (~2877, ~2888)1feed validated MappingParameters fields; instrument
test/unit/tool_util/test_parameter_convert.py, parameter_specification.yml1converter red-to-green
lib/galaxy/model/__init__.py (WorkflowInvocationStep :10466) + alembic2request + request_state columns
resolver helper (manager)2the stable seam
lib/galaxy/managers/history_graph.py (:301-389)2resolve via seam; wire shape unchanged
lib/galaxy/workflow/extract.py (step_inputs_by_id)2resolve via seam
test/unit/app/managers/test_HistoryGraphBuilder.py2workflow-produced parity (API parity case dropped on review — see 2.5)

Out of scope (do not pull in)

Unresolved questions