CURRENT_PROBLEMS

Current Problems — workflow_state_backfill

Running ledger of problems uncovered while reviewing the branch. Each entry: what’s broken, why, where, what to do.


1. UNIQUE(tool_execution_state_id) on job rejects batch tool requests

Status: regression, not yet released. Introduced by 1f87eebce3 Replace TRICA with TES-keyed TEICA; tighten TES back-pops to scalar (migration 10c4cd393d5a).

Symptom

A tool request whose request input expands to N>1 param combinations without a collection map-over (i.e., a Batch over plain HDAs) fails on the second Job insert with a UNIQUE constraint violation on job.tool_execution_state_id. The first Job stamps the TR’s TES; the second Job tries to stamp the same TES and is rejected.

The shape that triggers it (example request body):

{
  "inputs": {
    "input_file": {
      "__class__": "Batch",
      "values": [{"src": "hda", "id": 1}, {"src": "hda", "id": 2}]
    }
  }
}

Two Jobs, one TR, one TES → second insert violates the UNIQUE.

Root cause

The commit tightened all four TES back-pops from list to scalar:

…and added partial-UNIQUE(tool_execution_state_id) on each peer table (lib/galaxy/model/migrations/alembic/versions_gxy/10c4cd393d5a_drop_trica_and_tighten_tes_backpops.py:46-51, :89).

Three of the four are sound — ToolRequest, WorkflowInvocationStep, and ImplicitCollectionJobs are 1:1 with TES by writer construction:

Job is loose. The stamp at tools/execute.py:261-262 runs once per param combination inside _execute:

if tool_execution_state is not None and execution_tracker.collection_info is None:
    job.tool_execution_state = tool_execution_state

The same tool_execution_state instance is reused for every iteration. For a batch TR with collection_info is None and len(execution_tracker.param_combinations) > 1, all N Jobs stamp the same TES row, and the UNIQUE rejects the 2nd…Nth.

Why the framing slipped

The commit message (“tighten TES back-pops to scalar”) encoded the WIS-side invariant — one WIS = one step execution event = at most one materialized Job XOR one ICJ. That framing is correct for the constrained scheduling surface (WIS), but TR is the looser scheduling surface and inherits Job’s 1:N cardinality through TES. The constraint walked an asymmetric invariant onto a symmetric column.

See EFFECTIVE_STEP_STATE.md for the deeper framing: TR and WIS sit at different granularities of “an execution,” and the schema needs to honor both.

Reachability of the broken case

Confirmed reachable on the async TR path:

This is a long-standing tool-input pattern (multi-file submission without HDCA wrapping), not a hypothetical.

Affected readers — quick survey

  1. Revert the Job UNIQUE. Drop uq_job_tool_execution_state_id from migration 10c4cd393d5a. Keep the other three UNIQUEs — TR/WIS/ICJ remain genuinely 1:1.
  2. Flip TES.job back to list["Job"] (model/__init__.py:1485) with uselist defaulted and back_populates="tool_execution_state". Job.tool_execution_state stays scalar.
  3. Update docs to reflect “TES↔Job is 1:N for batch TRs.” The BOOKKEEPING_MODELS and MODELS_VISUAL_TOUR notice lists currently say the four back-pops are scalar; that’s now true for three of four.
  4. Fold the job_ids extract dedup into the same PR or as an immediate follow-up: collapse adjacent work_items sharing (tier, source_id) after the sort at workflow/extract.py:850. Tests should cover the batch-of-N extraction shape.

Open question

Does any production install already carry data that depended on the UNIQUE? Since 10c4cd393d5a is on this unreleased branch, no — but verify before merging the revert that no fixture or test relies on the constraint being there.