Plan: Structured YAML Modeling for then Expressions — DONE
Overview
Convert the 32 free-form then string expressions in collection_semantics.yml into structured, typed YAML to enable richer rendering, validation, and programmatic use.
Catalog of All then Expressions (32 total)
Category A: Map-Over Producing Implicit Collection (8)
| Label | Expression |
|---|---|
BASIC_MAPPING_PAIRED | tool(i=mapOver(C)) ~> {o: collection<paired, ...>} |
BASIC_MAPPING_PAIRED_OR_UNPAIRED_PAIRED | tool(i=mapOver(C)) ~> {o: collection<paired_or_unpaired,...>} |
BASIC_MAPPING_PAIRED_OR_UNPAIRED_UNPAIRED | tool(i=mapOver(C)) ~> {o: collection<paired_or_unpaired,...>} |
BASIC_MAPPING_LIST | tool(i=mapOver(C)) ~> {o: collection<list,...>} |
NESTED_LIST_MAPPING | tool(i=mapOver(C)) ~> {o: collection<list:list,...>} |
BASIC_MAPPING_LIST_PAIRED_OR_UNPAIRED | tool(i=mapOver(C)) ~> {o: collection<list:paired_or_unpaired,...>} |
BASIC_MAPPING_INCLUDING_SINGLE_DATASET | tool(i=mapOver(C),i2=d_o) ~> ... |
BASIC_MAPPING_TWO_INPUTS_WITH_IDENTICAL_STRUCTURE | tool(i=mapOver(C1), i2=mapOver(C2)) ~> ... |
Category B: Direct Reduction (4)
| Label | Expression |
|---|---|
COLLECTION_INPUT_PAIRED | tool(i=C) -> {o: dataset} |
COLLECTION_INPUT_LIST | tool(i=C) -> {o: dataset} |
COLLECTION_INPUT_PAIRED_OR_UNPAIRED | tool(i=C) -> {o: dataset} |
COLLECTION_INPUT_LIST_PAIRED_OR_UNPAIRED | tool(i=C) -> {o: dataset} |
Category C: Invalid Operations (10)
Bare invocations with is_valid: false. Includes COLLECTION_INPUT_LIST_PAIRED_NOT_CONSUMES_PAIRED_PAIRED and COLLECTION_INPUT_LIST_PAIRED_OR_NOT_PAIRED_NOT_CONSUMES_PAIRED_PAIRED added in ebdc057528.
Category D: Equivalence Assertions (4)
Using == operator: LIST_REDUCTION, PAIRED_OR_UNPAIRED_CONSUMES_PAIRED, MAPPING_LIST_PAIRED_OVER_PAIRED_OR_UNPAIRED, MAPPING_LIST_LIST_PAIRED_OVER_PAIRED_OR_UNPAIRED.
Category E: Explicit Invalidity (1)
PAIRED_OR_UNPAIRED_NOT_CONSUMED_BY_PAIRED uses “is invalid” text.
Category F: Sub-collection Map-Over (3)
MAPPING_LIST_PAIRED_OVER_PAIRED, NESTED_LIST_REDUCTION, MAPPING_LIST_LIST_OVER_LIST_PAIRED_OR_UNPAIRED. Note: the last uses compound sub_collection_type list:paired_or_unpaired.
Category G: Single-dataset Sub-collection (2)
MAPPING_LIST_OVER_PAIRED_OR_UNPAIRED, MAPPING_LIST_LIST_OVER_PAIRED_OR_UNPAIRED. The latter produces list:list output (vs list for the former).
Proposed Structured YAML Schema
Top-Level then Discriminated Union
# map_over: tool with mapOver producing implicit collection
then:
type: map_over
invocation: <ToolInvocation>
produces: <OutputMap>
# reduction: tool consumes collection directly
then:
type: reduction
invocation: <ToolInvocation>
produces: <OutputMap>
# equivalence: two invocations are semantically equal
then:
type: equivalence
left: <ToolInvocation>
right: <ToolInvocation>
# invalid: operation is not valid
then:
type: invalid
invocation: <ToolInvocation>
ToolInvocation
invocation:
inputs:
<input_name>:
type: dataset # direct dataset ref
ref: d_f
type: map_over # mapOver(C) or mapOver(C, 'paired')
collection: C
sub_collection_type: paired # optional
type: collection # direct collection input
ref: C
type: dataset_list # inline list [d_1,...,d_n]
refs: [d_1, "...", d_n]
OutputMap
produces:
<output_name>:
type: dataset # simple dataset
type: collection # implicit collection
collection_type: paired
elements:
<id>:
type: tool_output_ref # tool(i=d_f)[o]
invocation: <ToolInvocation>
output: o
type: nested_elements # for nested collections
elements: { ... }
Worked Example: BASIC_MAPPING_PAIRED
Current:
then: "tool(i=mapOver(C)) ~> {o: collection<paired,{forward=tool(i=d_f)[o], reverse=tool(i=d_r)[o]}>}"
Proposed:
then:
type: map_over
invocation:
inputs:
i:
type: map_over
collection: C
produces:
o:
type: collection
collection_type: paired
elements:
forward:
type: tool_output_ref
invocation:
inputs:
i: { type: dataset, ref: d_f }
output: o
reverse:
type: tool_output_ref
invocation:
inputs:
i: { type: dataset, ref: d_r }
output: o
Worked Example: COLLECTION_INPUT_PAIRED (Reduction)
then:
type: reduction
invocation:
inputs:
i: { type: collection, ref: C }
produces:
o: { type: dataset }
Pydantic Model Changes
New models in semantics.py:
# Input binding types (discriminated union on "type")
DatasetInput, MapOverInput, CollectionInput, DatasetListInput
# Tool invocation
ToolInvocation(inputs: dict[str, InputBinding])
# Output types
DatasetOutput, ToolOutputRef, NestedElements, CollectionOutput
# Top-level then types (discriminated union on "type")
MapOverThen, ReductionThen, EquivalenceThen, InvalidThen
Example.then changes from Optional[str] to Optional[Union[str, ThenExpression]] during migration, then to Optional[ThenExpression] after full migration.
LaTeX Rendering
Replace expression_to_latex() string substitution with as_latex() methods on each Pydantic model. Each model knows how to render itself to LaTeX. The generate_docs() dispatch changes from:
expression_to_latex(example.then)
to:
example.then.as_latex()
Migration Approach (3 Phases)
Phase 1: Add new models, keep backward compat
- Add all new Pydantic models +
as_latex()methods - Change
Example.thentoUnion[str, ThenExpression] - Keep
expression_to_latex()for string case - Dispatch on
isinstanceingenerate_docs()
Phase 2: Convert YAML expressions by category (simplest first)
- Category C (10 invalid) - simplest structure
- Category B (4 reductions)
- Category D (4 equivalences)
- Category E (1 explicit invalid)
- Category A (8 map-overs) - most complex
- Categories F+G (5 sub-collection)
For each: convert YAML, run doc gen, diff markdown to verify identical LaTeX output.
Phase 3: Remove backward compat
- Remove
strfrom Union - Remove
expression_to_latex() - Clean up dispatch
Testing Strategy (Red-to-Green)
Test file: test/unit/data/model/test_collection_semantics.py (existing — 40 tests for models, LaTeX helpers, validators)
as_latex()round-trip tests: For each of 32 expressions, assert structured.as_latex()matchesexpression_to_latex(original_string). Write test first, implement to make green.- Pydantic validation tests: Assert model_validate produces correct types/fields.
- Full doc generation regression test: Diff generated markdown against snapshot.
Test ordering:
- InvalidThen tests -> implement -> ReductionThen tests -> implement -> … up to MapOverThen with nested outputs.
Critical Files
| File | Role |
|---|---|
lib/galaxy/model/dataset_collections/types/semantics.py | All new Pydantic models + as_latex() methods |
lib/galaxy/model/dataset_collections/types/collection_semantics.yml | Migrate all then strings to structured YAML |
doc/source/dev/collection_semantics.md | Regression baseline for LaTeX output |
test/unit/data/model/test_collection_semantics.py | Existing test file (40 tests) |
Potential Challenges
- Ellipsis representation:
...pattern in elements needs_ellipsis: truesentinel or list-of-pairs approach - Escaped backslashes: Current
C\\_PAIREDbecomes rawC_PAIREDwith LaTeX escaping in renderer - cleaner - Verbosity: Structured YAML is 5-15x more lines per expression - trade-off for machine-readability
Unresolved Questions
_ellipsissentinel key in dict vs list-of-pairs with explicit ellipsis entry type?- Should
is_valid: falsebe removed in favor oftype: invalid, or kept as redundant validation? - Should schema support future expression types beyond current 4 (e.g.,
collection_creation)?