Plan: Fix / in Step Labels Breaking Source Reference Parsing
GitHub issue: https://github.com/galaxyproject/gxformat2/issues/145
Problem
BaseConversionContext.step_output() splits source references on / to separate step_label from output_name. When a step label contains / (e.g., Host/Contaminant Reference Genome), the split produces garbage. Since the label fragment isn’t in self.labels, it falls through to int() and crashes.
3 of 112 IWC workflows hit this: host-or-contamination-removal-on-short-reads.ga, fragment-based-docking-scoring.ga, iwc-clinicalmp-database-generation.ga.
The Format
Source references in format2 use / as the separator between step label and output name:
in:
input1: some_step/output_name # step "some_step", output "output_name"
input2: some_step # step "some_step", output "output" (implicit)
outputs:
result:
outputSource: some_step/out # same convention
When a label IS Host/Contaminant Reference Genome and the output IS output (implicit), the export side produces:
reference_genome|index:
source: Host/Contaminant Reference Genome # no /output suffix — elided
The reimport sees Host/Contaminant Reference Genome, splits on first /, gets ("Host", "Contaminant Reference Genome"), tries step_id("Host") → int("Host") → crash.
Affected Code Paths
1. BaseConversionContext.step_output() — converter.py:586
def step_output(self, value):
value_parts = str(value).split("/")
if len(value_parts) == 1:
value_parts.append("output")
id = self.step_id(value_parts[0])
return id, value_parts[1]
Called from:
- Output source resolution (converter.py:225):
id, output_name = conversion_context.step_output(source) - Input connection population (converter.py:702):
step_id, output_name = context.step_output(source) - When condition source (converter.py:473):
step_id, output_name = context.step_output(step["when"]["source"])
Fix strategy: self.labels contains all registered step labels at call time. Try matching known labels before falling back to split.
2. normalize.py — lines 152, 176, 183
output_source.split("/", 1)
No ConversionContext available in normalize — needs its own label set built from the workflow dict.
Fix strategy: Build a label set from inputs + steps keys at the top of the normalize function, then use a shared _parse_source_reference(value, known_labels) helper.
3. _to_source() — export.py:370 (NOT affected)
Builds source strings by concatenation — f"{output_label}/{output_name}". If label contains /, the produced string is ambiguous, but the export direction doesn’t parse it. The reimport is what breaks.
Note: The export producing Host/Contaminant Reference Genome/mapping_stats is inherently ambiguous. This is fine as long as the parser can resolve it using known labels, which it can.
Implementation
Step 1: Add _resolve_source_reference() helper
def _resolve_source_reference(value: str, known_labels) -> tuple[str, str]:
"""Parse a source reference into (step_label_or_id, output_name).
Tries matching known labels first to handle labels containing '/'.
Falls back to split on '/' for numeric step IDs.
"""
# Try known labels (longest match first to handle nested ambiguity)
for label in sorted(known_labels, key=len, reverse=True):
if value == label:
return label, "output"
if value.startswith(label + "/"):
return label, value[len(label) + 1:]
# Fallback: split on first '/'
if "/" in value:
parts = value.split("/", 1)
return parts[0], parts[1]
return value, "output"
Sorting by length descending handles edge cases where one label is a prefix of another (unlikely but defensive).
Step 2: Update BaseConversionContext.step_output()
def step_output(self, value):
label_or_id, output_name = _resolve_source_reference(str(value), self.labels)
return self.step_id(label_or_id), output_name
Step 3: Update normalize.py
Build a label set at the start of normalize functions, pass to _resolve_source_reference:
known_labels = set()
for step_label, _ in walk_id_list_or_dict(workflow_dict.get("steps", {})):
known_labels.add(str(step_label))
for input_label, _ in walk_id_list_or_dict(workflow_dict.get("inputs", {})):
known_labels.add(str(input_label))
Replace output_source.split("/", 1) calls with _resolve_source_reference(output_source, known_labels).
Step 4: Tests
-
Round-trip test with
/in input label: Create a minimal workflow with an input labeledHost/Contaminant Genome, a tool step connected to it, verify round-trip preserves connections. -
Round-trip test with
/in label AND non-default output: Ensuresome/label/custom_outputcorrectly parses as("some/label", "custom_output")whensome/labelis a known label. -
Round-trip test with
/in label referencing implicit output:some/label(no output suffix) →("some/label", "output"). -
Normalize test: Verify
outputSource: "Host/Contaminant Genome/mapping_stats"resolves correctly. -
Regression test: Existing tests still pass — labels without
/should be unaffected.
Step 5: Verify IWC workflows
Run the three previously-ERROR workflows through round-trip and confirm they now produce OK or MISMATCH (not ERROR).
Where to Put the Helper
_resolve_source_reference should live in model.py (alongside clean_connection and other source-handling utilities) and be importable by both converter.py and normalize.py.
Edge Cases
- Label is a numeric string (e.g.,
"0"): Already handled —step_iddoesself.labels[label_or_id]first, falls through toint()only if not in labels. This is existing behavior. - Label contains multiple
/(e.g.,A/B/C): Longest-match against known labels handles this. IfA/B/Cis a label, it matches beforeA/BorA. - Label is a prefix of another label (e.g.,
A/BandA/B/Cboth exist): Longest-match-first ordering handles this —A/B/Cis tried beforeA/B. - Two labels where one +
/outputlooks like the other (e.g., labelfooand labelfoo/output): Ambiguous. Longest match resolves it tofoo/outputwith implicitoutput. This is a pathological case unlikely in practice.
Unresolved Questions
- Should
_to_source()on the export side escape or quote labels containing/? This would eliminate ambiguity at the source, but would change the format2 representation and require a corresponding unescape on reimport. Probably not worth it — the label-aware parser handles it without format changes.