Galaxy paired_or_unpaired collections
Audience: a Mold author shaping a Galaxy workflow interface from an upstream (CWL / Nextflow / paper) source whose reads can be paired-end or single-end or a mixed batch of both.
The shape
paired_or_unpaired is a Galaxy collection type modeling a discriminated union of 1 or 2 elements:
- Unpaired variant — one element with identifier
unpaired. - Paired variant — two elements with identifiers
forwardandreverse.
list:paired_or_unpaired lifts the same shape to a heterogeneous batch where some samples are paired and some are single-end — a representation that did not exist before this type. A list:paired forces every sample to be paired; a plain list of flat datasets loses pairing structure.
The type and rank paired_or_unpaired may occur at any rank within nested types (list:paired_or_unpaired, list:list:paired_or_unpaired) but only at the deepest (innermost) rank — the subtyping logic is implemented at the suffix level. See “Limitation: only deepest rank” below.
When to reach for it (decision rule for translators)
Reach for paired_or_unpaired when the upstream workflow declares either of:
- Two or more optional read-like inputs (e.g., CWL
forward_reads: File?,reverse_reads: File?,single_reads: File?) gated by mutually-exclusivewhen:predicates that branch on which inputs are present. - A single workflow input that already carries “could be paired, could be single” semantics (Nextflow
meta.single_end, paper “we accept paired-end or single-end reads”).
Don’t reach for it when:
- The upstream workflow has two unrelated file inputs that aren’t a paired/single pair (then keep them as separate inputs).
- The upstream produces an explicit mode switch that downstream tooling depends on for non-mode reasons (rare).
Subtyping (concrete matching table)
paired IS-A paired_or_unpaired. The reverse is not true. From lib/galaxy/model/dataset_collections/type_description.py can_match_type:
| Input expects | Data provided | Match? |
|---|---|---|
paired_or_unpaired | paired | ✅ (paired is a subtype) |
paired_or_unpaired | paired_or_unpaired | ✅ exact match |
paired | paired_or_unpaired | ❌ may lack forward/reverse |
list:paired_or_unpaired | list:paired | ✅ each element treated as paired variant |
list:paired_or_unpaired | list | ✅ each element treated as unpaired variant |
list:paired | list:paired_or_unpaired | ❌ some elements may be unpaired |
The asymmetry has a consequence for downstream wiring: if a workflow uses a paired_or_unpaired upstream of a step that strictly requires paired, the Galaxy editor rejects the connection with:
“Cannot attach optionally paired outputs to inputs requiring pairing, consider using the Split Paired and Unpaired tool to extract just the pairs out from this output.”
The escape hatch is the built-in tool __SPLIT_PAIRED_AND_UNPAIRED__.
Inside a tool wrapper
A tool declares a paired_or_unpaired input as:
<param name="reads" type="data_collection"
collection_type="paired_or_unpaired" label="Input reads" />
At command-build time the runtime exposes:
$reads.has_single_item—Truefor the unpaired variant.$reads.single_item— the single element wrapper.$reads.forward/$reads['reverse']— for the paired variant.
Idiomatic Cheetah:
#if $reads.has_single_item:
cat $reads.single_item >> $out;
#else:
cat $reads.forward $reads['reverse'] >> $out;
#end if
Branching happens inside the wrapper, not at the workflow level. This is the key reason paired_or_unpaired collapses a CWL “paired-or-single subworkflow with when: on every step” into a single Galaxy workflow with no mode-switch parameter.
Translation playbook
For a Mold translating from CWL / Nextflow:
- Detect the mode-discrimination pattern in the source (multiple optional reads inputs with
when:-gated steps, ormeta.single_end-style metadata). - Recommend the
paired_or_unpairedshape in the interface brief as the primary option, ahead of “three optional File? inputs” or a workflow-levelreads_modeswitch. - In the data-flow brief, model the per-step branching as a wrapper-internal concern (
has_single_item/forward/reverse) rather than a workflow-levelwhen:. The branching disappears from the gxformat2 topology. - For batched / list inputs (mixed paired+single batches), use
list:paired_or_unpaired. This is the canonical shape for a Galaxy port of an MGnify-style amplicon QC pipeline driven by a sample sheet. - Watch for downstream paired-only tools. If any step strictly needs
paired, insert__SPLIT_PAIRED_AND_UNPAIRED__upstream of that step (and document that the unpaired samples are dropped at that fork). - Don’t synthesize a workflow-level
reads_modeselect parameter for what is structurally a collection-shape concern.
Limitation: only deepest rank
The subtyping logic is implemented at the suffix level. paired_or_unpaired works correctly when it’s the innermost type in a nested collection (list:paired_or_unpaired, list:list:paired_or_unpaired). Putting it at a non-deepest rank (e.g., paired_or_unpaired:list) does not match the regex / subtyping rules. For CWL workflows with scatter over read inputs that already produce list:list:File, careful rank ordering is required.
Relationship to the IWC sibling-workflows convention
IWC’s MGnify amplicon ports (amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-quality-control-paired-end and …-single-end) publish two separate .gxwf.yml files, one per mode. That convention predates paired_or_unpaired (the type landed in PR #19377, late 2024 onward). With paired_or_unpaired available, a CWL→Galaxy translator may legitimately recommend consolidating the two IWC ports into one workflow, or continue the two-sibling convention for backwards compatibility. Surface both options in the IWC comparison brief; let the user / template Mold decide.
Citations
- Introduced by
galaxyproject/galaxyPR #19377 (“Empower Users to Build More Kinds of Collections, More Intelligently”), authored by John Chilton. Merge commitc212434dc8. - Type plugin:
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py. - Type validation regex:
lib/galaxy/model/dataset_collections/type_description.py:15-17. - Subtype matching:
CollectionTypeDescription.can_match_typeandhas_subcollections_of_typein the same file, ~lines 76–124. - Workflow editor rejection message:
client/src/components/Workflow/Editor/modules/terminals.ts:658-663. - Runtime wrapper properties:
lib/galaxy/tools/wrappers.py:801, 805(has_single_item,single_item). - Formal semantics:
lib/galaxy/model/dataset_collections/types/collection_semantics.yml(BASIC_MAPPING_PAIRED_OR_UNPAIRED_*,MAPPING_LIST_PAIRED_OVER_PAIRED_OR_UNPAIRED, etc.). - Built-in tool
__SPLIT_PAIRED_AND_UNPAIRED__:lib/galaxy/tools/__init__.py:4027-4032.
Evidence quality
- Galaxy-source observed (concrete): type definition, identifiers, subtype matching table, editor rejection message,
has_single_itemruntime API,__SPLIT_PAIRED_AND_UNPAIRED__tool, limitation on deepest rank. - Foundry inference (marked): the translation playbook (when to reach for
paired_or_unpairedfrom CWL evidence) is an inference from the type’s stated purpose; corpus validation comes from re-running CWL → Galaxy test-drives. - Speculation (marked): the IWC sibling-workflows convention may legitimately consolidate to a single
paired_or_unpairedworkflow under future updates; not validated against a re-ported IWC workflow yet.