Collection Tool Execution Semantics in Galaxy
A Technical White Paper for Galaxy Developers
Table of Contents
- Introduction
- Collection Type Hierarchy
- Input Mapping Semantics
- The Consume Value Model
- Output Collection Creation
- Runtime Representation
- Practical Examples
- Code References
1. Introduction
What Are Dataset Collections?
Dataset collections are Galaxy’s mechanism for grouping related datasets into structured containers that can be processed together. They enable:
- Batch processing: Apply a tool to multiple datasets simultaneously
- Parallelization: Galaxy automatically creates separate jobs for each element
- Structure preservation: Output collections maintain the same structure as inputs
- Workflow composability: Collections enable complex data flows without explicit loops
Why Collections Matter for Tool Execution
When a tool accepts a single dataset input but receives a collection:
- Galaxy “maps over” the collection, creating one job per element
- Outputs are gathered into an “implicit collection” matching the input structure
- Element identifiers flow through, enabling downstream matching
This happens transparently - tool authors need not be aware of collections for basic mapping scenarios.
Core Principle
Galaxy’s collection semantics follow a design principle: “what should just intuitively work.” Users connect steps and Galaxy determines the natural execution pattern. This document formalizes that intuition for implementers.
2. Collection Type Hierarchy
Fundamental Collection Types
Galaxy supports several base collection types, which can be nested:
| Type | Description | Element Identifiers |
|---|---|---|
list | Ordered sequence of datasets | User-defined (e.g., sample1, sample2) |
paired | Forward/reverse read pairs | Fixed: forward, reverse |
paired_or_unpaired | Single or paired reads | unpaired OR forward/reverse |
record | Heterogeneous named fields | Schema-defined field names |
sample_sheet | Tabular metadata with datasets | Row identifiers with column metadata |
Nested Collection Types
Collection types compose via the : separator:
list:paired- A list of paired datasets (e.g., multiple samples, each with forward/reverse)list:list- Nested lists (e.g., samples grouped by condition)list:paired_or_unpaired- List where each element may be single or paired
Type Validation Regex
Collection types must match this pattern (from type_description.py:15-17):
COLLECTION_TYPE_REGEX = re.compile(
r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)
Collection Type Description
The CollectionTypeDescription class (lib/galaxy/model/dataset_collections/type_description.py:31-175) provides methods for working with collection types:
class CollectionTypeDescription:
def child_collection_type(self):
"""Returns the collection type after removing the outermost rank."""
# e.g., "list:paired" -> "paired"
def effective_collection_type(self, subcollection_type):
"""Returns the type after consuming a subcollection type."""
# e.g., "list:list:paired".effective_collection_type("paired") -> "list:list"
def has_subcollections_of_type(self, other_collection_type):
"""Check if this type contains subcollections of another type."""
def rank_collection_type(self):
"""Returns the outermost collection type."""
# e.g., "list:paired" -> "list"
Dimension
The dimension of a collection type equals the number of : separators plus 2:
@property
def dimension(self):
return len(self.collection_type.split(":")) + 1
listhas dimension 2list:pairedhas dimension 3list:list:pairedhas dimension 4
3. Input Mapping Semantics
Single Collection Input Iteration (Map Over)
When a tool declares a simple dataset parameter (type="data") but receives a collection, Galaxy maps over the collection:
Example: Tool (i: dataset) => {o: dataset} with input list[d1, d2, d3]
Result: list[tool(i=d1)[o], tool(i=d2)[o], tool(i=d3)[o]]
The output collection preserves:
- The same collection type (
list) - The same element identifiers
- The same order
This extends naturally to nested collections:
Example: Tool (i: dataset) => {o: dataset} with input list:paired[{forward=f, reverse=r}]
Result: list:paired[{forward=tool(i=f)[o], reverse=tool(i=r)[o]}]
Multiple Collection Inputs
When a tool has multiple dataset inputs, Galaxy supports two modes:
Linked (Matched) Semantics (Default)
Collections are iterated in lockstep - element N of input A is paired with element N of input B:
# From meta.py:217
is_linked = value.get("linked", True) # Default is linked=True
Example: Tool (i1: dataset, i2: dataset) => {o: dataset}
- Input 1:
list[a1, a2, a3] - Input 2:
list[b1, b2, b3]
Result: list[tool(i1=a1, i2=b1)[o], tool(i1=a2, i2=b2)[o], tool(i1=a3, i2=b3)[o]]
Requirement: Linked collections must have identical structure (same element identifiers in same order).
Cross-Product (Unlinked) Semantics
When linked=False, Galaxy computes the Cartesian product:
# From meta.py:220-221
elif is_batch:
classification = input_classification.MULTIPLIED
Example: Tool (i1: dataset, i2: dataset) => {o: dataset}
- Input 1:
list[a1, a2](withlinked=False) - Input 2:
list[b1, b2]
Result: Every combination is executed:
tool(i1=a1, i2=b1)tool(i1=a1, i2=b2)tool(i1=a2, i2=b1)tool(i1=a2, i2=b2)
Galaxy provides built-in tools (__CROSS_PRODUCT_FLAT__, __CROSS_PRODUCT_NESTED__) to pre-compute cross-product structures for downstream linked mapping.
Element Identifier Matching
For linked collections, matching occurs by structure position, not by identifier name. The matching algorithm (lib/galaxy/model/dataset_collections/matching.py:40-120):
class MatchingCollections:
def __attempt_add_to_linked_match(self, input_name, hdca, collection_type_description, subcollection_type):
structure = get_structure(hdca, collection_type_description, leaf_subcollection_type=subcollection_type)
if not self.linked_structure:
self.linked_structure = structure
else:
if not self.linked_structure.can_match(structure):
raise exceptions.MessageException(CANNOT_MATCH_ERROR_MESSAGE)
The can_match method (lib/galaxy/model/dataset_collections/structure.py:130-145) verifies:
- Collection types are compatible
- Same number of children at each level
- Nested structure matches recursively
Subcollection Mapping
A nested collection can be mapped over a tool that accepts a subcollection type:
Example: Tool (i: collection<paired>) => {o: dataset} with list:paired input
The tool receives each paired subcollection, producing a flat list output:
list:paired[{el1: {forward=f1, reverse=r1}}, {el2: {forward=f2, reverse=r2}}]
=> list[tool(i={forward=f1, reverse=r1})[o], tool(i={forward=f2, reverse=r2})[o]]
The mapping logic (lib/galaxy/model/dataset_collections/subcollections.py:36-58):
def _split_dataset_collection(dataset_collection, collection_type):
for element in dataset_collection.elements:
if child_collection.collection_type == collection_type:
split_elements.append(element)
else:
split_elements.extend(_split_dataset_collection(child_collection, collection_type))
4. The Consume Value Model
Conceptual Framework
Tool inputs “consume” collection structure based on their declared type:
| Input Declaration | What It Consumes | Remaining Structure |
|---|---|---|
type="data" | Single dataset | Maps over entire collection |
type="data" multiple="true" | List (reduces it) | Outer collection type |
type="data_collection" collection_type="paired" | Paired subcollection | Outer list(s) |
type="data_collection" collection_type="list" | List subcollection | Outer collection type |
Reduction vs Mapping
Reduction occurs when an input consumes an entire collection rank:
multiple="true"data inputs reducelistto nothing (or outer list for nested)collection_type="list"inputs reducelistto nothing
Mapping occurs when collection structure remains after consumption:
list:pairedoverpairedinput => maps, produceslistlist:listoverlistinput => maps, produceslist
What Cannot Be Reduced
Paired collections cannot be reduced by multiple="true" inputs:
# From collection_semantics.md documentation
# PAIRED_REDUCTION_INVALID - paired cannot be reduced by multiple data input
This is intentional: paired represents a semantic tuple (forward/reverse), not an arbitrary list.
paired_or_unpaired Special Handling
The paired_or_unpaired type has special consumption rules:
- It can consume
pairedcollections (forward/reverse case) - It can consume single datasets via
single_datasetssubcollection mapping pairedinputs CANNOT consumepaired_or_unpaired(may lack forward/reverse)
From type_description.py:92-98:
if other_collection_type == "paired_or_unpaired":
# this can be thought of as a subcollection of anything except a pair
return collection_type != "paired"
if other_collection_type == "single_datasets":
# effectively any collection has unpaired subcollections
return True
5. Output Collection Creation
Structure Inheritance
Output collections inherit structure from input collections being mapped over. The structure determination (lib/galaxy/model/dataset_collections/structure.py:180-211):
def tool_output_to_structure(get_sliced_input_collection_structure, tool_output, collections_manager):
if not tool_output.collection:
tree = leaf
else:
structured_like = tool_output.structure.structured_like
collection_type = tool_output.structure.collection_type
if structured_like:
tree = get_sliced_input_collection_structure(structured_like)
Implicit Collection Creation
When mapping over collections, Galaxy pre-creates output collections (lib/galaxy/tools/execute.py:583-638):
def precreate_output_collections(self, history, params, tool_request):
for output_name, output in self.tool.outputs.items():
effective_structure = self._mapped_output_structure(trans, output)
collection_instance = trans.app.dataset_collection_manager.precreate_dataset_collection_instance(
trans=trans,
parent=history,
name=output_collection_name,
structure=effective_structure,
implicit_inputs=implicit_inputs,
)
The effective_structure is computed by multiplying the mapping structure by the output structure:
def _mapped_output_structure(self, trans, tool_output):
output_structure = tool_output_to_structure(...)
mapping_structure = self._structure_for_output(trans, tool_output)
return mapping_structure.multiply(output_structure)
Structure Multiplication
When a tool that produces collections is mapped over a collection, structures multiply (lib/galaxy/model/dataset_collections/structure.py:150-165):
def multiply(self, other_structure):
if other_structure.is_leaf:
return self.clone()
new_collection_type = self.collection_type_description.multiply(other_structure.collection_type_description)
new_children = []
for identifier, structure in self.children:
new_children.append((identifier, structure.multiply(other_structure)))
return Tree(new_children, new_collection_type)
Example: Mapping list[a, b] over a tool that outputs paired:
- Structure:
listxpaired=list:paired - Result:
list:paired[{a: {forward, reverse}}, {b: {forward, reverse}}]
Element Identifier Propagation
Output collection elements inherit identifiers from input collections. The default_identifier_source output attribute can specify which input provides identifiers when multiple collections are involved.
6. Runtime Representation
Wrapper Classes
At tool execution time, collection data is wrapped for template access (lib/galaxy/tools/wrappers.py):
DatasetFilenameWrapper (lines 288-540)
Wraps individual datasets within collections:
class DatasetFilenameWrapper(ToolParameterValueWrapper):
@property
def element_identifier(self) -> str:
identifier = self._element_identifier
if identifier is None:
identifier = self.name
return identifier
Key properties:
__str__()returns the file pathelement_identifierreturns the dataset’s identifier within its collectionfile_extreturns the file extension
DatasetCollectionWrapper (lines 639-800)
Wraps collection parameters:
class DatasetCollectionWrapper(ToolParameterValueWrapper, HasDatasets):
def __init__(self, job_working_directory, has_collection, ...):
for dataset_collection_element in elements:
element_identifier = dataset_collection_element.element_identifier
if isinstance(element_object, DatasetCollection):
element_wrapper = DatasetCollectionWrapper(...) # Recursive
else:
element_wrapper = self._dataset_wrapper(element_object, identifier=element_identifier)
Accessing Collection Elements
In Cheetah templates:
## Iterate over collection elements
#for $element in $input_collection
$element ## File path
$element.element_identifier ## Element name
$element.ext ## File extension
#end for
## Access by identifier
$input_collection['forward']
$input_collection['reverse']
## Get all paths
#for $path in $input_collection.all_paths
$path
#end for
Element Identifier Flow
When mapping over collections, each job receives the element identifier via a special parameter:
# From actions/__init__.py:501-503
identifier = getattr(data, "element_identifier", None)
if identifier is not None:
incoming[f"{name}|__identifier__"] = identifier
This enables tools to access the identifier of the dataset they’re processing.
7. Practical Examples
Example 1: Simple List Mapping
Tool: cat_file.xml - concatenates file contents
Input: list[file1.txt, file2.txt, file3.txt]
Galaxy creates 3 jobs:
cat_file(input=file1.txt)->out1.txtcat_file(input=file2.txt)->out2.txtcat_file(input=file3.txt)->out3.txt
Output: list[out1.txt, out2.txt, out3.txt] with same identifiers
Example 2: Paired Collection Processing
Tool: bwa_mem.xml - accepts collection_type="paired"
Input: list:paired with 2 samples
Galaxy creates 2 jobs (one per paired element):
bwa_mem(reads={forward=s1_R1.fq, reverse=s1_R2.fq})bwa_mem(reads={forward=s2_R1.fq, reverse=s2_R2.fq})
Output: list[aligned_s1.bam, aligned_s2.bam]
Example 3: Multiple Linked Collections
Tool: compare.xml - compares two files
Input A: list[a1, a2, a3]
Input B: list[b1, b2, b3]
With default linked semantics:
compare(file1=a1, file2=b1)compare(file1=a2, file2=b2)compare(file1=a3, file2=b3)
Output: list[cmp1, cmp2, cmp3]
Example 4: Reduction with Multiple Input
Tool: merge.xml - type="data" multiple="true"
Input: list[f1, f2, f3]
Single job consumes entire list:
merge(inputs=[f1, f2, f3])-> single output dataset
Example 5: Nested Collection Mapping
Tool: single_file_tool.xml - accepts single dataset
Input: list:list[[a1, a2], [b1, b2, b3]]
Galaxy creates 5 jobs:
tool(input=a1)tool(input=a2)tool(input=b1)tool(input=b2)tool(input=b3)
Output: list:list[[out_a1, out_a2], [out_b1, out_b2, out_b3]]
Example 6: Cross-Product Execution
Setup: Use __CROSS_PRODUCT_FLAT__ tool first
Input A: list[a1, a2]
Input B: list[b1, b2]
Cross-product tool outputs:
output_a:list[a1, a1, a2, a2]with identifiers[a1_b1, a1_b2, a2_b1, a2_b2]output_b:list[b1, b2, b1, b2]with identifiers[a1_b1, a1_b2, a2_b1, a2_b2]
Downstream tool with linked mapping produces all 4 combinations.
8. Code References
Core Model Classes
| Class | File | Lines | Purpose |
|---|---|---|---|
DatasetCollection | lib/galaxy/model/__init__.py | 6982-7090 | Collection storage model |
DatasetCollectionElement | lib/galaxy/model/__init__.py | 8006-8100 | Element within collection |
DatasetCollectionInstance | lib/galaxy/model/__init__.py | 7494-7544 | Base for HDCA/LDCA |
HistoryDatasetCollectionAssociation | lib/galaxy/model/__init__.py | varies | History-bound collection |
Collection Type System
| File | Key Components |
|---|---|
lib/galaxy/model/dataset_collections/type_description.py | CollectionTypeDescription, type matching logic |
lib/galaxy/model/dataset_collections/types/__init__.py | BaseDatasetCollectionType abstract base |
lib/galaxy/model/dataset_collections/types/list.py | ListDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/paired.py | PairedDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py | PairedOrUnpairedDatasetCollectionType |
Matching and Structure
| File | Key Components |
|---|---|
lib/galaxy/model/dataset_collections/matching.py | CollectionsToMatch, MatchingCollections |
lib/galaxy/model/dataset_collections/structure.py | Tree, Leaf, get_structure, tool_output_to_structure |
lib/galaxy/model/dataset_collections/subcollections.py | split_dataset_collection_instance |
lib/galaxy/model/dataset_collections/query.py | HistoryQuery, collection type matching for parameters |
Tool Parameter Handling
| File | Key Components |
|---|---|
lib/galaxy/tools/parameters/basic.py:2489-2700 | DataCollectionToolParameter |
lib/galaxy/tools/parameters/meta.py:189-260 | expand_meta_parameters - batch expansion |
lib/galaxy/tools/parameters/meta.py:345-391 | expand_meta_parameters_async - async expansion |
lib/galaxy/tools/parameters/meta.py:419-462 | __expand_collection_parameter |
Execution Pipeline
| File | Key Components |
|---|---|
lib/galaxy/tools/execute.py:93-361 | execute, ExecutionTracker |
lib/galaxy/tools/execute.py:499-538 | sliced_input_collection_structure |
lib/galaxy/tools/execute.py:561-573 | _mapped_output_structure |
lib/galaxy/tools/execute.py:583-638 | precreate_output_collections |
lib/galaxy/tools/actions/__init__.py:134-795 | DefaultToolAction.execute |
Collection Manager
| File | Key Components |
|---|---|
lib/galaxy/managers/collections.py:70-350 | DatasetCollectionManager |
lib/galaxy/managers/collections.py:96-178 | precreate_dataset_collection_instance, precreate_dataset_collection |
Runtime Wrappers
| File | Key Components |
|---|---|
lib/galaxy/tools/wrappers.py:288-540 | DatasetFilenameWrapper |
lib/galaxy/tools/wrappers.py:552-634 | DatasetListWrapper |
lib/galaxy/tools/wrappers.py:639-800 | DatasetCollectionWrapper |
Builder Pattern
| File | Key Components |
|---|---|
lib/galaxy/model/dataset_collections/builder.py:27-46 | build_collection |
lib/galaxy/model/dataset_collections/builder.py:95-199 | CollectionBuilder |
lib/galaxy/model/dataset_collections/builder.py:201-220 | BoundCollectionBuilder |
Appendix: Key Algorithms
Collection Matching Algorithm
function match_collections(collections_to_match):
matching = MatchingCollections()
for each (input_key, to_match) in collections_to_match:
collection = to_match.hdca
type_description = for_collection_type(collection.collection_type)
subcollection_type = to_match.subcollection_type
if to_match.linked:
structure = get_structure(collection, type_description, subcollection_type)
if not matching.linked_structure:
matching.linked_structure = structure
else if not matching.linked_structure.can_match(structure):
raise "Cannot match collection types"
matching.add(input_key, collection, subcollection_type)
else:
structure = get_structure(...)
matching.unlinked_structures.append(structure)
return matching
Structure Walking for Execution
function walk_collections(self, hdca_dict):
for each (index, (identifier, substructure)) in self.children:
elements = {k: collection[index] for k, collection in hdca_dict}
if substructure.is_leaf:
yield elements # These go to individual jobs
else:
for child_elements in substructure.walk_collections(child_collections):
yield child_elements
Output Structure Computation
function compute_output_structure(tool_output, collection_info):
# Get the structure from tool output definition
if tool_output.structured_like:
output_structure = get_input_structure(tool_output.structured_like)
else:
output_structure = UninitializedTree(tool_output.collection_type)
# Multiply by mapping structure
mapping_structure = collection_info.structure
return mapping_structure.multiply(output_structure)
References
- Galaxy Collection Semantics Documentation:
doc/source/dev/collection_semantics.md - Planemo Documentation: https://planemo.readthedocs.io/
- Galaxy Tool Development: https://docs.galaxyproject.org/en/latest/dev/schema.html