Collection Tool Execution Semantics in Galaxy

A Technical White Paper for Galaxy Developers

Introduction
Collection Type Hierarchy
Input Mapping Semantics
The Consume Value Model
Output Collection Creation
Runtime Representation
Practical Examples
Code References

1. Introduction

What Are Dataset Collections?

Dataset collections are Galaxy’s mechanism for grouping related datasets into structured containers that can be processed together. They enable:

Batch processing: Apply a tool to multiple datasets simultaneously
Parallelization: Galaxy automatically creates separate jobs for each element
Structure preservation: Output collections maintain the same structure as inputs
Workflow composability: Collections enable complex data flows without explicit loops

Why Collections Matter for Tool Execution

When a tool accepts a single dataset input but receives a collection:

Galaxy “maps over” the collection, creating one job per element
Outputs are gathered into an “implicit collection” matching the input structure
Element identifiers flow through, enabling downstream matching

This happens transparently - tool authors need not be aware of collections for basic mapping scenarios.

Core Principle

Galaxy’s collection semantics follow a design principle: “what should just intuitively work.” Users connect steps and Galaxy determines the natural execution pattern. This document formalizes that intuition for implementers.

2. Collection Type Hierarchy

Fundamental Collection Types

Galaxy supports several base collection types, which can be nested:

Type	Description	Element Identifiers
`list`	Ordered sequence of datasets	User-defined (e.g., `sample1`, `sample2`)
`paired`	Forward/reverse read pairs	Fixed: `forward`, `reverse`
`paired_or_unpaired`	Single or paired reads	`unpaired` OR `forward`/`reverse`
`record`	Heterogeneous named fields	Schema-defined field names
`sample_sheet`	Tabular metadata with datasets	Row identifiers with column metadata

Nested Collection Types

Collection types compose via the : separator:

list:paired - A list of paired datasets (e.g., multiple samples, each with forward/reverse)
list:list - Nested lists (e.g., samples grouped by condition)
list:paired_or_unpaired - List where each element may be single or paired

Type Validation Regex

Collection types must match this pattern (from type_description.py:15-17):

COLLECTION_TYPE_REGEX = re.compile(
    r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)

Collection Type Description

The CollectionTypeDescription class (lib/galaxy/model/dataset_collections/type_description.py:31-175) provides methods for working with collection types:

class CollectionTypeDescription:
    def child_collection_type(self):
        """Returns the collection type after removing the outermost rank."""
        # e.g., "list:paired" -> "paired"

    def effective_collection_type(self, subcollection_type):
        """Returns the type after consuming a subcollection type."""
        # e.g., "list:list:paired".effective_collection_type("paired") -> "list:list"

    def has_subcollections_of_type(self, other_collection_type):
        """Check if this type contains subcollections of another type."""

    def rank_collection_type(self):
        """Returns the outermost collection type."""
        # e.g., "list:paired" -> "list"

Dimension

The dimension of a collection type equals the number of : separators plus 2:

@property
def dimension(self):
    return len(self.collection_type.split(":")) + 1

list has dimension 2
list:paired has dimension 3
list:list:paired has dimension 4

3. Input Mapping Semantics

Single Collection Input Iteration (Map Over)

When a tool declares a simple dataset parameter (type="data") but receives a collection, Galaxy maps over the collection:

Example: Tool (i: dataset) => {o: dataset} with input list[d1, d2, d3]

Result: list[tool(i=d1)[o], tool(i=d2)[o], tool(i=d3)[o]]

The output collection preserves:

The same collection type (list)
The same element identifiers
The same order

This extends naturally to nested collections:

Example: Tool (i: dataset) => {o: dataset} with input list:paired[{forward=f, reverse=r}]

Result: list:paired[{forward=tool(i=f)[o], reverse=tool(i=r)[o]}]

Multiple Collection Inputs

When a tool has multiple dataset inputs, Galaxy supports two modes:

Linked (Matched) Semantics (Default)

Collections are iterated in lockstep - element N of input A is paired with element N of input B:

# From meta.py:217
is_linked = value.get("linked", True)  # Default is linked=True

Example: Tool (i1: dataset, i2: dataset) => {o: dataset}

Input 1: list[a1, a2, a3]
Input 2: list[b1, b2, b3]

Result: list[tool(i1=a1, i2=b1)[o], tool(i1=a2, i2=b2)[o], tool(i1=a3, i2=b3)[o]]

Requirement: Linked collections must have identical structure (same element identifiers in same order).

Cross-Product (Unlinked) Semantics

When linked=False, Galaxy computes the Cartesian product:

# From meta.py:220-221
elif is_batch:
    classification = input_classification.MULTIPLIED

Example: Tool (i1: dataset, i2: dataset) => {o: dataset}

Input 1: list[a1, a2] (with linked=False)
Input 2: list[b1, b2]

Result: Every combination is executed:

tool(i1=a1, i2=b1)
tool(i1=a1, i2=b2)
tool(i1=a2, i2=b1)
tool(i1=a2, i2=b2)

Galaxy provides built-in tools (__CROSS_PRODUCT_FLAT__, __CROSS_PRODUCT_NESTED__) to pre-compute cross-product structures for downstream linked mapping.

Element Identifier Matching

For linked collections, matching occurs by structure position, not by identifier name. The matching algorithm (lib/galaxy/model/dataset_collections/matching.py:40-120):

class MatchingCollections:
    def __attempt_add_to_linked_match(self, input_name, hdca, collection_type_description, subcollection_type):
        structure = get_structure(hdca, collection_type_description, leaf_subcollection_type=subcollection_type)
        if not self.linked_structure:
            self.linked_structure = structure
        else:
            if not self.linked_structure.can_match(structure):
                raise exceptions.MessageException(CANNOT_MATCH_ERROR_MESSAGE)

The can_match method (lib/galaxy/model/dataset_collections/structure.py:130-145) verifies:

Collection types are compatible
Same number of children at each level
Nested structure matches recursively

Subcollection Mapping

A nested collection can be mapped over a tool that accepts a subcollection type:

Example: Tool (i: collection<paired>) => {o: dataset} with list:paired input

The tool receives each paired subcollection, producing a flat list output:

list:paired[{el1: {forward=f1, reverse=r1}}, {el2: {forward=f2, reverse=r2}}]
    => list[tool(i={forward=f1, reverse=r1})[o], tool(i={forward=f2, reverse=r2})[o]]

The mapping logic (lib/galaxy/model/dataset_collections/subcollections.py:36-58):

def _split_dataset_collection(dataset_collection, collection_type):
    for element in dataset_collection.elements:
        if child_collection.collection_type == collection_type:
            split_elements.append(element)
        else:
            split_elements.extend(_split_dataset_collection(child_collection, collection_type))

4. The Consume Value Model

Conceptual Framework

Tool inputs “consume” collection structure based on their declared type:

Input Declaration	What It Consumes	Remaining Structure
`type="data"`	Single dataset	Maps over entire collection
`type="data" multiple="true"`	List (reduces it)	Outer collection type
`type="data_collection" collection_type="paired"`	Paired subcollection	Outer list(s)
`type="data_collection" collection_type="list"`	List subcollection	Outer collection type

Reduction vs Mapping

Reduction occurs when an input consumes an entire collection rank:

multiple="true" data inputs reduce list to nothing (or outer list for nested)
collection_type="list" inputs reduce list to nothing

Mapping occurs when collection structure remains after consumption:

list:paired over paired input => maps, produces list
list:list over list input => maps, produces list

What Cannot Be Reduced

Paired collections cannot be reduced by multiple="true" inputs:

# From collection_semantics.md documentation
# PAIRED_REDUCTION_INVALID - paired cannot be reduced by multiple data input

This is intentional: paired represents a semantic tuple (forward/reverse), not an arbitrary list.

paired_or_unpaired Special Handling

The paired_or_unpaired type has special consumption rules:

It can consume paired collections (forward/reverse case)
It can consume single datasets via single_datasets subcollection mapping
paired inputs CANNOT consume paired_or_unpaired (may lack forward/reverse)

From type_description.py:92-98:

if other_collection_type == "paired_or_unpaired":
    # this can be thought of as a subcollection of anything except a pair
    return collection_type != "paired"
if other_collection_type == "single_datasets":
    # effectively any collection has unpaired subcollections
    return True

5. Output Collection Creation

Structure Inheritance

Output collections inherit structure from input collections being mapped over. The structure determination (lib/galaxy/model/dataset_collections/structure.py:180-211):

def tool_output_to_structure(get_sliced_input_collection_structure, tool_output, collections_manager):
    if not tool_output.collection:
        tree = leaf
    else:
        structured_like = tool_output.structure.structured_like
        collection_type = tool_output.structure.collection_type
        if structured_like:
            tree = get_sliced_input_collection_structure(structured_like)

Implicit Collection Creation

When mapping over collections, Galaxy pre-creates output collections (lib/galaxy/tools/execute.py:583-638):

def precreate_output_collections(self, history, params, tool_request):
    for output_name, output in self.tool.outputs.items():
        effective_structure = self._mapped_output_structure(trans, output)
        collection_instance = trans.app.dataset_collection_manager.precreate_dataset_collection_instance(
            trans=trans,
            parent=history,
            name=output_collection_name,
            structure=effective_structure,
            implicit_inputs=implicit_inputs,
        )

The effective_structure is computed by multiplying the mapping structure by the output structure:

def _mapped_output_structure(self, trans, tool_output):
    output_structure = tool_output_to_structure(...)
    mapping_structure = self._structure_for_output(trans, tool_output)
    return mapping_structure.multiply(output_structure)

Structure Multiplication

When a tool that produces collections is mapped over a collection, structures multiply (lib/galaxy/model/dataset_collections/structure.py:150-165):

def multiply(self, other_structure):
    if other_structure.is_leaf:
        return self.clone()

    new_collection_type = self.collection_type_description.multiply(other_structure.collection_type_description)
    new_children = []
    for identifier, structure in self.children:
        new_children.append((identifier, structure.multiply(other_structure)))
    return Tree(new_children, new_collection_type)

Example: Mapping list[a, b] over a tool that outputs paired:

Structure: list x paired = list:paired
Result: list:paired[{a: {forward, reverse}}, {b: {forward, reverse}}]

Element Identifier Propagation

Output collection elements inherit identifiers from input collections. The default_identifier_source output attribute can specify which input provides identifiers when multiple collections are involved.

6. Runtime Representation

Wrapper Classes

At tool execution time, collection data is wrapped for template access (lib/galaxy/tools/wrappers.py):

DatasetFilenameWrapper (lines 288-540)

Wraps individual datasets within collections:

class DatasetFilenameWrapper(ToolParameterValueWrapper):
    @property
    def element_identifier(self) -> str:
        identifier = self._element_identifier
        if identifier is None:
            identifier = self.name
        return identifier

Key properties:

__str__() returns the file path
element_identifier returns the dataset’s identifier within its collection
file_ext returns the file extension

DatasetCollectionWrapper (lines 639-800)

Wraps collection parameters:

class DatasetCollectionWrapper(ToolParameterValueWrapper, HasDatasets):
    def __init__(self, job_working_directory, has_collection, ...):
        for dataset_collection_element in elements:
            element_identifier = dataset_collection_element.element_identifier
            if isinstance(element_object, DatasetCollection):
                element_wrapper = DatasetCollectionWrapper(...)  # Recursive
            else:
                element_wrapper = self._dataset_wrapper(element_object, identifier=element_identifier)

Accessing Collection Elements

In Cheetah templates:

## Iterate over collection elements
#for $element in $input_collection
    $element              ## File path
    $element.element_identifier  ## Element name
    $element.ext          ## File extension
#end for

## Access by identifier
$input_collection['forward']
$input_collection['reverse']

## Get all paths
#for $path in $input_collection.all_paths
    $path
#end for

Element Identifier Flow

When mapping over collections, each job receives the element identifier via a special parameter:

# From actions/__init__.py:501-503
identifier = getattr(data, "element_identifier", None)
if identifier is not None:
    incoming[f"{name}|__identifier__"] = identifier

This enables tools to access the identifier of the dataset they’re processing.

7. Practical Examples

Example 1: Simple List Mapping

Tool: cat_file.xml - concatenates file contents Input: list[file1.txt, file2.txt, file3.txt]

Galaxy creates 3 jobs:

cat_file(input=file1.txt) -> out1.txt
cat_file(input=file2.txt) -> out2.txt
cat_file(input=file3.txt) -> out3.txt

Output: list[out1.txt, out2.txt, out3.txt] with same identifiers

Example 2: Paired Collection Processing

Tool: bwa_mem.xml - accepts collection_type="paired" Input: list:paired with 2 samples

Galaxy creates 2 jobs (one per paired element):

bwa_mem(reads={forward=s1_R1.fq, reverse=s1_R2.fq})
bwa_mem(reads={forward=s2_R1.fq, reverse=s2_R2.fq})

Output: list[aligned_s1.bam, aligned_s2.bam]

Example 3: Multiple Linked Collections

Tool: compare.xml - compares two files Input A: list[a1, a2, a3] Input B: list[b1, b2, b3]

With default linked semantics:

compare(file1=a1, file2=b1)
compare(file1=a2, file2=b2)
compare(file1=a3, file2=b3)

Output: list[cmp1, cmp2, cmp3]

Example 4: Reduction with Multiple Input

Tool: merge.xml - type="data" multiple="true" Input: list[f1, f2, f3]

Single job consumes entire list:

merge(inputs=[f1, f2, f3]) -> single output dataset

Example 5: Nested Collection Mapping

Tool: single_file_tool.xml - accepts single dataset Input: list:list[[a1, a2], [b1, b2, b3]]

Galaxy creates 5 jobs:

tool(input=a1)
tool(input=a2)
tool(input=b1)
tool(input=b2)
tool(input=b3)

Output: list:list[[out_a1, out_a2], [out_b1, out_b2, out_b3]]

Example 6: Cross-Product Execution

Setup: Use __CROSS_PRODUCT_FLAT__ tool first Input A: list[a1, a2] Input B: list[b1, b2]

Cross-product tool outputs:

output_a: list[a1, a1, a2, a2] with identifiers [a1_b1, a1_b2, a2_b1, a2_b2]
output_b: list[b1, b2, b1, b2] with identifiers [a1_b1, a1_b2, a2_b1, a2_b2]

Downstream tool with linked mapping produces all 4 combinations.

8. Code References

Core Model Classes

Class	File	Lines	Purpose
`DatasetCollection`	`lib/galaxy/model/__init__.py`	6982-7090	Collection storage model
`DatasetCollectionElement`	`lib/galaxy/model/__init__.py`	8006-8100	Element within collection
`DatasetCollectionInstance`	`lib/galaxy/model/__init__.py`	7494-7544	Base for HDCA/LDCA
`HistoryDatasetCollectionAssociation`	`lib/galaxy/model/__init__.py`	varies	History-bound collection

Collection Type System

File	Key Components
`lib/galaxy/model/dataset_collections/type_description.py`	`CollectionTypeDescription`, type matching logic
`lib/galaxy/model/dataset_collections/types/__init__.py`	`BaseDatasetCollectionType` abstract base
`lib/galaxy/model/dataset_collections/types/list.py`	`ListDatasetCollectionType`
`lib/galaxy/model/dataset_collections/types/paired.py`	`PairedDatasetCollectionType`
`lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py`	`PairedOrUnpairedDatasetCollectionType`

Matching and Structure

File	Key Components
`lib/galaxy/model/dataset_collections/matching.py`	`CollectionsToMatch`, `MatchingCollections`
`lib/galaxy/model/dataset_collections/structure.py`	`Tree`, `Leaf`, `get_structure`, `tool_output_to_structure`
`lib/galaxy/model/dataset_collections/subcollections.py`	`split_dataset_collection_instance`
`lib/galaxy/model/dataset_collections/query.py`	`HistoryQuery`, collection type matching for parameters

Tool Parameter Handling

File	Key Components
`lib/galaxy/tools/parameters/basic.py:2489-2700`	`DataCollectionToolParameter`
`lib/galaxy/tools/parameters/meta.py:189-260`	`expand_meta_parameters` - batch expansion
`lib/galaxy/tools/parameters/meta.py:345-391`	`expand_meta_parameters_async` - async expansion
`lib/galaxy/tools/parameters/meta.py:419-462`	`__expand_collection_parameter`

Execution Pipeline

File	Key Components
`lib/galaxy/tools/execute.py:93-361`	`execute`, `ExecutionTracker`
`lib/galaxy/tools/execute.py:499-538`	`sliced_input_collection_structure`
`lib/galaxy/tools/execute.py:561-573`	`_mapped_output_structure`
`lib/galaxy/tools/execute.py:583-638`	`precreate_output_collections`
`lib/galaxy/tools/actions/__init__.py:134-795`	`DefaultToolAction.execute`

Collection Manager

File	Key Components
`lib/galaxy/managers/collections.py:70-350`	`DatasetCollectionManager`
`lib/galaxy/managers/collections.py:96-178`	`precreate_dataset_collection_instance`, `precreate_dataset_collection`

Runtime Wrappers

File	Key Components
`lib/galaxy/tools/wrappers.py:288-540`	`DatasetFilenameWrapper`
`lib/galaxy/tools/wrappers.py:552-634`	`DatasetListWrapper`
`lib/galaxy/tools/wrappers.py:639-800`	`DatasetCollectionWrapper`

Builder Pattern

File	Key Components
`lib/galaxy/model/dataset_collections/builder.py:27-46`	`build_collection`
`lib/galaxy/model/dataset_collections/builder.py:95-199`	`CollectionBuilder`
`lib/galaxy/model/dataset_collections/builder.py:201-220`	`BoundCollectionBuilder`

Appendix: Key Algorithms

Collection Matching Algorithm

function match_collections(collections_to_match):
    matching = MatchingCollections()
    for each (input_key, to_match) in collections_to_match:
        collection = to_match.hdca
        type_description = for_collection_type(collection.collection_type)
        subcollection_type = to_match.subcollection_type

        if to_match.linked:
            structure = get_structure(collection, type_description, subcollection_type)
            if not matching.linked_structure:
                matching.linked_structure = structure
            else if not matching.linked_structure.can_match(structure):
                raise "Cannot match collection types"
            matching.add(input_key, collection, subcollection_type)
        else:
            structure = get_structure(...)
            matching.unlinked_structures.append(structure)

    return matching

Structure Walking for Execution

function walk_collections(self, hdca_dict):
    for each (index, (identifier, substructure)) in self.children:
        elements = {k: collection[index] for k, collection in hdca_dict}

        if substructure.is_leaf:
            yield elements  # These go to individual jobs
        else:
            for child_elements in substructure.walk_collections(child_collections):
                yield child_elements

Output Structure Computation

function compute_output_structure(tool_output, collection_info):
    # Get the structure from tool output definition
    if tool_output.structured_like:
        output_structure = get_input_structure(tool_output.structured_like)
    else:
        output_structure = UninitializedTree(tool_output.collection_type)

    # Multiply by mapping structure
    mapping_structure = collection_info.structure
    return mapping_structure.multiply(output_structure)

References

Galaxy Collection Semantics Documentation: doc/source/dev/collection_semantics.md
Planemo Documentation: https://planemo.readthedocs.io/
Galaxy Tool Development: https://docs.galaxyproject.org/en/latest/dev/schema.html