Dashboard

Component Collection Tool Execution Semantics

Collection types (list, paired, record), mapping semantics, linked vs cross-product multiple inputs, element identifier flow

Raw
Revised:
2026-05-16
Revision:
6
Related Notes:
Component - Collection Adapters, Component - Collection Models, Component - Collections - Paired or Unpaired, Component - CWL Ephemeral Collections, Component - Tool XML Collection Commands, Component - Workflow Editor Terminals, PR 19377 - Collection Types and Wizard UI, PR 21828 - YAML Tool Hardening and Tool State, PR 21932 - History Graph API, PR 21935 - Workflow Extraction Vue Conversion, PR 22706 - Workflow Extraction by IDs

Collection Tool Execution Semantics in Galaxy

A Technical White Paper for Galaxy Developers


Table of Contents

  1. Introduction
  2. Collection Type Hierarchy
  3. Input Mapping Semantics
  4. The Consume Value Model
  5. Output Collection Creation
  6. Runtime Representation
  7. Practical Examples
  8. Code References

1. Introduction

What Are Dataset Collections?

Dataset collections are Galaxy’s mechanism for grouping related datasets into structured containers that can be processed together. They enable:

  • Batch processing: Apply a tool to multiple datasets simultaneously
  • Parallelization: Galaxy automatically creates separate jobs for each element
  • Structure preservation: Output collections maintain the same structure as inputs
  • Workflow composability: Collections enable complex data flows without explicit loops

Why Collections Matter for Tool Execution

When a tool accepts a single dataset input but receives a collection:

  1. Galaxy “maps over” the collection, creating one job per element
  2. Outputs are gathered into an “implicit collection” matching the input structure
  3. Element identifiers flow through, enabling downstream matching

This happens transparently - tool authors need not be aware of collections for basic mapping scenarios.

Core Principle

Galaxy’s collection semantics follow a design principle: “what should just intuitively work.” Users connect steps and Galaxy determines the natural execution pattern. This document formalizes that intuition for implementers.


2. Collection Type Hierarchy

Fundamental Collection Types

Galaxy supports several base collection types, which can be nested:

TypeDescriptionElement Identifiers
listOrdered sequence of datasetsUser-defined (e.g., sample1, sample2)
pairedForward/reverse read pairsFixed: forward, reverse
paired_or_unpairedSingle or paired readsunpaired OR forward/reverse
recordHeterogeneous named fieldsSchema-defined field names
sample_sheetTabular metadata with datasetsRow identifiers with column metadata

Nested Collection Types

Collection types compose via the : separator:

  • list:paired - A list of paired datasets (e.g., multiple samples, each with forward/reverse)
  • list:list - Nested lists (e.g., samples grouped by condition)
  • list:paired_or_unpaired - List where each element may be single or paired

Type Validation Regex

Collection types must match this pattern (from type_description.py:15-17):

COLLECTION_TYPE_REGEX = re.compile(
    r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)

Collection Type Description

The CollectionTypeDescription class (lib/galaxy/model/dataset_collections/type_description.py:31-175) provides methods for working with collection types:

class CollectionTypeDescription:
    def child_collection_type(self):
        """Returns the collection type after removing the outermost rank."""
        # e.g., "list:paired" -> "paired"

    def effective_collection_type(self, subcollection_type):
        """Returns the type after consuming a subcollection type."""
        # e.g., "list:list:paired".effective_collection_type("paired") -> "list:list"

    def has_subcollections_of_type(self, other_collection_type):
        """Check if this type contains subcollections of another type."""

    def rank_collection_type(self):
        """Returns the outermost collection type."""
        # e.g., "list:paired" -> "list"

Dimension

The dimension of a collection type equals the number of : separators plus 2:

@property
def dimension(self):
    return len(self.collection_type.split(":")) + 1
  • list has dimension 2
  • list:paired has dimension 3
  • list:list:paired has dimension 4

3. Input Mapping Semantics

Single Collection Input Iteration (Map Over)

When a tool declares a simple dataset parameter (type="data") but receives a collection, Galaxy maps over the collection:

Example: Tool (i: dataset) => {o: dataset} with input list[d1, d2, d3]

Result: list[tool(i=d1)[o], tool(i=d2)[o], tool(i=d3)[o]]

The output collection preserves:

  • The same collection type (list)
  • The same element identifiers
  • The same order

This extends naturally to nested collections:

Example: Tool (i: dataset) => {o: dataset} with input list:paired[{forward=f, reverse=r}]

Result: list:paired[{forward=tool(i=f)[o], reverse=tool(i=r)[o]}]

Multiple Collection Inputs

When a tool has multiple dataset inputs, Galaxy supports two modes:

Linked (Matched) Semantics (Default)

Collections are iterated in lockstep - element N of input A is paired with element N of input B:

# From meta.py:217
is_linked = value.get("linked", True)  # Default is linked=True

Example: Tool (i1: dataset, i2: dataset) => {o: dataset}

  • Input 1: list[a1, a2, a3]
  • Input 2: list[b1, b2, b3]

Result: list[tool(i1=a1, i2=b1)[o], tool(i1=a2, i2=b2)[o], tool(i1=a3, i2=b3)[o]]

Requirement: Linked collections must have identical structure (same element identifiers in same order).

Cross-Product (Unlinked) Semantics

When linked=False, Galaxy computes the Cartesian product:

# From meta.py:220-221
elif is_batch:
    classification = input_classification.MULTIPLIED

Example: Tool (i1: dataset, i2: dataset) => {o: dataset}

  • Input 1: list[a1, a2] (with linked=False)
  • Input 2: list[b1, b2]

Result: Every combination is executed:

  • tool(i1=a1, i2=b1)
  • tool(i1=a1, i2=b2)
  • tool(i1=a2, i2=b1)
  • tool(i1=a2, i2=b2)

Galaxy provides built-in tools (__CROSS_PRODUCT_FLAT__, __CROSS_PRODUCT_NESTED__) to pre-compute cross-product structures for downstream linked mapping.

Element Identifier Matching

For linked collections, matching occurs by structure position, not by identifier name. The matching algorithm (lib/galaxy/model/dataset_collections/matching.py:40-120):

class MatchingCollections:
    def __attempt_add_to_linked_match(self, input_name, hdca, collection_type_description, subcollection_type):
        structure = get_structure(hdca, collection_type_description, leaf_subcollection_type=subcollection_type)
        if not self.linked_structure:
            self.linked_structure = structure
        else:
            if not self.linked_structure.can_match(structure):
                raise exceptions.MessageException(CANNOT_MATCH_ERROR_MESSAGE)

The can_match method (lib/galaxy/model/dataset_collections/structure.py:130-145) verifies:

  1. Collection types are compatible
  2. Same number of children at each level
  3. Nested structure matches recursively

Subcollection Mapping

A nested collection can be mapped over a tool that accepts a subcollection type:

Example: Tool (i: collection<paired>) => {o: dataset} with list:paired input

The tool receives each paired subcollection, producing a flat list output:

list:paired[{el1: {forward=f1, reverse=r1}}, {el2: {forward=f2, reverse=r2}}]
    => list[tool(i={forward=f1, reverse=r1})[o], tool(i={forward=f2, reverse=r2})[o]]

The mapping logic (lib/galaxy/model/dataset_collections/subcollections.py:36-58):

def _split_dataset_collection(dataset_collection, collection_type):
    for element in dataset_collection.elements:
        if child_collection.collection_type == collection_type:
            split_elements.append(element)
        else:
            split_elements.extend(_split_dataset_collection(child_collection, collection_type))

4. The Consume Value Model

Conceptual Framework

Tool inputs “consume” collection structure based on their declared type:

Input DeclarationWhat It ConsumesRemaining Structure
type="data"Single datasetMaps over entire collection
type="data" multiple="true"List (reduces it)Outer collection type
type="data_collection" collection_type="paired"Paired subcollectionOuter list(s)
type="data_collection" collection_type="list"List subcollectionOuter collection type

Reduction vs Mapping

Reduction occurs when an input consumes an entire collection rank:

  • multiple="true" data inputs reduce list to nothing (or outer list for nested)
  • collection_type="list" inputs reduce list to nothing

Mapping occurs when collection structure remains after consumption:

  • list:paired over paired input => maps, produces list
  • list:list over list input => maps, produces list

What Cannot Be Reduced

Paired collections cannot be reduced by multiple="true" inputs:

# From collection_semantics.md documentation
# PAIRED_REDUCTION_INVALID - paired cannot be reduced by multiple data input

This is intentional: paired represents a semantic tuple (forward/reverse), not an arbitrary list.

paired_or_unpaired Special Handling

The paired_or_unpaired type has special consumption rules:

  1. It can consume paired collections (forward/reverse case)
  2. It can consume single datasets via single_datasets subcollection mapping
  3. paired inputs CANNOT consume paired_or_unpaired (may lack forward/reverse)

From type_description.py:92-98:

if other_collection_type == "paired_or_unpaired":
    # this can be thought of as a subcollection of anything except a pair
    return collection_type != "paired"
if other_collection_type == "single_datasets":
    # effectively any collection has unpaired subcollections
    return True

5. Output Collection Creation

Structure Inheritance

Output collections inherit structure from input collections being mapped over. The structure determination (lib/galaxy/model/dataset_collections/structure.py:180-211):

def tool_output_to_structure(get_sliced_input_collection_structure, tool_output, collections_manager):
    if not tool_output.collection:
        tree = leaf
    else:
        structured_like = tool_output.structure.structured_like
        collection_type = tool_output.structure.collection_type
        if structured_like:
            tree = get_sliced_input_collection_structure(structured_like)

Implicit Collection Creation

When mapping over collections, Galaxy pre-creates output collections (lib/galaxy/tools/execute.py:583-638):

def precreate_output_collections(self, history, params, tool_request):
    for output_name, output in self.tool.outputs.items():
        effective_structure = self._mapped_output_structure(trans, output)
        collection_instance = trans.app.dataset_collection_manager.precreate_dataset_collection_instance(
            trans=trans,
            parent=history,
            name=output_collection_name,
            structure=effective_structure,
            implicit_inputs=implicit_inputs,
        )

The effective_structure is computed by multiplying the mapping structure by the output structure:

def _mapped_output_structure(self, trans, tool_output):
    output_structure = tool_output_to_structure(...)
    mapping_structure = self._structure_for_output(trans, tool_output)
    return mapping_structure.multiply(output_structure)

Structure Multiplication

When a tool that produces collections is mapped over a collection, structures multiply (lib/galaxy/model/dataset_collections/structure.py:150-165):

def multiply(self, other_structure):
    if other_structure.is_leaf:
        return self.clone()

    new_collection_type = self.collection_type_description.multiply(other_structure.collection_type_description)
    new_children = []
    for identifier, structure in self.children:
        new_children.append((identifier, structure.multiply(other_structure)))
    return Tree(new_children, new_collection_type)

Example: Mapping list[a, b] over a tool that outputs paired:

  • Structure: list x paired = list:paired
  • Result: list:paired[{a: {forward, reverse}}, {b: {forward, reverse}}]

Element Identifier Propagation

Output collection elements inherit identifiers from input collections. The default_identifier_source output attribute can specify which input provides identifiers when multiple collections are involved.


6. Runtime Representation

Wrapper Classes

At tool execution time, collection data is wrapped for template access (lib/galaxy/tools/wrappers.py):

DatasetFilenameWrapper (lines 288-540)

Wraps individual datasets within collections:

class DatasetFilenameWrapper(ToolParameterValueWrapper):
    @property
    def element_identifier(self) -> str:
        identifier = self._element_identifier
        if identifier is None:
            identifier = self.name
        return identifier

Key properties:

  • __str__() returns the file path
  • element_identifier returns the dataset’s identifier within its collection
  • file_ext returns the file extension

DatasetCollectionWrapper (lines 639-800)

Wraps collection parameters:

class DatasetCollectionWrapper(ToolParameterValueWrapper, HasDatasets):
    def __init__(self, job_working_directory, has_collection, ...):
        for dataset_collection_element in elements:
            element_identifier = dataset_collection_element.element_identifier
            if isinstance(element_object, DatasetCollection):
                element_wrapper = DatasetCollectionWrapper(...)  # Recursive
            else:
                element_wrapper = self._dataset_wrapper(element_object, identifier=element_identifier)

Accessing Collection Elements

In Cheetah templates:

## Iterate over collection elements
#for $element in $input_collection
    $element              ## File path
    $element.element_identifier  ## Element name
    $element.ext          ## File extension
#end for

## Access by identifier
$input_collection['forward']
$input_collection['reverse']

## Get all paths
#for $path in $input_collection.all_paths
    $path
#end for

Element Identifier Flow

When mapping over collections, each job receives the element identifier via a special parameter:

# From actions/__init__.py:501-503
identifier = getattr(data, "element_identifier", None)
if identifier is not None:
    incoming[f"{name}|__identifier__"] = identifier

This enables tools to access the identifier of the dataset they’re processing.


7. Practical Examples

Example 1: Simple List Mapping

Tool: cat_file.xml - concatenates file contents Input: list[file1.txt, file2.txt, file3.txt]

Galaxy creates 3 jobs:

  1. cat_file(input=file1.txt) -> out1.txt
  2. cat_file(input=file2.txt) -> out2.txt
  3. cat_file(input=file3.txt) -> out3.txt

Output: list[out1.txt, out2.txt, out3.txt] with same identifiers

Example 2: Paired Collection Processing

Tool: bwa_mem.xml - accepts collection_type="paired" Input: list:paired with 2 samples

Galaxy creates 2 jobs (one per paired element):

  1. bwa_mem(reads={forward=s1_R1.fq, reverse=s1_R2.fq})
  2. bwa_mem(reads={forward=s2_R1.fq, reverse=s2_R2.fq})

Output: list[aligned_s1.bam, aligned_s2.bam]

Example 3: Multiple Linked Collections

Tool: compare.xml - compares two files Input A: list[a1, a2, a3] Input B: list[b1, b2, b3]

With default linked semantics:

  1. compare(file1=a1, file2=b1)
  2. compare(file1=a2, file2=b2)
  3. compare(file1=a3, file2=b3)

Output: list[cmp1, cmp2, cmp3]

Example 4: Reduction with Multiple Input

Tool: merge.xml - type="data" multiple="true" Input: list[f1, f2, f3]

Single job consumes entire list:

  • merge(inputs=[f1, f2, f3]) -> single output dataset

Example 5: Nested Collection Mapping

Tool: single_file_tool.xml - accepts single dataset Input: list:list[[a1, a2], [b1, b2, b3]]

Galaxy creates 5 jobs:

  1. tool(input=a1)
  2. tool(input=a2)
  3. tool(input=b1)
  4. tool(input=b2)
  5. tool(input=b3)

Output: list:list[[out_a1, out_a2], [out_b1, out_b2, out_b3]]

Example 6: Cross-Product Execution

Setup: Use __CROSS_PRODUCT_FLAT__ tool first Input A: list[a1, a2] Input B: list[b1, b2]

Cross-product tool outputs:

  • output_a: list[a1, a1, a2, a2] with identifiers [a1_b1, a1_b2, a2_b1, a2_b2]
  • output_b: list[b1, b2, b1, b2] with identifiers [a1_b1, a1_b2, a2_b1, a2_b2]

Downstream tool with linked mapping produces all 4 combinations.


8. Code References

Core Model Classes

ClassFileLinesPurpose
DatasetCollectionlib/galaxy/model/__init__.py6982-7090Collection storage model
DatasetCollectionElementlib/galaxy/model/__init__.py8006-8100Element within collection
DatasetCollectionInstancelib/galaxy/model/__init__.py7494-7544Base for HDCA/LDCA
HistoryDatasetCollectionAssociationlib/galaxy/model/__init__.pyvariesHistory-bound collection

Collection Type System

FileKey Components
lib/galaxy/model/dataset_collections/type_description.pyCollectionTypeDescription, type matching logic
lib/galaxy/model/dataset_collections/types/__init__.pyBaseDatasetCollectionType abstract base
lib/galaxy/model/dataset_collections/types/list.pyListDatasetCollectionType
lib/galaxy/model/dataset_collections/types/paired.pyPairedDatasetCollectionType
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.pyPairedOrUnpairedDatasetCollectionType

Matching and Structure

FileKey Components
lib/galaxy/model/dataset_collections/matching.pyCollectionsToMatch, MatchingCollections
lib/galaxy/model/dataset_collections/structure.pyTree, Leaf, get_structure, tool_output_to_structure
lib/galaxy/model/dataset_collections/subcollections.pysplit_dataset_collection_instance
lib/galaxy/model/dataset_collections/query.pyHistoryQuery, collection type matching for parameters

Tool Parameter Handling

FileKey Components
lib/galaxy/tools/parameters/basic.py:2489-2700DataCollectionToolParameter
lib/galaxy/tools/parameters/meta.py:189-260expand_meta_parameters - batch expansion
lib/galaxy/tools/parameters/meta.py:345-391expand_meta_parameters_async - async expansion
lib/galaxy/tools/parameters/meta.py:419-462__expand_collection_parameter

Execution Pipeline

FileKey Components
lib/galaxy/tools/execute.py:93-361execute, ExecutionTracker
lib/galaxy/tools/execute.py:499-538sliced_input_collection_structure
lib/galaxy/tools/execute.py:561-573_mapped_output_structure
lib/galaxy/tools/execute.py:583-638precreate_output_collections
lib/galaxy/tools/actions/__init__.py:134-795DefaultToolAction.execute

Collection Manager

FileKey Components
lib/galaxy/managers/collections.py:70-350DatasetCollectionManager
lib/galaxy/managers/collections.py:96-178precreate_dataset_collection_instance, precreate_dataset_collection

Runtime Wrappers

FileKey Components
lib/galaxy/tools/wrappers.py:288-540DatasetFilenameWrapper
lib/galaxy/tools/wrappers.py:552-634DatasetListWrapper
lib/galaxy/tools/wrappers.py:639-800DatasetCollectionWrapper

Builder Pattern

FileKey Components
lib/galaxy/model/dataset_collections/builder.py:27-46build_collection
lib/galaxy/model/dataset_collections/builder.py:95-199CollectionBuilder
lib/galaxy/model/dataset_collections/builder.py:201-220BoundCollectionBuilder

Appendix: Key Algorithms

Collection Matching Algorithm

function match_collections(collections_to_match):
    matching = MatchingCollections()
    for each (input_key, to_match) in collections_to_match:
        collection = to_match.hdca
        type_description = for_collection_type(collection.collection_type)
        subcollection_type = to_match.subcollection_type

        if to_match.linked:
            structure = get_structure(collection, type_description, subcollection_type)
            if not matching.linked_structure:
                matching.linked_structure = structure
            else if not matching.linked_structure.can_match(structure):
                raise "Cannot match collection types"
            matching.add(input_key, collection, subcollection_type)
        else:
            structure = get_structure(...)
            matching.unlinked_structures.append(structure)

    return matching

Structure Walking for Execution

function walk_collections(self, hdca_dict):
    for each (index, (identifier, substructure)) in self.children:
        elements = {k: collection[index] for k, collection in hdca_dict}

        if substructure.is_leaf:
            yield elements  # These go to individual jobs
        else:
            for child_elements in substructure.walk_collections(child_collections):
                yield child_elements

Output Structure Computation

function compute_output_structure(tool_output, collection_info):
    # Get the structure from tool output definition
    if tool_output.structured_like:
        output_structure = get_input_structure(tool_output.structured_like)
    else:
        output_structure = UninitializedTree(tool_output.collection_type)

    # Multiply by mapping structure
    mapping_structure = collection_info.structure
    return mapping_structure.multiply(output_structure)

References

Incoming References (11)