Vendored from upstream, pinned at SHA
7765fae. Two files live next to this note:
galaxy-collection-semantics.yml— the structured source. Agents and casting should consume this. It carries thetests:blocks that pin concrete Galaxy test names; the rendered upstream view drops them.galaxy-collection-semantics.upstream.myst— Galaxy’s auto-generated MyST/LaTeX rendering of the YAML, vendored only so the human view below has something to render. Sync is manual.When to consult: authoring or reasoning about Molds and patterns that touch
data_collectioninputs, map-over / reduction shape changes, sub-collection mapping,paired_or_unpaired, orsample_sheet.
7765fae.Collection Semantics
This document describes the semantics around working with Galaxy dataset collections. In particular it describes how they operate within Galaxy tools and workflows.
## MappingIf a tool consumes a simple dataset parameter and produces a simple dataset parameter, then any collection type may be "mapped over" the data input to that tool. The result of that is the tool being applied to each element of the collection and "implicit collections" being created from the outputs that are produced from those operations. Those implicit collections have the same element identifiers in the same order as the input collection that is mapped over. Each element of the implicit collections correspond to their own job and Galaxy very naturally and intuitively parallelizes jobs without extra work from the user and without any knowledge of the tool.
Examples
The above description of mapping over inputs works naturally and as expected for nested collections.
Examples
For tools with multiple data inputs, the tool can be executed with individual datasets for the non-mapped over input and each tool execution will just be executed with that dataset. The dataset not mapped over serves as the input for each execution.
Examples
If a tool consumes two input datasets and produces one output dataset, you can map two collections with identical structure (same element identifiers in the same order) over the respective inputs and the result is an implicit collection with the same structure as the inputs and where each output in the implicit collection corresponds to the tool being executed with the two inputs corresponding to that position in the input collections.
The default behavior here is the collections are linked and the act of mapping over inputs to the tool are sort of a flat map or a dot product. No extra dimensionality in the resulting collections.
From a user perspective this means if you start with a collection and apply a bunch of map over operations on tools - the results will all continue to match and work together very naturally - again without extra work by the user and without extra knowledge by the tool author.
Examples
Reduction
Not all tool executions result in implicit collections and mapping
over inputs. Tool inputs of type data_collection can consume
collections directly and do not necessarily result in mapping over.
Tools that consume collections and output datasets effectively reduce the dimension of the Galaxy data structure. When used at runtime this is often referred to as a "reduction" in the code.
Examples
For nested collections where each rank is a list or a paired collection,
then collection inputs must match every part of the collection type input definition.
Examples
In addition to explicit collection inputs, tool inputs of type data
where multiple="true" can consume lists directly. This is likewise a
"reduction" and does not result in implicit collection creation.
Examples
Paired collections cannot be reduced this way. paired is not meant
to represent a list/array/vector data structure - it is more like a tuple.
Examples
Sub-collection Mapping
Examples
The natural extension of multiple data input parameters consuming list collections as described
above when discussing reductions is that nested lists of lists (list:list) can be mapped
over a multiple data input parameter. Each nested list will be reduced by this operation but the
results will be mapped over. The result will be a list with the same structure as the outer list
of the input collection.
Examples
Just as a paired collection won't be reduced by a multiple data input, any sort of nested
collection ending in a paired collection cannot be mapped over such an input. So a multiple
data input parameter cannot be mapped over by a list of pairs (list:paired) for instance.
Examples
paired_or_unpaired Collections
The collection type paired_or_unpaired is meant to serve as a stand-in for
an entity that can be either a single dataset or what is effectively a paired
dataset collection. These collections either have one element with identifier
unpaired or two elements with identifiers forward and reverse.
Tools can declare a data_collection input with collection type paired_or_unpaired
and that input will consume either an explicit paired_or_unpaired collection
normally or can consume a paired input.
Examples
The inverse of this doesn't work intentionally. In some ways a paired collection
acts as a paired_or_unpaired collection but a paired_or_unpaired is not a paired
collection. This makes a lot of sense in terms of tools - a tool consuming a paired
dataset expects to find both a forward and reverse element but these may not exist
in paired_or_unpaired collection.
Examples
The same logic holds for mapping, lists of paired datasets (list:paired) can be mapped over these
paired_or_unpaired inputs and mixed lists of pairs (list:paired_or_unpaired) cannot
be mapped over a paired input. Following the same logic, list:paired_or_unpaired cannot
be mapped over a list input or multiple data input.
Examples
This logic extends naturally into higher dimensional collections. A list:list:paired
can be mapped over either a paired_or_unpaired input to produce a nested list (list:list)
or a list:paired_or_unpaired input to produce a flat list (list).
Examples
In order for paired_or_unpaired collections to also act as a single dataset,
a flat list can be mapped over a such an input with a special sub collection mapping
type of 'single_datasets'.
Examples
This treatment of lists without pairing extends to nested structures naturally.
For instance, a list of list of datasets (list:list) can be mapped over a
paired_or_unpaired input to produce a nested list of lists (list:list)
with a structure matching the input. Likewise, the nested list can be mapped over
a list:paired_or_unpaired input to produce a flat list with the same structure
as the outer list of the input.
Examples
Due only to implementation time, the special casing of allowing paired_or_unpaired act as both datasets and paired collections only works when it is the deepest collection type. So while list:paired can be consumed by a list:paired_or_unpaired input, a paired:list cannot be consumed by a paired_or_unpaired:list input though it should be able to for consistency. We have focused our time on data structures more likely to be used in actual Galaxy analyses given current and guessed future usage.
sample_sheet Collections
The collection type sample_sheet attaches typed, columnar metadata to each
element of a dataset collection. For mapping and type matching purposes,
sample_sheet behaves identically to list - it can be mapped over tool
inputs, matched against list collection inputs, and composed with inner types
(sample_sheet:paired, sample_sheet:paired_or_unpaired). The key asymmetry
is that while a sample_sheet output can satisfy a list input, a list
output cannot satisfy a sample_sheet input - sample sheets carry metadata
that plain lists do not.
Examples
Sub-collection mapping works the same as for list composites. A
sample_sheet:paired can be mapped over a paired collection input,
extracting each inner pair and producing a sample_sheet implicit output.
Examples
The paired_or_unpaired integration rules carry over from list to
sample_sheet. A flat sample_sheet can be mapped over a
paired_or_unpaired input via single_datasets sub-collection mapping,
and sample_sheet:paired can be mapped over paired_or_unpaired just as
list:paired can.
Examples
The type matching is asymmetric: a sample_sheet output can satisfy a list
input because sample sheets carry all the structural information lists have (plus
metadata). However, a list output cannot satisfy a sample_sheet input because
lists lack the column_definitions and per-element columns metadata that
sample sheet consumers expect.