Vendored from upstream, pinned at SHA 7765fae. Two files live next to this note:

galaxy-collection-semantics.yml — the structured source. Agents and casting should consume this. It carries the tests: blocks that pin concrete Galaxy test names; the rendered upstream view drops them.

galaxy-collection-semantics.upstream.myst — Galaxy’s auto-generated MyST/LaTeX rendering of the YAML, vendored only so the human view below has something to render. Sync is manual.

When to consult: authoring or reasoning about Molds and patterns that touch data_collection inputs, map-over / reduction shape changes, sub-collection mapping, paired_or_unpaired, or sample_sheet.

Vendored from upstream source, pinned at 7765fae.

Collection Semantics

This document describes the semantics around working with Galaxy dataset collections. In particular it describes how they operate within Galaxy tools and workflows.

## Mapping

If a tool consumes a simple dataset parameter and produces a simple dataset parameter, then any collection type may be "mapped over" the data input to that tool. The result of that is the tool being applied to each element of the collection and "implicit collections" being created from the outputs that are produced from those operations. Those implicit collections have the same element identifiers in the same order as the input collection that is mapped over. Each element of the implicit collections correspond to their own job and Galaxy very naturally and intuitively parallelizes jobs without extra work from the user and without any knowledge of the tool.

Examples

The above description of mapping over inputs works naturally and as expected for nested collections.

Examples

For tools with multiple data inputs, the tool can be executed with individual datasets for the non-mapped over input and each tool execution will just be executed with that dataset. The dataset not mapped over serves as the input for each execution.

Examples

If a tool consumes two input datasets and produces one output dataset, you can map two collections with identical structure (same element identifiers in the same order) over the respective inputs and the result is an implicit collection with the same structure as the inputs and where each output in the implicit collection corresponds to the tool being executed with the two inputs corresponding to that position in the input collections.

The default behavior here is the collections are linked and the act of mapping over inputs to the tool are sort of a flat map or a dot product. No extra dimensionality in the resulting collections.

From a user perspective this means if you start with a collection and apply a bunch of map over operations on tools - the results will all continue to match and work together very naturally - again without extra work by the user and without extra knowledge by the tool author.

Examples

Reduction

Not all tool executions result in implicit collections and mapping over inputs. Tool inputs of type data_collection can consume collections directly and do not necessarily result in mapping over.

Tools that consume collections and output datasets effectively reduce the dimension of the Galaxy data structure. When used at runtime this is often referred to as a "reduction" in the code.

Examples

For nested collections where each rank is a list or a paired collection, then collection inputs must match every part of the collection type input definition.

Examples

In addition to explicit collection inputs, tool inputs of type data where multiple="true" can consume lists directly. This is likewise a "reduction" and does not result in implicit collection creation.

Examples

Paired collections cannot be reduced this way. paired is not meant to represent a list/array/vector data structure - it is more like a tuple.

Examples

Sub-collection Mapping

Examples

The natural extension of multiple data input parameters consuming list collections as described above when discussing reductions is that nested lists of lists (list:list) can be mapped over a multiple data input parameter. Each nested list will be reduced by this operation but the results will be mapped over. The result will be a list with the same structure as the outer list of the input collection.

Examples

Just as a paired collection won't be reduced by a multiple data input, any sort of nested collection ending in a paired collection cannot be mapped over such an input. So a multiple data input parameter cannot be mapped over by a list of pairs (list:paired) for instance.

Examples

paired_or_unpaired Collections

The collection type paired_or_unpaired is meant to serve as a stand-in for an entity that can be either a single dataset or what is effectively a paired dataset collection. These collections either have one element with identifier unpaired or two elements with identifiers forward and reverse.

Tools can declare a data_collection input with collection type paired_or_unpaired and that input will consume either an explicit paired_or_unpaired collection normally or can consume a paired input.

Examples

The inverse of this doesn't work intentionally. In some ways a paired collection acts as a paired_or_unpaired collection but a paired_or_unpaired is not a paired collection. This makes a lot of sense in terms of tools - a tool consuming a paired dataset expects to find both a forward and reverse element but these may not exist in paired_or_unpaired collection.

Examples

The same logic holds for mapping, lists of paired datasets (list:paired) can be mapped over these paired_or_unpaired inputs and mixed lists of pairs (list:paired_or_unpaired) cannot be mapped over a paired input. Following the same logic, list:paired_or_unpaired cannot be mapped over a list input or multiple data input.

Examples

This logic extends naturally into higher dimensional collections. A list:list:paired can be mapped over either a paired_or_unpaired input to produce a nested list (list:list) or a list:paired_or_unpaired input to produce a flat list (list).

Examples

In order for paired_or_unpaired collections to also act as a single dataset, a flat list can be mapped over a such an input with a special sub collection mapping type of 'single_datasets'.

Examples

This treatment of lists without pairing extends to nested structures naturally. For instance, a list of list of datasets (list:list) can be mapped over a paired_or_unpaired input to produce a nested list of lists (list:list) with a structure matching the input. Likewise, the nested list can be mapped over a list:paired_or_unpaired input to produce a flat list with the same structure as the outer list of the input.

Examples

Due only to implementation time, the special casing of allowing paired_or_unpaired act as both datasets and paired collections only works when it is the deepest collection type. So while list:paired can be consumed by a list:paired_or_unpaired input, a paired:list cannot be consumed by a paired_or_unpaired:list input though it should be able to for consistency. We have focused our time on data structures more likely to be used in actual Galaxy analyses given current and guessed future usage.

sample_sheet Collections

The collection type sample_sheet attaches typed, columnar metadata to each element of a dataset collection. For mapping and type matching purposes, sample_sheet behaves identically to list - it can be mapped over tool inputs, matched against list collection inputs, and composed with inner types (sample_sheet:paired, sample_sheet:paired_or_unpaired). The key asymmetry is that while a sample_sheet output can satisfy a list input, a list output cannot satisfy a sample_sheet input - sample sheets carry metadata that plain lists do not.

Examples

Sub-collection mapping works the same as for list composites. A sample_sheet:paired can be mapped over a paired collection input, extracting each inner pair and producing a sample_sheet implicit output.

Examples

The paired_or_unpaired integration rules carry over from list to sample_sheet. A flat sample_sheet can be mapped over a paired_or_unpaired input via single_datasets sub-collection mapping, and sample_sheet:paired can be mapped over paired_or_unpaired just as list:paired can.

Examples

The type matching is asymmetric: a sample_sheet output can satisfy a list input because sample sheets carry all the structural information lists have (plus metadata). However, a list output cannot satisfy a sample_sheet input because lists lack the column_definitions and per-element columns metadata that sample sheet consumers expect.

Examples

Galaxy collection semantics

Collection Semantics

Reduction

Sub-collection Mapping

paired_or_unpaired Collections

sample_sheet Collections

Incoming References (18)