# Galaxy `` Cited Galaxy source pinned to `galaxyproject/galaxy@7765fae9` (XSD: `lib/galaxy/tool_util/xsd/galaxy.xsd`; parser: `lib/galaxy/tool_util/parser/output_collection_def.py`). `` is Galaxy's mechanism for collecting outputs whose names or counts aren't knowable at tool-wrapper authoring time. A tool that emits "one BAM per chromosome", "every `*.report.tsv` in the working dir", "whatever fell out of split-by-this-column" — uses `` to tell Galaxy how to find them after the job completes. Two parents, slightly different behavior: | Parent | What discover_datasets populates | Result | |---|---|---| | `` | The primary dataset's siblings (and optionally the primary itself with `assign_primary_output="true"`) | Multiple history items derived from one `` declaration | | `` | The elements of the collection | A `list`, `paired`, `list:paired`, or arbitrarily-nested collection | The convert Mold (`[[convert-nfcore-module-to-galaxy-tool]]`) reaches for `` inside `` whenever a Nextflow `output:` channel uses a glob (`path('*.bam')`) or names interpolated from runtime values; the corresponding Galaxy idiom needs to discover the matching files after the script runs. ## Two discovery modes Set with `from_provided_metadata`. Default is `pattern`. ### `pattern` mode (default) Galaxy scans the working directory (or a named subdirectory) and matches filenames against a regex. ```xml ``` The regex must match the filename (not the full path, unless `match_relative_path="true"`). Named groups inside the pattern feed Galaxy metadata about each discovered file — see *Named groups* below. ### `tool_provided_metadata` mode The tool writes a `galaxy.json` (or equivalent) into the working directory listing each output's path, name, datatype, and metadata. Galaxy reads it verbatim. ```xml ``` Used when the regex idiom doesn't carry enough information — usually because the tool needs to set datatype per file, attach element identifiers from a non-filename source, or surface metadata Galaxy can't infer. `pattern` and `sort_by` are forbidden in this mode (the parser rejects them at load time per `output_collection_def.py:88-90`). ## Attribute reference All attributes are optional. Pulled from `OutputDiscoverDatasetsCommon` in `galaxy.xsd:6698-6749`. | Attribute | Type | Notes | |---|---|---| | `pattern` | regex | Filename pattern. May be a named pattern (`__name__`, …) or a literal regex with named groups. Forbidden when `from_provided_metadata="true"`. | | `directory` | string | Working-dir-relative directory to scan. Default is the working dir itself. | | `recurse` | bool | Walk `directory` recursively. Default `false`. | | `match_relative_path` | bool | Match the regex against the path relative to `directory` (lets you embed path components in named groups). Default `false` — match filename only. | | `format` / `ext` | datatype | Datatype for every discovered file. `format` is an alias for `ext`. Override per-file via a named `ext` regex group. | | `sort_by` | string | `[reverse_][SORT_COMP_]SORTBY`. `SORTBY` ∈ {`filename`, `name`, `designation`, `dbkey`}; `SORT_COMP` ∈ {`lexical`, `numeric`}. Default `lexical_filename`. | | `visible` | bool | History visibility of discovered datasets. Defaults to `false` per the XSD, but the XSD doc string explicitly warns "probably shouldn't be" — almost every IUC and Galaxy test tool sets `visible="true"`. | | `from_provided_metadata` | bool | Switch to `tool_provided_metadata` mode. | Additional on `` inside `` (`OutputDiscoverDatasets` only): | Attribute | Notes | |---|---| | `assign_primary_output` | Replace the parent ``'s primary dataset with the first discovered match. Useful for tools where one of N outputs should be the canonical output. | ## Named patterns Five string aliases that expand to regexes (`output_collection_def.py:31-37`): | Alias | Expands to | Effect | |---|---|---| | `__default__` | `primary_DATASET_ID_(?P[^_]+)_(?P[^_]+)_(?P[^_]+)(_(?P[^_]+))?` | The historical Galaxy convention. Filename literally begins with `primary_DATASET_ID_…`. Avoid for new tools — too much encoded in the filename. | | `__name__` | `(?P.*)` | Match anything; the filename becomes the element identifier. Datatype comes from `ext` / `format` on the `` element. | | `__designation__` | `(?P.*)` | Same as `__name__` semantically, but the matched group is `designation` (the legacy term that still drives the `` test assertion form). | | `__name_and_ext__` | `(?P.*)\.(?P[^\.]+)?` | Match `.`; element identifier is the basename, datatype is the extension. The most common new-tool choice. | | `__designation_and_ext__` | `(?P.*)\.(?P[^\._]+)?` | Same shape, but populates `designation` instead of `name`. Use when test fixtures need `` to match — by convention, anywhere the test side uses `discovered_dataset`. | Use a named pattern unless you need information from the filename that the aliases can't capture (per-file `dbkey`, multi-level nesting via `identifier_0`/`identifier_1`). ## Named regex groups Galaxy recognizes Custom regexes feed Galaxy metadata through specific named groups: | Group | Meaning | |---|---| | `name` | Element identifier (history-item name). | | `designation` | Element identifier under the legacy term; same effect. | | `ext` | Per-file datatype, overrides the element-level `ext`/`format`. | | `dbkey` | Per-file dbkey (genome build). Defaults to the input's dbkey (`INPUT_DBKEY_TOKEN = "__input__"`). | | `visible` | Per-file visibility override (boolean). | | `identifier_0`, `identifier_1`, …, `identifier_N` | Inside ``: nested-level identifiers. `list:paired` uses `identifier_0` for the outer list identifier and `identifier_1` for the inner `forward`/`reverse`. The level count must match the collection type's nesting depth. | Cited example — nested `list:paired` from a single discover sweep (`test/functional/tools/output_filter.xml:75`): ```xml ``` A file `p1.forward` becomes an element under outer identifier `p1`, inner identifier `forward`. ## Inside `` — multiple outputs from one declaration ```xml ``` `assign_primary_output="true"` makes the *first* discovered match (under the configured `sort_by`) become the primary dataset that `name="sample"` resolves to in the test block; the rest become additional, sibling history items. Without `assign_primary_output`, Galaxy expects the `` to be produced as normal (e.g. via `from_work_dir` or stdout redirection) and discover_datasets contributes only siblings. Cited test case: `test/functional/tools/multi_output_assign_primary.xml:15`. ## Inside `` — discovered collection elements ```xml ``` The collection type determines how many `identifier_N` groups are required: - `list` → `identifier_0`. - `paired` → not usable directly; the two element identifiers are fixed (`forward`, `reverse`). Use the `__name__` form and rely on filenames matching `forward` and `reverse`, or split with a `` per arm (rare). - `list:paired` → `identifier_0` (outer list) + `identifier_1` (inner `forward`/`reverse`). - Deeper nesting → add `identifier_2`, …. Optional / variable-cardinality collections: combine `` with `` on the `` to gate the whole emit, or use an output `count`/`min`/`max` on the `` side to assert cardinality. ## Test-side `` Inside `` blocks, the discovered files are addressable for assertions through `` nested under the matching `` (`galaxy.xsd:2247-2297`): ```xml ``` The `designation` attribute matches the value of the `designation` (or `name`) named group from the pattern. `ftype` checks the inferred datatype. The first discovered match was hoisted to the primary output by `assign_primary_output="true"`, so its `` lives directly under ``. For dynamic-collection outputs, use `` instead — same shape, different addressing surface — and add `count`/`min`/`max` on the `` to assert cardinality (`galaxy.xsd:2085-2099`). ## Galaxy 24+ — the `format` propagation change Pre-24, `format` declared on the parent ``/`` was **ignored** for discovered datasets; you had to set `ext`/`format` on the `` element itself (or else discovered files defaulted to `data`). Galaxy 24+ propagates the parent `format` if `` doesn't specify one (`galaxy.xsd:6265`). Practical impact for the convert Mold: declaring `format` on the parent `` is **necessary and sufficient** as long as the wrapper sets `profile="24.0"` or later. For the convert Mold's `profile="23.1"` default (chosen for broad compatibility), keep declaring `ext`/`format` on the `` element itself. ## Convert Mold posture — nf-core → Galaxy mapping The Mold's §4 (*Translate ``*) collapses to a set of rules driven by the Nextflow `output:` channel's **cardinality** first, then its glob shape. The trap to avoid: a leading `*` does not by itself imply a list. `path('*.bam')` (N files, one per upstream element) and `path('*.{bai,csi,crai}')` (exactly one file, alternation across mutually-exclusive extensions) look the same syntactically and map to different Galaxy idioms. **Cardinality heuristic.** A process with `input: tuple(meta, path)` emits one item per invocation. Its `output:` channels are single unless the `script:` body explicitly fans out (loops writing N files, split tools, etc.). Glob shape (`*`, `*.{a,b}`, `${prefix}.*`) is about *filename uncertainty*, not *file count*. Walk the rules below in order; do not skip to Rule 3 just because you see an asterisk. ### Rule 1 — single output, deterministic name → ``, no discovery Nextflow: ```nextflow output: tuple val(meta), path("${prefix}.json"), emit: json ``` The output is exactly one file with a runtime-known but stable name. Galaxy: ```xml ``` `` is unnecessary — there is exactly one output at a known path. ### Rule 2 — single output, variable extension (alternation glob) → upstream invocation + `mv` + ``, no discovery Nextflow: ```nextflow output: tuple val(meta), path("*.{bai,csi,crai}"), emit: index ``` The channel emits **one** file; the extension depends on input format (CRAM → CRAI) or an args/param choice (BAM + `-c` → CSI, otherwise BAI). This is *not* a list. **Canonical Galaxy shape — preserve the upstream invocation, capture with `mv`:** ```xml ``` Three moves: 1. **`ln -s` to a deterministic name preserving the input's extension.** Lets upstream extension-derivation logic fire identically to the nf-core module (CRAM input → CRAI output, etc.). 2. **Call the tool with the same positional / flag shape as the upstream `script:` body.** No extra output-path argument, no fork in the invocation. The reviewer's command-parity dimension passes trivially. 3. **`mv '$output_name'` to capture into Galaxy's per-output staging path.** Use a bash brace glob (`{bai,csi,crai}`) rather than `.*` so the move can't sweep unrelated cwd files. The ``'s declared `format=` is the default datatype; `` (XSD: `OutputDataElement → change_format`) flips it based on the responsible input/param. Galaxy doesn't care what extension the on-disk file has — datatype is metadata, not filename inspection. The canonical cited case is the nf-core `samtools/index` module: `path("*.{bai,csi,crai}")` → one Galaxy `` with ``, captured with `mv input.bam.{bai,csi,crai} '$index'`. **Anti-pattern.** Mapping this shape to `` + `` produces a single-element collection wrapping a degenerate list — wrong cardinality contract for downstream workflow tools, wrong test shape (`` instead of ``), and pulls in discovery overhead for a deterministic single file. If you find yourself writing `` inside a `` for a single-file emit, you are in this anti-pattern; rewrite to Rule 2. **Variant — tool accepts an explicit output-path arg.** If the upstream tool takes a destination path as an arg or flag (`--output`, `-o`, second positional, etc.) and the nf-core `script:` body uses it, you can write directly to `'$'` and skip the `mv`: ```xml ``` Treat this as a secondary form, not the default. The `mv` form is preferred because it preserves the upstream command shape byte-for-byte (better for the reviewer's command-parity dimension) and generalizes to tools that don't accept an output-path arg at all. Use the direct-`$output` form only when the upstream `script:` body itself uses the output-path arg — then you're mirroring upstream, not editing it. **When `from_work_dir` is the right move instead.** If the upstream `script:` body always writes to a fixed, literal name in cwd (no `${prefix}` interpolation, no extension alternation), Rule 1 with `from_work_dir="literal.ext"` is simpler than Rule 2. Rule 2 exists for the case where the filename — or its extension — is not knowable from the wrapper's static configuration. ### Rule 3 — multi-output, list cardinality (true glob) → `` + `` Nextflow: ```nextflow output: tuple val(meta), path('*.bam'), emit: bams ``` …and the `script:` body emits N BAMs keyed by element identifier (e.g. per-sample, per-chromosome). Galaxy: ```xml ``` `__name_and_ext__` captures the basename as the element identifier and the extension as datatype. ### Rule 4 — multi-output, paired cardinality → `` + paired discovery Nextflow: ```nextflow output: tuple val(meta), path("*_R{1,2}.fastp.fastq.gz"), emit: reads ``` Galaxy: ```xml ``` This needs a custom regex; no named pattern captures the `R1`/`R2` split into `identifier_1`. The Mold's LLM step generates the regex from the file glob. ### Rule 5 — `versions.yml` (the versions emit) → drop The `versions:` channel has no Galaxy analog (`` covers it). Don't emit `` or `` for it. See `[[nfcore-versions-emit-to-galaxy-version-command]]`. ## Pitfalls - **`visible` defaults to false.** Always declare `visible="true"` unless you specifically want the discovered files hidden from history. Forgetting it is the most common authoring slip. - **`assign_primary_output` only on ``, not ``.** The collection version doesn't accept it; the XSD enforces this (`OutputCollectionDiscoverDatasets` doesn't include the attribute). - **`from_provided_metadata="true"` is exclusive with `pattern` and `sort_by`.** Specify either-or, never both. - **Glob in `from_work_dir` ≠ discovery.** `from_work_dir="*.bam"` is invalid; that attribute takes a literal path. For globs you must use ``. - **`__name__` vs `__designation__`.** The only difference is which named group the regex captures. Use `__designation__` (and `` on the test side) when matching IUC convention for tests; use `__name__` when the surrounding wrapper already uses `name`-style identifiers. - **Sort matters for `assign_primary_output`.** Default sort is lexical filename. If your wrapper expects a specific file as primary, pin the sort (`sort_by="numeric_name"`, `sort_by="reverse_lexical_designation"`, etc.) — don't rely on luck. - **Nested collections need every `identifier_N`.** A `list:paired` discover that names only `identifier_0` fails at runtime; you need both levels covered by the regex. - **Custom regexes match filename only by default.** To match subdirectory components, set `match_relative_path="true"` and embed the path in named groups; otherwise scope with `directory="subdir"` + `recurse="false"`. ## See also - `[[convert-nfcore-module-to-galaxy-tool]]` — Mold that consumes this reference when emitting ``. - `[[nfcore-channel-input-to-galaxy-collection]]` — companion: how to map input channels to data / collection params. - `[[galaxy-collection-semantics]]` — what map-over / reduction does to a collection at workflow time. - `[[planemo-asserts-idioms]]` — how to write assertions inside `` / `` bodies. - Galaxy XSD: `lib/galaxy/tool_util/xsd/galaxy.xsd` — authoritative attribute grammar. - Galaxy test tools: `test/functional/tools/multi_output*.xml`, `output_filter.xml`, `discover_sort_by.xml`, `collection_creates_dynamic_*.xml` — exhaustive coverage of supported shapes. - Planemo writing-advanced docs: "Multiple output files" — narrative tutorial.