Galaxy `<discover_datasets>`

Cited Galaxy source pinned to galaxyproject/galaxy@7765fae9 (XSD: lib/galaxy/tool_util/xsd/galaxy.xsd; parser: lib/galaxy/tool_util/parser/output_collection_def.py).

<discover_datasets> is Galaxy’s mechanism for collecting outputs whose names or counts aren’t knowable at tool-wrapper authoring time. A tool that emits “one BAM per chromosome”, “every *.report.tsv in the working dir”, “whatever fell out of split-by-this-column” — uses <discover_datasets> to tell Galaxy how to find them after the job completes.

Two parents, slightly different behavior:

Parent	What discover_datasets populates	Result
`<data>`	The primary dataset’s siblings (and optionally the primary itself with `assign_primary_output="true"`)	Multiple history items derived from one `<data>` declaration
`<collection>`	The elements of the collection	A `list`, `paired`, `list:paired`, or arbitrarily-nested collection

The convert Mold ([[convert-nfcore-module-to-galaxy-tool]]) reaches for <discover_datasets> inside <collection> whenever a Nextflow output: channel uses a glob (path('*.bam')) or names interpolated from runtime values; the corresponding Galaxy idiom needs to discover the matching files after the script runs.

Two discovery modes

Set with from_provided_metadata. Default is pattern.

`pattern` mode (default)

Galaxy scans the working directory (or a named subdirectory) and matches filenames against a regex.

<discover_datasets pattern="(?P<designation>.+)\.report\.tsv" ext="tabular" visible="true"/>

The regex must match the filename (not the full path, unless match_relative_path="true"). Named groups inside the pattern feed Galaxy metadata about each discovered file — see Named groups below.

`tool_provided_metadata` mode

The tool writes a galaxy.json (or equivalent) into the working directory listing each output’s path, name, datatype, and metadata. Galaxy reads it verbatim.

<discover_datasets from_provided_metadata="true" visible="true"/>

Used when the regex idiom doesn’t carry enough information — usually because the tool needs to set datatype per file, attach element identifiers from a non-filename source, or surface metadata Galaxy can’t infer. pattern and sort_by are forbidden in this mode (the parser rejects them at load time per output_collection_def.py:88-90).

Attribute reference

All attributes are optional. Pulled from OutputDiscoverDatasetsCommon in galaxy.xsd:6698-6749.

Attribute	Type	Notes
`pattern`	regex	Filename pattern. May be a named pattern (`__name__`, …) or a literal regex with named groups. Forbidden when `from_provided_metadata="true"`.
`directory`	string	Working-dir-relative directory to scan. Default is the working dir itself.
`recurse`	bool	Walk `directory` recursively. Default `false`.
`match_relative_path`	bool	Match the regex against the path relative to `directory` (lets you embed path components in named groups). Default `false` — match filename only.
`format` / `ext`	datatype	Datatype for every discovered file. `format` is an alias for `ext`. Override per-file via a named `ext` regex group.
`sort_by`	string	`[reverse_][SORT_COMP_]SORTBY`. `SORTBY` ∈ {`filename`, `name`, `designation`, `dbkey`}; `SORT_COMP` ∈ {`lexical`, `numeric`}. Default `lexical_filename`.
`visible`	bool	History visibility of discovered datasets. Defaults to `false` per the XSD, but the XSD doc string explicitly warns “probably shouldn’t be” — almost every IUC and Galaxy test tool sets `visible="true"`.
`from_provided_metadata`	bool	Switch to `tool_provided_metadata` mode.

Additional on <discover_datasets> inside <data> (OutputDiscoverDatasets only):

Attribute	Notes
`assign_primary_output`	Replace the parent `<data>`’s primary dataset with the first discovered match. Useful for tools where one of N outputs should be the canonical output.

Named patterns

Five string aliases that expand to regexes (output_collection_def.py:31-37):

Alias	Expands to	Effect
`__default__`	`primary_DATASET_ID_(?P<designation>[^_]+)_(?P<visible>[^_]+)_(?P<ext>[^_]+)(_(?P<dbkey>[^_]+))?`	The historical Galaxy convention. Filename literally begins with `primary_DATASET_ID_…`. Avoid for new tools — too much encoded in the filename.
`__name__`	`(?P<name>.*)`	Match anything; the filename becomes the element identifier. Datatype comes from `ext` / `format` on the `<discover_datasets>` element.
`__designation__`	`(?P<designation>.*)`	Same as `__name__` semantically, but the matched group is `designation` (the legacy term that still drives the `<discovered_dataset designation="…">` test assertion form).
`__name_and_ext__`	`(?P<name>.*)\.(?P<ext>[^\.]+)?`	Match `<basename>.<extension>`; element identifier is the basename, datatype is the extension. The most common new-tool choice.
`__designation_and_ext__`	`(?P<designation>.*)\.(?P<ext>[^\._]+)?`	Same shape, but populates `designation` instead of `name`. Use when test fixtures need `<discovered_dataset designation="…">` to match — by convention, anywhere the test side uses `discovered_dataset`.

Use a named pattern unless you need information from the filename that the aliases can’t capture (per-file dbkey, multi-level nesting via identifier_0/identifier_1).

Named regex groups Galaxy recognizes

Custom regexes feed Galaxy metadata through specific named groups:

Group	Meaning
`name`	Element identifier (history-item name).
`designation`	Element identifier under the legacy term; same effect.
`ext`	Per-file datatype, overrides the element-level `ext`/`format`.
`dbkey`	Per-file dbkey (genome build). Defaults to the input’s dbkey (`INPUT_DBKEY_TOKEN = "__input__"`).
`visible`	Per-file visibility override (boolean).
`identifier_0`, `identifier_1`, …, `identifier_N`	Inside `<collection>`: nested-level identifiers. `list:paired` uses `identifier_0` for the outer list identifier and `identifier_1` for the inner `forward`/`reverse`. The level count must match the collection type’s nesting depth.

Cited example — nested list:paired from a single discover sweep (test/functional/tools/output_filter.xml:75):

<collection name="paired_list_output" type="list:paired" label="paired list">
  <discover_datasets pattern="(?P<identifier_0>p[12])\.(?P<identifier_1>.*)" ext="txt" visible="true"/>
</collection>

A file p1.forward becomes an element under outer identifier p1, inner identifier forward.

Inside `<data>` — multiple outputs from one declaration

<data format="tabular" name="sample">
  <discover_datasets pattern="(?P<designation>.+)\.report\.tsv" ext="tabular" visible="true" assign_primary_output="true"/>
</data>

assign_primary_output="true" makes the first discovered match (under the configured sort_by) become the primary dataset that name="sample" resolves to in the test block; the rest become additional, sibling history items. Without assign_primary_output, Galaxy expects the <data> to be produced as normal (e.g. via from_work_dir or stdout redirection) and discover_datasets contributes only siblings.

Cited test case: test/functional/tools/multi_output_assign_primary.xml:15.

Inside `<collection>` — discovered collection elements

<collection name="list_output" type="list" label="List">
  <discover_datasets pattern="(?P<identifier_0>[45])" ext="txt" visible="true"/>
</collection>

The collection type determines how many identifier_N groups are required:

list → identifier_0.
paired → not usable directly; the two element identifiers are fixed (forward, reverse). Use the __name__ form and rely on filenames matching forward and reverse, or split with a <discover_datasets> per arm (rare).
list:paired → identifier_0 (outer list) + identifier_1 (inner forward/reverse).
Deeper nesting → add identifier_2, ….

Optional / variable-cardinality collections: combine <discover_datasets> with <filter> on the <collection> to gate the whole emit, or use an output count/min/max on the <test> side to assert cardinality.

Test-side `<discovered_dataset>`

Inside <test> blocks, the discovered files are addressable for assertions through <discovered_dataset designation="…"> nested under the matching <output> (galaxy.xsd:2247-2297):

<test>
  <param name="num_param" value="7"/>
  <param name="input" ftype="txt" value="simple_line.txt"/>
  <output name="sample">
    <assert_contents><has_line line="1"/></assert_contents>
    <discovered_dataset designation="sample2" ftype="tabular">
      <assert_contents><has_line line="2"/></assert_contents>
    </discovered_dataset>
    <discovered_dataset designation="sample3" ftype="tabular">
      <assert_contents><has_line line="3"/></assert_contents>
    </discovered_dataset>
  </output>
</test>

The designation attribute matches the value of the designation (or name) named group from the pattern. ftype checks the inferred datatype. The first discovered match was hoisted to the primary output by assign_primary_output="true", so its <assert_contents> lives directly under <output>.

For dynamic-collection outputs, use <element name="…"> instead — same shape, different addressing surface — and add count/min/max on the <output> to assert cardinality (galaxy.xsd:2085-2099).

Galaxy 24+ — the `format` propagation change

Pre-24, format declared on the parent <data>/<collection> was ignored for discovered datasets; you had to set ext/format on the <discover_datasets> element itself (or else discovered files defaulted to data). Galaxy 24+ propagates the parent format if <discover_datasets> doesn’t specify one (galaxy.xsd:6265).

Practical impact for the convert Mold: declaring format on the parent <collection> is necessary and sufficient as long as the wrapper sets profile="24.0" or later. For the convert Mold’s profile="23.1" default (chosen for broad compatibility), keep declaring ext/format on the <discover_datasets> element itself.

Convert Mold posture — nf-core → Galaxy mapping

The Mold’s §4 (Translate <outputs>) collapses to a set of rules driven by the Nextflow output: channel’s cardinality first, then its glob shape. The trap to avoid: a leading * does not by itself imply a list. path('*.bam') (N files, one per upstream element) and path('*.{bai,csi,crai}') (exactly one file, alternation across mutually-exclusive extensions) look the same syntactically and map to different Galaxy idioms.

Cardinality heuristic. A process with input: tuple(meta, path) emits one item per invocation. Its output: channels are single unless the script: body explicitly fans out (loops writing N files, split tools, etc.). Glob shape (*, *.{a,b}, ${prefix}.*) is about filename uncertainty, not file count. Walk the rules below in order; do not skip to Rule 3 just because you see an asterisk.

Rule 1 — single output, deterministic name → `<data from_work_dir="…">`, no discovery

Nextflow:

output:
tuple val(meta), path("${prefix}.json"), emit: json

The output is exactly one file with a runtime-known but stable name. Galaxy:

<data name="json" format="json" from_work_dir="${prefix}.json"/>

<discover_datasets> is unnecessary — there is exactly one output at a known path.

Rule 2 — single output, variable extension (alternation glob) → upstream invocation + `mv` + `<change_format>`, no discovery

Nextflow:

output:
tuple val(meta), path("*.{bai,csi,crai}"), emit: index

The channel emits one file; the extension depends on input format (CRAM → CRAI) or an args/param choice (BAM + -c → CSI, otherwise BAI). This is not a list.

Canonical Galaxy shape — preserve the upstream invocation, capture with mv:

<command><![CDATA[
    ln -s '$input' 'input.${input.ext}' &&
    samtools index
        -@ \${GALAXY_SLOTS:-1}
        #if $input.ext != 'cram' and $index_format == 'csi':
            -c
        #end if
        $extra_args
        'input.${input.ext}'
    &&
    mv 'input.${input.ext}'.{bai,csi,crai} '$index'
]]></command>
<outputs>
    <data name="index" format="bai">
        <change_format>
            <when input="index_format" value="csi" format="csi"/>
            <when input_dataset="input" attribute="ext" value="cram" format="crai"/>
        </change_format>
    </data>
</outputs>

Three moves:

ln -s to a deterministic name preserving the input’s extension. Lets upstream extension-derivation logic fire identically to the nf-core module (CRAM input → CRAI output, etc.).
Call the tool with the same positional / flag shape as the upstream script: body. No extra output-path argument, no fork in the invocation. The reviewer’s command-parity dimension passes trivially.
mv <tight-glob> '$output_name' to capture into Galaxy’s per-output staging path. Use a bash brace glob ({bai,csi,crai}) rather than .* so the move can’t sweep unrelated cwd files.

The <data>’s declared format= is the default datatype; <change_format> (XSD: OutputDataElement → change_format) flips it based on the responsible input/param. Galaxy doesn’t care what extension the on-disk file has — datatype is metadata, not filename inspection.

The canonical cited case is the nf-core samtools/index module: path("*.{bai,csi,crai}") → one Galaxy <data name="index" format="bai"> with <change_format>, captured with mv input.bam.{bai,csi,crai} '$index'.

Anti-pattern. Mapping this shape to <collection type="list"> + <discover_datasets> produces a single-element collection wrapping a degenerate list — wrong cardinality contract for downstream workflow tools, wrong test shape (<output_collection> instead of <output>), and pulls in discovery overhead for a deterministic single file. If you find yourself writing <discover_datasets pattern=".+\.(bai|csi|crai)"/> inside a <collection> for a single-file emit, you are in this anti-pattern; rewrite to Rule 2.

Variant — tool accepts an explicit output-path arg. If the upstream tool takes a destination path as an arg or flag (--output, -o, second positional, etc.) and the nf-core script: body uses it, you can write directly to '$<output_name>' and skip the mv:

<command><![CDATA[
    seqkit grep $extra_args -o '$filter' '$sequence'
]]></command>

Treat this as a secondary form, not the default. The mv form is preferred because it preserves the upstream command shape byte-for-byte (better for the reviewer’s command-parity dimension) and generalizes to tools that don’t accept an output-path arg at all. Use the direct-$output form only when the upstream script: body itself uses the output-path arg — then you’re mirroring upstream, not editing it.

When from_work_dir is the right move instead. If the upstream script: body always writes to a fixed, literal name in cwd (no ${prefix} interpolation, no extension alternation), Rule 1 with from_work_dir="literal.ext" is simpler than Rule 2. Rule 2 exists for the case where the filename — or its extension — is not knowable from the wrapper’s static configuration.

Rule 3 — multi-output, list cardinality (true glob) → `<collection type="list">` + `<discover_datasets>`

Nextflow:

output:
tuple val(meta), path('*.bam'), emit: bams

…and the script: body emits N BAMs keyed by element identifier (e.g. per-sample, per-chromosome). Galaxy:

<collection name="bams" type="list" format="bam">
  <discover_datasets pattern="__name_and_ext__" directory="." visible="true"/>
</collection>

__name_and_ext__ captures the basename as the element identifier and the extension as datatype.

Rule 4 — multi-output, paired cardinality → `<collection type="paired">` + paired discovery

Nextflow:

output:
tuple val(meta), path("*_R{1,2}.fastp.fastq.gz"), emit: reads

Galaxy:

<collection name="reads" type="paired" format="fastqsanger.gz">
  <discover_datasets pattern="(?P<name>.+)_R(?P<identifier_1>[12])\..*\.fastq\.gz" ext="fastqsanger.gz" visible="true"/>
</collection>

This needs a custom regex; no named pattern captures the R1/R2 split into identifier_1. The Mold’s LLM step generates the regex from the file glob.

Rule 5 — `versions.yml` (the versions emit) → drop

The versions: channel has no Galaxy analog (<version_command> covers it). Don’t emit <data> or <collection> for it. See [[nfcore-versions-emit-to-galaxy-version-command]].

Pitfalls

visible defaults to false. Always declare visible="true" unless you specifically want the discovered files hidden from history. Forgetting it is the most common authoring slip.
assign_primary_output only on <data>, not <collection>. The collection version doesn’t accept it; the XSD enforces this (OutputCollectionDiscoverDatasets doesn’t include the attribute).
from_provided_metadata="true" is exclusive with pattern and sort_by. Specify either-or, never both.
Glob in from_work_dir ≠ discovery. from_work_dir="*.bam" is invalid; that attribute takes a literal path. For globs you must use <discover_datasets>.
__name__ vs __designation__. The only difference is which named group the regex captures. Use __designation__ (and <discovered_dataset designation="…"> on the test side) when matching IUC convention for tests; use __name__ when the surrounding wrapper already uses name-style identifiers.
Sort matters for assign_primary_output. Default sort is lexical filename. If your wrapper expects a specific file as primary, pin the sort (sort_by="numeric_name", sort_by="reverse_lexical_designation", etc.) — don’t rely on luck.
Nested collections need every identifier_N. A list:paired discover that names only identifier_0 fails at runtime; you need both levels covered by the regex.
Custom regexes match filename only by default. To match subdirectory components, set match_relative_path="true" and embed the path in named groups; otherwise scope with directory="subdir" + recurse="false".

Galaxy <discover_datasets>

Galaxy `<discover_datasets>`

Two discovery modes

`pattern` mode (default)

`tool_provided_metadata` mode

Attribute reference

Named patterns

Named regex groups Galaxy recognizes

Inside `<data>` — multiple outputs from one declaration

Inside `<collection>` — discovered collection elements

Test-side `<discovered_dataset>`

Galaxy 24+ — the `format` propagation change

Convert Mold posture — nf-core → Galaxy mapping

Rule 1 — single output, deterministic name → `<data from_work_dir="…">`, no discovery

Rule 2 — single output, variable extension (alternation glob) → upstream invocation + `mv` + `<change_format>`, no discovery

Rule 3 — multi-output, list cardinality (true glob) → `<collection type="list">` + `<discover_datasets>`

Rule 4 — multi-output, paired cardinality → `<collection type="paired">` + paired discovery

Rule 5 — `versions.yml` (the versions emit) → drop

Pitfalls

See also

Incoming References (4)

Galaxy <discover_datasets>

Two discovery modes

pattern mode (default)

tool_provided_metadata mode

Attribute reference

Named patterns

Named regex groups Galaxy recognizes

Inside <data> — multiple outputs from one declaration

Inside <collection> — discovered collection elements

Test-side <discovered_dataset>

Galaxy 24+ — the format propagation change

Convert Mold posture — nf-core → Galaxy mapping

Rule 1 — single output, deterministic name → <data from_work_dir="…">, no discovery

Rule 2 — single output, variable extension (alternation glob) → upstream invocation + mv + <change_format>, no discovery

Rule 3 — multi-output, list cardinality (true glob) → <collection type="list"> + <discover_datasets>

Rule 4 — multi-output, paired cardinality → <collection type="paired"> + paired discovery

Rule 5 — versions.yml (the versions emit) → drop

Pitfalls

See also

Incoming References (4)

Galaxy `<discover_datasets>`

`pattern` mode (default)

`tool_provided_metadata` mode

Inside `<data>` — multiple outputs from one declaration

Inside `<collection>` — discovered collection elements

Test-side `<discovered_dataset>`

Galaxy 24+ — the `format` propagation change

Rule 1 — single output, deterministic name → `<data from_work_dir="…">`, no discovery

Rule 2 — single output, variable extension (alternation glob) → upstream invocation + `mv` + `<change_format>`, no discovery

Rule 3 — multi-output, list cardinality (true glob) → `<collection type="list">` + `<discover_datasets>`

Rule 4 — multi-output, paired cardinality → `<collection type="paired">` + paired discovery

Rule 5 — `versions.yml` (the versions emit) → drop