Nextflow to Galaxy reference-data mapping

Mapping research for nextflow-summary-to-galaxy-reference-data. Once a Nextflow pipeline’s reference-data usage is classified per nextflow-reference-data-classification, this note pins the Galaxy-side translation: idioms available, the v1 posture, datatype defaults, the in-tool rebuild trade-off, and known representation gaps the brief should flag.

Galaxy side

Galaxy has multiple idioms for surfacing reference data. The bullets below are presented as available shapes; the recommendations that follow narrow them to the v1 posture.

dbkey-keyed cached lookups. Workflow inputs carry a dbkey annotation; tools consume an admin-pre-loaded data table indexed by dbkey (BWA indexes, GATK known-sites, dbSNP for common builds). Powerful when builds are common; opaque when not, and not every Galaxy server admin curates the same set.
Generic from_data_table / .loc lookups (no dbkey). Tools can use from_data_table against an admin-loaded .loc file keyed by name rather than dbkey — kraken2 databases, centrifuge databases, custom-keyed index tables. Same admin-install constraint as dbkey tables; the difference is just the indexing key.
Sample sheets (see galaxy-sample-sheet-collections). Files and typed metadata travel together in a sample_sheet-shaped collection with typed column_definitions. Variants: sample_sheet, sample_sheet:paired, sample_sheet:paired_or_unpaired, sample_sheet:record. Reference-side use case: a coordinated reference bundle can ride a sample_sheet:record row whose typed slots hold the FASTA, the GTF, the index tarball.
List of records. Sample sheets force a unique element identifier per row; a list of records can carry column-like metadata without that uniqueness restriction. Useful when reference-data rows naturally repeat or when the column set is not row-keyed. Peer research note tracked at jmchilton/foundry#225.
Explicit data inputs. Reference + index passed individually as workflow data inputs. The simplest shape; the v1 default.

Galaxy folks who care about reference-data UX increasingly dislike the cached/dbkey path. It’s invisible state that breaks when the admin hasn’t loaded the build, varies between Galaxy instances, and forces wrappers to special-case “is this dbkey cached or do I rebuild.” Workflows that ship to public Galaxy + small private instances + cloud Galaxy can’t assume a shared cache.

Recommendations

If reference data can be cleanly refactored into natural workflow parameters, just do that. If a reference table or sample sheet is only ever keyed on a single input, just make the columns of that table direct inputs to the workflow.

Data tables (both dbkey and generic) are discouraged in v1. Both kinds require admin install, tool_data_table_conf.xml edits, and .loc files, which break the Foundry’s portability-first posture: a translated workflow should run on a stock Galaxy with user-uploaded inputs. A future v2 may layer them in for performance, but not for correctness.

Prefer in-tool rebuild over a workflow-tier build step. When an index input is unset, the corresponding Galaxy tool wrapper rebuilds it from the FASTA inside the wrapper itself, using the same logic as the source pipeline’s compute-if-missing block. See Why “in-tool rebuild” below for the trade-off.

Reference-producing workflows output explicit files. Nextflow patterns where reference data is updated as part of the workflow, or where workflow output is updating an admin-managed reference, need to be restructured so the output is an explicit Galaxy dataset or collection. The downstream consumer workflow takes that as input.

Workflows cannot bundle reference data — but inputs can. A Galaxy workflow file is a definition, not a payload. Reference data that fits in a directory or single file (and that does not contain absolute path references inside it) can be zipped, uploaded as an input dataset, and referenced from the workflow input. The workflow’s input documentation should describe how to obtain or assemble the bundle.

Use specific datatypes when available; fall back to data. Workflows can consume reference inputs as the generic data (any) or directory types when nothing more specific applies, but a more specific datatype (fasta, fai, picard_dict, bwa_index, …) gives users format-shaped upload guidance and lets tools sniff inputs correctly.

Common reference-data datatypes

Pulled from the vendored galaxy-datatypes-conf; consult that note for the canonical extension list and sniff order.

Asset	Galaxy `format`	Notes
Reference FASTA	`fasta`	`auto_compressed_types="gz,bz2"` — a `.fasta.gz` upload sniffs as `fasta`.
FASTA index	`fai`	Tabular subclass; emitted by `samtools faidx`.
Sequence dictionary	(no built-in `picard_dict` extension in vendored sample — falls through to `data` or a per-tool subtype; verify per Galaxy version)	Emitted by `gatk4 CreateSequenceDictionary`.
BAM	`bam`	Index sibling: `bai`.
CRAM	`cram`	Index sibling: `crai`.
BCF	`bcf`	Binary VCF.
VCF (bgzip)	`vcf_bgzip`	Index sibling: `tbi`.
GTF	`gtf`	`auto_compressed_types="gz"`.
GFF / GFF3	`gff` / `gff3`	gff3 has `auto_compressed_types="gz,bz2"`.
BED	`bed`	Interval.
BWA index	`bwa_index`	Directory subclass — uploaded as a directory or extracted tarball.
BWA-MEM2 index	`bwa_mem2_index`	Directory subclass.
Bowtie color/base index	`bowtie_color_index`, `bowtie_base_index`	Directory subclass; not displayed in upload UI by default.
HMMER profile	`hmm2`, `hmm3`	Per HMMER version.
Kallisto index	`kallisto.idx`	Single binary file.
Generic HDF5	`h5` / `h5ad` / `loom`	Per consumer tool.
2bit	(no `2bit` in vendored sample at this pin — use `data` and document)	UCSC twoBit format.
Tabix index	`tbi` (built-in via `vcf_bgzip` ecosystem)	Sibling index.

Why “in-tool rebuild” rather than “workflow-tier rebuild step”

Galaxy can also encode “if fasta_fai is absent, run samtools faidx as an explicit workflow step before alignment.” Two reasons to prefer in-tool over workflow-tier:

gxformat2 conditional steps are weak. Step-level when: exists but is awkward at scale. A workflow with 6 optional indexes turns into 6 conditional precursor steps and 6 if/else data routings into the consumer. The DAG explodes.
Wrapper-level rebuild is already common. Galaxy’s BWA, BWA-MEM2, GATK4, and Picard wrappers already handle “no index supplied” by building one. Leaning on this collapses the workflow surface and matches existing IWC convention. The reference-data Mold should not invent new wrapper behavior — it should match what existing wrappers already do.

When to deviate from in-tool rebuild

Heavy indexes, repeated use. A bwa-mem2 index used by 200 alignment steps shouldn’t rebuild 200 times. For these, an explicit workflow-tier “build index” step that fans out is correct. Threshold: rebuild more than ~3 times → workflow-tier step. Below → in-tool.
Index sharing across siblings. If two sibling workflows (WES + WGS) consume the same indexes and a user runs both, an explicit input lets them share. In-tool rebuild duplicates work but doesn’t break anything.

Gaps in representation

A self-check that an LLM working from the nextflow-reference-data-classification taxonomy plus the Galaxy idioms / recommendations above has what it needs to make the per-asset shape decision. Open gaps where it does not:

In-tool rebuild capability discovery. The recommendation “wrapper rebuilds the index when absent” assumes the chosen Galaxy tool wrapper actually supports the no index supplied → build one branch. The note asserts this is common for BWA / BWA-MEM2 / GATK4 / Picard but does not give the LLM a way to verify it for any specific tool/version. Until discover-shed-tool surfaces wrapper rebuild capability, the brief has to flag rebuild assumptions as unverified.
Bundle UX guidance for ≥4 coupled inputs. The recommendations cover why coordinated bundles are awkward (smrnaseq four-input case, key-expanded ten-plus-input case) but do not give the LLM a concrete shape to default to. sample_sheet:record is suggested but not exemplified for reference data; no IWC exemplar is cited as ground truth.
Detecting compute-if-missing in summary-nextflow. As of jmchilton/foundry#229, summary-nextflow emits structured reference_rebuilds[] entries (asset_param, guard, guard_params, builder, builder_outputs, fallback_for, evidence.confidence) for the negated-guard idiom — if (!<x>_in) { BUILDER(...); <asset> = BUILDER.out.<chan> } — that nf-core/sarek’s PREPARE_GENOME uses. Detection is name-agnostic (no PREPARE_GENOME literal), walks every workflow body (primary + subworkflows), and pairs with reference_assets[] for asset attribution via used_by. Confidence drops to medium when the guard mixes non-take, non-params.X locals (e.g. step, aligner). Known gap: nf-core/rnaseq-style positive-form rebuilds (if (<asset>) { unpack } else if (fasta_provided) { ch_<asset> = BUILDER(args).<chan> }) are not yet matched — reference_rebuilds[] is empty for those pipelines and the brief should flag affected assets as rebuild-unverified.
Multi-DB pick-list (funcscan-style) translation pattern. No worked Galaxy exemplar for “user picks 0..N optional reference databases, each gating its own analysis branch.” gxformat2 conditional + optional-input territory; the LLM can describe the choice but cannot anchor it to a known-good shape.
Reference-producing pipeline downstream contract. Reference-producing workflows (createtaxdb, references) output a bundle; whether downstream consumer workflows can take that bundle as a workflow input cleanly (single dataset, collection of files, collection of typed records) is unsettled. The brief should flag this as an open question per pipeline rather than picking unilaterally.
Cohort / panel data without rebuild. PoN, germline-resource, gnomAD-style large reference VCFs. The recommendations cover that they are explicit user inputs, but not how to surface “this is required only when somatic mode is enabled” — Galaxy’s required-when-conditional input affordance is weak and worth noting on the brief.
Datatype gaps in vendored datatypes_conf.xml.sample. picard_dict, 2bit, and several less-common index types do not appear in the vendored sample at the pinned SHA. Real Galaxy instances may register them via tool-bundled datatypes_conf.xml fragments, but the LLM cannot rely on them at design time. The brief should fall back to data plus a description for absent extensions and flag the looseness.
dbkey propagation through map-overs. Even with the no-cached-tables posture, dbkey may be set by a user upstream. The data-flow brief needs to know not to strip dbkey from element identifiers through map-over steps; the recommendation here pins the posture but does not yet exemplify the data-flow side.