Home Research

Nextflow path/glob to Galaxy datatype mapping

Rules for mapping Nextflow path, glob, sample-sheet, and output filename evidence to Galaxy datatype extensions.

Raw
Revised
2026-05-06
Rev
1
component

Nextflow path/glob to Galaxy datatype mapping

Use this note when a Nextflow-to-Galaxy Mold needs a gxformat2 format value for a data input, collection element, or workflow output. nextflow-params-to-galaxy-inputs decides whether something is a dataset or collection; this note only decides datatype extension and confidence.

Evidence quality:

  • Corpus-observed claims cite pinned fixtures under $NEXTFLOW_FIXTURES, the shared clone at /Users/jxc755/projects/repositories/workflow-fixtures/pipelines/.
  • Foundry-internal claims cite existing Foundry notes and the summary-nextflow schema.
  • External-doc claims cite Nextflow, nf-schema, and Galaxy registry docs.
  • Design inference states the translation posture this Foundry note recommends.

Registry constraint

Do not invent Galaxy extensions. galaxy-datatypes-conf and its adjacent datatypes_conf.xml.sample are the registry. If an extension is not registered or generated by auto_compressed_types, omit format.

For gxformat2 inputs, type: data or type: collection is required, but format is optional. Omit format when confidence is low rather than using a weak guess. Use the generic Galaxy data extension only as a last resort when a consumer requires an extension value.

Evidence precedence

RankEvidenceUse
1Explicit Galaxy extension from a target tool or prior Galaxy stepKeep unless contradicted.
2nf-schema mimetype plus format plus filename patternHigh-confidence sample-sheet path-column or path-param datatype.
3Sample-sheet column name plus patternHigh to medium depending specificity.
4Process path("*.ext") output or nf-core meta.yml patternHigh for process output; medium for top-level workflow output unless emitted or published.
5publishDir pattern: or workflow output pathHigh for user-visible output intent; medium for datatype unless pattern is specific.
6fromPath / fromFilePairs globHigh for shape, medium for datatype unless extension is specific.
7Process variable name only, e.g. path(reads), path(fasta)Weak hint; do not emit format unless clear.
8Generic extension, directory, extensionless output, dynamic stringLow; omit format.

Core datatype mapping

Nextflow / filename evidenceGalaxy formatConfidenceNotes
.fastq, .fqfastqsangerHighPrefer over generic fastq unless source explicitly needs non-Sanger FASTQ.
.fastq.gz, .fq.gzfastqsanger.gzHighfastqsanger has gzip auto-compression.
.fa, .fasta, .fna, .fasfastaHighCommon reference or assembled sequence input.
.fa.gz, .fasta.gz, .fna.gz, .fas.gzfasta.gzHighfasta has gzip auto-compression.
.bambamHighPrimary alignment dataset.
.baibaiHigh but usually sidecarAvoid separate top-level input unless source requires explicit index.
.samsamHighRegistered.
.cramcramHighRegistered.
.vcfvcfHighRegistered.
.vcf.gzvcf_bgzipMedium to highHigh with tabix/bgzip evidence or paired .tbi; otherwise medium.
.bedbedHighRegistered.
.bed.gzbed_tabix.gz only with tabix evidenceLow to mediumDo not invent plain bed.gz.
.gffgff or gff3MediumContent/version matters.
.gff3gff3HighRegistered.
.gff3.gzgff3.gzHighgff3 has gzip auto-compression.
.gtfgtfHighRegistered.
.gtf.gzgtf.gzHighgtf has gzip auto-compression.
.csv, mimetype: text/csvcsvHighUse for the sheet file itself, not when translating rows into sample_sheet*.
.tsvtsvHighRegistered.
.tab or tabular texttabularMediumUse when semantics are tabular but exact TSV evidence is absent.
.tsv.gztabular.gzMediumDo not emit invented tsv.gz.
.csv.gzOmitLowNo registry-backed csv.gz mapping in the pinned registry.
.txttxtLow to mediumMany scientific tables hide behind .txt; prefer low confidence.
.txt.gzOmitLowDo not invent txt.gz.
.htmlhtmlHighRegistered.
.jsonjsonHighRegistered.
.xmlxmlHighRegistered.
.zipzipHighArchive, often report bundle.
.tartarHighArchive.
.tar.gztar.gzHightar has gzip auto-compression.
Bare .gzOmit or gz only if compression artifact mattersLowCompression alone is not scientific datatype.
Directory pathOmitHigh for non-datatypeDecide collection/reference bundle separately.
Extensionless or dynamic closureOmitLowLet Galaxy sniff or ask user.

nf-schema evidence

nf-schema evidenceGalaxy extension actionConfidence
format: file-path, pattern restricts known extensionMap by extension table.High
mimetype: text/csv and sample-sheet paramUse csv only if modeling sheet as dataset; otherwise use sample_sheet* collection.High
mimetype: application/gzip onlyDo not choose scientific datatype by mimetype alone.Low
format: directory-pathNo dataset format; classify as directory/reference/collection.High
fastq_1 / fastq_2 path columns plus FASTQ patternfastqsanger or fastqsanger.gz.High
bam path column plus .bam patternbam.High
vcf path column plus .vcf(.gz)? patternvcf or vcf_bgzip.Medium to high
Column name onlyWeak hint; omit format unless clear.Low to medium

For sample sheets, map path columns to element datatypes and metadata columns to column_definitions; do not preserve only the sample-sheet CSV datatype unless the sheet itself is the dataset being processed.

Process and publish evidence

EvidenceUseConfidence
path("*.fastq.gz"), path("*_{1,2}.fastq.gz")fastqsanger.gzHigh
path("*.bam")bamHigh
path("*.vcf")vcfHigh
path("*{vcf.gz,vcf.gz.tbi}")Primary vcf_bgzip plus index sidecarMedium to high
path("*.html")htmlHigh
path("*.zip")zipHigh
path("*.{tsv,csv,arrow,parquet,biom}")Heterogeneous; do not collapseLow for one extension
publishDir pattern: "*.ext"User-visible output evidenceMedium for datatype; high only if registry-backed and specific.
path(reads), path(fasta), path(input)Semantic name hintLow to medium

Treat publishDir pattern as high-confidence evidence that an output is user-visible, but only medium datatype evidence unless the pattern is specific and registry-backed.

Compression rules

Galaxy auto-compressed types in datatypes_conf.xml.sample are authoritative.

  • If base type has auto_compressed_types="gz", emit <base>.gz for .gz.
  • If base type lacks auto-compression, do not invent <base>.gz.
  • Specific registered compressed extensions override generic rules: vcf_bgzip, bed_tabix.gz, gff_tabix.gz, interval_tabix.gz, and bgzip.
  • Map .fastq.gz and .fq.gz to fastqsanger.gz.
  • Map .fasta.gz and .fa.gz to fasta.gz.
  • Map .tsv.gz to tabular.gz, not tsv.gz.
  • Omit format for csv.gz, txt.gz, bare .gz, and bed.gz unless registry-backed context is present.

Pairing and collection shape

Datatype mapping and collection-shape mapping are separate decisions. R1/R2 names, {1,2}, and fromFilePairs affect paired, list:paired, or sample_sheet:paired shape; they do not change datatype beyond FASTQ.

EvidenceDatatypeShape implication
*_R1/_R2, *_1/_2, {1,2} FASTQfastqsanger(.gz)Paired or list:paired candidate.
fromFilePairs(pattern)From inner extensionStrong paired/grouped shape evidence.
fastq_1 plus fastq_2 columnsfastqsanger(.gz)sample_sheet:paired or list:paired.
One FASTQ path columnfastqsanger(.gz)sample_sheet or list.
Mixed single/paired rowsfastqsanger(.gz)sample_sheet:paired_or_unpaired or branch split.
tuple val(meta), path(a), path(b)Map each path separatelyNot automatically paired.
BAM plus BAI, VCF plus TBIPrimary BAM/VCF datatypeIndex sidecar, not paired-end collection.

Uncertainty posture

When datatype is uncertain, downstream Molds should add a confidence note, ask the user if the datatype affects tool choice, or defer to Galaxy sniffing. They should not force a specific format from a directory, extensionless path, bare .gz, or generic *.txt.

Recommended confidence labels:

ConfidenceUse when
HighRegistry-backed extension from explicit pattern, sample-sheet schema, module meta, or process output.
MediumSpecific but indirect evidence, such as publish pattern or semantic variable plus extension.
LowGeneric text, directory, dynamic output, extensionless path, bare compression, or name-only inference.

Corpus examples

Corpus-observed:

  • $NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.json has fastq_1 / fastq_2 path columns whose pattern allows .fq, .fastq, .fq.gz, and .fastq.gz; map to fastqsanger or fastqsanger.gz with high confidence.
  • $NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.json has genome_bam and transcriptome_bam path columns with .bam patterns; map to bam with high confidence.
  • $NEXTFLOW_FIXTURES/nf-core__taxprofiler/assets/schema_input.json has FASTQ and FASTA path columns; map FASTQ paths to fastqsanger.gz when gzipped and FASTA paths to fasta or fasta.gz.
  • $NEXTFLOW_FIXTURES/nf-core__sarek/assets/schema_input.json includes FASTQ, BAM, CRAM, and VCF-style path columns; map by the core table and keep .vcf.gz confidence tied to bgzip/tabix context.
  • $NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/samtools/fastq/main.nf emits path("*_{1,2}.fastq.gz"); map datatype to fastqsanger.gz and shape to paired/list:paired separately.
  • $NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/fastqc/main.nf emits path("*.html") and path("*.zip"); map to html and zip.
  • $NEXTFLOW_FIXTURES/nf-core__taxprofiler/conf/modules.config contains publish patterns such as *.fastq.gz, *.bam, *.bai, *.tsv, and *.txt; these are output-intent evidence, but *.txt remains weak datatype evidence.
  • $NEXTFLOW_FIXTURES/nf-core__sarek/conf/modules/ includes patterns like *{vcf.gz,vcf.gz.tbi} and *bed.gz; treat VCF plus TBI as stronger vcf_bgzip evidence and BED gzip as tabix-context-dependent.

Foundry-internal:

Open questions

  • Confirm whether every <base>.gz generated by registry auto-compression is accepted unchanged in gxformat2 format fields.
  • Decide whether .vcf.gz without .tbi should still default to vcf_bgzip or omit format unless bgzip evidence appears.
  • Decide whether summary-nextflow should add explicit datatype_hint and datatype_confidence fields, or leave this as downstream Mold logic.
  • Model index sidecars (bai, tbi, crai) without treating them as primary datasets.
  • Broaden ad-hoc fixture evidence if non-nf-core pipelines become the immediate translation target.

Incoming References (11)