Nextflow path/glob to Galaxy datatype mapping
Use this note when a Nextflow-to-Galaxy Mold needs a gxformat2 format value for a data input, collection element, or workflow output. nextflow-params-to-galaxy-inputs decides whether something is a dataset or collection; this note only decides datatype extension and confidence.
Evidence quality:
- Corpus-observed claims cite pinned fixtures under
$NEXTFLOW_FIXTURES, the shared clone at/Users/jxc755/projects/repositories/workflow-fixtures/pipelines/. - Foundry-internal claims cite existing Foundry notes and the
summary-nextflowschema. - External-doc claims cite Nextflow, nf-schema, and Galaxy registry docs.
- Design inference states the translation posture this Foundry note recommends.
Registry constraint
Do not invent Galaxy extensions. galaxy-datatypes-conf and its adjacent datatypes_conf.xml.sample are the registry. If an extension is not registered or generated by auto_compressed_types, omit format.
For gxformat2 inputs, type: data or type: collection is required, but format is optional. Omit format when confidence is low rather than using a weak guess. Use the generic Galaxy data extension only as a last resort when a consumer requires an extension value.
Evidence precedence
| Rank | Evidence | Use |
|---|---|---|
| 1 | Explicit Galaxy extension from a target tool or prior Galaxy step | Keep unless contradicted. |
| 2 | nf-schema mimetype plus format plus filename pattern | High-confidence sample-sheet path-column or path-param datatype. |
| 3 | Sample-sheet column name plus pattern | High to medium depending specificity. |
| 4 | Process path("*.ext") output or nf-core meta.yml pattern | High for process output; medium for top-level workflow output unless emitted or published. |
| 5 | publishDir pattern: or workflow output path | High for user-visible output intent; medium for datatype unless pattern is specific. |
| 6 | fromPath / fromFilePairs glob | High for shape, medium for datatype unless extension is specific. |
| 7 | Process variable name only, e.g. path(reads), path(fasta) | Weak hint; do not emit format unless clear. |
| 8 | Generic extension, directory, extensionless output, dynamic string | Low; omit format. |
Core datatype mapping
| Nextflow / filename evidence | Galaxy format | Confidence | Notes |
|---|---|---|---|
.fastq, .fq | fastqsanger | High | Prefer over generic fastq unless source explicitly needs non-Sanger FASTQ. |
.fastq.gz, .fq.gz | fastqsanger.gz | High | fastqsanger has gzip auto-compression. |
.fa, .fasta, .fna, .fas | fasta | High | Common reference or assembled sequence input. |
.fa.gz, .fasta.gz, .fna.gz, .fas.gz | fasta.gz | High | fasta has gzip auto-compression. |
.bam | bam | High | Primary alignment dataset. |
.bai | bai | High but usually sidecar | Avoid separate top-level input unless source requires explicit index. |
.sam | sam | High | Registered. |
.cram | cram | High | Registered. |
.vcf | vcf | High | Registered. |
.vcf.gz | vcf_bgzip | Medium to high | High with tabix/bgzip evidence or paired .tbi; otherwise medium. |
.bed | bed | High | Registered. |
.bed.gz | bed_tabix.gz only with tabix evidence | Low to medium | Do not invent plain bed.gz. |
.gff | gff or gff3 | Medium | Content/version matters. |
.gff3 | gff3 | High | Registered. |
.gff3.gz | gff3.gz | High | gff3 has gzip auto-compression. |
.gtf | gtf | High | Registered. |
.gtf.gz | gtf.gz | High | gtf has gzip auto-compression. |
.csv, mimetype: text/csv | csv | High | Use for the sheet file itself, not when translating rows into sample_sheet*. |
.tsv | tsv | High | Registered. |
.tab or tabular text | tabular | Medium | Use when semantics are tabular but exact TSV evidence is absent. |
.tsv.gz | tabular.gz | Medium | Do not emit invented tsv.gz. |
.csv.gz | Omit | Low | No registry-backed csv.gz mapping in the pinned registry. |
.txt | txt | Low to medium | Many scientific tables hide behind .txt; prefer low confidence. |
.txt.gz | Omit | Low | Do not invent txt.gz. |
.html | html | High | Registered. |
.json | json | High | Registered. |
.xml | xml | High | Registered. |
.zip | zip | High | Archive, often report bundle. |
.tar | tar | High | Archive. |
.tar.gz | tar.gz | High | tar has gzip auto-compression. |
Bare .gz | Omit or gz only if compression artifact matters | Low | Compression alone is not scientific datatype. |
| Directory path | Omit | High for non-datatype | Decide collection/reference bundle separately. |
| Extensionless or dynamic closure | Omit | Low | Let Galaxy sniff or ask user. |
nf-schema evidence
| nf-schema evidence | Galaxy extension action | Confidence |
|---|---|---|
format: file-path, pattern restricts known extension | Map by extension table. | High |
mimetype: text/csv and sample-sheet param | Use csv only if modeling sheet as dataset; otherwise use sample_sheet* collection. | High |
mimetype: application/gzip only | Do not choose scientific datatype by mimetype alone. | Low |
format: directory-path | No dataset format; classify as directory/reference/collection. | High |
fastq_1 / fastq_2 path columns plus FASTQ pattern | fastqsanger or fastqsanger.gz. | High |
bam path column plus .bam pattern | bam. | High |
vcf path column plus .vcf(.gz)? pattern | vcf or vcf_bgzip. | Medium to high |
| Column name only | Weak hint; omit format unless clear. | Low to medium |
For sample sheets, map path columns to element datatypes and metadata columns to column_definitions; do not preserve only the sample-sheet CSV datatype unless the sheet itself is the dataset being processed.
Process and publish evidence
| Evidence | Use | Confidence |
|---|---|---|
path("*.fastq.gz"), path("*_{1,2}.fastq.gz") | fastqsanger.gz | High |
path("*.bam") | bam | High |
path("*.vcf") | vcf | High |
path("*{vcf.gz,vcf.gz.tbi}") | Primary vcf_bgzip plus index sidecar | Medium to high |
path("*.html") | html | High |
path("*.zip") | zip | High |
path("*.{tsv,csv,arrow,parquet,biom}") | Heterogeneous; do not collapse | Low for one extension |
publishDir pattern: "*.ext" | User-visible output evidence | Medium for datatype; high only if registry-backed and specific. |
path(reads), path(fasta), path(input) | Semantic name hint | Low to medium |
Treat publishDir pattern as high-confidence evidence that an output is user-visible, but only medium datatype evidence unless the pattern is specific and registry-backed.
Compression rules
Galaxy auto-compressed types in datatypes_conf.xml.sample are authoritative.
- If base type has
auto_compressed_types="gz", emit<base>.gzfor.gz. - If base type lacks auto-compression, do not invent
<base>.gz. - Specific registered compressed extensions override generic rules:
vcf_bgzip,bed_tabix.gz,gff_tabix.gz,interval_tabix.gz, andbgzip. - Map
.fastq.gzand.fq.gztofastqsanger.gz. - Map
.fasta.gzand.fa.gztofasta.gz. - Map
.tsv.gztotabular.gz, nottsv.gz. - Omit
formatforcsv.gz,txt.gz, bare.gz, andbed.gzunless registry-backed context is present.
Pairing and collection shape
Datatype mapping and collection-shape mapping are separate decisions. R1/R2 names, {1,2}, and fromFilePairs affect paired, list:paired, or sample_sheet:paired shape; they do not change datatype beyond FASTQ.
| Evidence | Datatype | Shape implication |
|---|---|---|
*_R1/_R2, *_1/_2, {1,2} FASTQ | fastqsanger(.gz) | Paired or list:paired candidate. |
fromFilePairs(pattern) | From inner extension | Strong paired/grouped shape evidence. |
fastq_1 plus fastq_2 columns | fastqsanger(.gz) | sample_sheet:paired or list:paired. |
| One FASTQ path column | fastqsanger(.gz) | sample_sheet or list. |
| Mixed single/paired rows | fastqsanger(.gz) | sample_sheet:paired_or_unpaired or branch split. |
tuple val(meta), path(a), path(b) | Map each path separately | Not automatically paired. |
| BAM plus BAI, VCF plus TBI | Primary BAM/VCF datatype | Index sidecar, not paired-end collection. |
Uncertainty posture
When datatype is uncertain, downstream Molds should add a confidence note, ask the user if the datatype affects tool choice, or defer to Galaxy sniffing. They should not force a specific format from a directory, extensionless path, bare .gz, or generic *.txt.
Recommended confidence labels:
| Confidence | Use when |
|---|---|
| High | Registry-backed extension from explicit pattern, sample-sheet schema, module meta, or process output. |
| Medium | Specific but indirect evidence, such as publish pattern or semantic variable plus extension. |
| Low | Generic text, directory, dynamic output, extensionless path, bare compression, or name-only inference. |
Corpus examples
Corpus-observed:
$NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.jsonhasfastq_1/fastq_2path columns whose pattern allows.fq,.fastq,.fq.gz, and.fastq.gz; map tofastqsangerorfastqsanger.gzwith high confidence.$NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.jsonhasgenome_bamandtranscriptome_bampath columns with.bampatterns; map tobamwith high confidence.$NEXTFLOW_FIXTURES/nf-core__taxprofiler/assets/schema_input.jsonhas FASTQ and FASTA path columns; map FASTQ paths tofastqsanger.gzwhen gzipped and FASTA paths tofastaorfasta.gz.$NEXTFLOW_FIXTURES/nf-core__sarek/assets/schema_input.jsonincludes FASTQ, BAM, CRAM, and VCF-style path columns; map by the core table and keep.vcf.gzconfidence tied to bgzip/tabix context.$NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/samtools/fastq/main.nfemitspath("*_{1,2}.fastq.gz"); map datatype tofastqsanger.gzand shape to paired/list:paired separately.$NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/fastqc/main.nfemitspath("*.html")andpath("*.zip"); map tohtmlandzip.$NEXTFLOW_FIXTURES/nf-core__taxprofiler/conf/modules.configcontains publish patterns such as*.fastq.gz,*.bam,*.bai,*.tsv, and*.txt; these are output-intent evidence, but*.txtremains weak datatype evidence.$NEXTFLOW_FIXTURES/nf-core__sarek/conf/modules/includes patterns like*{vcf.gz,vcf.gz.tbi}and*bed.gz; treat VCF plus TBI as strongervcf_bgzipevidence and BED gzip as tabix-context-dependent.
Foundry-internal:
- gxformat2-workflow-inputs says
formatis optional and should be omitted when datatype confidence is weak. - galaxy-datatypes-conf identifies
datatypes_conf.xml.sampleas the extension registry source. - summary-nextflow records sample-sheet column
format,mimetype, andpattern, plus process and module path patterns that downstream Molds can inspect. - nextflow-to-galaxy-channel-shape-mapping separates dataset versus collection shape.
Open questions
- Confirm whether every
<base>.gzgenerated by registry auto-compression is accepted unchanged in gxformat2formatfields. - Decide whether
.vcf.gzwithout.tbishould still default tovcf_bgzipor omitformatunless bgzip evidence appears. - Decide whether
summary-nextflowshould add explicitdatatype_hintanddatatype_confidencefields, or leave this as downstream Mold logic. - Model index sidecars (
bai,tbi,crai) without treating them as primary datasets. - Broaden ad-hoc fixture evidence if non-nf-core pipelines become the immediate translation target.