# Nextflow path/glob to Galaxy datatype mapping

Use this note when a Nextflow-to-Galaxy Mold needs a gxformat2 `format` value for a `data` input, collection element, or workflow output. [[nextflow-params-to-galaxy-inputs]] decides whether something is a dataset or collection; this note only decides datatype extension and confidence.

Evidence quality:

- **Corpus-observed** claims cite pinned fixtures under `$NEXTFLOW_FIXTURES`, the shared clone at `/Users/jxc755/projects/repositories/workflow-fixtures/pipelines/`.
- **Foundry-internal** claims cite existing Foundry notes and the `summary-nextflow` schema.
- **External-doc** claims cite Nextflow, nf-schema, and Galaxy registry docs.
- **Design inference** states the translation posture this Foundry note recommends.

## Registry constraint

Do not invent Galaxy extensions. [[galaxy-datatypes-conf]] and its adjacent `datatypes_conf.xml.sample` are the registry. If an extension is not registered or generated by `auto_compressed_types`, omit `format`.

For gxformat2 inputs, `type: data` or `type: collection` is required, but `format` is optional. Omit `format` when confidence is low rather than using a weak guess. Use the generic Galaxy `data` extension only as a last resort when a consumer requires an extension value.

## Evidence precedence

| Rank | Evidence | Use |
|---:|---|---|
| 1 | Explicit Galaxy extension from a target tool or prior Galaxy step | Keep unless contradicted. |
| 2 | nf-schema `mimetype` plus `format` plus filename `pattern` | High-confidence sample-sheet path-column or path-param datatype. |
| 3 | Sample-sheet column name plus pattern | High to medium depending specificity. |
| 4 | Process `path("*.ext")` output or nf-core `meta.yml` pattern | High for process output; medium for top-level workflow output unless emitted or published. |
| 5 | `publishDir pattern:` or workflow output path | High for user-visible output intent; medium for datatype unless pattern is specific. |
| 6 | `fromPath` / `fromFilePairs` glob | High for shape, medium for datatype unless extension is specific. |
| 7 | Process variable name only, e.g. `path(reads)`, `path(fasta)` | Weak hint; do not emit `format` unless clear. |
| 8 | Generic extension, directory, extensionless output, dynamic string | Low; omit `format`. |

## Core datatype mapping

| Nextflow / filename evidence | Galaxy `format` | Confidence | Notes |
|---|---|---:|---|
| `.fastq`, `.fq` | `fastqsanger` | High | Prefer over generic `fastq` unless source explicitly needs non-Sanger FASTQ. |
| `.fastq.gz`, `.fq.gz` | `fastqsanger.gz` | High | `fastqsanger` has gzip auto-compression. |
| `.fa`, `.fasta`, `.fna`, `.fas` | `fasta` | High | Common reference or assembled sequence input. |
| `.fa.gz`, `.fasta.gz`, `.fna.gz`, `.fas.gz` | `fasta.gz` | High | `fasta` has gzip auto-compression. |
| `.bam` | `bam` | High | Primary alignment dataset. |
| `.bai` | `bai` | High but usually sidecar | Avoid separate top-level input unless source requires explicit index. |
| `.sam` | `sam` | High | Registered. |
| `.cram` | `cram` | High | Registered. |
| `.vcf` | `vcf` | High | Registered. |
| `.vcf.gz` | `vcf_bgzip` | Medium to high | High with tabix/bgzip evidence or paired `.tbi`; otherwise medium. |
| `.bed` | `bed` | High | Registered. |
| `.bed.gz` | `bed_tabix.gz` only with tabix evidence | Low to medium | Do not invent plain `bed.gz`. |
| `.gff` | `gff` or `gff3` | Medium | Content/version matters. |
| `.gff3` | `gff3` | High | Registered. |
| `.gff3.gz` | `gff3.gz` | High | `gff3` has gzip auto-compression. |
| `.gtf` | `gtf` | High | Registered. |
| `.gtf.gz` | `gtf.gz` | High | `gtf` has gzip auto-compression. |
| `.csv`, `mimetype: text/csv` | `csv` | High | Use for the sheet file itself, not when translating rows into `sample_sheet*`. |
| `.tsv` | `tsv` | High | Registered. |
| `.tab` or tabular text | `tabular` | Medium | Use when semantics are tabular but exact TSV evidence is absent. |
| `.tsv.gz` | `tabular.gz` | Medium | Do not emit invented `tsv.gz`. |
| `.csv.gz` | Omit | Low | No registry-backed `csv.gz` mapping in the pinned registry. |
| `.txt` | `txt` | Low to medium | Many scientific tables hide behind `.txt`; prefer low confidence. |
| `.txt.gz` | Omit | Low | Do not invent `txt.gz`. |
| `.html` | `html` | High | Registered. |
| `.json` | `json` | High | Registered. |
| `.xml` | `xml` | High | Registered. |
| `.zip` | `zip` | High | Archive, often report bundle. |
| `.tar` | `tar` | High | Archive. |
| `.tar.gz` | `tar.gz` | High | `tar` has gzip auto-compression. |
| Bare `.gz` | Omit or `gz` only if compression artifact matters | Low | Compression alone is not scientific datatype. |
| Directory path | Omit | High for non-datatype | Decide collection/reference bundle separately. |
| Extensionless or dynamic closure | Omit | Low | Let Galaxy sniff or ask user. |

## nf-schema evidence

| nf-schema evidence | Galaxy extension action | Confidence |
|---|---|---:|
| `format: file-path`, pattern restricts known extension | Map by extension table. | High |
| `mimetype: text/csv` and sample-sheet param | Use `csv` only if modeling sheet as dataset; otherwise use `sample_sheet*` collection. | High |
| `mimetype: application/gzip` only | Do not choose scientific datatype by mimetype alone. | Low |
| `format: directory-path` | No dataset `format`; classify as directory/reference/collection. | High |
| `fastq_1` / `fastq_2` path columns plus FASTQ pattern | `fastqsanger` or `fastqsanger.gz`. | High |
| `bam` path column plus `.bam` pattern | `bam`. | High |
| `vcf` path column plus `.vcf(.gz)?` pattern | `vcf` or `vcf_bgzip`. | Medium to high |
| Column name only | Weak hint; omit `format` unless clear. | Low to medium |

For sample sheets, map path columns to element datatypes and metadata columns to `column_definitions`; do not preserve only the sample-sheet CSV datatype unless the sheet itself is the dataset being processed.

## Process and publish evidence

| Evidence | Use | Confidence |
|---|---|---:|
| `path("*.fastq.gz")`, `path("*_{1,2}.fastq.gz")` | `fastqsanger.gz` | High |
| `path("*.bam")` | `bam` | High |
| `path("*.vcf")` | `vcf` | High |
| `path("*{vcf.gz,vcf.gz.tbi}")` | Primary `vcf_bgzip` plus index sidecar | Medium to high |
| `path("*.html")` | `html` | High |
| `path("*.zip")` | `zip` | High |
| `path("*.{tsv,csv,arrow,parquet,biom}")` | Heterogeneous; do not collapse | Low for one extension |
| `publishDir pattern: "*.ext"` | User-visible output evidence | Medium for datatype; high only if registry-backed and specific. |
| `path(reads)`, `path(fasta)`, `path(input)` | Semantic name hint | Low to medium |

Treat `publishDir pattern` as high-confidence evidence that an output is user-visible, but only medium datatype evidence unless the pattern is specific and registry-backed.

## Compression rules

Galaxy auto-compressed types in `datatypes_conf.xml.sample` are authoritative.

- If base type has `auto_compressed_types="gz"`, emit `<base>.gz` for `.gz`.
- If base type lacks auto-compression, do not invent `<base>.gz`.
- Specific registered compressed extensions override generic rules: `vcf_bgzip`, `bed_tabix.gz`, `gff_tabix.gz`, `interval_tabix.gz`, and `bgzip`.
- Map `.fastq.gz` and `.fq.gz` to `fastqsanger.gz`.
- Map `.fasta.gz` and `.fa.gz` to `fasta.gz`.
- Map `.tsv.gz` to `tabular.gz`, not `tsv.gz`.
- Omit `format` for `csv.gz`, `txt.gz`, bare `.gz`, and `bed.gz` unless registry-backed context is present.

## Pairing and collection shape

Datatype mapping and collection-shape mapping are separate decisions. R1/R2 names, `{1,2}`, and `fromFilePairs` affect `paired`, `list:paired`, or `sample_sheet:paired` shape; they do not change datatype beyond FASTQ.

| Evidence | Datatype | Shape implication |
|---|---|---|
| `*_R1/_R2`, `*_1/_2`, `{1,2}` FASTQ | `fastqsanger(.gz)` | Paired or `list:paired` candidate. |
| `fromFilePairs(pattern)` | From inner extension | Strong paired/grouped shape evidence. |
| `fastq_1` plus `fastq_2` columns | `fastqsanger(.gz)` | `sample_sheet:paired` or `list:paired`. |
| One FASTQ path column | `fastqsanger(.gz)` | `sample_sheet` or `list`. |
| Mixed single/paired rows | `fastqsanger(.gz)` | `sample_sheet:paired_or_unpaired` or branch split. |
| `tuple val(meta), path(a), path(b)` | Map each path separately | Not automatically paired. |
| BAM plus BAI, VCF plus TBI | Primary BAM/VCF datatype | Index sidecar, not paired-end collection. |

## Uncertainty posture

When datatype is uncertain, downstream Molds should add a confidence note, ask the user if the datatype affects tool choice, or defer to Galaxy sniffing. They should not force a specific `format` from a directory, extensionless path, bare `.gz`, or generic `*.txt`.

Recommended confidence labels:

| Confidence | Use when |
|---|---|
| High | Registry-backed extension from explicit pattern, sample-sheet schema, module meta, or process output. |
| Medium | Specific but indirect evidence, such as publish pattern or semantic variable plus extension. |
| Low | Generic text, directory, dynamic output, extensionless path, bare compression, or name-only inference. |

## Corpus examples

Corpus-observed:

- `$NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.json` has `fastq_1` / `fastq_2` path columns whose pattern allows `.fq`, `.fastq`, `.fq.gz`, and `.fastq.gz`; map to `fastqsanger` or `fastqsanger.gz` with high confidence.
- `$NEXTFLOW_FIXTURES/nf-core__rnaseq/assets/schema_input.json` has `genome_bam` and `transcriptome_bam` path columns with `.bam` patterns; map to `bam` with high confidence.
- `$NEXTFLOW_FIXTURES/nf-core__taxprofiler/assets/schema_input.json` has FASTQ and FASTA path columns; map FASTQ paths to `fastqsanger.gz` when gzipped and FASTA paths to `fasta` or `fasta.gz`.
- `$NEXTFLOW_FIXTURES/nf-core__sarek/assets/schema_input.json` includes FASTQ, BAM, CRAM, and VCF-style path columns; map by the core table and keep `.vcf.gz` confidence tied to bgzip/tabix context.
- `$NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/samtools/fastq/main.nf` emits `path("*_{1,2}.fastq.gz")`; map datatype to `fastqsanger.gz` and shape to paired/list:paired separately.
- `$NEXTFLOW_FIXTURES/nf-core__taxprofiler/modules/nf-core/fastqc/main.nf` emits `path("*.html")` and `path("*.zip")`; map to `html` and `zip`.
- `$NEXTFLOW_FIXTURES/nf-core__taxprofiler/conf/modules.config` contains publish patterns such as `*.fastq.gz`, `*.bam`, `*.bai`, `*.tsv`, and `*.txt`; these are output-intent evidence, but `*.txt` remains weak datatype evidence.
- `$NEXTFLOW_FIXTURES/nf-core__sarek/conf/modules/` includes patterns like `*{vcf.gz,vcf.gz.tbi}` and `*bed.gz`; treat VCF plus TBI as stronger `vcf_bgzip` evidence and BED gzip as tabix-context-dependent.

Foundry-internal:

- [[gxformat2-workflow-inputs]] says `format` is optional and should be omitted when datatype confidence is weak.
- [[galaxy-datatypes-conf]] identifies `datatypes_conf.xml.sample` as the extension registry source.
- [[summary-nextflow]] records sample-sheet column `format`, `mimetype`, and `pattern`, plus process and module path patterns that downstream Molds can inspect.
- [[nextflow-to-galaxy-channel-shape-mapping]] separates dataset versus collection shape.

## Open questions

- Confirm whether every `<base>.gz` generated by registry auto-compression is accepted unchanged in gxformat2 `format` fields.
- Decide whether `.vcf.gz` without `.tbi` should still default to `vcf_bgzip` or omit `format` unless bgzip evidence appears.
- Decide whether `summary-nextflow` should add explicit `datatype_hint` and `datatype_confidence` fields, or leave this as downstream Mold logic.
- Model index sidecars (`bai`, `tbi`, `crai`) without treating them as primary datasets.
- Broaden ad-hoc fixture evidence if non-nf-core pipelines become the immediate translation target.