Nextflow workflow I/O semantics
Use this note when deciding what counts as a workflow input or output in a Nextflow pipeline before translating the interface into Galaxy or CWL. The central problem: Nextflow does not have one historical interface surface. Real pipelines spread interface meaning across params, nextflow_schema.json, sample-sheet schemas, entry workflow channel construction, process input: / output: blocks, named workflow take: / emit: blocks, publishDir, and newer workflow publish: / output {} blocks.
Evidence quality is split below:
- Documentation-observed claims come from Nextflow, nf-core, and nf-schema docs.
- Corpus-observed claims come from the pinned fixture corpus in
workflow-fixtures/pipelines/. - Design inference states how the Foundry should use those facts.
What counts as input
Nextflow workflow input has three layers. The correct abstraction depends on which layer is being translated.
| Layer | Surface | Meaning | Translation value |
|---|---|---|---|
| Parameter surface | params.*, params {}, nextflow_schema.json, params files, config | User-facing values supplied at launch. May be scalars, paths, directories, glob patterns, sample sheets, toggles, tool choices, resource knobs, and output settings. | Strongest source for Galaxy top-level scalar parameters and candidate dataset inputs. |
| Materialization surface | channel.fromPath, channel.fromFilePairs, channel.fromList(samplesheetToList(...)), splitCsv, file(), files() | How launch values become channels or dataflow values. Encodes dataset-vs-collection shape. | Strongest source for Galaxy dataset and collection shapes. |
| Component surface | named workflow take:, process input: declarations | Internal function boundary. Names semantic channels or task-local staged files. | Strongest source for per-step shape and dependency semantics, not automatically top-level workflow inputs. |
Documentation-observed: an entry workflow can read params; named workflows declare inputs with take:. Processes declare task inputs with qualifiers including val, path, env, stdin, tuple, and each. A process call must be supplied a channel, dataflow value, or regular value for each declared input.
Design inference: top-level Galaxy workflow inputs should be derived from parameter plus materialization surfaces, not from every process input:. Process inputs mostly describe task-local staging and downstream wiring. Named workflow take: entries matter when the named workflow is the real pipeline body invoked by a thin entry workflow, as in many nf-core pipelines.
Parameter surface
params values are the launch-time user interface. Defaults can be assigned in .nf files, nextflow.config, params files, or config profiles; command-line and config resolution order is defined by Nextflow’s pipeline parameter docs. The same parameter can be a scalar threshold, an input file, an input directory, a glob, a sample-sheet path, a reference bundle, an output directory, or a behavior toggle.
When nextflow_schema.json exists, it is the best available declaration for parameter type, description, requiredness, and path-like intent. nextflow-parameters-meta records the supported nf-schema vocabulary: type is limited to string, boolean, integer, and number; format can mark file-path, directory-path, path, or file-path-pattern; exists, mimetype, pattern, schema, enum, hidden, and description fields refine behavior.
Corpus-observed:
| Shape | Evidence |
|---|---|
All seven nf-core fixtures have root nextflow_schema.json. | nf-core__demo, nf-core__fetchngs, nf-core__rnaseq, nf-core__bacass, nf-core__hlatyping, nf-core__sarek, nf-core__taxprofiler. |
Two ad-hoc fixtures also have root nextflow_schema.json. | labsyspharm__mcmicro, epi2me-labs__wf-human-variation. |
Six ad-hoc fixtures lack a root nextflow_schema.json. | CRG-CNAG__CalliNGS-NF, JaneliaSciComp__nf-demos, ZuberLab__crispr-process-nf, biocorecrg__MOP2, replikation__What_the_Phage, ncbi__egapx. |
| nf-core schemas often make sample sheets explicit. | nf-core__rnaseq/nextflow_schema.json declares input as type: string, format: file-path, exists: true, schema: assets/schema_input.json, mimetype: text/csv, and required: ["input", "outdir"]. |
| Required parameters can be enforced imperatively when schema is absent or not authoritative. | labsyspharm__mcmicro/main.nf errors when !params.containsKey('in'); CRG-CNAG__CalliNGS-NF/main.nf assigns defaults directly to params.genome, params.variants, params.denylist, params.reads, and params.results. |
Design inference:
- Treat
nextflow_schema.jsonas high-quality but incomplete. It says what the user may supply; the entry workflow says how the value becomes dataflow. - Path-like
stringparameters are not automatically Galaxydatainputs. Adirectory-pathmay represent a data collection root, a reference bundle, a staged cache, or an output directory. outdir,publish_dir_mode, email, resource, and reporting parameters are workflow-control parameters. They usually should not become Galaxy workflow inputs unless the target harness intentionally exposes execution-control knobs.- Requiredness should combine schema
required, missing-default checks, imperativeerror, and runtimeifEmpty { error ... }guards.
Sample sheets and records
Sample sheets are the dominant structured input idiom in modern nf-core pipelines. A sample sheet is a file parameter at the launch surface, but after validation and parsing it becomes a channel of records or tuples with metadata and paths.
Documentation-observed: nf-schema samplesheetToList() validates a CSV/TSV sample sheet with a JSON schema and converts rows into typed Groovy values. The helper emits columns in schema property order, not source column order, and can mark fields as meta so metadata becomes a map in channel items.
Corpus-observed:
nf-core__rnaseq/workflows/rnaseq/main.nfimportssamplesheetToList, readsparams.inputthroughchannel.fromList(samplesheetToList(params.input, "${projectDir}/assets/schema_input.json")), then maps rows into[meta.id, meta + [single_end: ...], reads, genome_bam, transcriptome_bam], groups by sample, and branches into FASTQ or BAM paths.nf-core__taxprofiler/subworkflows/local/utils_nfcore_taxprofiler_pipeline/main.nfusessamplesheetToList(input, "assets/schema_input.json")for samples andsamplesheetToList(databases, "assets/schema_database.json")for database definitions.nf-core__sarek/subworkflows/local/utils_nfcore_sarek_pipeline/main.nfusessamplesheetToListfor both normal input and restart input.epi2me-labs__wf-human-variation/lib/ingress.nfvalidates a sample sheet, then buildsChannel.fromPath(sample_sheet).last().splitCsv(header: true, quote: '"').replikation__What_the_Phage/phage.nftreatsparams.fastaas either a CSV list that maps row 0 to sample name and row 1 to file path, or a direct FASTA file when list mode is off.
Design inference:
- A sample-sheet parameter often maps to a Galaxy collection input, not to a single file input, even though its schema type is
stringandformat: file-path. - Preserve row metadata.
val(meta)and sample-sheet columns carry identifiers, labels, grouping axes, single/paired flags, and optional analysis groups. They are not datasets, but they often determine Galaxy collection identifiers and branch conditions. - Prefer sample-sheet schema over ad-hoc filename inference when both exist. It carries column names, required columns, path columns, and metadata fields.
Path and glob inputs
Direct path/glob construction is common, especially outside nf-core. It is also present in nf-core subworkflows for references and optional resources.
Documentation-observed:
channel.fromPath()emits paths matching a file name or glob pattern. Options include existence checks and type filtering.channel.fromFilePairs()emits tuples of grouping key plus a sorted list of files.sizecontrols expected pair size andflat: trueflattens file pairs.file()returns onePath;files()returns a collection ofPathobjects.- A process
pathinput stages a file or collection of files into the task work directory.
Corpus-observed counts from the fixture corpus:
| Construct | Matches | Pipelines |
|---|---|---|
Channel.fromPath / channel.fromPath | 118 | 9 |
Channel.fromFilePairs | 7 | 2 |
splitCsv | 14 | 5 |
fromSamplesheet | 0 | 0 |
Representative observed forms:
CRG-CNAG__CalliNGS-NF/main.nfsetsparams.reads = "$baseDir/data/reads/rep1_{1,2}.fq.gz"and materializes it withChannel.fromFilePairs(params.reads).biocorecrg__MOP2/mop_preprocess/mop_preprocess.nfusesChannel.fromPath(params.fast5).ifEmpty { error ... }andChannel.fromFilePairs(params.fastq, size: 1).ifEmpty { error ... }.labsyspharm__mcmicro/main.nfderives multiple channels from subdirectories belowparams.in, includingmarkers.csv, raw-image files, staging directories, and precomputed intermediates.epi2me-labs__wf-human-variation/main.nfmaps optional BED/reference inputs withChannel.fromPath(params.bed, checkIfExists: true)or placeholder files when absent.
Design inference:
fromPathover one concrete file is a candidate Galaxy dataset input.fromPathover a directory or glob is a candidate Galaxy collection input, but the collection type depends on downstream tuple shape and process qualifiers.fromFilePairsis strong evidence for paired files, usually Galaxypairedorlist:paired, unlesssize: 1indicates grouped singleton files.- Placeholder paths such as
OPTIONAL_FILErepresent optional-branch plumbing, not real user inputs. ifEmpty { error ... }is requiredness evidence for the materialized dataflow, even if the parameter schema does not mark the parameter required.
Channels, values, and operators
Nextflow dataflow has two cardinality classes: channels and dataflow values. Channels are asynchronous sequences; dataflow values are singleton asynchronous values. Operators transform, filter, combine, fork, or reduce them. component-nextflow-channel-operators gives the detailed operator taxonomy; this note only records why operators matter for workflow I/O.
Documentation-observed:
mapis one-to-one except null results are not emitted.flatMapandflattenfan out values.collect,toList, and related operators fan in many values into one list-like value.groupTuplegroups tuple streams by key and emits grouped payload lists.join,combine,cross,mix, andconcatcombine channels with different key/order semantics.branchandmultiMapreturn multiple channels.
Design inference:
- Channel shape is part of the interface. A path parameter that becomes
tuple(meta, [fastq_1, fastq_2])should not translate like a scalar file parameter. - Operators between materialization and first process call may change the Galaxy input shape.
groupTuple()can turn rows into per-sample collections;branchcan split one input surface into alternative workflow paths;collect()can turn many files into one collection-valued dependency. - Some operators are nondeterministic or order-sensitive. Translation should prefer keyed metadata and explicit identifier matching over implicit channel order.
Process and named workflow surfaces
Process input: and output: blocks define task interface. Named workflow take: and emit: define reusable workflow interface. These are real I/O surfaces, but not always top-level pipeline I/O surfaces.
Corpus-observed counts from 700 .nf files:
| Surface | Matches |
|---|---|
input: blocks | 656 |
output: blocks | 677 |
take: blocks | 266 |
workflow/module emit: blocks | 244 |
path qualifier | 1229 |
val qualifier | 325 |
tuple qualifier | 1953 |
env qualifier | 20 |
each qualifier | 9 |
legacy file qualifier | 71 |
legacy set qualifier | 21 |
Representative observed forms:
nf-core__rnaseq/modules/nf-core/fastqc/main.nfuses inputtuple val(meta), path(reads)and emits named HTML, ZIP, and versions outputs.CRG-CNAG__CalliNGS-NF/modules.nfhas simple path inputs such aspath genome, and mixed tuple outputs such astuple val(replicateId), path(...bam), path(...bai).ZuberLab__crispr-process-nf/main.nfstill uses legacy DSL1-ishset val(lane), file(bam) from rawBamFilesandset val(lane), file("${lane}.fastq.gz") into fastqFilesFromBam.nf-core__rnaseq/workflows/rnaseq/main.nfdeclares a named workflowtake:with semantic channel names such asch_samplesheet,ch_fasta, andch_star_index, plus comments describing expected path shapes.
Documentation-observed process qualifier meanings:
| Qualifier | Meaning for interface extraction |
|---|---|
val | Scalar or metadata value. Not a dataset unless it is a path string intentionally passed as scalar. |
path / legacy file | Staged file or collection of files. Strong dataset evidence. |
tuple / legacy set | One channel item with multiple typed fields. Strong collection/record evidence, not automatically one Galaxy collection. |
env | Environment variable value. Usually scalar/tool parameter semantics. |
stdin / stdout | Stream surface. Rare in this corpus; translate only with per-process review. |
each | Repeater. Expands task combinations and often maps to parameter sweeps or collection map-over. |
Design inference:
- For top-level interface conversion, process inputs validate and refine shape; they should not create new workflow inputs unless their upstream source is a top-level parameter or external file construction.
- Named workflow
take:blocks are especially important in nf-core because the entry workflow often delegates to one substantive named workflow after initialization. tuple val(meta), path(...)is the core idiom for keyed dataset collections. Themetamap supplies identifiers; thepathfield supplies datasets.- Legacy
file/setmust be supported for ad-hoc pipelines even if new guidance preferspath/tuple.
What counts as output
Nextflow output has four distinct meanings. Confusing them produces bad Galaxy interfaces.
| Surface | Meaning | Top-level output confidence |
|---|---|---|
Process output: | Files/values emitted from a task into dataflow. | Medium. Internal unless published or exposed by workflow emits. |
Named workflow emit: | Outputs exposed by a named workflow call. | Medium to high when the named workflow is the main pipeline body. |
publishDir | Side-effect copy/link of selected process outputs into an output directory. | High for user-visible output intent; low for dataflow shape alone. |
Entry workflow publish: plus top-level output {} | Modern declared workflow outputs. | Highest when present. |
Documentation-observed:
- Workflow output blocks are available as preview in Nextflow 24.04, 24.10, and 25.04, and documented as added in 25.10.
- Workflow outputs are intended to replace
publishDir. - Entry workflow
publish:assigns channels to output names; top-leveloutput {}declares how those outputs are published. - Workflow output
pathcan be static, dynamic per value, or use>>to publish selected files. - Workflow output
indexcan write CSV, JSON, or YAML metadata preserving channel value structure. publishDiris asynchronous side-effect publishing. Downstream processes should use declared process outputs, not files in publish directories.
Corpus-observed:
- Only one fixture had a top-level
output {}block:nf-core__sarek/main.nfpublishesmultiqc = NFCORE_SAREK.out.multiqc_publishand declaresoutput { multiqc { path "multiqc" } }. - Inline
publishDirin.nffiles was observed primarily in ad-hoc fixtures: 113 matches across 7 pipelines. - nf-core publishing is primarily config-driven. For example,
nf-core__rnaseq/workflows/rnaseq/nextflow.configusesprocess.withNameselectors withpublishDir = [ path: { "${params.outdir}/..." }, mode: params.publish_dir_mode, pattern: '*.bam', saveAs: { ... } ]. nf-core__taxprofiler/conf/modules.configcontains manypublishDirblocks; module.nffiles mostly declare outputs and versions but leave publication policy to config.replikation__What_the_Phage/workflows/process/virsorter2/virsorter2_collect_data.nfpublishes a tarball with inlinepublishDir "${params.output}/${name}/raw_data", mode: 'copy', pattern: "virsorter2_results_${name}.tar.gz".ncbi__egapx/nf/ui.nfuses a UI/export process withpublishDir "${params.output}", mode: 'copy', saveAs: { ... }, many staged inputs, andpath "*", includeInputs: true.
Design inference:
- Prefer top-level workflow
output {}when present; it is explicit interface vocabulary. - Otherwise combine named workflow emits, process emits, and publication rules to identify user-visible outputs.
- Do not use
publishDiras dataflow. Use it to name, group, filter, and label final outputs. - Config-driven
publishDiris load-bearing for nf-core. Reading only.nffiles misses user-visible output intent. versions.ymlandtopic: versionsare provenance/reporting outputs. They should usually become a report/provenance output class, not primary scientific outputs.
nf-core versus ad-hoc posture
The fixture corpus shows two different extraction modes.
| Feature | nf-core fixtures | Ad-hoc fixtures |
|---|---|---|
| Parameter schema | Always root nextflow_schema.json. | Mixed; two roots have schemas, six do not. |
| Sample sheet | Usually samplesheetToList plus assets/schema_*.json. | Direct fromPath, fromFilePairs, splitCsv, directory discovery. |
| Module I/O | Heavy tuple val(meta), path(...), named emit, versions channels. | Mixed DSL2 and legacy DSL1-ish file / set; less uniform names. |
| Publishing | Mostly config publishDir selectors. | Often inline publishDir in .nf files. |
| Requiredness | Schema required plus validation helpers. | Imperative error, ifEmpty, defaults, and custom parsers. |
Design inference: a robust summarizer needs two paths. The nf-core path should trust schema and convention but still inspect workflow construction. The ad-hoc path should mine defaults, guards, channel factories, process qualifiers, and publication side effects more aggressively.
Extraction order for Foundry Molds
For summarize-nextflow and downstream interface Molds, use this order:
- Inventory launch params. Read
nextflow_schema.jsonif present; also readparamsdefaults in.configand.nffiles. Record type, format, default, requiredness, enum, path constraints, and descriptions. - Classify control params. Separate data-bearing parameters from output, resource, profile, reporting, and behavior toggles. Keep toggles if they alter workflow shape.
- Resolve sample-sheet schemas. When a param property has
schema: assets/schema_*.jsonor the workflow callssamplesheetToList, parse the sample-sheet schema as the structured input shape. - Trace materialization. Follow each input-like param into
fromPath,fromFilePairs,fromList,splitCsv,file,files, placeholder channels, andifEmptyguards. - Read component boundaries. Use named workflow
take:/emit:and processinput:/output:to refine channel shapes, file arity, optionality, and labels. - Apply operator semantics. Use component-nextflow-channel-operators to account for grouping, branching, fan-in, fan-out, joins, and reductions before the first task or final output.
- Identify output intent. Prefer workflow output blocks. Otherwise inspect named emits plus
publishDirin.nfand.configfiles. Keep publication grouping separate from dataflow edges. - Emit confidence. Mark schema-backed, sample-sheet-backed, and direct channel-backed facts as high confidence. Mark glob-only file typing, dynamic closures, and output side effects as lower confidence.
Mapping reminders
params.inputis only a name. Determine whether it is a sample sheet, direct dataset, directory, glob, accession list, or workflow mode switch.val(meta)is metadata, not data. Use it for Galaxy collection identifiers, labels, tags, and synchronization.tupleis a record, not automatically a Galaxy paired collection. Pairing requires path count and semantic evidence.pathwitharity: '1'is stronger single-file evidence than a glob with no arity.optional: trueoutputs emit nothing when absent; Galaxy output handling may need conditional or optional-output modeling.- Multiple queue channels to one process can be nondeterministic; prefer keyed
join/combineevidence before assuming synchronized collections. - Topic channels are implicit fan-in. In this corpus they mostly carry versions, but the extractor should not assume topic edges are normal explicit wires.
- Workflow
output {}andpublishDirboth describe user-visible files, but only workflowpublish:participates in the modern top-level workflow output interface.
Open questions
ShouldResolved 2026-05-08 — no.summary-nextflowdistinguish launch parameters from materialized workflow inputs as separate arrays?summary-nextflowstays Nextflow-faithful:params[]is the launch surface,sample_sheets[]is the structured-bridge surface, and materialization stays insideworkflow.channels[]. The input-vs-intermediate-vs-reference classification is a target concern and lives in nextflow-summary-to-galaxy-interface /nextflow-summary-to-cwl-interface. Three small Nextflow-faithful enrichments toworkflow.channels[]were added so the converter does mechanical lookups instead of re-parsingsourcestrings:construct(typed enum),from_param(FK intoparams[]for direct references), andrequired_runtime(boolean from.ifEmpty { error … }guards). One-hop Groovy bindings (def reads = params.reads; Channel.fromPath(reads)) are deferred to jmchilton/foundry#211. Tracking issue jmchilton/foundry#179.Should sample-sheet schemas become first-class structured inputs instead of prose inside a parameter entry?Resolved 2026-05-05 — yes. Promoted toSummary.sample_sheets[]in summary-nextflow rev 6; resolution rationale and Galaxy-side mapping in galaxy-sample-sheet-collections; tracking issue jmchilton/foundry#177.How should Galaxy targets expose Nextflow execution-control parameters such asResolved 2026-05-06 for v1 — classify by effect in nextflow-params-to-galaxy-inputs: exclude publish/runtime controls by default; keep booleans/enums only when they alter workflow shape, tool choice, or command outputs.outdir,publish_dir_mode, email, andsave_*toggles?- Do workflow output
indexfiles deserve a target-side Galaxy pattern for preserving sample metadata beside published datasets? - How much legacy DSL1 support should the cast skill keep for
file/setpipelines versus flagging them as low-confidence?