Home Research

Nextflow workflow I/O semantics

Defines Nextflow workflow inputs and outputs from docs plus observed fixture pipeline structures.

Raw
Revised
2026-05-08
Rev
4
component

Nextflow workflow I/O semantics

Use this note when deciding what counts as a workflow input or output in a Nextflow pipeline before translating the interface into Galaxy or CWL. The central problem: Nextflow does not have one historical interface surface. Real pipelines spread interface meaning across params, nextflow_schema.json, sample-sheet schemas, entry workflow channel construction, process input: / output: blocks, named workflow take: / emit: blocks, publishDir, and newer workflow publish: / output {} blocks.

Evidence quality is split below:

  • Documentation-observed claims come from Nextflow, nf-core, and nf-schema docs.
  • Corpus-observed claims come from the pinned fixture corpus in workflow-fixtures/pipelines/.
  • Design inference states how the Foundry should use those facts.

What counts as input

Nextflow workflow input has three layers. The correct abstraction depends on which layer is being translated.

LayerSurfaceMeaningTranslation value
Parameter surfaceparams.*, params {}, nextflow_schema.json, params files, configUser-facing values supplied at launch. May be scalars, paths, directories, glob patterns, sample sheets, toggles, tool choices, resource knobs, and output settings.Strongest source for Galaxy top-level scalar parameters and candidate dataset inputs.
Materialization surfacechannel.fromPath, channel.fromFilePairs, channel.fromList(samplesheetToList(...)), splitCsv, file(), files()How launch values become channels or dataflow values. Encodes dataset-vs-collection shape.Strongest source for Galaxy dataset and collection shapes.
Component surfacenamed workflow take:, process input: declarationsInternal function boundary. Names semantic channels or task-local staged files.Strongest source for per-step shape and dependency semantics, not automatically top-level workflow inputs.

Documentation-observed: an entry workflow can read params; named workflows declare inputs with take:. Processes declare task inputs with qualifiers including val, path, env, stdin, tuple, and each. A process call must be supplied a channel, dataflow value, or regular value for each declared input.

Design inference: top-level Galaxy workflow inputs should be derived from parameter plus materialization surfaces, not from every process input:. Process inputs mostly describe task-local staging and downstream wiring. Named workflow take: entries matter when the named workflow is the real pipeline body invoked by a thin entry workflow, as in many nf-core pipelines.

Parameter surface

params values are the launch-time user interface. Defaults can be assigned in .nf files, nextflow.config, params files, or config profiles; command-line and config resolution order is defined by Nextflow’s pipeline parameter docs. The same parameter can be a scalar threshold, an input file, an input directory, a glob, a sample-sheet path, a reference bundle, an output directory, or a behavior toggle.

When nextflow_schema.json exists, it is the best available declaration for parameter type, description, requiredness, and path-like intent. nextflow-parameters-meta records the supported nf-schema vocabulary: type is limited to string, boolean, integer, and number; format can mark file-path, directory-path, path, or file-path-pattern; exists, mimetype, pattern, schema, enum, hidden, and description fields refine behavior.

Corpus-observed:

ShapeEvidence
All seven nf-core fixtures have root nextflow_schema.json.nf-core__demo, nf-core__fetchngs, nf-core__rnaseq, nf-core__bacass, nf-core__hlatyping, nf-core__sarek, nf-core__taxprofiler.
Two ad-hoc fixtures also have root nextflow_schema.json.labsyspharm__mcmicro, epi2me-labs__wf-human-variation.
Six ad-hoc fixtures lack a root nextflow_schema.json.CRG-CNAG__CalliNGS-NF, JaneliaSciComp__nf-demos, ZuberLab__crispr-process-nf, biocorecrg__MOP2, replikation__What_the_Phage, ncbi__egapx.
nf-core schemas often make sample sheets explicit.nf-core__rnaseq/nextflow_schema.json declares input as type: string, format: file-path, exists: true, schema: assets/schema_input.json, mimetype: text/csv, and required: ["input", "outdir"].
Required parameters can be enforced imperatively when schema is absent or not authoritative.labsyspharm__mcmicro/main.nf errors when !params.containsKey('in'); CRG-CNAG__CalliNGS-NF/main.nf assigns defaults directly to params.genome, params.variants, params.denylist, params.reads, and params.results.

Design inference:

  • Treat nextflow_schema.json as high-quality but incomplete. It says what the user may supply; the entry workflow says how the value becomes dataflow.
  • Path-like string parameters are not automatically Galaxy data inputs. A directory-path may represent a data collection root, a reference bundle, a staged cache, or an output directory.
  • outdir, publish_dir_mode, email, resource, and reporting parameters are workflow-control parameters. They usually should not become Galaxy workflow inputs unless the target harness intentionally exposes execution-control knobs.
  • Requiredness should combine schema required, missing-default checks, imperative error, and runtime ifEmpty { error ... } guards.

Sample sheets and records

Sample sheets are the dominant structured input idiom in modern nf-core pipelines. A sample sheet is a file parameter at the launch surface, but after validation and parsing it becomes a channel of records or tuples with metadata and paths.

Documentation-observed: nf-schema samplesheetToList() validates a CSV/TSV sample sheet with a JSON schema and converts rows into typed Groovy values. The helper emits columns in schema property order, not source column order, and can mark fields as meta so metadata becomes a map in channel items.

Corpus-observed:

  • nf-core__rnaseq/workflows/rnaseq/main.nf imports samplesheetToList, reads params.input through channel.fromList(samplesheetToList(params.input, "${projectDir}/assets/schema_input.json")), then maps rows into [meta.id, meta + [single_end: ...], reads, genome_bam, transcriptome_bam], groups by sample, and branches into FASTQ or BAM paths.
  • nf-core__taxprofiler/subworkflows/local/utils_nfcore_taxprofiler_pipeline/main.nf uses samplesheetToList(input, "assets/schema_input.json") for samples and samplesheetToList(databases, "assets/schema_database.json") for database definitions.
  • nf-core__sarek/subworkflows/local/utils_nfcore_sarek_pipeline/main.nf uses samplesheetToList for both normal input and restart input.
  • epi2me-labs__wf-human-variation/lib/ingress.nf validates a sample sheet, then builds Channel.fromPath(sample_sheet).last().splitCsv(header: true, quote: '"').
  • replikation__What_the_Phage/phage.nf treats params.fasta as either a CSV list that maps row 0 to sample name and row 1 to file path, or a direct FASTA file when list mode is off.

Design inference:

  • A sample-sheet parameter often maps to a Galaxy collection input, not to a single file input, even though its schema type is string and format: file-path.
  • Preserve row metadata. val(meta) and sample-sheet columns carry identifiers, labels, grouping axes, single/paired flags, and optional analysis groups. They are not datasets, but they often determine Galaxy collection identifiers and branch conditions.
  • Prefer sample-sheet schema over ad-hoc filename inference when both exist. It carries column names, required columns, path columns, and metadata fields.

Path and glob inputs

Direct path/glob construction is common, especially outside nf-core. It is also present in nf-core subworkflows for references and optional resources.

Documentation-observed:

  • channel.fromPath() emits paths matching a file name or glob pattern. Options include existence checks and type filtering.
  • channel.fromFilePairs() emits tuples of grouping key plus a sorted list of files. size controls expected pair size and flat: true flattens file pairs.
  • file() returns one Path; files() returns a collection of Path objects.
  • A process path input stages a file or collection of files into the task work directory.

Corpus-observed counts from the fixture corpus:

ConstructMatchesPipelines
Channel.fromPath / channel.fromPath1189
Channel.fromFilePairs72
splitCsv145
fromSamplesheet00

Representative observed forms:

  • CRG-CNAG__CalliNGS-NF/main.nf sets params.reads = "$baseDir/data/reads/rep1_{1,2}.fq.gz" and materializes it with Channel.fromFilePairs(params.reads).
  • biocorecrg__MOP2/mop_preprocess/mop_preprocess.nf uses Channel.fromPath(params.fast5).ifEmpty { error ... } and Channel.fromFilePairs(params.fastq, size: 1).ifEmpty { error ... }.
  • labsyspharm__mcmicro/main.nf derives multiple channels from subdirectories below params.in, including markers.csv, raw-image files, staging directories, and precomputed intermediates.
  • epi2me-labs__wf-human-variation/main.nf maps optional BED/reference inputs with Channel.fromPath(params.bed, checkIfExists: true) or placeholder files when absent.

Design inference:

  • fromPath over one concrete file is a candidate Galaxy dataset input.
  • fromPath over a directory or glob is a candidate Galaxy collection input, but the collection type depends on downstream tuple shape and process qualifiers.
  • fromFilePairs is strong evidence for paired files, usually Galaxy paired or list:paired, unless size: 1 indicates grouped singleton files.
  • Placeholder paths such as OPTIONAL_FILE represent optional-branch plumbing, not real user inputs.
  • ifEmpty { error ... } is requiredness evidence for the materialized dataflow, even if the parameter schema does not mark the parameter required.

Channels, values, and operators

Nextflow dataflow has two cardinality classes: channels and dataflow values. Channels are asynchronous sequences; dataflow values are singleton asynchronous values. Operators transform, filter, combine, fork, or reduce them. component-nextflow-channel-operators gives the detailed operator taxonomy; this note only records why operators matter for workflow I/O.

Documentation-observed:

  • map is one-to-one except null results are not emitted.
  • flatMap and flatten fan out values.
  • collect, toList, and related operators fan in many values into one list-like value.
  • groupTuple groups tuple streams by key and emits grouped payload lists.
  • join, combine, cross, mix, and concat combine channels with different key/order semantics.
  • branch and multiMap return multiple channels.

Design inference:

  • Channel shape is part of the interface. A path parameter that becomes tuple(meta, [fastq_1, fastq_2]) should not translate like a scalar file parameter.
  • Operators between materialization and first process call may change the Galaxy input shape. groupTuple() can turn rows into per-sample collections; branch can split one input surface into alternative workflow paths; collect() can turn many files into one collection-valued dependency.
  • Some operators are nondeterministic or order-sensitive. Translation should prefer keyed metadata and explicit identifier matching over implicit channel order.

Process and named workflow surfaces

Process input: and output: blocks define task interface. Named workflow take: and emit: define reusable workflow interface. These are real I/O surfaces, but not always top-level pipeline I/O surfaces.

Corpus-observed counts from 700 .nf files:

SurfaceMatches
input: blocks656
output: blocks677
take: blocks266
workflow/module emit: blocks244
path qualifier1229
val qualifier325
tuple qualifier1953
env qualifier20
each qualifier9
legacy file qualifier71
legacy set qualifier21

Representative observed forms:

  • nf-core__rnaseq/modules/nf-core/fastqc/main.nf uses input tuple val(meta), path(reads) and emits named HTML, ZIP, and versions outputs.
  • CRG-CNAG__CalliNGS-NF/modules.nf has simple path inputs such as path genome, and mixed tuple outputs such as tuple val(replicateId), path(...bam), path(...bai).
  • ZuberLab__crispr-process-nf/main.nf still uses legacy DSL1-ish set val(lane), file(bam) from rawBamFiles and set val(lane), file("${lane}.fastq.gz") into fastqFilesFromBam.
  • nf-core__rnaseq/workflows/rnaseq/main.nf declares a named workflow take: with semantic channel names such as ch_samplesheet, ch_fasta, and ch_star_index, plus comments describing expected path shapes.

Documentation-observed process qualifier meanings:

QualifierMeaning for interface extraction
valScalar or metadata value. Not a dataset unless it is a path string intentionally passed as scalar.
path / legacy fileStaged file or collection of files. Strong dataset evidence.
tuple / legacy setOne channel item with multiple typed fields. Strong collection/record evidence, not automatically one Galaxy collection.
envEnvironment variable value. Usually scalar/tool parameter semantics.
stdin / stdoutStream surface. Rare in this corpus; translate only with per-process review.
eachRepeater. Expands task combinations and often maps to parameter sweeps or collection map-over.

Design inference:

  • For top-level interface conversion, process inputs validate and refine shape; they should not create new workflow inputs unless their upstream source is a top-level parameter or external file construction.
  • Named workflow take: blocks are especially important in nf-core because the entry workflow often delegates to one substantive named workflow after initialization.
  • tuple val(meta), path(...) is the core idiom for keyed dataset collections. The meta map supplies identifiers; the path field supplies datasets.
  • Legacy file / set must be supported for ad-hoc pipelines even if new guidance prefers path / tuple.

What counts as output

Nextflow output has four distinct meanings. Confusing them produces bad Galaxy interfaces.

SurfaceMeaningTop-level output confidence
Process output:Files/values emitted from a task into dataflow.Medium. Internal unless published or exposed by workflow emits.
Named workflow emit:Outputs exposed by a named workflow call.Medium to high when the named workflow is the main pipeline body.
publishDirSide-effect copy/link of selected process outputs into an output directory.High for user-visible output intent; low for dataflow shape alone.
Entry workflow publish: plus top-level output {}Modern declared workflow outputs.Highest when present.

Documentation-observed:

  • Workflow output blocks are available as preview in Nextflow 24.04, 24.10, and 25.04, and documented as added in 25.10.
  • Workflow outputs are intended to replace publishDir.
  • Entry workflow publish: assigns channels to output names; top-level output {} declares how those outputs are published.
  • Workflow output path can be static, dynamic per value, or use >> to publish selected files.
  • Workflow output index can write CSV, JSON, or YAML metadata preserving channel value structure.
  • publishDir is asynchronous side-effect publishing. Downstream processes should use declared process outputs, not files in publish directories.

Corpus-observed:

  • Only one fixture had a top-level output {} block: nf-core__sarek/main.nf publishes multiqc = NFCORE_SAREK.out.multiqc_publish and declares output { multiqc { path "multiqc" } }.
  • Inline publishDir in .nf files was observed primarily in ad-hoc fixtures: 113 matches across 7 pipelines.
  • nf-core publishing is primarily config-driven. For example, nf-core__rnaseq/workflows/rnaseq/nextflow.config uses process.withName selectors with publishDir = [ path: { "${params.outdir}/..." }, mode: params.publish_dir_mode, pattern: '*.bam', saveAs: { ... } ].
  • nf-core__taxprofiler/conf/modules.config contains many publishDir blocks; module .nf files mostly declare outputs and versions but leave publication policy to config.
  • replikation__What_the_Phage/workflows/process/virsorter2/virsorter2_collect_data.nf publishes a tarball with inline publishDir "${params.output}/${name}/raw_data", mode: 'copy', pattern: "virsorter2_results_${name}.tar.gz".
  • ncbi__egapx/nf/ui.nf uses a UI/export process with publishDir "${params.output}", mode: 'copy', saveAs: { ... }, many staged inputs, and path "*", includeInputs: true.

Design inference:

  • Prefer top-level workflow output {} when present; it is explicit interface vocabulary.
  • Otherwise combine named workflow emits, process emits, and publication rules to identify user-visible outputs.
  • Do not use publishDir as dataflow. Use it to name, group, filter, and label final outputs.
  • Config-driven publishDir is load-bearing for nf-core. Reading only .nf files misses user-visible output intent.
  • versions.yml and topic: versions are provenance/reporting outputs. They should usually become a report/provenance output class, not primary scientific outputs.

nf-core versus ad-hoc posture

The fixture corpus shows two different extraction modes.

Featurenf-core fixturesAd-hoc fixtures
Parameter schemaAlways root nextflow_schema.json.Mixed; two roots have schemas, six do not.
Sample sheetUsually samplesheetToList plus assets/schema_*.json.Direct fromPath, fromFilePairs, splitCsv, directory discovery.
Module I/OHeavy tuple val(meta), path(...), named emit, versions channels.Mixed DSL2 and legacy DSL1-ish file / set; less uniform names.
PublishingMostly config publishDir selectors.Often inline publishDir in .nf files.
RequirednessSchema required plus validation helpers.Imperative error, ifEmpty, defaults, and custom parsers.

Design inference: a robust summarizer needs two paths. The nf-core path should trust schema and convention but still inspect workflow construction. The ad-hoc path should mine defaults, guards, channel factories, process qualifiers, and publication side effects more aggressively.

Extraction order for Foundry Molds

For summarize-nextflow and downstream interface Molds, use this order:

  1. Inventory launch params. Read nextflow_schema.json if present; also read params defaults in .config and .nf files. Record type, format, default, requiredness, enum, path constraints, and descriptions.
  2. Classify control params. Separate data-bearing parameters from output, resource, profile, reporting, and behavior toggles. Keep toggles if they alter workflow shape.
  3. Resolve sample-sheet schemas. When a param property has schema: assets/schema_*.json or the workflow calls samplesheetToList, parse the sample-sheet schema as the structured input shape.
  4. Trace materialization. Follow each input-like param into fromPath, fromFilePairs, fromList, splitCsv, file, files, placeholder channels, and ifEmpty guards.
  5. Read component boundaries. Use named workflow take: / emit: and process input: / output: to refine channel shapes, file arity, optionality, and labels.
  6. Apply operator semantics. Use component-nextflow-channel-operators to account for grouping, branching, fan-in, fan-out, joins, and reductions before the first task or final output.
  7. Identify output intent. Prefer workflow output blocks. Otherwise inspect named emits plus publishDir in .nf and .config files. Keep publication grouping separate from dataflow edges.
  8. Emit confidence. Mark schema-backed, sample-sheet-backed, and direct channel-backed facts as high confidence. Mark glob-only file typing, dynamic closures, and output side effects as lower confidence.

Mapping reminders

  • params.input is only a name. Determine whether it is a sample sheet, direct dataset, directory, glob, accession list, or workflow mode switch.
  • val(meta) is metadata, not data. Use it for Galaxy collection identifiers, labels, tags, and synchronization.
  • tuple is a record, not automatically a Galaxy paired collection. Pairing requires path count and semantic evidence.
  • path with arity: '1' is stronger single-file evidence than a glob with no arity.
  • optional: true outputs emit nothing when absent; Galaxy output handling may need conditional or optional-output modeling.
  • Multiple queue channels to one process can be nondeterministic; prefer keyed join / combine evidence before assuming synchronized collections.
  • Topic channels are implicit fan-in. In this corpus they mostly carry versions, but the extractor should not assume topic edges are normal explicit wires.
  • Workflow output {} and publishDir both describe user-visible files, but only workflow publish: participates in the modern top-level workflow output interface.

Open questions

  • Should summary-nextflow distinguish launch parameters from materialized workflow inputs as separate arrays? Resolved 2026-05-08 — no. summary-nextflow stays Nextflow-faithful: params[] is the launch surface, sample_sheets[] is the structured-bridge surface, and materialization stays inside workflow.channels[]. The input-vs-intermediate-vs-reference classification is a target concern and lives in nextflow-summary-to-galaxy-interface / nextflow-summary-to-cwl-interface. Three small Nextflow-faithful enrichments to workflow.channels[] were added so the converter does mechanical lookups instead of re-parsing source strings: construct (typed enum), from_param (FK into params[] for direct references), and required_runtime (boolean from .ifEmpty { error … } guards). One-hop Groovy bindings (def reads = params.reads; Channel.fromPath(reads)) are deferred to jmchilton/foundry#211. Tracking issue jmchilton/foundry#179.
  • Should sample-sheet schemas become first-class structured inputs instead of prose inside a parameter entry? Resolved 2026-05-05 — yes. Promoted to Summary.sample_sheets[] in summary-nextflow rev 6; resolution rationale and Galaxy-side mapping in galaxy-sample-sheet-collections; tracking issue jmchilton/foundry#177.
  • How should Galaxy targets expose Nextflow execution-control parameters such as outdir, publish_dir_mode, email, and save_* toggles? Resolved 2026-05-06 for v1 — classify by effect in nextflow-params-to-galaxy-inputs: exclude publish/runtime controls by default; keep booleans/enums only when they alter workflow shape, tool choice, or command outputs.
  • Do workflow output index files deserve a target-side Galaxy pattern for preserving sample metadata beside published datasets?
  • How much legacy DSL1 support should the cast skill keep for file / set pipelines versus flagging them as low-confidence?

Incoming References (9)