Structured contract copied for validation or lookup.
- Purpose
- Validate the emitted Nextflow summary JSON and provide downstream consumers the output contract.
Read a Nextflow pipeline source tree (nf-core or ad-hoc DSL2) and emit a structured JSON summary for downstream translation Molds.
1 non-index Markdown file with frontmatter.
source-specific fields are coherent.
eval.md declares cases and check type.
7 typed references; 0 resolver issues.
All on-demand references describe triggers.
Hypothesis references include verification.
Typed Mold references describe what casting consumes and when the generated skill should load each artifact.
Structured contract copied for validation or lookup.
Structured contract copied for validation or lookup.
Structured contract copied for validation or lookup.
Structured contract copied for validation or lookup.
Background synthesis loaded by explicit progressive-disclosure metadata.
Background synthesis loaded by explicit progressive-disclosure metadata.
Background synthesis loaded by explicit progressive-disclosure metadata.
A structured JSON summary of a Nextflow pipeline, including its interface, processes, data flow, software environment, and test fixtures.
Read a Nextflow pipeline source tree (nf-core or ad-hoc DSL2) and emit a structured JSON summary describing its processes, channels, conditionals, containers, parameters, and test fixtures. Source-specific (Nextflow), target-agnostic. The summary is the input to every downstream Mold in the NEXTFLOW → GALAXY and NEXTFLOW → CWL pipelines: nextflow-summary-to-galaxy-interface, nextflow-summary-to-galaxy-data-flow, nextflow-summary-to-cwl-interface, nextflow-summary-to-cwl-data-flow, author-galaxy-tool-wrapper (for the container/conda block), nextflow-test-to-galaxy-test-plan, and nextflow-test-to-cwl-test-plan (for the test-fixture block).
This Mold owns only the read-and-structure step. Every cross-source-and-target translation lives downstream; this Mold is responsible for surfacing what exists in the NF tree honestly, not for reshaping it toward Galaxy or CWL idioms.
The output schema is per-source by design — see gxy-sketches-alignment for why a forced-shared cross-source summary shape was rejected.
The Mold expects:
SketchSource semantics from gxy-sketches.test, test_full, …) selecting which conf/<profile>.config to read for fixtures. Defaults to test.test_fixtures.inputs[].path.Whole-pipeline only. The Mold does not accept “summarize this single subworkflow” subset hints; subset summarization is an open question — see Non-goals.
A single JSON document conforming to summary-nextflow (packages/summarize-nextflow/src/schema/summary-nextflow.schema.json). Sketch shape:
{
"source": { // mirrors SketchSource
"ecosystem": "nf-core" | "nextflow",
"workflow": "rnaseq",
"url": "https://github.com/nf-core/rnaseq",
"version": "3.14.0", // tag or commit SHA
"license": "MIT",
"slug": "nf-core-rnaseq"
},
"params": [
{ "name": "input", "type": "path", "default": null,
"description": "Samplesheet CSV", "required": true }
],
"sample_sheets": [
{ "param": "input",
"schema_path": "assets/schema_input.json",
"discovered_via": "nf-schema",
"format": "csv", "header": true,
"columns": [
{ "name": "sample", "type": "string", "kind": "meta", "required": true,
"pattern": "^\\S+$" },
{ "name": "fastq_1", "type": "string", "kind": "data", "format": "file-path",
"required": true, "exists": true, "pattern": "^\\S+\\.f(ast)?q\\.gz$" },
{ "name": "fastq_2", "type": "string", "kind": "data", "format": "file-path",
"required": false, "exists": true, "pattern": "^\\S+\\.f(ast)?q\\.gz$" },
{ "name": "strandedness","type": "string", "kind": "meta", "required": true,
"enum": ["forward", "reverse", "unstranded", "auto"] }
] }
],
"profiles": ["test", "test_full", "docker", "singularity", "conda"],
"tools": [ // mirrors gxy-sketches ToolSpec, augmented
{ "name": "fastp", "version": "0.23.4",
"biocontainer": "biocontainers/fastp:0.23.4--h5f740d0_0", // accepts quay.io/ or docker.io biocontainers/ alias
"bioconda": "bioconda::fastp=0.23.4",
"docker": null,
"singularity": "https://depot.galaxyproject.org/singularity/fastp:0.23.4--h5f740d0_0",
"wave": null } // Seqera Wave / community-cr registry
],
"processes": [
{ "name": "MINIMAP2_ALIGN", // canonical name
"aliases": ["MINIMAP2_CONSENSUS", "MINIMAP2_POLISH"], // re-imported under multiple names; edges reference the alias
"module_path": "modules/nf-core/minimap2/align/main.nf",
"tool": "minimap2_mulled", // FK into tools[].name
"container": "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? '<sing-uri>' : '<other-uri>' }", // verbatim directive
"conda": "${moduleDir}/environment.yml", // verbatim directive
"inputs": [ { "name": "reads", "shape": "tuple(val(meta), path(reads))", "description": "...", "topic": null } ],
"outputs": [ { "name": "paf", "shape": "tuple(val(meta), path(\"*.paf\")) optional", "description": "...", "topic": null },
{ "name": "versions", "shape": "path(\"versions.yml\")", "description": "tool versions YAML", "topic": null } ],
"when": null,
"script_summary": "Align reads against reference, emit PAF or BAM.",
"publish_dir": null }
],
"subworkflows": [
{ "name": "FASTQ_TRIM_FASTP_FASTQC",
"path": "subworkflows/nf-core/fastq_trim_fastp_fastqc/main.nf",
"kind": "pipeline",
"calls": ["FASTP", "FASTQC_RAW", "FASTQC_TRIM"],
"inputs": [], "outputs": [] },
{ "name": "PIPELINE_INITIALISATION",
"path": "subworkflows/local/utils_nfcore_<name>_pipeline/main.nf",
"kind": "utility", // composes free functions, no process invocations
"calls": [],
"inputs": [], "outputs": [
{ "name": "samplesheet", "shape": "tuple(meta, path)", "description": "validated --input", "topic": null }
] }
],
"workflow": {
"name": "RNASEQ",
"channels": [
{ "name": "ch_samplesheet",
"source": "Channel.fromList(samplesheetToList(params.input, '...'))",
"shape": "tuple(meta, [path,path])",
"construct": "samplesheetToList",
"from_param": "input",
"required_runtime": false }
],
"edges": [
{ "from": "ch_samplesheet", "to": "FASTP", "via": [] },
{ "from": "FASTP.out.reads", "to": "STAR_ALIGN",
"via": ["map", "join"] }
],
"conditionals": [
{ "guard": "params.skip_alignment", "branch": "alternate",
"affects": ["STAR_ALIGN"] }
]
},
"test_fixtures": {
"profile": "test",
"inputs": [ /* TestDataRef-shaped */ ],
"outputs": [ /* ExpectedOutputRef-shaped */ ]
},
"nf_tests": [
{ "name": "-profile test_dfast",
"path": "tests/dfast.nf.test",
"profiles": ["test_dfast"],
"params_overrides": { "outdir": "$outputDir" },
"assert_workflow_success": true,
"snapshot": {
"captures": ["succeeded_task_count", "versions_yml", "stable_names", "stable_paths"],
"helpers": ["getAllFilesFromDir", "removeNextflowVersion"],
"ignore_files": ["tests/.nftignore", "tests/.nftignore_files_entirely"],
"ignore_globs": [],
"snap_path": "tests/dfast.nf.test.snap"
},
"prose_assertions": [] }
]
}
Field-name parity with gxy-sketches (SketchSource, ToolSpec, TestDataRef, ExpectedOutputRef) is intentional and load-bearing — see gxy-sketches-alignment §1-3.
The cast skill is not a single LLM prompt over the source tree. It is a small program with one or two embedded LLM calls. The split is:
nextflow.config and nextflow_schema.json, regex-tokenize process blocks for typed fields (name, container, conda, declared IO channel names, when: guards, publishDir), read nf-core module meta.yml verbatim, enumerate include { X } from '...' for the call graph, resolve biocontainer image strings.script: body, reconciliation of operator-chained channel paths (A | map | join(B) | groupTuple) into the workflow edges[], free-text description / notes fields, IO inference when meta.yml is absent and the script is the only signal.Everything the schema demands as a typed enum or path is deterministic. Free-text fields are LLM. The schema enforces that boundary by typing.
Branch shallow on layout:
nextflow.config declares manifest.name = 'nf-core/...'; modules/nf-core/, subworkflows/nf-core/, and nextflow_schema.json are present. Prefer meta.yml as IO ground truth.nextflow_schema.json, no module meta.yml. Falls back to script:-block IO inference. Consult component-nextflow-pipeline-anatomy when layout differs from nf-core conventions in ways these rules do not cover.source block and exit early with a warnings[] entry. Out of scope for v1.Real pipelines have multiple named workflow blocks — typically an anonymous workflow {} entrypoint in main.nf that wires PIPELINE_INITIALISATION → NFCORE_<NAME> → PIPELINE_COMPLETION, plus a substantive named workflow under workflows/<name>.nf. Selection rule for the primary workflow: pick the named workflow that invokes the most pipeline processes. The anonymous workflow {} glue and the NFCORE_<NAME> wrapper land in subworkflows[], marked kind: utility and kind: pipeline respectively.
Populate source from git remote get-url, git rev-parse HEAD (or the user-supplied pin), manifest.name / manifest.homePage / manifest.version in nextflow.config, and LICENSE filename detection. slug is kebab of <owner>-<repo> for nf-core, kebab of repo basename otherwise.
Read nextflow.config params { ... } block for defaults. When nextflow_schema.json exists (nf-core), prefer it as the source of truth for type, description, and required — it is real JSON Schema, copy verbatim. Some params are computed at config-load time (for example params.fasta = getGenomeAttribute('fasta') in main.nf) and will not appear in nextflow_schema.json; include them with a description noting the dynamic source. Enumerate profiles { ... } keys.
Sample-sheet inputs are the dominant structured-input idiom in modern nf-core pipelines and the most lossy thing to leave as prose inside params[].description. For each candidate sample-sheet parameter, populate one sample_sheets[] entry capturing the row schema deterministically. Discovery has three branches, recorded in discovered_via:
nf-schema: the param’s nextflow_schema.json entry has a schema: keyword pointing at a sibling JSON Schema file (assets/schema_*.json). Read that file. Each property in the row schema maps to one SampleSheetColumn. Preserve property order, not source-column order — samplesheetToList() emits columns in property order, and downstream channel item layout depends on it.samplesheetToList: the workflow imports samplesheetToList from nf-schema and calls it on the param. When the call cites a schema path, follow it. Without a schema path, emit the entry with schema_path: null and infer columns from splitCsv-shaped fallback if any; otherwise emit columns: [] and a warnings[] note.splitCsv: a Channel.fromPath(params.X).splitCsv(header: true) materialization. Header inference only — emit columns by name, leave type: string, kind inferred from downstream path() consumption when traceable, else meta. Mark discovered_via: splitCsv.ad-hoc: pipeline-specific CSV/TSV parsing detected from script bodies (e.g. row-zero/row-one indexing). Emit a minimal entry with columns: [] plus a warnings[] advisory; downstream Molds will need to handle these by hand.Column field rules:
kind: data when nf-schema format is file-path/directory-path/path or when the column is annotated meta: is absent and the value is consumed as a path() downstream. meta otherwise (including all meta: true annotations and all non-path scalars). Nest the nf-schema meta: annotation here even when implicit — translation Molds key on it to decide which columns become Galaxy column_definitions[] versus element/inner-collection slots.type: copy verbatim from the row schema (string/integer/number/boolean). Path columns are string with a format qualifier; do not collapse path into a synthetic type.required, default, enum, pattern, exists, mimetype, description: copy verbatim when present, leaving null/empty defaults otherwise.This step does not reshape onto any target idiom (Galaxy sample_sheet:paired vs list:paired is not decided here). It records what the source pipeline declares; the variant choice belongs to nextflow-summary-to-galaxy-interface and nextflow-summary-to-cwl-interface.
For each process <NAME> { ... } in main.nf, workflows/, modules/**, subworkflows/**:
container, conda, publishDir, when: directives verbatim into processes[].container / processes[].conda. Modern nf-core directives are ternary expressions (workflow.containerEngine == 'singularity' ? <sing-uri> : <docker-uri>) and file references (${moduleDir}/environment.yml); keep the directive text intact and resolve into tools[] separately (§5).input: and output: blocks for declared channel names and shapes — typed channels (tuple val(meta), path(reads)) become shape strings ("tuple(meta, [path])"); arity is preserved as a string, not structured.include { ... } statements across the pipeline (main.nf, workflows/, subworkflows/**) to populate processes[].aliases. include { MINIMAP2_ALIGN as MINIMAP2_CONSENSUS } adds MINIMAP2_CONSENSUS to the MINIMAP2_ALIGN process’s aliases[]. The same module can be re-imported under multiple aliases (bacass aliases MINIMAP2_ALIGN three times). Edges reference the alias name; the canonical name is the FK target.topic: <name> annotations on outputs (Nextflow 24+ channel topics — nf-core templates emit tuple(val("${task.process}"), val('toolname'), eval(...)) topic: versions for version aggregation). Record the topic name in ChannelIO.topic.meta.yml exists, use it for description and IO documentation rather than parsing the script: block.script: body in one line. Pass the script verbatim plus the declared IO; ask only for what the tool does.Walk per-process container and conda directives. Container directives are usually ternary — extract both branches:
singularity ? branch typically yields an https://depot.galaxyproject.org/singularity/<name>:<version>--<build> URL → tools[].singularity.quay.io/biocontainers/<name>:<version>--<build> → tools[].biocontainer.biocontainers/<name>:<version>--<build> (docker.io alias for the same biocontainer image) → tools[].biocontainer (same field; both forms are biocontainer images).community.wave.seqera.io/library/<name>:<version>--<digest> or https://community-cr-prod.seqera.io/.../sha256/<digest>/data → tools[].wave.tools[].docker.Conda directives are usually file references to ${moduleDir}/environment.yml; read the file and extract its dependencies: list. Each bioconda::<name>=<version> entry becomes a tools[] entry with tools[].bioconda set to the original dependency string. Multi-tool environments are common (minimap2 + samtools + htslib, racon + multiqc); keep every Bioconda dependency rather than selecting the first. Legacy literal-string directives (conda "bioconda::<name>=<version>") feed the same field.
Tool name and version are typically derivable from any of the resolved fields. Deduplicate by (name, version) across processes; one entry per tool. processes[].tool is a foreign key into tools[].name. This block is the bridge to author-galaxy-tool-wrapper — it consumes container/conda info to choose or justify the UDT container.
Enumerate the top-level workflow’s include statements and channel construction (Channel.fromPath, Channel.fromFilePairs, Channel.fromList(samplesheetToList(...)), splitCsv, file()/files(), params.*, channel.empty(), channel.topic('<name>')). For operator chains, the deterministic parser records the literal chain (["map", "join", "groupTuple"] in via). Reconciling chained operators into a coherent from → to edge is the second LLM call: given the literal chain, the source channel shape, and the downstream process’s declared input shape, emit the resolved edge.
For each emitted workflow.channels[] entry, populate three classified fields alongside the verbatim source:
construct — typed enum reflecting the channel’s primary materialization factory or shape-determining operator. Selection precedence: (1) samplesheetToList when the chain contains samplesheetToList(...); (2) splitCsv when the chain ends in .splitCsv(header: true) over a path; (3) otherwise the outermost factory (Channel.fromPath → fromPath, Channel.fromFilePairs → fromFilePairs, Channel.fromList → fromList, file(...) → file, files(...) → files, Channel.of → of, Channel.value → value, Channel.empty → empty, Channel.topic → topic); (4) other for derived/operator-only constructions.from_param — FK into params[].name when the construction expression directly references params.X (e.g. Channel.fromPath(params.reads), samplesheetToList(params.input, ...), file(params.fasta)). v1 is direct-only — one-hop Groovy bindings (def reads = params.reads; Channel.fromPath(reads)) are deferred to jmchilton/foundry#211. Null when no direct reference, or when construct is not data-bearing (empty, of, value, topic, other).required_runtime — true when the construction chain ends in .ifEmpty { error ... } (or an equivalent imperative emptiness-throw guard). Captures runtime requiredness even when the param’s nf-schema entry does not mark it required. False otherwise.All three fields are syntactic: regex-level extraction over the construction expression, no LLM call.
Workflow-level conditionals (if (params.skip_alignment) { ... }) emit conditionals[] entries with the guard, the branch (alternate vs default), and the set of processes affected.
Subworkflows split into two kinds:
kind: pipeline — invokes pipeline processes (data-flow contributor). The NFCORE_<NAME> wrapper and any nested subworkflows/local/ that calls processes.kind: utility — composes free-function calls only (paramsHelp, samplesheetToList, completionEmail, imNotification). nf-core template subworkflows like PIPELINE_INITIALISATION and PIPELINE_COMPLETION. Subworkflow.calls is empty for utilities; their job is to produce channels (e.g. the validated samplesheet) the primary workflow consumes.Free-function calls in the workflow body itself (paramsSummaryMap, softwareVersionsToYAML, methodsDescriptionText) are not modeled as processes or subworkflows. Their channel outputs flow into the primary workflow’s channels[]; the function names are nf-core template idiom, not pipeline-specific signal. Operator chains with deeply nested closures may produce edges flagged with low confidence in notes.
Two artifacts come out of this step: test_fixtures (data shape of the selected profile’s input) and nf_tests[] (every tests/*.nf.test file).
test_fixtures — read conf/<profile>.config (default conf/test.config) for params.input (samplesheet URL) and any other URL-shaped params. For nf-core pipelines, follow the samplesheet URL into the nf-core/test-datasets repo if a single fetch is enough to enumerate the file paths it references; otherwise emit the samplesheet URL alone as the input. The samplesheet URL may be a runtime concatenation (params.pipelines_testdata_base_path + 'foo.csv'); resolve at config-load semantics and record the resolved URL.
When fixture fetching is enabled, hash each fetched remote file with SHA-1. When a test-data directory is provided, write the samplesheet and every referenced remote file under that directory using a deterministic URL-derived path and record that local filesystem path in path while preserving the original url.
Each entry follows TestDataRef (inputs) / ExpectedOutputRef (outputs) field names verbatim. The path vs url rules from gxy-sketches’ TestDataRef carry over, with one extension: path may be the local fetched path for a remote URL. The “must be under test_data/” constraint does not — see gxy-sketches-alignment §1.
nf_tests[] — enumerate every tests/*.nf.test file. Real pipelines have one .nf.test per test profile (bacass has 9). For each:
name = the description string passed to test("...").path = repo-relative file path.profiles[] = file-level profile "<name>" declaration plus any per-test config overrides.params_overrides = the when { params { ... } } block as a key→value map.assert_workflow_success = true when an assert workflow.success (or equivalent) clause is present.snapshot = structured SnapshotFixture when an assert snapshot(...).match() clause is present, else null. nf-core templates use a near-uniform snapshot pattern; extract:
captures[] = logical names of values passed into snapshot(...) (typical set: succeeded_task_count, versions_yml, stable_names, stable_paths).helpers[] = nf-test helper functions invoked (getAllFilesFromDir, removeNextflowVersion, …).ignore_files[] = repo-relative paths passed as ignoreFile: to helpers (e.g. tests/.nftignore).ignore_globs[] = inline ignore: [...] glob list from helpers.snap_path = repo-relative path of the corresponding .nf.test.snap file.prose_assertions[] = any other complex/non-snapshot assertions, summarized to prose strings. Empty for snapshot-only tests (the common nf-core case).Consult component-nextflow-testing when fixtures use a layout outside conf/test.config + nf-test (e.g. legacy test/ scripts, external test harnesses) or when assertions are non-snapshot equality / regex / containsString checks.
Validate the assembled object before emitting: run foundry validate-summary-nextflow summary-nextflow.json. The subcommand is shipped by @galaxy-foundry/foundry and can be invoked from npm with npx --package @galaxy-foundry/foundry foundry validate-summary-nextflow summary-nextflow.json. The standalone summarize-nextflow bin (from @galaxy-foundry/summarize-nextflow) self-validates by default and is the better gate when the skill is also producing the summary. On schema failure, the cast skill should fail loud — the downstream Molds bind to the schema and will produce worse errors later. additionalProperties: false at every level catches drift early; do not add extra fields to work around a mismatch.
The procedure assumes — and the cast skill must surface in warnings[] when relevant — the following NF realities:
workflow { ... } block); emit a single warning and exit with the provenance block only.meta.yml may lie. nf-core module meta.yml is hand-authored and can drift from the actual script: IO. When the LLM-inferred IO disagrees with meta.yml, prefer meta.yml and surface the disagreement as a warning rather than overriding it."tuple(meta, [path,path])" is enough for downstream Molds to reason about; structured channel typing is a research project. Downstream Molds that need structure must parse the string.map { ... } with substantial Groovy logic) may produce edges flagged with low confidence in notes.include aliasing is followed one level. include { FASTP as TRIM_PROC } from '...' resolves to FASTP in processes[].name and the alias is recorded in the call graph. Multi-level aliasing chains are not chased.environment.yml files.conf/test.config + nf-test, or on snapshot/assertion patterns the structured fallback does not capture well.