Home Research

Iwc Map Over Lifecycle Survey

Survey of IWC map-over lifecycle recipes, with a Nextflow-to-Galaxy crosswalk for collection construction, cleanup, reshape, reduce, and publish phases.

Raw
Revised
2026-05-03
Rev
1
component

IWC map-over lifecycle survey

Source corpus: 120 cleaned gxformat2 workflows under $IWC_FORMAT2/, with structural scans over $IWC_SKELETONS/. Nextflow comparison uses local fixtures under workflow-fixtures/pipelines/ plus existing Foundry notes. This survey is intentionally lifecycle-shaped: it composes existing operation pattern pages rather than rediscovering each collection tool.

Scope is the end-to-end shape around Galaxy collection map-over:

  1. Prepare a collection from user input, manifest rows, existing datasets, or domain fan-out.
  2. Map a domain tool over the collection.
  3. Clean empty or failed per-element outputs.
  4. Synchronize, relabel, reshape, flatten, or unbox collection shape.
  5. Reduce or aggregate mapped outputs into report-ready files, tables, or bundles.
  6. Publish final datasets/collections, sometimes behind conditional gates.

Out of scope:

1. Lifecycle shapes observed

Shape A — manifest to collection to mapped fetch to reshaped outputs

The SRA manifest workflow shows the fullest prepare-map-reshape lifecycle. A tabular manifest is cut down to SRA accessions, split into one collection element per row, mapped through fasterq_dump, then relabeled and reshaped into paired/single outputs.

Evidence:

  • $IWC_FORMAT2/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.gxwf.yml:175 uses split_file_to_collection.
  • $IWC_FORMAT2/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.gxwf.yml:204 maps fasterq_dump over the collection.
  • $IWC_FORMAT2/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.gxwf.yml:238 and :261 relabel paired and single outputs with __RELABEL_FROM_FILE__.
  • $IWC_FORMAT2/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.gxwf.yml:283 and :328 reshape with __APPLY_RULES__.

Lifecycle recipe: manifest/table -> collection -> mapped fetch -> relabel -> reshape -> publish paired/single outputs.

Existing pages cover the phase leaves: tabular-to-collection-by-row, regex-relabel-via-tabular, and collection-build-list-paired-with-apply-rules. What is missing is the end-to-end recipe that tells a translator when these pages compose.

Shape B — mapped search produces sparse results, cleanup syncs siblings, non-empty result gates reports

The MGnify rRNA prediction workflow shows the most important cleanup-and-publish lifecycle. Multiple per-sample steps can emit empty collections. The workflow drops empties, uses identifiers from the cleaned collection to filter sibling inputs, then gates Krona and BIOM export on computed non-empty booleans.

Evidence:

  • $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:352 and :367 filter empty SSU/LSU BED outputs.
  • $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:407 and :427 extract collection identifiers.
  • $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:447 and :470 use __FILTER_FROM_FILE__ to sync sibling collections to the cleaned result set.
  • $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:1361 starts an embedded Map empty/not empty collection to boolean subworkflow.
  • $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:1486, :1514, :1552, :1588, and :1626 gate Krona/BIOM publishing steps downstream.

Lifecycle recipe: mapped search -> FILTER_EMPTY -> identifiers -> FILTER_FROM_FILE sibling sync -> non-empty boolean -> conditional report/export.

Existing pages cover the leaves: collection-cleanup-after-mapover-failure, sync-collections-by-identifier, and conditional-gate-on-nonempty-result. The lifecycle-level value is the sequence and the decision point: per-element cleanup is not enough if a sibling collection must stay aligned, and report export should be gated at the whole-result level.

Shape C — domain fan-out, axis swap, relabel, remap

The influenza consensus/subtyping workflow shows a dense collection-axis lifecycle. Domain fan-out creates a nested shape whose useful map-over axis is not the one Galaxy currently exposes. The workflow uses Apply Rules, identifier extraction, text replacement, relabeling, and another Apply Rules pass before downstream consensus steps.

Evidence:

  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:943 uses __APPLY_RULES__ for axis reshaping.
  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:974 and :995 extract collection identifiers.
  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:1016 rewrites identifiers with tp_find_and_replace.
  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:1046 relabels with __RELABEL_FROM_FILE__.
  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:1068 reshapes again with __APPLY_RULES__.
  • $IWC_FORMAT2/virology/influenza-isolates-consensus-and-subtyping/influenza-consensus-and-subtyping.gxwf.yml:1117 runs ivar_consensus on the corrected shape.

Lifecycle recipe: domain fan-out -> reshape axes -> derive labels -> relabel -> reshape again -> mapped consensus.

This is high-density but narrow evidence. It probably belongs as a recipe pattern or section in a lifecycle MOC before being treated as a broad operation.

Shape D — sibling collection order harmonization before mapped paired processing

The pox-virus half-genome workflow shows sibling collection ordering as lifecycle setup. Pool-specific data is derived, Pool2 is sorted by Pool1 identifiers, and both pools then feed mapped paired processing.

Evidence:

  • $IWC_FORMAT2/virology/pox-virus-amplicon/pox-virus-half-genome.gxwf.yml:363 uses split_file_to_collection.
  • $IWC_FORMAT2/virology/pox-virus-amplicon/pox-virus-half-genome.gxwf.yml:391 uses __SORTLIST__.
  • $IWC_FORMAT2/virology/pox-virus-amplicon/pox-virus-half-genome.gxwf.yml:480 maps fastp over the second pool after sorting.
  • $IWC_FORMAT2/virology/pox-virus-amplicon/pox-virus-half-genome.gxwf.yml:1171 uses samtools_merge as a later collection-style reduction.
  • $IWC_FORMAT2/virology/pox-virus-amplicon/pox-virus-half-genome.gxwf.yml:1315 flattens later nested output.

Lifecycle recipe: derive/split sibling collection -> sort by sibling identifiers -> mapped paired processing -> reduce/flatten.

This composes harmonize-by-sortlist-from-identifiers, map-over semantics, and collection-flatten-after-fanout.

Shape E — fan-in bundle, downstream collection consumer, pooled flatten

MAGs generation shows map-over lifecycle in the opposite direction: multiple tool outputs are bundled into a collection, consumed by a downstream refinement tool, then flattened for pooled downstream processing.

Evidence:

  • $IWC_FORMAT2/microbiome/mags-building/MAGs-generation.gxwf.yml:961 builds a collection with __BUILD_LIST__.
  • $IWC_FORMAT2/microbiome/mags-building/MAGs-generation.gxwf.yml:995 consumes that collection with binette.
  • $IWC_FORMAT2/microbiome/mags-building/MAGs-generation.gxwf.yml:1037 flattens nested bins with __FLATTEN__.

Lifecycle recipe: parallel outputs -> BUILD_LIST fan-in -> downstream collection consumer -> FLATTEN pooled result.

Existing coverage: collection-build-named-bundle and collection-flatten-after-fanout. The lifecycle-level point is that collection construction is not just input preparation; it also appears as mid-workflow fan-in before a collection-aware tool.

2. Phase inventory, demoted

The earlier collection survey already counted individual tools. For lifecycle purposes, the inventory is more useful as phase membership:

Lifecycle phaseCommon IWC mechanismExisting page coverage
Prepare from rows or manifestsplit_file_to_collection, __BUILD_LIST__, Apply Rules paired mappingtabular-to-collection-by-row, collection-build-named-bundle, collection-build-list-paired-with-apply-rules
Map domain toolNative Galaxy collection map-over by connecting collection inputgalaxy-collection-patterns, galaxy-collection-semantics
Drop unusable elements__FILTER_EMPTY_DATASETS__, __FILTER_FAILED_DATASETS__collection-cleanup-after-mapover-failure
Sync sibling collectionscollection_element_identifiers, __FILTER_FROM_FILE__, __SORTLIST__, __RELABEL_FROM_FILE__sync-collections-by-identifier, harmonize-by-sortlist-from-identifiers, regex-relabel-via-tabular
Reshape axes__APPLY_RULES__, __FLATTEN__, __EXTRACT_DATASET__collection-swap-nesting-with-apply-rules, collection-flatten-after-fanout, collection-unbox-singleton
Aggregate tabular/report outputcollapse_dataset, collection_column_join, domain reducers, report toolstabular-concatenate-collection-to-table, tabular-pivot-collection-to-wide
Publishworkflow outputs, __BUILD_LIST__, conditional when, singleton extractioncollection-build-named-bundle, conditional-gate-on-nonempty-result, collection-unbox-singleton

3. Nextflow channel-shape implications

The broad Nextflow anatomy note, component-nextflow-pipeline-anatomy, is still a stub. Two narrower notes are usable prior work: nextflow-to-galaxy-channel-shape-mapping and nextflow-operators-to-galaxy-collection-recipes. Local Nextflow fixtures were materialized under workflow-fixtures/pipelines/ for this pass.

Clean mappings

Nextflow patternGalaxy/IWC lifecycle analogueEvidence and caveat
Samplesheet rows become repeated tuple(meta, path) recordsGalaxy workflow input list / list:paired, or manifest-to-collection preparationnf-core/demo samplesheet normalization in $PIPELINES/nf-core__demo/subworkflows/local/utils_nfcore_demo_pipeline/main.nf:102-120; Galaxy analogue in $IWC_FORMAT2/transcriptomics/rnaseq-pe/rnaseq-pe.gxwf.yml:5-11 and SRA manifest split at $IWC_FORMAT2/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.gxwf.yml:175. Rich meta fields still need a representation choice.
tuple(meta, path) map-overNative Galaxy collection map-overnf-core/rnaseq branches normalized inputs in $PIPELINES/nf-core__rnaseq/workflows/rnaseq/main.nf:152-164; Galaxy examples are ordinary collection inputs to mapped tools, e.g. MGnify list input at $IWC_FORMAT2/amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:5-9.
collect(), toList(), collectFile() for report or file materializationAggregate/report tool, collapse_dataset, collection_column_join, or one explicit output filenf-core/demo MultiQC collection in $PIPELINES/nf-core__demo/workflows/demo.nf:106-115; Galaxy tabular aggregation examples include $IWC_FORMAT2/sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:413-433.
mix() of final report/version channelsReport aggregation or output bundlenf-core/taxprofiler mixes optional MultiQC inputs in $PIPELINES/nf-core__taxprofiler/workflows/taxprofiler.nf:350-393; Galaxy bundle analogue is __BUILD_LIST__ in $IWC_FORMAT2/amplicon/qiime2/qiime2-III-VI-downsteam/QIIME2-VI-diversity-metrics-and-estimations.gxwf.yml:337-374.

Mappings that need review

Nextflow patternGalaxy pressure pointLikely Galaxy strategy
branch per elementGalaxy when is step-level, not arbitrary per-element channel routingUse workflow booleans or data-derived booleans for whole-step gates; use explicit filters/classifiers for per-element routing. MGnify’s non-empty boolean gate is the grounded whole-step pattern.
join() / combine(by:) keyed pairingGalaxy map-over is safe only when identifiers and structure already alignUse identifier extraction, filtering, sorting, or relabeling; review remainder, duplicate keys, and unmatched elements.
groupTuple() / transpose()Galaxy collection types are structured but not arbitrary tuple recordsUse nested collections, Apply Rules, __FLATTEN__, or domain reducers only when the grouping axis is real in Galaxy.
Unkeyed combine() / Cartesian expansionWeak IWC support for cross-product collection toolsTreat as user-review trigger unless a concrete domain operation justifies it.
publishDir path rulesGalaxy outputs are labeled datasets/collections, not filesystem layoutsTranslate intent into workflow output labels, collection bundles, or report artifacts; do not preserve path layout literally.

Nextflow-first recipe boundaries

The Nextflow-first layer probably should not replace Galaxy operation patterns. It should route from source idioms to the Galaxy pages that implement them:

This is a strong argument for a source-pattern layer above the Galaxy pattern layer; see open questions.

4. Candidate pattern boundaries

Keep

  1. galaxy-map-over-lifecycle-patterns MOC. Existing pages are phase-specific. A MOC can route authors through prepare -> map -> cleanup -> sync -> reshape -> aggregate -> publish without duplicating operation details. Evidence spans MGnify, SRA manifest, influenza, pox-virus, and MAGs.

  2. manifest-to-mapped-collection-lifecycle recipe. Scope: a tabular manifest/list becomes a collection, a mapped tool runs once per row/key, and outputs are relabeled or reshaped. Strongest evidence: SRA manifest and pox-virus.

  3. cleanup-sync-and-publish-nonempty-results recipe. Scope: mapped outputs may be empty; clean them, use identifiers to keep siblings aligned, and gate reports on non-empty results. Strongest evidence: MGnify rRNA prediction.

  4. reshape-relabel-remap-by-collection-axis recipe. Scope: domain fan-out creates the wrong axis; use Apply Rules and relabeling to expose the downstream map-over axis. Evidence is dense but mostly influenza; keep as a recipe page with confidence caveats, not a broad operation.

  5. Nextflow-source-pattern MOC. Scope: source-facing pages that explain Nextflow channel/operator idioms and list Galaxy pattern pages that implement each idiom. This is not an IWC pattern page in the same sense as collection leaves; it is a translation index over them.

  1. Cleanup variants belong in collection-cleanup-after-mapover-failure.
  2. Sibling sync belongs in sync-collections-by-identifier and harmonize-by-sortlist-from-identifiers.
  3. Collection-to-tabular reduction belongs in tabular-concatenate-collection-to-table and tabular-pivot-collection-to-wide.
  4. Output singleton extraction and bundling belong in collection-unbox-singleton and collection-build-named-bundle.

Drop

  1. Tool-anchored lifecycle pages such as FILTER_EMPTY lifecycle or Apply Rules lifecycle.
  2. Generic map-over-in-galaxy as an operation page. Map-over itself is semantics; corpus-backed reusable value is in the lifecycle recipes around it.
  3. Unkeyed Nextflow combine / Cartesian product as a corpus-backed pattern. Treat it as review-trigger until stronger IWC evidence appears.

5. Open questions

  1. Should galaxy-map-over-lifecycle-patterns be a pattern_kind: moc page under content/patterns/, or remain a research survey until one conversion workflow uses it?
  2. Should lifecycle recipes be standalone recipe pages, or sections inside the lifecycle MOC that link to existing operations?
  3. Is dense single-workflow evidence enough for reshape-relabel-remap-by-collection-axis, or should it stay survey-only until a second workflow attests the full sequence?
  4. Should source-pattern pages be a new note type, a research/component subtype convention, or ordinary pattern pages with source-specific metadata?
  5. If source-pattern pages list implemented_by Galaxy pattern pages, should that field be schema-validated as wiki links?
  6. How prescriptive should Nextflow-first pages be when the mapping is lossy: default recipe plus review triggers, or review-only until verified in a conversion?

Incoming References (12)