White Paper: nf-to-galaxy Skill Family
A technical description of the nf-to-galaxy skill bundle as it exists in
galaxy-skills/wf_dev (path: nf-to-galaxy/). This document describes
the skills’ structure, content, methodology, and assumed runtime context.
It does not evaluate fit against any external goal.
1. Shape and Packaging
nf-to-galaxy is a router skill with three sub-skills and a flat collection
of reference markdown files. The directory layout:
nf-to-galaxy/
SKILL.md # router
README.md # navigation index
nf-process-to-galaxy-tool/SKILL.md # sub-skill: 1 process -> 1 tool
nf-subworkflow-to-galaxy-workflow/SKILL.md# sub-skill: subworkflow -> .ga
nf-pipeline-to-galaxy-workflow/SKILL.md # sub-skill: full pipeline -> N .ga + tools
nextflow-galaxy-terminology.md # concept mapping reference
process-to-tool.md # detailed process->XML reference
workflow-to-ga.md # detailed .ga authoring reference
container-mapping.md # container -> bioconda lookup
datatype-mapping.md # nf glob pattern -> Galaxy datatype lookup
check-tool-availability.md # tool discovery procedure
tool-sources.md # placement decisions for new tools
testing-and-validation.md # routing stub
scripts/check_tool.sh # simple availability helper
examples/ # CAPHEINE-based worked examples
Sub-skills are entered through the router (SKILL.md) which dispatches by
conversion granularity: single process, single subworkflow, or whole
pipeline. The router is decision-tree only; it carries no conversion logic.
Total content: ~5,100 lines of markdown. Body text is roughly 80 % reference material (mappings, examples, caveats) and 20 % procedural step-lists in the sub-skill files.
2. Conceptual Model
The skill imposes a fixed mapping between Nextflow and Galaxy concepts:
| Nextflow concept | Galaxy concept | Cardinality |
|---|---|---|
| Process | Tool (XML wrapper) | 1 : 1 |
| Module (directory) | Tool directory | structural |
| Subworkflow | Workflow (.ga) or subworkflow | 1 : 1..N |
| Workflow (top-level) | Workflow (.ga) | 1 : 1..N |
| Container declaration | <requirements> package | 1 : 1 |
path('*.ext') glob | Galaxy datatype string | lookup |
| Channel connection | Step input_connection | 1 : 1 |
flatten() / scatter | Dataset collection | pattern |
collect() / gather | Collection-aware tool input | pattern |
task.ext.args | Advanced section / explicit param | translation |
when: / conditional | Workflow variants | duplication |
The golden rule stated repeatedly across the skill: one Nextflow process =
one Galaxy tool XML. Subworkflows decompose into a sequence of tool steps with
explicit input_connections. Pipelines are expected to decompose into a
family of smaller .ga workflows (“splitter pattern”) rather than a single
mega-workflow when the source has many flags or optional branches.
3. Conversion Methodology
Each sub-skill enforces the same outer loop:
- Scope clarification with the user — required inputs include workflow name, author, license, annotation, tags. Placeholder values are explicitly forbidden.
- Static analysis of the Nextflow source — enumerate processes, channels, conditionals; produce a DAG.
- Tool discovery before authoring anything — see § 4.
- Plan presentation and approval gate — the user must confirm the plan
before any
.xmlor.gais emitted. - Generation — XML for tools, JSON for workflows.
- Validation — Planemo lint/test for tools; Galaxy import + warning triage for workflows.
- Documentation — usage notes, version disclosure, divergence from the Nextflow source.
The pipeline-level sub-skill additionally requires a “best-practice bioinformatics sanity check” gate: if the requested scope omits steps the agent considers standard (sort/index, QC, MultiQC aggregation), the agent must flag and ask before proceeding.
4. Tool Discovery Procedure
check-tool-availability.md defines a fixed search order:
- Local clone of
galaxyproject/tools-iuc. tools-iucon GitHub.- Other known repositories:
genouest/galaxy-tools,bgruening/galaxytools,ARTbio/tools-artbio,galaxyproject/tools-devteam. - Galaxy Main ToolShed search.
- Web search.
A distinction is drawn between installed on target instance, available to install (wrapper exists somewhere), and missing (no wrapper at all). Only the third case authorizes new tool authoring. Discovery output is recorded as an explicit tool inventory table (process / tool / status / action) that becomes input to the conversion plan.
A check_tool.sh shell helper composes search URLs for several repositories;
no automated cross-repo querying is shipped beyond that. A separate
galaxy-integration skill (referenced but outside this directory) provides
MCP / BioBlend-based live querying of a Galaxy instance.
5. Process → Tool Translation
process-to-tool.md defines a per-element transformation:
- Container (
biocontainers/<pkg>:<ver>--<hash>) →<requirement type="package" version="<ver>"><pkg></requirement>. A lookup table for ~14 common tools is included; otherwise the package name is parsed from the container string orenvironment.yml. path()inputs →<param type="data" format="...">. Format is selected viadatatype-mapping.md, which tabulates ~30 file-extension → Galaxy datatype mappings (sequence, alignment, tree, annotation, tabular, JSON, HyPhy-specific).val()inputs → typed<param>(select,boolean,int,float,text).- Optional inputs →
optional="true"plus Cheetah#ifguards in the command block. - Outputs →
<data>elements;emit:name becomes thenameattribute.versions.ymlis dropped (Galaxy tracks versions itself). - Script →
<command detect_errors="exit_code"><![CDATA[…]]></command>, with single-quoted variable interpolations and CDATA wrapping. task.ext.args→ either an “advanced” section with a free-text param or explicit per-flag parameters.
A worked example (HYPHY_FEL) is included in full.
6. Workflow → .ga Translation
workflow-to-ga.md describes the .ga JSON schema as understood by the
skill:
- Workflow inputs are
data_input/data_collection_inputsteps. - Tool steps carry
tool_id,tool_version,inputs(parameters andconnections), andlabel. - Connections are by upstream step
id+output_name. - Collections are produced and consumed via
discover_datasetsand collection-aware operators (__FLATTEN__,__MERGE_COLLECTION__,__FILTER_FAILED_DATASETS__). - Three authoring routes are presented: Galaxy UI export (recommended),
programmatic via
galaxy-mcp, and direct hand-written JSON (“error-prone”).
The skill encodes a substantial caveats section that dominates the
.ga-authoring procedure. The most-emphasized failure modes:
- UUID validity — every
uuidfield must be a real UUID4 and unique across the workflow; descriptive strings cause import errors. - Tool-ID / owner / version mismatch — a tool may exist under a
different ToolShed owner than expected, and
+galaxyNsuffixes vary per instance. The agent is instructed to resolve against the target instance when accessible and otherwise to marktool_id/tool_versionas placeholders. - Tool semantics ≠ tool existence — finding a same-named wrapper does
not prove it does what the Nextflow step does. Example given:
seqkit_split2exists but splits into chunks, not one-record-per-dataset. input_connectionsparameter-name mismatches — tool XML often exposes inputs through conditional paths (reference_cond|reference_history, notreference). Wrong key → silent dataset-empty warnings on import.- Conditional selectors in
tool_state— connecting an upstream dataset is insufficient if aselect-driven branch was not chosen; Galaxy will treat the input as missing. - Galaxy import-warning interpretation — distinct categories (benign default-fill, real bug, environment mismatch) are listed with recommended responses. The agent is told to ask the user to paste the warning report after import.
These caveats are repeated nearly verbatim in
nf-pipeline-to-galaxy-workflow/SKILL.md, indicating they were considered
load-bearing across all .ga-emitting flows.
7. Validation Surface
The skill assumes external validators:
- Planemo for tool-level XML lint and test execution.
- Galaxy instance import for workflow-level structural validation; the instance’s import warnings are the primary signal.
- Manual semantic comparison of Galaxy outputs against Nextflow outputs for end-to-end validation.
There is no in-skill schema-driven validator. Static checks reduce to “read the tool XML and compare strings.” Type-level checks of input/output compatibility between connected steps are not described.
8. Inputs, Outputs, and User-Interaction Pattern
Inputs the agent expects to be available:
- Nextflow source tree (processes, subworkflows, main workflow).
- Optional clone of
tools-iuc(improves discovery latency). - Optional credentials for a target Galaxy instance (enables tool-version resolution and import-warning triage).
- User responses for metadata, scope, and tool-placement decisions.
Outputs produced:
- Galaxy tool XML files (one per missing process).
- Galaxy workflow
.gaJSON files (one per subworkflow / workflow variant). - Optional auxiliary files:
macros.xml, test data, README.
Interaction pattern: the skill is explicitly gated. Approval checkpoints exist at scope confirmation, tool placement (tools-iuc vs. custom), and post-plan-pre-implementation. Caveats are surfaced to the user rather than auto-resolved.
9. Dependencies on Other Skills
The skill references but does not contain:
../../galaxy-integration/— MCP / BioBlend integration for live tool lookup, workflow import, invocation monitoring.../../tool-dev/— generic Galaxy tool authoring procedure (used when the chosen placement istools-iuc).
Discovery of tools on a live instance, programmatic workflow import, and
invocation monitoring are delegated to galaxy-integration.
10. Coverage and Known Gaps
Explicitly handled:
- Linear pipelines, parallel branches, scatter / gather via collections.
- Container-based requirement extraction.
- A fixed set of common datatype extensions.
- A small set of common bioconda packages.
- HyPhy-specific output typing.
Explicitly limited or workaround-only:
- Conditionals beyond simple selector branches — recommended workaround is “publish multiple workflow variants.”
- Dynamic file patterns — collections + discover_datasets, with manual pattern authoring.
- Groovy logic in scripts — pre-processing tools or manual translation; no automated approach.
task.ext.argsconfiguration layering — flattened to either explicit parameters or a free-text advanced field.- Meta-map propagation — discarded; element identifiers used instead.
Not addressed:
- Sub-workflow nesting beyond a brief mention of Galaxy 21.05+ syntax.
- Programmatic emission of
.gafrom a structured intermediate representation; the skill assumes either UI export or hand-written JSON. - Differential / round-trip fidelity (Galaxy → Nextflow direction).
- Type-checked connection compatibility prior to instance import.
11. Example Material
examples/capheine-mapping.md documents an end-to-end conversion of the
CAPHEINE viral-genomics pipeline. Highlights:
- 15/15 tools were already present in
tools-iuc; no new tool authoring was required. - Conversion reduced to workflow assembly (two
.gafiles: preprocessing, analyses). - The example is referenced from all three sub-skills as the canonical case study; it represents the “ideal case” where tool discovery succeeds for every process.
tool-checking-example.md walks the discovery procedure for a single tool
across each repository in the search order.
12. Summary
nf-to-galaxy is a prose-and-procedure skill that translates
Nextflow source artifacts into Galaxy artifacts by:
- modeling Nextflow → Galaxy as a fixed concept-mapping table,
- discovering existing wrappers before generating new ones,
- generating tool XML from per-element transformations,
- generating workflow
.gaJSON from a DAG with explicit step / connection modeling, - validating externally via Planemo and Galaxy instance import,
- gating each major step on user approval.
Conversion correctness is enforced by read-the-source discipline (read the actual tool XML, the actual ToolShed owner, the actual installed version) and by post-hoc instance feedback (import warnings). There is no programmatic schema or static type system in the loop; correctness obligations rest on the agent’s adherence to the procedure and on validators run after generation.