gxformat2: Parsing and Syntax in Galaxy

How Galaxy parses Format2 (gxformat2) YAML workflows, covering the complete syntax specification, conversion pipeline, and integration points.

Format Detection

Galaxy detects Format2 workflows through two markers:

class: GalaxyWorkflow — primary marker, checked by artifact_class() in lib/galaxy/managers/executables.py
yaml_content — secondary marker, indicates wrapped YAML content from a prior export

Detection entry points:

artifact_class() — inspects the dict for class field; also handles CWL $graph docs by resolving target object by object_id (defaults to "main")
normalize_workflow_format() in WorkflowContentsManager — triggers conversion when class == "GalaxyWorkflow" or "yaml_content" exists
WES service (lib/galaxy/webapps/galaxy/services/wes.py) — separate detection via _determine_workflow_type()

Conversion Pipeline

All Format2 workflows are converted to Galaxy native JSON before the import/construction pipeline processes them. The stack never operates on Format2 directly.

Format2 YAML dict
  │
  ▼
normalize_workflow_format()                    # lib/galaxy/managers/workflows.py:620
  │
  ├── artifact_class() detects "GalaxyWorkflow"
  │
  ├── Format2ConverterGalaxyInterface()        # line 2273, stub implementation
  ├── ImportOptions(deduplicate_subworkflows=True)
  │
  └── python_to_workflow(as_dict, galaxy_interface,
  │       workflow_directory=..., import_options=...)
  │                                            # gxformat2.converter
  ▼
Native Galaxy JSON dict (same as .ga format)
  │
  ▼
RawWorkflowDescription(as_dict, workflow_path)
  │
  ▼
build_workflow_from_raw_description()          # standard import pipeline

Format2ConverterGalaxyInterface is a minimal stub — import_workflow() raises NotImplementedError. Nested Format2 subworkflow imports go through the standard Galaxy recursive build path, not through this interface.

gxformat2 Converter Internals

Entry point: python_to_workflow() in gxformat2/converter.py.

Phase 1 — Graph Preprocessing

_preprocess_graphs() handles $graph multi-workflow documents:

Identifies main workflow as entry point
Registers other workflows by their id for reference via #graph_id
With deduplicate_subworkflows=True, shared subworkflows are stored once in converted["subworkflows"]

Phase 2 — Input Conversion

convert_inputs_to_steps() transforms top-level inputs: into native input step dicts and prepends them to the step list. Each input becomes a step with a proper type (data_input, data_collection_input, or parameter_input).

Phase 3 — Step Transformation

Per-step dispatch via transform_{step_type}():

Transform function	Handles
`transform_tool()`	Tool steps — resolves `state`/`runtime_inputs`/`tool_state`, builds PJAs from `out` specs
`transform_subworkflow()`	Subworkflow steps — recursive conversion via `run_workflow_to_step()`
`transform_data_input()`	Dataset input scaffolding
`transform_data_collection_input()`	Collection input scaffolding
`transform_parameter_input()`	Parameter input scaffolding
`transform_pause()`	Pause/review step scaffolding

ConversionContext tracks label→step_id mappings. SubworkflowConversionContext delegates graph-level state to parent.

Phase 4 — Connection Population

_populate_input_connections() converts CWL-style in/connect references to native input_connections with numeric step IDs. Handles:

Simple references: step_name → default output
Qualified references: step_name/output_name
Array sources: [step1/out, step2/out]
$link directives in state — replaced with {"__class__": "ConnectedValue"}

Phase 5 — Output Processing

Maps outputs[].outputSource → workflow_outputs entries on target steps.

Complete Format2 Syntax

Workflow Root

class: GalaxyWorkflow               # REQUIRED — format marker
label: "My Workflow"                # workflow name (also accepts legacy `name`)
doc: "Description of workflow"      # documentation (also accepts `annotation`)
format-version: v2.0                # optional, defaults to v2.0

inputs:   { ... }                   # workflow inputs
outputs:  { ... }                   # workflow outputs
steps:    { ... }                   # workflow steps

# Metadata
tags: [genomics, rna-seq]
uuid: "xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx"
license: "MIT"                      # SPDX identifier
release: "0.1.14"
creator:
  - class: Person
    name: Jane Doe
    identifier: "https://orcid.org/0000-0001-2345-6789"
  - class: Organization
    name: Example Lab

# Report template
report:
  markdown: |
    # Workflow Report
    ```galaxy
    history_dataset_as_image(output="plot")
    ```

Inputs

Inputs can be a dict (keyed by label) or a list (with explicit id fields).

Shorthand Forms

inputs:
  my_dataset: data                  # dataset input
  my_collection: collection         # collection input
  my_text: text                     # text parameter
  my_int: integer                   # integer parameter
  my_float: float                   # float parameter
  my_bool: boolean                  # boolean parameter
  my_color: color                   # color picker
  multi_text: [string]              # array of strings

Expanded Forms

inputs:
  aligned_reads:
    type: data                      # or File (alias)
    doc: "Aligned reads in BAM format"
    optional: false
    format: bam                     # single format
    # format: [bam, sam]            # or list of formats

  paired_fastqs:
    type: collection                # or data_collection
    collection_type: "list:paired"  # list, paired, list:paired, list:list
    optional: false

  num_lines:
    type: integer                   # or int
    default: 5
    optional: true

  seed_text:
    type: text                      # or string
    default: "hello"
    restrictions: ["opt1", "opt2", "opt3"]  # dropdown values

  sample_sheet_input:
    type: data_collection
    collection_type: sample_sheet
    column_definitions:
      - name: treatment
        type: string
        default_value: control
        restrictions: [treatment, control]

Type Aliases

Format2 type	Native type	Notes
`data`, `File`	`data_input`	Dataset
`collection`, `data_collection`	`data_collection_input`	Collection
`string`, `text`	`parameter_input`	`parameter_type: "text"`
`int`, `integer`	`parameter_input`	`parameter_type: "integer"`
`float`	`parameter_input`	`parameter_type: "float"`
`boolean`	`parameter_input`	`parameter_type: "boolean"`
`color`	`parameter_input`	`parameter_type: "color"`
`[type]`	`parameter_input`	`multiple: true` in tool_state

List Form

inputs:
  - id: the_input
    type: data
  - id: the_param
    type: integer
    default: 5

Outputs

Outputs declare which step outputs are workflow-level results.

outputs:
  trimmed_reads:
    outputSource: cutadapt/out_pairs
  quality_report:
    outputSource: fastqc/html_file

outputSource: step_label/output_name — qualified reference
outputSource: step_label — defaults to first/primary output
Legacy source key also accepted

Steps

Steps can be a dict (keyed by label) or a list (with explicit id fields). Dict form is standard.

Tool Steps

steps:
  trim_reads:
    tool_id: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/4.4+galaxy0
    tool_version: "4.4+galaxy0"                  # optional
    tool_shed_repository:                         # optional, for Tool Shed provenance
      changeset_revision: 8c0175e03cee
      name: cutadapt
      owner: lparsons
      tool_shed: toolshed.g2.bx.psu.edu
    doc: "Trim adapters and low-quality bases"    # optional
    position: {left: 400, top: 200}               # optional, editor layout

    in:                                            # input connections
      library|input_1: raw_reads
      anno|reference: annotation_file

    state:                                         # tool parameter values
      adapter_options:
        action: trim
      quality_cutoff: 20

    out:                                           # output actions (PJAs)
      out_pairs:
        rename: "Trimmed Reads"
      report:
        hide: true

    runtime_inputs:                                # runtime-settable params
      - quality_cutoff

Input Connections (`in`)

Three syntactic forms, all equivalent:

Simple reference (shorthand):

in:
  input1: upstream_step                    # default output of step
  input2: upstream_step/specific_output    # qualified output

Source dict:

in:
  input1:
    source: upstream_step/output_name
  input2:
    source: upstream_step                  # default output
    default: fallback_value                # default if disconnected

Multiple sources (multi-data inputs):

in:
  input1:
    source:
      - step1/output
      - step2/output

Nested parameter addressing uses pipe notation inherited from Galaxy’s tool form hierarchy:

in:
  "library|input_1": raw_reads             # section.param
  "seed_source|seed": seed_input           # conditional.param
  "queries_0|input2": extra_data           # repeat[0].param

Alternative Connection Syntax: `connect`

The connect key is equivalent to in but coexists for backward compatibility:

steps:
  my_step:
    tool_id: cat1
    connect:
      input1: upstream/output

Both in and connect are merged during conversion. Connections from connect and in are combined.

Tool State (`state`)

Structured tool parameters. Preferred over tool_state for human readability:

state:
  simple_param: value
  nested:
    conditional_selector: option_a
    nested_param: value
  repeat_param:
    - element1
    - element2
  connected_input:
    $link: upstream_step/output            # connection via $link

$link directives are replaced with {"__class__": "ConnectedValue"} in the native conversion and the connection is recorded in input_connections.

$link vs in: Both create connections, but $link lives inside state (useful for deeply nested tool params), while in is top-level on the step.

tool_state: Low-level alternative — JSON-encoded strings per parameter, matching native format exactly. Cannot be used simultaneously with state.

Runtime Inputs

steps:
  my_step:
    tool_id: random_lines1
    runtime_inputs:
      - num_lines                          # these become RuntimeValue markers
    state:
      num_lines: 1                         # default value (overridden at runtime)

Params listed in runtime_inputs get {"__class__": "RuntimeValue"} in the native tool_state.

Post-Job Actions (`out`)

out:
  output_name:
    hide: true                             # HideDatasetAction
    rename: "New Name"                     # RenameDatasetAction
    change_datatype: fasta                 # ChangeDatatypeAction
    delete_intermediate_datasets: true     # DeleteIntermediatesAction
    add_tags: [tag1, tag2]                 # TagDatasetAction
    remove_tags: [old_tag]                 # RemoveTagDatasetAction
    set_columns: [col1, col2, col3]        # ColumnSetAction

PJA mapping:

Format2 key	Native Galaxy PJA
`hide`	`HideDatasetAction`
`rename`	`RenameDatasetAction`
`change_datatype`	`ChangeDatatypeAction`
`delete_intermediate_datasets`	`DeleteIntermediatesAction`
`add_tags`	`TagDatasetAction`
`remove_tags`	`RemoveTagDatasetAction`
`set_columns`	`ColumnSetAction`

Pause Steps

steps:
  review_qc:
    type: pause
    in:
      input: upstream_step/output

Conditional Execution (`when`)

steps:
  conditional_step:
    tool_id: some_tool
    when: "$(inputs.run_this != 'skip')"   # ECMAScript 5.1 expression
    in:
      run_this: boolean_input
      data_input: upstream/output

The when expression is evaluated at runtime. If it evaluates to false, the step and downstream dependents are skipped.

Subworkflows

Three mechanisms:

Inline Subworkflow

steps:
  nested:
    run:
      class: GalaxyWorkflow
      inputs:
        inner_input: data
      outputs:
        inner_output:
          outputSource: inner_step/out_file1
      steps:
        inner_step:
          tool_id: cat1
          in:
            input1: inner_input
    in:
      inner_input: outer_step/output

External File Import

steps:
  nested:
    run:
      "@import": ./path/to/subworkflow.gxwf.yml
    in:
      input_name: upstream/output

Resolved relative to workflow_directory.

`$graph` Multi-Workflow Document

$graph:
  - id: helper_workflow
    class: GalaxyWorkflow
    inputs:
      helper_input: data
    outputs:
      helper_output:
        outputSource: step/output
    steps:
      step:
        tool_id: cat1
        in:
          input1: helper_input

  - id: main
    class: GalaxyWorkflow
    inputs:
      the_input: data
    outputs:
      the_output:
        outputSource: nested/helper_output
    steps:
      nested:
        run: "#helper_workflow"
        in:
          helper_input: the_input

main is the entry point. Other workflows are referenced by #graph_id. With deduplicate_subworkflows=True (Galaxy’s default), shared subworkflows are stored once in the converted output’s subworkflows map.

Report Template

report:
  markdown: |
    # Analysis Report

    ## Inputs
    ```galaxy
    invocation_inputs()
    ```

    ## Results
    ```galaxy
    history_dataset_as_image(output="plot")
    ```

    ```galaxy
    history_dataset_as_table(output="counts")
    ```

Galaxy template directives supported:

invocation_inputs(), invocation_outputs()
history_dataset_display(output="..."), history_dataset_as_image(output="...")
history_dataset_as_table(output="..."), history_dataset_peek(output="...")
history_dataset_info(input="..."), history_dataset_collection_display(input="...")
workflow_display()
job_parameters(step="..."), job_metrics(step="...")
tool_stdout(step="..."), tool_stderr(step="...")

Export Path (Native → Format2)

Galaxy can export native workflows back to Format2 via from_galaxy_native():

# lib/galaxy/managers/workflows.py
wf_dict = self._workflow_to_dict_export(trans, stored_workflow, workflow=workflow)
wf_dict = from_galaxy_native(wf_dict, None, json_wrapper=True)
f.write(wf_dict["yaml_content"])

Export artifacts produced by store_workflow_artifacts():

File	Format	Method
`name.ga`	Native Galaxy JSON	`json.dump(wf_dict)`
`name.gxwf.yml`	Format2 YAML	`from_galaxy_native()`
`name.abstract.cwl`	CWL v1.2 abstract	`from_dict()` from `gxformat2.abstract`
`name.html`	Cytoscape visualization	`to_cytoscape()` (optional, may fail)

API export styles:

style="export" or style="ga" → native JSON
style="format2" → Format2 dict
style="format2_wrapped_yaml" → {"yaml_content": "<yaml>"}

The reverse conversion (from_galaxy_native() in gxformat2/export.py) iterates native steps and dispatches by module type:

data_input/data_collection_input/parameter_input → Format2 inputs: entries
tool → step dict with tool_id, recovered state (parsed from JSON string)
subworkflow → recursive call, result embedded as run:
pause → step with type: pause

Connections are reversed from native input_connections to CWL-style in: dicts. PJAs are reversed to out: specs.

Schema Validation

Format2 workflows are validated against a schema-salad schema (v19.09):

from gxformat2.lint import lint_format2, lint_ga
from gxformat2.linting import LintContext

ctx = LintContext()
lint_format2(ctx, workflow_dict, path="/path/to/workflow.gxwf.yml")
ctx.print_messages()

Checks performed:

Structural correctness via schema-salad validation
Required keys (class, steps, outputs)
Workflow outputs exist and have labels
Step errors (tool not installed warnings)
Report markdown validity
Input default type validation (e.g., string default for integer input = error)
PJA type validation (e.g., hide: "moocow" = error, must be bool)

Exit codes: 0 success, 1 warnings, 2 errors, 3 parse failure.

Normalization Layer

gxformat2/normalize.py provides format-agnostic views:

steps_normalized() — all steps (inputs + tool/subworkflow) as a flat normalized list
inputs_normalized() — just input steps
outputs_normalized() — just outputs
NormalizedWorkflow — deep-copies and normalizes: replaces anonymous output references, ensures implicit out dicts

This layer is used by the abstract CWL export and the Cytoscape visualization to work with a uniform step representation regardless of input format.

Legacy Syntax Notes

Legacy	Modern	Notes
`name`	`label`	Workflow name
`outputs` (step-level)	`out`	Step output actions
`source`	`outputSource`	Workflow output reference
`step#output`	`step/output`	Connection syntax (opt-in via `GXFORMAT2_SUPPORT_LEGACY_CONNECTIONS=1`)
List-format inputs with `id`	Dict-format inputs	Both supported
List-format steps with `id`	Dict-format steps	Both supported

File Index

Component	File
Format detection	`lib/galaxy/managers/executables.py` — `artifact_class()`
Conversion orchestration	`lib/galaxy/managers/workflows.py` — `normalize_workflow_format()`
Galaxy interface stub	`lib/galaxy/managers/workflows.py` — `Format2ConverterGalaxyInterface`
Format2→native converter	`gxformat2/converter.py` — `python_to_workflow()`
Native→Format2 exporter	`gxformat2/export.py` — `from_galaxy_native()`
Type system / model	`gxformat2/model.py` — type aliases, connection handling, input conversion
Normalization	`gxformat2/normalize.py` — format-agnostic views
Schema validation	`gxformat2/lint.py` + `gxformat2/schema/v19_09/`
Abstract CWL export	`gxformat2/abstract.py` — `from_dict()`
YAML utilities	`gxformat2/yaml.py` — `ordered_load()`, `ordered_dump()`
Test fixtures (gxformat2)	`gxformat2/tests/example_wfs.py`
Test fixtures (Galaxy)	`lib/galaxy_test/base/workflow_fixtures.py`
WES integration	`lib/galaxy/webapps/galaxy/services/wes.py`

Component Format2 Workflows Gxformat2

gxformat2: Parsing and Syntax in Galaxy

Format Detection

Conversion Pipeline

gxformat2 Converter Internals

Phase 1 — Graph Preprocessing

Phase 2 — Input Conversion

Phase 3 — Step Transformation

Phase 4 — Connection Population

Phase 5 — Output Processing

Complete Format2 Syntax

Workflow Root

Inputs

Shorthand Forms

Expanded Forms

Type Aliases

List Form

Outputs

Steps

Tool Steps

Input Connections (`in`)

Alternative Connection Syntax: `connect`

Tool State (`state`)

Runtime Inputs

Post-Job Actions (`out`)

Pause Steps

Conditional Execution (`when`)

Subworkflows

Inline Subworkflow

External File Import

`$graph` Multi-Workflow Document

Report Template

Export Path (Native → Format2)

Schema Validation

Normalization Layer

Legacy Syntax Notes

File Index

Incoming References (6)

gxformat2: Parsing and Syntax in Galaxy

Format Detection

Conversion Pipeline

gxformat2 Converter Internals

Phase 1 — Graph Preprocessing

Phase 2 — Input Conversion

Phase 3 — Step Transformation

Phase 4 — Connection Population

Phase 5 — Output Processing

Complete Format2 Syntax

Workflow Root

Inputs

Shorthand Forms

Expanded Forms

Type Aliases

List Form

Outputs

Steps

Tool Steps

Input Connections (in)

Alternative Connection Syntax: connect

Tool State (state)

Runtime Inputs

Post-Job Actions (out)

Pause Steps

Conditional Execution (when)

Subworkflows

Inline Subworkflow

External File Import

$graph Multi-Workflow Document

Report Template

Export Path (Native → Format2)

Schema Validation

Normalization Layer

Legacy Syntax Notes

File Index

Incoming References (6)

Input Connections (`in`)

Alternative Connection Syntax: `connect`

Tool State (`state`)

Post-Job Actions (`out`)

Conditional Execution (`when`)

`$graph` Multi-Workflow Document