Dashboard

Component Gxformat2 Parsing And Syntax

Format2 YAML detection, conversion pipeline to Galaxy JSON before import processing

Raw
Revised:
2026-04-22
Revision:
2
Related Notes:
Component - Format2 Workflows (gxformat2), Component - Workflow Format (.ga), Component - Workflow Format Differences, Component - Workflow Import, Dependency - gxformat2

gxformat2: Parsing and Syntax in Galaxy

How Galaxy parses Format2 (gxformat2) YAML workflows, covering the complete syntax specification, conversion pipeline, and integration points.

Format Detection

Galaxy detects Format2 workflows through two markers:

  1. class: GalaxyWorkflow — primary marker, checked by artifact_class() in lib/galaxy/managers/executables.py
  2. yaml_content — secondary marker, indicates wrapped YAML content from a prior export

Detection entry points:

  • artifact_class() — inspects the dict for class field; also handles CWL $graph docs by resolving target object by object_id (defaults to "main")
  • normalize_workflow_format() in WorkflowContentsManager — triggers conversion when class == "GalaxyWorkflow" or "yaml_content" exists
  • WES service (lib/galaxy/webapps/galaxy/services/wes.py) — separate detection via _determine_workflow_type()

Conversion Pipeline

All Format2 workflows are converted to Galaxy native JSON before the import/construction pipeline processes them. The stack never operates on Format2 directly.

Format2 YAML dict


normalize_workflow_format()                    # lib/galaxy/managers/workflows.py:620

  ├── artifact_class() detects "GalaxyWorkflow"

  ├── Format2ConverterGalaxyInterface()        # line 2273, stub implementation
  ├── ImportOptions(deduplicate_subworkflows=True)

  └── python_to_workflow(as_dict, galaxy_interface,
  │       workflow_directory=..., import_options=...)
  │                                            # gxformat2.converter

Native Galaxy JSON dict (same as .ga format)


RawWorkflowDescription(as_dict, workflow_path)


build_workflow_from_raw_description()          # standard import pipeline

Format2ConverterGalaxyInterface is a minimal stub — import_workflow() raises NotImplementedError. Nested Format2 subworkflow imports go through the standard Galaxy recursive build path, not through this interface.

gxformat2 Converter Internals

Entry point: python_to_workflow() in gxformat2/converter.py.

Phase 1 — Graph Preprocessing

_preprocess_graphs() handles $graph multi-workflow documents:

  • Identifies main workflow as entry point
  • Registers other workflows by their id for reference via #graph_id
  • With deduplicate_subworkflows=True, shared subworkflows are stored once in converted["subworkflows"]

Phase 2 — Input Conversion

convert_inputs_to_steps() transforms top-level inputs: into native input step dicts and prepends them to the step list. Each input becomes a step with a proper type (data_input, data_collection_input, or parameter_input).

Phase 3 — Step Transformation

Per-step dispatch via transform_{step_type}():

Transform functionHandles
transform_tool()Tool steps — resolves state/runtime_inputs/tool_state, builds PJAs from out specs
transform_subworkflow()Subworkflow steps — recursive conversion via run_workflow_to_step()
transform_data_input()Dataset input scaffolding
transform_data_collection_input()Collection input scaffolding
transform_parameter_input()Parameter input scaffolding
transform_pause()Pause/review step scaffolding

ConversionContext tracks label→step_id mappings. SubworkflowConversionContext delegates graph-level state to parent.

Phase 4 — Connection Population

_populate_input_connections() converts CWL-style in/connect references to native input_connections with numeric step IDs. Handles:

  • Simple references: step_name → default output
  • Qualified references: step_name/output_name
  • Array sources: [step1/out, step2/out]
  • $link directives in state — replaced with {"__class__": "ConnectedValue"}

Phase 5 — Output Processing

Maps outputs[].outputSourceworkflow_outputs entries on target steps.

Complete Format2 Syntax

Workflow Root

class: GalaxyWorkflow               # REQUIRED — format marker
label: "My Workflow"                # workflow name (also accepts legacy `name`)
doc: "Description of workflow"      # documentation (also accepts `annotation`)
format-version: v2.0                # optional, defaults to v2.0

inputs:   { ... }                   # workflow inputs
outputs:  { ... }                   # workflow outputs
steps:    { ... }                   # workflow steps

# Metadata
tags: [genomics, rna-seq]
uuid: "xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx"
license: "MIT"                      # SPDX identifier
release: "0.1.14"
creator:
  - class: Person
    name: Jane Doe
    identifier: "https://orcid.org/0000-0001-2345-6789"
  - class: Organization
    name: Example Lab

# Report template
report:
  markdown: |
    # Workflow Report
    ```galaxy
    history_dataset_as_image(output="plot")
    ```

Inputs

Inputs can be a dict (keyed by label) or a list (with explicit id fields).

Shorthand Forms

inputs:
  my_dataset: data                  # dataset input
  my_collection: collection         # collection input
  my_text: text                     # text parameter
  my_int: integer                   # integer parameter
  my_float: float                   # float parameter
  my_bool: boolean                  # boolean parameter
  my_color: color                   # color picker
  multi_text: [string]              # array of strings

Expanded Forms

inputs:
  aligned_reads:
    type: data                      # or File (alias)
    doc: "Aligned reads in BAM format"
    optional: false
    format: bam                     # single format
    # format: [bam, sam]            # or list of formats

  paired_fastqs:
    type: collection                # or data_collection
    collection_type: "list:paired"  # list, paired, list:paired, list:list
    optional: false

  num_lines:
    type: integer                   # or int
    default: 5
    optional: true

  seed_text:
    type: text                      # or string
    default: "hello"
    restrictions: ["opt1", "opt2", "opt3"]  # dropdown values

  sample_sheet_input:
    type: data_collection
    collection_type: sample_sheet
    column_definitions:
      - name: treatment
        type: string
        default_value: control
        restrictions: [treatment, control]

Type Aliases

Format2 typeNative typeNotes
data, Filedata_inputDataset
collection, data_collectiondata_collection_inputCollection
string, textparameter_inputparameter_type: "text"
int, integerparameter_inputparameter_type: "integer"
floatparameter_inputparameter_type: "float"
booleanparameter_inputparameter_type: "boolean"
colorparameter_inputparameter_type: "color"
[type]parameter_inputmultiple: true in tool_state

List Form

inputs:
  - id: the_input
    type: data
  - id: the_param
    type: integer
    default: 5

Outputs

Outputs declare which step outputs are workflow-level results.

outputs:
  trimmed_reads:
    outputSource: cutadapt/out_pairs
  quality_report:
    outputSource: fastqc/html_file
  • outputSource: step_label/output_name — qualified reference
  • outputSource: step_label — defaults to first/primary output
  • Legacy source key also accepted

Steps

Steps can be a dict (keyed by label) or a list (with explicit id fields). Dict form is standard.

Tool Steps

steps:
  trim_reads:
    tool_id: toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/4.4+galaxy0
    tool_version: "4.4+galaxy0"                  # optional
    tool_shed_repository:                         # optional, for Tool Shed provenance
      changeset_revision: 8c0175e03cee
      name: cutadapt
      owner: lparsons
      tool_shed: toolshed.g2.bx.psu.edu
    doc: "Trim adapters and low-quality bases"    # optional
    position: {left: 400, top: 200}               # optional, editor layout

    in:                                            # input connections
      library|input_1: raw_reads
      anno|reference: annotation_file

    state:                                         # tool parameter values
      adapter_options:
        action: trim
      quality_cutoff: 20

    out:                                           # output actions (PJAs)
      out_pairs:
        rename: "Trimmed Reads"
      report:
        hide: true

    runtime_inputs:                                # runtime-settable params
      - quality_cutoff

Input Connections (in)

Three syntactic forms, all equivalent:

Simple reference (shorthand):

in:
  input1: upstream_step                    # default output of step
  input2: upstream_step/specific_output    # qualified output

Source dict:

in:
  input1:
    source: upstream_step/output_name
  input2:
    source: upstream_step                  # default output
    default: fallback_value                # default if disconnected

Multiple sources (multi-data inputs):

in:
  input1:
    source:
      - step1/output
      - step2/output

Nested parameter addressing uses pipe notation inherited from Galaxy’s tool form hierarchy:

in:
  "library|input_1": raw_reads             # section.param
  "seed_source|seed": seed_input           # conditional.param
  "queries_0|input2": extra_data           # repeat[0].param

Alternative Connection Syntax: connect

The connect key is equivalent to in but coexists for backward compatibility:

steps:
  my_step:
    tool_id: cat1
    connect:
      input1: upstream/output

Both in and connect are merged during conversion. Connections from connect and in are combined.

Tool State (state)

Structured tool parameters. Preferred over tool_state for human readability:

state:
  simple_param: value
  nested:
    conditional_selector: option_a
    nested_param: value
  repeat_param:
    - element1
    - element2
  connected_input:
    $link: upstream_step/output            # connection via $link

$link directives are replaced with {"__class__": "ConnectedValue"} in the native conversion and the connection is recorded in input_connections.

$link vs in: Both create connections, but $link lives inside state (useful for deeply nested tool params), while in is top-level on the step.

tool_state: Low-level alternative — JSON-encoded strings per parameter, matching native format exactly. Cannot be used simultaneously with state.

Runtime Inputs

steps:
  my_step:
    tool_id: random_lines1
    runtime_inputs:
      - num_lines                          # these become RuntimeValue markers
    state:
      num_lines: 1                         # default value (overridden at runtime)

Params listed in runtime_inputs get {"__class__": "RuntimeValue"} in the native tool_state.

Post-Job Actions (out)

out:
  output_name:
    hide: true                             # HideDatasetAction
    rename: "New Name"                     # RenameDatasetAction
    change_datatype: fasta                 # ChangeDatatypeAction
    delete_intermediate_datasets: true     # DeleteIntermediatesAction
    add_tags: [tag1, tag2]                 # TagDatasetAction
    remove_tags: [old_tag]                 # RemoveTagDatasetAction
    set_columns: [col1, col2, col3]        # ColumnSetAction

PJA mapping:

Format2 keyNative Galaxy PJA
hideHideDatasetAction
renameRenameDatasetAction
change_datatypeChangeDatatypeAction
delete_intermediate_datasetsDeleteIntermediatesAction
add_tagsTagDatasetAction
remove_tagsRemoveTagDatasetAction
set_columnsColumnSetAction

Pause Steps

steps:
  review_qc:
    type: pause
    in:
      input: upstream_step/output

Conditional Execution (when)

steps:
  conditional_step:
    tool_id: some_tool
    when: "$(inputs.run_this != 'skip')"   # ECMAScript 5.1 expression
    in:
      run_this: boolean_input
      data_input: upstream/output

The when expression is evaluated at runtime. If it evaluates to false, the step and downstream dependents are skipped.

Subworkflows

Three mechanisms:

Inline Subworkflow

steps:
  nested:
    run:
      class: GalaxyWorkflow
      inputs:
        inner_input: data
      outputs:
        inner_output:
          outputSource: inner_step/out_file1
      steps:
        inner_step:
          tool_id: cat1
          in:
            input1: inner_input
    in:
      inner_input: outer_step/output

External File Import

steps:
  nested:
    run:
      "@import": ./path/to/subworkflow.gxwf.yml
    in:
      input_name: upstream/output

Resolved relative to workflow_directory.

$graph Multi-Workflow Document

$graph:
  - id: helper_workflow
    class: GalaxyWorkflow
    inputs:
      helper_input: data
    outputs:
      helper_output:
        outputSource: step/output
    steps:
      step:
        tool_id: cat1
        in:
          input1: helper_input

  - id: main
    class: GalaxyWorkflow
    inputs:
      the_input: data
    outputs:
      the_output:
        outputSource: nested/helper_output
    steps:
      nested:
        run: "#helper_workflow"
        in:
          helper_input: the_input

main is the entry point. Other workflows are referenced by #graph_id. With deduplicate_subworkflows=True (Galaxy’s default), shared subworkflows are stored once in the converted output’s subworkflows map.

Report Template

report:
  markdown: |
    # Analysis Report

    ## Inputs
    ```galaxy
    invocation_inputs()
    ```

    ## Results
    ```galaxy
    history_dataset_as_image(output="plot")
    ```

    ```galaxy
    history_dataset_as_table(output="counts")
    ```

Galaxy template directives supported:

  • invocation_inputs(), invocation_outputs()
  • history_dataset_display(output="..."), history_dataset_as_image(output="...")
  • history_dataset_as_table(output="..."), history_dataset_peek(output="...")
  • history_dataset_info(input="..."), history_dataset_collection_display(input="...")
  • workflow_display()
  • job_parameters(step="..."), job_metrics(step="...")
  • tool_stdout(step="..."), tool_stderr(step="...")

Export Path (Native → Format2)

Galaxy can export native workflows back to Format2 via from_galaxy_native():

# lib/galaxy/managers/workflows.py
wf_dict = self._workflow_to_dict_export(trans, stored_workflow, workflow=workflow)
wf_dict = from_galaxy_native(wf_dict, None, json_wrapper=True)
f.write(wf_dict["yaml_content"])

Export artifacts produced by store_workflow_artifacts():

FileFormatMethod
name.gaNative Galaxy JSONjson.dump(wf_dict)
name.gxwf.ymlFormat2 YAMLfrom_galaxy_native()
name.abstract.cwlCWL v1.2 abstractfrom_dict() from gxformat2.abstract
name.htmlCytoscape visualizationto_cytoscape() (optional, may fail)

API export styles:

  • style="export" or style="ga" → native JSON
  • style="format2" → Format2 dict
  • style="format2_wrapped_yaml"{"yaml_content": "<yaml>"}

The reverse conversion (from_galaxy_native() in gxformat2/export.py) iterates native steps and dispatches by module type:

  • data_input/data_collection_input/parameter_input → Format2 inputs: entries
  • tool → step dict with tool_id, recovered state (parsed from JSON string)
  • subworkflow → recursive call, result embedded as run:
  • pause → step with type: pause

Connections are reversed from native input_connections to CWL-style in: dicts. PJAs are reversed to out: specs.

Schema Validation

Format2 workflows are validated against a schema-salad schema (v19.09):

from gxformat2.lint import lint_format2, lint_ga
from gxformat2.linting import LintContext

ctx = LintContext()
lint_format2(ctx, workflow_dict, path="/path/to/workflow.gxwf.yml")
ctx.print_messages()

Checks performed:

  • Structural correctness via schema-salad validation
  • Required keys (class, steps, outputs)
  • Workflow outputs exist and have labels
  • Step errors (tool not installed warnings)
  • Report markdown validity
  • Input default type validation (e.g., string default for integer input = error)
  • PJA type validation (e.g., hide: "moocow" = error, must be bool)

Exit codes: 0 success, 1 warnings, 2 errors, 3 parse failure.

Normalization Layer

gxformat2/normalize.py provides format-agnostic views:

  • steps_normalized() — all steps (inputs + tool/subworkflow) as a flat normalized list
  • inputs_normalized() — just input steps
  • outputs_normalized() — just outputs
  • NormalizedWorkflow — deep-copies and normalizes: replaces anonymous output references, ensures implicit out dicts

This layer is used by the abstract CWL export and the Cytoscape visualization to work with a uniform step representation regardless of input format.

Legacy Syntax Notes

LegacyModernNotes
namelabelWorkflow name
outputs (step-level)outStep output actions
sourceoutputSourceWorkflow output reference
step#outputstep/outputConnection syntax (opt-in via GXFORMAT2_SUPPORT_LEGACY_CONNECTIONS=1)
List-format inputs with idDict-format inputsBoth supported
List-format steps with idDict-format stepsBoth supported

File Index

ComponentFile
Format detectionlib/galaxy/managers/executables.pyartifact_class()
Conversion orchestrationlib/galaxy/managers/workflows.pynormalize_workflow_format()
Galaxy interface stublib/galaxy/managers/workflows.pyFormat2ConverterGalaxyInterface
Format2→native convertergxformat2/converter.pypython_to_workflow()
Native→Format2 exportergxformat2/export.pyfrom_galaxy_native()
Type system / modelgxformat2/model.py — type aliases, connection handling, input conversion
Normalizationgxformat2/normalize.py — format-agnostic views
Schema validationgxformat2/lint.py + gxformat2/schema/v19_09/
Abstract CWL exportgxformat2/abstract.pyfrom_dict()
YAML utilitiesgxformat2/yaml.pyordered_load(), ordered_dump()
Test fixtures (gxformat2)gxformat2/tests/example_wfs.py
Test fixtures (Galaxy)lib/galaxy_test/base/workflow_fixtures.py
WES integrationlib/galaxy/webapps/galaxy/services/wes.py

Incoming References (5)