Galaxy Workflow Formats: gxformat2 vs Native (.ga)

A comparison of Galaxy’s two workflow serialization formats — the native JSON format (.ga) and the Format2 YAML format (.gxwf.yml) — covering design philosophy, structural differences, and practical implications.

Design Philosophy

Native (.ga) is the canonical internal format. It mirrors the database schema directly — steps keyed by numeric IDs, connections as step-ID references, tool state as double-encoded JSON strings. It was designed for machine consumption: lossless round-trips through Galaxy’s ORM, every field present, nothing inferred.

Format2 (.gxwf.yml) was designed for human authorship. It borrows CWL conventions — labeled steps, in/source connection syntax, top-level inputs/outputs sections. It optimizes for readability by inferring defaults, using semantic labels instead of numeric IDs, and representing tool state as structured YAML instead of JSON strings. It is always converted to native format before Galaxy processes it.

Structural Comparison

Document Root

Aspect	Native (.ga)	Format2 (.gxwf.yml)
Format marker	`"a_galaxy_workflow": "true"`	`class: GalaxyWorkflow`
Format version	`"format-version": "0.1"` (frozen since inception)	`format-version: v2.0` (optional)
Serialization	JSON	YAML
Workflow name	`"name"`	`label` (also accepts `name`)
Description	`"annotation"`	`doc` (also accepts `annotation`)
Inputs	Encoded as steps with `type: data_input` etc.	Top-level `inputs:` section
Outputs	`workflow_outputs` arrays embedded in steps	Top-level `outputs:` with `outputSource`
Steps	Dict keyed by string integers (`"0"`, `"1"`, …)	Dict keyed by semantic labels

Both formats share identical support for: tags, uuid, license, release, creator, report, readme, help, logo_url, doi, source_metadata.

Inputs

Native: Inputs are steps. A dataset input is a step with type: data_input and relevant configuration in tool_state:

{
  "0": {
    "id": 0,
    "type": "data_input",
    "label": "Input BAM file",
    "name": "Input dataset",
    "tool_state": "{\"optional\": false, \"format\": [\"bam\"], \"tag\": null}",
    "input_connections": {},
    "outputs": [],
    "workflow_outputs": []
  }
}

Format2: Inputs are first-class citizens with a dedicated section:

inputs:
  input_bam:
    type: data
    doc: "Aligned reads in BAM format"
    format: bam
    optional: false

Format2 shorthand: input_bam: data — a single word declares a required dataset input.

Key differences:

Format2 uses type aliases (data instead of data_input, integer instead of parameter_input with parameter_type: integer)
Format2 flattens tool_state fields (format, optional, collection_type) to top-level input properties
Format2 supports array type notation ([string] for multi-valued params)
Native represents every input as a full step object with position, uuid, errors, outputs etc.

Steps

Native: Steps are a dict keyed by string integers. Each step carries the full module state:

{
  "1": {
    "id": 1,
    "type": "tool",
    "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/featurecounts/2.1.1+galaxy0",
    "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/featurecounts/2.1.1+galaxy0",
    "tool_version": "2.1.1+galaxy0",
    "tool_shed_repository": {
      "changeset_revision": "...",
      "name": "featurecounts",
      "owner": "iuc",
      "tool_shed": "toolshed.g2.bx.psu.edu"
    },
    "tool_state": "{\"alignment\": {\"__class__\": \"ConnectedValue\"}, ...}",
    "input_connections": { ... },
    "outputs": [ ... ],
    "workflow_outputs": [ ... ],
    "post_job_actions": { ... },
    "position": {"left": 660, "top": 340},
    "annotation": "",
    "label": "Count features",
    "uuid": "...",
    "when": null,
    "errors": null
  }
}

Format2: Steps are a dict keyed by labels. Only non-default values need to be specified:

steps:
  count_features:
    tool_id: toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/featurecounts/2.1.1+galaxy0
    in:
      alignment: map_reads/mapped
      anno|reference_gene_sets: annotation
    state:
      anno:
        anno_select: history
        gff_feature_type: exon
    out:
      feature_counts:
        rename: "Gene Counts"

Key differences:

Format2 steps are identified by label, native by numeric ID
Format2 infers type: tool as default (explicit type only needed for pause)
Format2 omits content_id (redundant with tool_id), name, errors, outputs, id
Format2 omits position and uuid unless explicitly set
tool_state becomes state — structured YAML, not a JSON string

Connections

This is the most significant syntactic difference between the formats.

Native: Connections are in input_connections, referencing steps by numeric ID:

"input_connections": {
  "library|input_1": {
    "id": 0,
    "output_name": "output"
  },
  "anno|reference_gene_sets": {
    "id": 4,
    "output_name": "output"
  }
}

Format2: Connections use CWL-style in with label-based references:

in:
  library|input_1: raw_reads
  anno|reference_gene_sets: annotation_file

Or the expanded form:

in:
  library|input_1:
    source: raw_reads/output

Differences in detail:

Aspect	Native	Format2
Step references	Numeric ID (`"id": 0`)	Label (`raw_reads`)
Output references	Always explicit (`"output_name": "output"`)	Implicit default or explicit (`step/output`)
Syntax	`input_connections` dict of `{id, output_name}`	`in` dict of strings or `{source}` objects
Multiple sources	Array of connection dicts	`source: [step1/out, step2/out]`
Defaults	Not in connections	`default: value` alongside `source`
Subworkflow routing	`input_subworkflow_step_id`	Implicit from subworkflow input labels

Format2 also supports an alternative connect key (equivalent to in) and $link directives inside state for deeply nested parameter connections:

state:
  nested_section:
    deep_param:
      $link: upstream_step/output

This has no native equivalent — in native format, any connected value is {"__class__": "ConnectedValue"} in tool_state and the actual connection source is in input_connections.

Tool State

Native: A JSON-encoded string inside JSON. Double-encoded with special markers:

"tool_state": "{\"alignment\": {\"__class__\": \"ConnectedValue\"}, \"anno\": {\"anno_select\": \"history\", \"__current_case__\": 2, \"reference_gene_sets\": {\"__class__\": \"ConnectedValue\"}, \"gff_feature_type\": \"exon\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}"

Format2: Structured YAML under state, only non-connected, non-default values:

state:
  anno:
    anno_select: history
    gff_feature_type: exon

What Format2 omits:

{"__class__": "ConnectedValue"} — replaced by in connections or $link
{"__class__": "RuntimeValue"} — replaced by runtime_inputs list
__current_case__ — inferred from selector values during conversion
__page__, __rerun_remap_job_id__ — internal bookkeeping, always null/0

Workflow Outputs

Native: Declared per-step in workflow_outputs arrays:

{
  "1": {
    "workflow_outputs": [
      {
        "label": "Gene Counts",
        "output_name": "feature_counts",
        "uuid": "9bcb277a-..."
      }
    ]
  }
}

Steps with workflow_outputs: [] have all outputs hidden.

Format2: Top-level outputs section with outputSource:

outputs:
  gene_counts:
    outputSource: count_features/feature_counts

Format2 is more explicit — outputs are declared centrally rather than distributed across steps. The label (gene_counts) is separate from the step’s output name (feature_counts).

Post-Job Actions

Native: Verbose PJA objects keyed by {action_type}{output_name}:

"post_job_actions": {
  "HideDatasetActionout_pairs": {
    "action_type": "HideDatasetAction",
    "output_name": "out_pairs",
    "action_arguments": {}
  },
  "RenameDatasetActionreport": {
    "action_type": "RenameDatasetAction",
    "output_name": "report",
    "action_arguments": {"newname": "Quality Report"}
  }
}

Format2: Concise out dict on the step:

out:
  out_pairs:
    hide: true
  report:
    rename: "Quality Report"

The Format2 out dict merges output-level declarations (PJAs) with the workflow output designation. If a step output appears in both out and the top-level outputs, it’s both a workflow output and has PJAs applied.

Subworkflows

Native: Subworkflow is embedded as a full .ga document inside the step, with input_subworkflow_step_id routing:

{
  "8": {
    "type": "subworkflow",
    "input_connections": {
      "PE fastq input": {
        "id": 0,
        "input_subworkflow_step_id": 0,
        "output_name": "output"
      }
    },
    "subworkflow": {
      "a_galaxy_workflow": "true",
      "format-version": "0.1",
      "steps": { ... }
    }
  }
}

Format2: Three mechanisms, all more concise:

# 1. Inline
steps:
  nested:
    run:
      class: GalaxyWorkflow
      inputs: { inner_input: data }
      steps: { ... }
    in:
      inner_input: upstream/output

# 2. File import
steps:
  nested:
    run:
      "@import": ./subworkflow.gxwf.yml

# 3. $graph reference
$graph:
  - id: helper
    class: GalaxyWorkflow
    ...
  - id: main
    class: GalaxyWorkflow
    steps:
      nested:
        run: "#helper"

Format2 advantages:

@import keeps workflows modular — no giant embedded documents
$graph enables deduplication of shared subworkflows
Connections are by label, not by input_subworkflow_step_id

Conditional Execution

Both formats support when expressions. The syntax is nearly identical:

Native:

{
  "when": "$(inputs.when)",
  "input_connections": {
    "when": {"id": 6, "output_name": "output"}
  }
}

Format2:

when: "$(inputs.run_this != 'skip')"
in:
  run_this: boolean_input

Format2 can use more expressive when expressions directly referencing input names, while native format conventionally routes through a when pseudo-input.

Comments (Editor Annotations)

Native: Full comment objects in a top-level comments array with position, size, color, type, data, and parent-child relationships (child_steps, child_comments).

Format2: No comment support. Comments are visual editor constructs and are lost during Format2 export. They only exist in native format.

Information Preserved and Lost

Native → Format2 (export via `from_galaxy_native()`)

Preserved: Workflow name, inputs, outputs, steps, connections, tool state values, PJAs, subworkflows, when expressions, annotations/doc, metadata (license, creator, tags, etc.), report.

Lost or degraded:

Editor comments (text, markdown, frame, freehand) — no Format2 representation
Step positions — preserved if present but not required
Step UUIDs — preserved if present but not required
__current_case__ markers — inferred, not explicit
errors field on steps — omitted
tool_shed_repository — preserved but often omitted in hand-authored Format2
Double-encoded tool_state internals (__page__, __rerun_remap_job_id__)
Output type declarations on steps (outputs array)
Step name field (the tool’s display name)

Format2 → Native (import via `python_to_workflow()`)

Added/generated:

Numeric step IDs assigned in order
a_galaxy_workflow: "true" and format-version: "0.1" markers
input_connections from in/connect/$link
tool_state (JSON-encoded) from state
__current_case__ computed from conditional selector values
__page__: null, __rerun_remap_job_id__: null
ConnectedValue markers for connected params
RuntimeValue markers for runtime_inputs params
post_job_actions from out specs
workflow_outputs on steps from top-level outputs
Input steps from inputs: section

Lost: Nothing — Format2 is a subset. All Format2 constructs have native equivalents. The conversion is lossless in the Format2→native direction.

Size and Readability Comparison

A representative workflow (RNA-seq, 15 tool steps, 4 inputs, 6 outputs):

Metric	Native (.ga)	Format2 (.gxwf.yml)
Approximate lines	~800-1200	~100-180
Size ratio	1.0x	~0.15x
Step definition overhead	~30-50 lines/step	~5-15 lines/step
Input declaration	~15 lines/input	~1-3 lines/input

The 5-8x size reduction comes from:

No double-encoded JSON strings
No redundant fields (content_id = tool_id, name, errors, outputs)
No numeric IDs (labels serve as keys)
Structured state vs JSON-in-JSON
Concise connection syntax
Top-level inputs/outputs vs embedded-in-steps
YAML’s inherent conciseness vs JSON’s verbosity

Practical Implications

When to Use Format2

Hand-authoring workflows (IWC best practice)
Version control (smaller diffs, meaningful labels)
Documentation and sharing (human-readable)
CI/CD pipelines for workflow testing (Planemo)
Modular workflow composition (@import, $graph)

When to Use Native (.ga)

Galaxy UI export (default format)
Preserving editor layout and comments
Maximum fidelity round-trips through Galaxy
Tooling that expects .ga format
When exact tool_state reproduction matters

Round-Trip Fidelity

Format2 → native → Format2

This round-trip is lossy: editor comments are lost, some formatting preferences may change, field ordering may differ. The gxformat2 test suite verifies that round-tripped workflows still lint clean, but byte-level equivalence is not guaranteed.

native → Format2 → native

This round-trip is also lossy in minor ways: __current_case__ values may differ if the tool isn’t available for validation, errors fields are dropped, step ordering may change within equivalence classes.

For practical purposes, both formats produce identical runtime behavior when imported into Galaxy — the same steps execute with the same parameters and connections. The differences are cosmetic and metadata-level.

Deep Dive: `tool_state` vs `state` in Format2

The Format2 specification accepts two different keys for tool parameter state: state and tool_state. These represent fundamentally different encoding contracts and take entirely different code paths during conversion to native format.

The Two Keys

state — the human-authored approach. Values are structured YAML (dicts, lists, scalars). This is what Format2 was designed for and what human authors should use:

steps:
  count_features:
    tool_id: random_lines1
    state:
      num_lines: 2
      seed_source:
        seed_source_selector: set_seed
        seed: asdf

tool_state — the machine-generated approach. Values are pre-serialized (each top-level value is already a JSON string). This is what gxformat2’s own from_galaxy_native() export function produces:

steps:
  count_features:
    tool_id: random_lines1
    tool_state:
      num_lines: '2'
      seed_source: '{"seed_source_selector": "set_seed", "seed": "asdf"}'

Conversion Logic (Format2 → Native)

The converter in gxformat2/converter.py (lines 400–414) uses a mutually exclusive if-elif:

tool_state = {"__page__": 0}

if "state" in step or runtime_inputs:
    step_state = step.pop("state", {})
    step_state = setup_connected_values(step_state, append_to=connect)
    for key, value in step_state.items():
        tool_state[key] = json.dumps(value)       # <-- per-value encoding
    for runtime_input in runtime_inputs:
        tool_state[runtime_input] = json.dumps({"__class__": "RuntimeValue"})
elif "tool_state" in step:
    tool_state.update(step.get("tool_state"))      # <-- direct merge, no encoding

Then, regardless of which path was taken:

step["tool_state"] = json.dumps(tool_state)        # <-- outer encoding

The state path does two levels of JSON encoding: each top-level value is individually json.dumps()’d, then the entire dict is json.dumps()’d. This produces the double-encoded format Galaxy expects (a JSON string containing JSON strings as values). Before encoding, setup_connected_values() recursively walks the state dict and replaces any $link references with {"__class__": "ConnectedValue"} markers, appending the actual connection targets to the connect dict.

The tool_state path does one level of JSON encoding: values are merged as-is (assumed to already be JSON strings), then the entire dict is json.dumps()’d. This path exists to support round-tripping — the export (from_galaxy_native()) produces tool_state with pre-encoded string values, and re-import consumes them without additional encoding.

Conversion Logic (Native → Format2)

The export in gxformat2/export.py (lines 138–141) does a straightforward decode:

tool_state = json.loads(step["tool_state"])   # parse outer JSON string
tool_state.pop("__page__", None)
tool_state.pop("__rerun_remap_job_id__", None)
step_dict["tool_state"] = tool_state           # store as structured dict

The export always uses the key tool_state, never state. This means the exported Format2 preserves the per-value JSON string encoding from the native format. The values remain as JSON strings (e.g., '{"seed_source_selector": "set_seed"}'), not decoded back to structured YAML. This is a deliberate round-trip design choice — it avoids the need for the export to understand the tool’s parameter structure in order to decide which values are dicts vs. JSON strings of dicts.

Encoding Example

Starting from human-authored Format2 with state:

state:
  anno:
    anno_select: history
    gff_feature_type: exon
  num_lines: 5

After per-value json.dumps (step 1):

tool_state = {
    "__page__": 0,
    "anno": '{"anno_select": "history", "gff_feature_type": "exon"}',
    "num_lines": "5"
}

After outer json.dumps (step 2) — final native tool_state:

"{\"__page__\": 0, \"anno\": \"{\\\"anno_select\\\": \\\"history\\\", \\\"gff_feature_type\\\": \\\"exon\\\"}\", \"num_lines\": \"5\"}"

Each top-level value is a JSON string inside a JSON string — the double-encoding characteristic of native .ga tool state.

No Schema Involvement in Conversion

The Format2 → native conversion is purely algorithmic, operating only on the workflow data itself. It does not:

Load or consult tool XML/YAML definitions
Validate parameter names, types, or values against any tool schema
Use the ImporterGalaxyInterface for state conversion (that interface exists only for importing subworkflows and tools)
Infer __current_case__ for conditionals (unlike what Galaxy does during step injection)
Check whether parameter values are valid for their declared types

The algorithm is simple: iterate top-level keys, json.dumps() each value, wrap in outer json.dumps(). The only “intelligence” is setup_connected_values(), which recursively finds $link markers and replaces them with ConnectedValue — but even this is pattern-matching on the data, not consulting a schema.

On the Galaxy side, the imported native-format workflow goes through ToolModule.recover_state(), which by default also just deserializes the JSON tool_state without validation. Full tool-aware validation (filling defaults, checking parameter values) only happens later during step injection (WorkflowModuleInjector) when the workflow is prepared for execution, and optionally during recover_state with fill_defaults=True.

Implications of Schema-Free Conversion

The absence of schema validation during conversion has several consequences:

Invalid state passes silently — A Format2 workflow with state: {nonexistent_param: 42} converts to native format without error. The bad parameter only surfaces (if at all) when Galaxy tries to execute the step.
No type coercion — If a tool expects an integer but the Format2 state provides a string, the string gets json.dumps()’d and stored. There’s no opportunity to catch the mismatch early.
Round-trip asymmetry — The export uses tool_state (preserving per-value encoding), not state (structured values). A human-authored workflow using state will look different after a native → Format2 round-trip, even though the semantics are identical.
Conditional cases are opaque — Native format includes __current_case__ markers computed from the tool definition. Format2’s state key omits these (they’re inferred later by Galaxy). If the tool isn’t available during import, the case inference may produce different results.
No completeness checking — There’s no way to know during conversion whether required parameters are missing, since “required” is defined by the tool, not the workflow format.

Future Direction: Schema-Validated Tool State

Galaxy already has a sophisticated tool state specification infrastructure (see Tool State Specification component research) with 12 distinct state representations, each backed by dynamically generated Pydantic models. Two representations are directly relevant to workflow state:

workflow_step: Validates unlinked parameter values (data inputs are always None)
workflow_step_linked: Validates parameters where connected inputs are {"__class__": "ConnectedValue"}

These models are generated from tool parameter definitions and enforce strict typing (StrictInt, StrictBool, etc.), range constraints, and structural correctness (conditional discriminators, repeat bounds, section nesting).

Bringing this validation infrastructure into the Format2 conversion pipeline could address the gaps above. A schema-aware conversion could:

Validate early — Reject invalid parameter names, types, or values during state → tool_state conversion rather than deferring to runtime
Infer __current_case__ — Compute conditional case indices from selector values using the tool’s parameter tree, matching what Galaxy does during step injection
Fill defaults — Populate missing parameters with tool-defined defaults, producing more complete native tool_state
Enable a state export path — With knowledge of the tool’s parameter structure, from_galaxy_native() could decode per-value JSON strings back to structured values and emit state instead of tool_state, producing cleaner human-readable output
Power linting — The gxformat2 linter could validate state values against the tool schema, catching errors before the workflow is even imported

The main constraint is that gxformat2 is designed to work without Galaxy’s tool infrastructure — it’s a standalone library. Schema-aware features would need to be optional (activated when tool definitions are available) while preserving the current schema-free path as the default. The ImporterGalaxyInterface or a similar mechanism could be extended to provide tool parameter metadata to the converter when available.

Version History

Native format has used format-version: "0.1" since its inception. New features are added without version bumps — the format version is effectively frozen.

Format2 uses format-version: v2.0. The schema is defined in schema-salad (v19.09 currently) and validated during linting. Format2 has evolved to add features like when expressions, $graph documents, and sample sheet collection types.

File Extension Conventions

Extension	Format	Context
`.ga`	Native JSON	Galaxy UI export, traditional
`.gxwf.yml`	Format2 YAML	Modern convention, IWC standard
`.gxwf.json`	Format2 as JSON	Rare, API responses with `style=format2` + JSON download
`.abstract.cwl`	CWL abstract	Non-executable CWL representation
`.yml` / `.yaml`	Format2 YAML	Also accepted, less specific

Component Workflow Format Differences

Galaxy Workflow Formats: gxformat2 vs Native (.ga)

Design Philosophy

Structural Comparison

Document Root

Inputs

Steps

Connections

Tool State

Workflow Outputs

Post-Job Actions

Subworkflows

Conditional Execution

Comments (Editor Annotations)

Information Preserved and Lost

Native → Format2 (export via `from_galaxy_native()`)

Format2 → Native (import via `python_to_workflow()`)

Size and Readability Comparison

Practical Implications

When to Use Format2

When to Use Native (.ga)

Round-Trip Fidelity

Deep Dive: `tool_state` vs `state` in Format2

The Two Keys

Conversion Logic (Format2 → Native)

Conversion Logic (Native → Format2)

Encoding Example

No Schema Involvement in Conversion

Implications of Schema-Free Conversion

Future Direction: Schema-Validated Tool State

Version History

File Extension Conventions

Incoming References (5)

Galaxy Workflow Formats: gxformat2 vs Native (.ga)

Design Philosophy

Structural Comparison

Document Root

Inputs

Steps

Connections

Tool State

Workflow Outputs

Post-Job Actions

Subworkflows

Conditional Execution

Comments (Editor Annotations)

Information Preserved and Lost

Native → Format2 (export via from_galaxy_native())

Format2 → Native (import via python_to_workflow())

Size and Readability Comparison

Practical Implications

When to Use Format2

When to Use Native (.ga)

Round-Trip Fidelity

Deep Dive: tool_state vs state in Format2

The Two Keys

Conversion Logic (Format2 → Native)

Conversion Logic (Native → Format2)

Encoding Example

No Schema Involvement in Conversion

Implications of Schema-Free Conversion

Future Direction: Schema-Validated Tool State

Version History

File Extension Conventions

Incoming References (5)

Native → Format2 (export via `from_galaxy_native()`)

Format2 → Native (import via `python_to_workflow()`)

Deep Dive: `tool_state` vs `state` in Format2