Home Research

Iwc Shortcuts Anti Patterns

What IWC test suites cut corners on (accepted) vs what's a code smell — existence-only probes, sim_size deltas, image dim checks, label coupling.

Raw
Revised
2026-05-03
Rev
2
component

IWC test-suite shortcuts and anti-patterns

Purpose

When an agent translates or authors a Galaxy workflow for IWC submission, the test suite it writes will be reviewed against IWC’s de facto style — not against an idealized assertion ladder. That style routinely tolerates assertions that look weak in isolation. This note distinguishes the corner-cutting that is normal and accepted in the corpus from the patterns that an agent should treat as smells worth flagging.

This note owns accepted-vs-smell calls. For positive workflow-structure guidance behind label stability, checkpoint promotion, and collection identifier design, use galaxy-workflow-testability-design.

Grounding: 115 *-tests.yml files under workflow-fixtures/iwc-src/workflows/ (mirror of galaxyproject/iwc), prior synthesis in galaxy-brain/vault/projects/workflow_state/skills/COMPONENT_GALAXY_WORKFLOW_TESTING.md. Path citations below are relative to iwc-src/workflows/ unless absolute.

TL;DR rules of thumb

  1. Default to tolerant assertions. compare: sim_size + delta:, has_image_* + delta:, has_text substring, has_h5_keys, has_n_lines + delta: are the IWC vocabulary. Strict compare: diff or exact-file: is the exception, used only when the upstream tool is fully deterministic on fixed inputs.
  2. No negative tests. expect_failure: does not appear in the corpus. Don’t author one.
  3. No checksums. md5: / checksum: do not appear on outputs in the corpus. SHA-1 hashes are used on inputs (integrity of remote fetch), never on output assertions.
  4. Preserve labels. Inputs and outputs are referenced by label. Renaming silently breaks tests; treat label changes as breaking changes that require a sibling -tests.yml update.
  5. Big data goes to Zenodo. In-repo test-data/ is for toy fixtures and expected outputs only.

1. Existence-only content probes

Accepted

The HyPhy test files reduce JSON output validation to “starts with {”:

  • comparative_genomics/hyphy/hyphy-core-tests.yml:32-71 — four output collections (meme_output, prime_output, busted_output, fel_output), each gene element asserts has_text: text: "{" and nothing else.
  • comparative_genomics/hyphy/hyphy-compare-tests.yml, capheine-core-and-compare-tests.yml — same pattern across the rest of the family.

This is accepted because:

  • HyPhy’s MEME/PRIME/BUSTED/FEL/CFEL/RELAX statistical outputs embed run-dependent floats throughout (likelihoods, AIC, posterior probabilities). Substring assertions on numeric fields would fail intermittently.
  • Selecting any specific gene name or category in the JSON would couple the test to internal HyPhy keying that the wrapper has changed across versions.
  • The assertion does verify “the tool ran, produced JSON, did not crash, and the collection structure matches expected element identifiers” — which is genuinely useful given HyPhy’s history of opaque failures.

A quick scan finds 298 lines matching the existence-style has_text: pattern across the corpus (grep "has_text:\s*$\|text: \"{\"$" yields 298 hits across many files). It is widespread, not a HyPhy quirk.

Other variants in the same accepted-shortcut family:

  • First-line-of-header probes: amplicon/amplicon-mgnify/.../mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml:46-49 asserts has_text: "# mapseq v1.2.6 (Jan 20 2023)" — version banner only.
  • has_n_columns schema probes: same file lines 50-52, 67-69 — “the table has 15 columns” / “4 columns” with no row-content check.
  • has_h5_keys structure probes: scRNAseq/scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:27-32, 159-166 — confirms AnnData has obs/louvain, var/highly_variable, uns/rank_genes_groups, etc., but says nothing about cluster labels or values. Also imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml for Merged anndata.

Smell

Existence probes on deterministic outputs. If the underlying tool is deterministic on fixed inputs (alignment, simple QC stats, well-defined transformations), reducing to has_text: "{" is laziness — agent should at least pull a known stable substring from the expected JSON.

Heuristic for an agent: if the source workflow’s tool is on the “stochastic / floating-point heavy / version-fragile” list (HyPhy, RepeatModeler, scanpy plots, MCMC samplers, ML inference), existence probes are accepted. Otherwise, prefer has_text against a stable token from a real output.


2. Size-only comparisons (compare: sim_size + delta:)

Accepted

Canonical example: repeatmasking/RepeatMasking-Workflow-tests.yml:11-46 — every output is compare: sim_size with delta: 30000 (30 KB) on small outputs and delta: 90000000 (90 MB!) on the Stockholm seed-alignment file. RepeatModeler’s discovered repeat families differ run-to-run; only output magnitude is reproducible.

grep "compare: sim_size" returns 9 files using this pattern:

  • repeatmasking/RepeatMasking-Workflow-tests.yml, repeatmasking/Repeat-masking-with-RepeatModeler-and-RepeatMasker-tests.yml (RepeatModeler — large delta band)
  • epigenetics/hic-hicup-cooler/chic-fastq-to-cool-hicup-cooler-tests.yml, epigenetics/hic-hicup-cooler/hic-fastq-to-cool-hicup-cooler-tests.yml (HiC matrices)
  • genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences-tests.yml
  • VGP-assembly-v2/kmer-profiling-hifi-trio-VGP2/kmer-profiling-hifi-trio-VGP2-tests.yml
  • genome-assembly/polish-with-long-reads/Assembly-polishing-with-long-reads-tests.yml
  • scRNAseq/baredsc/baredSC-1d-logNorm-tests.yml, scRNAseq/baredsc/baredSC-2d-logNorm-tests.yml (Bayesian sampling — uses delta_frac: here)

Delta-magnitude survey

From grep "delta: [0-9]+" distribution across the corpus:

Delta bandCountTypical use
4–100 (tiny)~20image pixel dimensions, line counts
1K–10K~40small text/tabular outputs, plot PNGs
25K–100K~25mid-size reports, multi-page plots
200K–1M~10report HTML, BAM stats
1M–10M~10medium BAM/BCF/cool files
10M+ (up to 90M)~7RepeatModeler libraries, large alignments

The 90 MB delta on RepeatMasking-Workflow-tests.yml:20 is at the extreme. It says “this output is somewhere between zero bytes and 180 MB” — effectively only catches the empty-output failure mode. Accepted because RepeatModeler’s seed alignments are known to vary by tens of MB across runs.

delta_frac:

Used in 3 files (scRNAseq/baredsc/*, genome-assembly/polish-with-long-reads/*). Preferred over absolute delta: when the expected output size scales with input. An agent translating a workflow whose output size depends on input volume should consider delta_frac: over delta:.

Smell

compare: sim_size on outputs from a deterministic tool. If the tool is bwa/bowtie2/samtools-sort with fixed seeds and pinned versions, there’s no excuse for size-only — compare: diff (with modest lines_diff:) or content assertions are appropriate.

Also a smell: stacking size + has_image_* checks on a PNG without any content assertion when the workflow’s claim is about the data shown (e.g., a clustering plot). The corpus does this routinely (Scanpy file below) — accepted, but a translated workflow that has a more deterministic plotter should do better.


3. Image plot assertions

Accepted

scRNAseq/scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:33-205 is the dense example. ~15 PNG outputs each get the same triple:

UMAP of louvain:
  has_size:
    size: 68416
    delta: 6000
  has_image_width:
    width: 601
    delta: 30
  has_image_height:
    height: 429
    delta: 25

What this catches:

  • Plot was rendered (non-zero size).
  • Render dimensions are stable (matplotlib defaults didn’t drift, theme didn’t change).
  • Approximate file size hasn’t shifted by an order of magnitude (no catastrophic content change like all-white or all-noise).

What this misses: cluster assignments wrong, axes mislabeled, points in wrong positions, colors swapped, the wrong subset plotted, NaN handling regression. Two visually different UMAPs can have identical width/height/size-within-10%.

Other observed image-assertion users (grep "has_image"):

  • imaging/tissue-microarray-analysis/tissue-microarray-analysis/tissue-micro-array-analysis-tests.yml
  • imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml

The TMA tests use a friendlier shorthand: has_size: size: 181K, delta: 50K (human-readable units).

Smell

Asserting has_image_width/has_image_height with zero delta on a tool that re-encodes (PNG round-trips through matplotlib) is brittle. The corpus uses 5–10% deltas; an agent emitting delta: 0 should be flagged unless the renderer is byte-stable.

Note has_image_channels, has_image_center_of_mass are documented (galaxy XSD) but not observed in the sampled corpus. An agent with a deterministic mask/segmentation output could use has_image_center_of_mass to actually verify spatial correctness — this would be an upgrade over the current corpus norm, not a smell.


4. Happy-path-only culture (expect_failure:)

grep -r "expect_failure" over all 115 tests files returns zero hits. The IWC corpus has no negative tests. Period.

This is a structural property of IWC: the workflows are published artifacts intended to succeed on canonical inputs. Adversarial / error-path testing happens in tool wrappers, not in workflow tests.

Implication for an agent: Do not author expect_failure: cases when translating a workflow. If the source pipeline (e.g., nf-core) had a “fail on bad reference” test, drop it — it doesn’t belong in IWC. If the validation logic is important, it should be in a wrapper-level tool test, not a workflow test.


5. md5: / checksum: rarity

grep -r "md5:\|checksum:" over *-tests.yml: zero hits.

SHA-1 hashes: blocks are pervasive — but exclusively on inputs (hash_function: SHA-1 paired with a remote location:), to guard against silent corruption of the fetched fixture. Output assertions never use them.

Accepted. The reason is empirical: outputs of real bioinformatics tools (BAM with PG headers and timestamps, VCFs with command-line provenance, JSON with run dates) are almost never byte-stable across runs. A checksum: would fail intermittently.

Smell: an agent emitting checksum: or md5: on a workflow output. Even for “fully deterministic” tools, embedded provenance breaks checksums. Use compare: diff + lines_diff: instead, or content assertions.


6. Output label coupling

Test files key outputs by workflow label, with spaces, capitals, punctuation preserved verbatim. Examples:

  • scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:27Anndata with Celltype Annotation: (spaces, mixed case)
  • scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:33UMAP of louvain and top ranked genes:
  • imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.ymlMerged anndata:, Spatial Scatterplot Montage:
  • consensus-from-variation-tests.yml:30multisample_consensus_fasta: (snake-case style)

Both styles coexist. Snake-case (older / SARS-CoV-2 family) and natural-language-with-spaces (newer / scanpy / TMA) are equally valid. Reviewers do not enforce a single convention.

Coupling consequences

Renaming an output label in the .ga without updating the sibling -tests.yml is a silent breakage:

  • Test references the old label → key not present in invocation outputs → assertion mismatch surfaces as an opaque “output not found” error in planemo.
  • planemo workflow_lint --iwc enforces that workflow outputs are labeled, not that test labels match.

Discipline observed: every output a test asserts on is a labeled workflow output. The corpus does not assert on positional / unlabeled outputs.

Smell: a translated workflow with unlabeled outputs that later need test coverage. Agent should label every output it intends to assert on, before writing assertions.

Positive design guidance now lives in galaxy-workflow-testability-design: pick stable labels before test authoring and treat renames as test-breaking API changes.


7. Intermediate-step output gap

-tests.yml can only assert on workflow-level outputs (entries in the .ga’s top-level outputs). Intermediate step results are inaccessible to assertions.

Observed workaround across the corpus: promote the intermediate to a workflow output. This is visible indirectly — many workflows expose what would naturally be intermediates as labeled outputs solely for testability:

  • scanpy-clustering exposes Initial Anndata General Info, Anndata with raw attribute, Plot highly variable, Elbow plot of PCs and variance — these are mid-pipeline checkpoints surfaced specifically to be assertable. Compare counts: 22 outputs asserted vs the 7-or-so “user-meaningful” final artifacts.
  • MAGs-generation-tests.yml exposes a Full MultiQC Report even though MultiQC is logically intermediate to MAG annotation.

Cost: “test-only” outputs clutter the workflow’s user-facing output list. Reviewers tolerate this in exchange for testability.

Accepted shortcut: promoting an intermediate to a workflow output for test purposes. Not a smell.

Smell: asserting on a step output via some side-channel (e.g., relying on Galaxy collection ordering, indexing into tool_state). The corpus does not do this and an agent should not invent it.

Positive design guidance now lives in galaxy-workflow-testability-design: promote assertable checkpoints deliberately, especially when final reports or plots can only support weak smoke tests.


8. Remote-data fragility

Pattern

Overwhelming preference for Zenodo as the input store. Every remote location: is paired with a SHA-1 hashes: block. Examples already cited in nearly every snippet above.

Non-Zenodo remote sources observed (from grep "ftp.sra.ebi\|ftp://\|figshare\|github.com"):

  • virology/pox-virus-amplicon/pox-virus-half-genome-tests.yml:27,34,48,55ftp://ftp.sra.ebi.ac.uk/... SRA fastqs
  • amplicon/amplicon-mgnify/.../mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml:21ftp://ftp.ebi.ac.uk/... reference DB
  • data-fetching/parallel-accession-download/parallel-accession-download-tests.yml, VGP-assembly-v2/Plot-Nx-Size/... — accession-driven
  • variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml — github raw URLs

The fragility

  • Zenodo is a single point of failure across CI. A Zenodo outage breaks every IWC PR concurrently. SHA-1 hashes guard against silent corruption but provide no mitigation for outages or HTTP 503s.
  • EBI/SRA FTP is even less reliable — observed flake-prone in the broader Galaxy CI history.
  • No retry / backoff configured at the test-format level; planemo-ci-action’s defaults handle transient failures only via re-running the chunk job manually.

Accepted

This is just life in the IWC. Don’t try to “fix” it in a translated workflow by inlining large data — reviewers will push back (see §10).

Smell

Inputs hosted on a contributor’s personal endpoint, S3 bucket, or Dropbox. Reviewers ask for migration to Zenodo before merge.


9. compare: diff on timestamped outputs

compare: diff usage from grep:

  • sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation-tests.yml:32compare: diff, lines_diff: 6 on annotated VCF
  • sars-cov-2-variant-calling/sars-cov-2-pe-illumina-wgs-variant-calling/pe-wgs-variation-tests.yml:35 — same
  • imaging/fluorescence-nuclei-segmentation-and-counting/segmentation-and-counting-tests.yml:16lines_diff: 0
  • variant-calling/variation-reporting/Generic-variation-analysis-reporting-tests.yml:21,25,29,33lines_diff: 6
  • variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml:39,45,55lines_diff: 6

The lines_diff: 6 constant is suspicious — 6 lines is the typical VCF header preamble that embeds ##fileDate=... and ##source=.... The use is defensible: tolerance window matches the known mutable header lines, content lines are diffed strictly.

Smell

  • compare: diff with lines_diff: 0 on a file that contains any timestamp, command-line capture, version banner, or random tie-break (hash-ordered dictionaries in Python output, etc.). The single observed lines_diff: 0 case (segmentation-and-counting-tests.yml:16) appears to be on a numeric tabular output where it’s defensible — verify content type before flagging.
  • compare: diff on a BAM. BAM headers include @PG lines with full command lines and Galaxy job IDs. Use has_size + content extracts via has_archive_member or samtools view-piped XML asserts — not byte-level diff.

Recommended replacement when timestamps appear:

  • For VCF: compare: diff, lines_diff: <header-line-count> (corpus convention is 6).
  • For tabular reports: has_text against stable column headers + has_n_columns.
  • For HTML reports (MultiQC etc.): has_text substring on stable section names — example short-read-qc-trimming/short-read-quality-control-and-trimming-tests.yml:25-28 asserts "Filtered Reads" substring on the MultiQC HTML report rather than diffing.

10. Reviewer-feedback recurring asks

Synthesized from the COMPONENT_GALAXY_WORKFLOW_TESTING.md analysis (sections 9 and the “Common PR-review feedback” subsection) plus structural observation of what every accepted workflow has and rejected drafts apparently lack:

Reviewer askWhere it bitesSource
Creator identifier: must be a full ORCID URI (https://orcid.org/...), not bare ID..ga frontmatter creator: block. Most common lint failure.planemo PR #1458; consensus-from-variation.ga:4-10 shows the conformant shape.
Move large inputs to Zenodo. Inline path: test-data/big.bam for >1 MB inputs gets pushback.-tests.yml job inputs.iwc/workflows/README.md (per prior analysis).
Bump release + add CHANGELOG.md entry in the same PR..ga release: and sibling CHANGELOG.md. Enforced by bump_version.py.iwc/workflows/README.md:217-247.
Generate tests via planemo workflow_test_init --from_invocation <id>, not by hand. Reviewers push back on hand-authored job blocks.Any new test contribution.help.galaxyproject.org thread 13903.
Don’t use compare: diff on outputs that embed timestamps. Switch to has_text/has_n_lines with delta:.See §9.Recurring review comment.
Add labeled outputs for any output you assert on. Unlabeled outputs caught by planemo workflow_lint --iwc..ga outputs.§6 above.
Hashes on every remote location:. SHA-1 block paired with the URL. Reviewers spot-check.-tests.yml job inputs.Universal in the corpus; missing hashes get flagged.

Smell to flag for an agent submission

  • Bare ORCID ID (0000-0002-...) in creator: instead of full URL.
  • Test job referencing >1 MB local fixture instead of a Zenodo URL.
  • PR that bumps a workflow without CHANGELOG / release: bump.
  • Hand-authored -tests.yml that reads “too clean” — reviewers know --from_invocation output has a recognizable fingerprint.

Summary cheatsheet for the implement-galaxy-workflow-test mold

Use these freely (accepted shortcuts):

  • has_text: text: "{" for stochastic JSON outputs.
  • compare: sim_size, delta: for non-deterministic file outputs; pick delta from §2 distribution by tool family.
  • has_image_width/height/has_size triple with 5–10% delta for matplotlib plots.
  • has_h5_keys for AnnData/HDF5 — assert structure not values.
  • Promoting intermediates to workflow outputs to make them assertable.
  • Labels with spaces, mixed case, punctuation as the output key.
  • SHA-1 hashes on every input location:.

Avoid (smells reviewers or future-you will catch):

  • expect_failure: — not an IWC pattern.
  • md5: / checksum: on outputs.
  • compare: diff, lines_diff: 0 on anything containing timestamps, BAM @PG lines, or Python dict ordering.
  • has_image_* with zero delta.
  • Existence-only has_text: "{" on outputs from a deterministic tool.
  • Asserting on positional/unlabeled outputs.
  • Inlining >1 MB binary fixtures in test-data/.
  • Bare ORCID identifier in .ga frontmatter.

Review-time checklist before submission:

  1. Every output asserted on has a label in the .ga.
  2. Every remote location: has a hashes: SHA-1 block.
  3. Inputs >1 MB live on Zenodo, not in test-data/.
  4. release: bumped and CHANGELOG.md updated if .ga changed.
  5. creator.identifier: is a full https://orcid.org/... URL.
  6. Test was generated by planemo workflow_test_init --from_invocation, not hand-written.

Sources

  • Positive workflow-structure guidance: galaxy-workflow-testability-design.
  • Prior synthesis: /Users/jxc755/projects/repositories/galaxy-brain/vault/projects/workflow_state/skills/COMPONENT_GALAXY_WORKFLOW_TESTING.md (sections 2c, 2d, 2e, 9).
  • Corpus root: /Users/jxc755/projects/repositories/workflow-fixtures/iwc-src/workflows/ (115 *-tests.yml files across 22 categories).
  • Specific files cited:
    • comparative_genomics/hyphy/hyphy-core-tests.yml, hyphy-compare-tests.yml, capheine-core-and-compare-tests.yml
    • repeatmasking/RepeatMasking-Workflow-tests.yml, Repeat-masking-with-RepeatModeler-and-RepeatMasker-tests.yml
    • scRNAseq/scanpy-clustering/Preprocessing-and-Clustering-of-single-cell-RNA-seq-data-with-Scanpy-tests.yml
    • read-preprocessing/short-read-qc-trimming/short-read-quality-control-and-trimming-tests.yml
    • virology/pox-virus-amplicon/pox-virus-half-genome-tests.yml
    • metabolomics/lcms-preprocessing/Mass_spectrometry__LC-MS_preprocessing_with_XCMS-tests.yml
    • sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation-tests.yml
    • sars-cov-2-variant-calling/sars-cov-2-pe-illumina-wgs-variant-calling/pe-wgs-variation-tests.yml
    • sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation-tests.yml
    • variant-calling/variation-reporting/Generic-variation-analysis-reporting-tests.yml
    • variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml
    • imaging/fluorescence-nuclei-segmentation-and-counting/segmentation-and-counting-tests.yml
    • imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml
    • amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml
    • epigenetics/atacseq/atacseq-tests.yml
    • epigenetics/hic-hicup-cooler/chic-fastq-to-cool-hicup-cooler-tests.yml, hic-fastq-to-cool-hicup-cooler-tests.yml
    • microbiome/mags-building/MAGs-generation-tests.yml
    • genome-assembly/polish-with-long-reads/Assembly-polishing-with-long-reads-tests.yml
    • scRNAseq/baredsc/baredSC-1d-logNorm-tests.yml, baredSC-2d-logNorm-tests.yml
  • Corpus-wide grep tallies:
    • expect_failure: 0 hits.
    • md5:|checksum: on outputs: 0 hits.
    • compare: sim_size: 9 files.
    • compare: diff: 5 files (variant-calling and imaging).
    • has_image_*: 3 files.
    • Existence-style has_text: patterns: ~298 line matches.
    • delta_frac:: 3 files.

Incoming References (10)

  • implement-galaxy-workflow-testrelated note— Assemble Galaxy workflow test fixtures and assertions.
  • Galaxy Workflow Testability Designrelated note— Design guidance for Galaxy workflow inputs, outputs, and checkpoints that make IWC-style workflow tests possible.
  • Iwc Conditionals Surveyrelated note— Corpus survey of Galaxy conditional step usage in IWC, covering when-gates, boolean shims, and routed output selection.
  • Iwc Map Over Lifecycle Surveyrelated note— Survey of IWC map-over lifecycle recipes, with a Nextflow-to-Galaxy crosswalk for collection construction, cleanup, reshape, reduce, and publish phases.
  • Iwc Tabular Operations Surveyrelated note— Corpus survey of tabular tools and operations across IWC workflows; map for the operation pattern hierarchy on row/column data manipulation.
  • Iwc Test Data Conventionsrelated note— How IWC workflows organize and reference test data — Zenodo-first, SHA-1 integrity, collection shapes, CVMFS gotchas.
  • Iwc Transformations Surveyrelated note— Corpus survey of collection-shape transformations across IWC: built-in collection ops, toolshed transformers, and the multi-step recipes that bracket map-over.
  • Nextflow nf-test snapshots to Galaxy/Planemo assertionsrelated note— Translates nf-test snapshot assertions into Galaxy workflow test-format assertions, broken out by module-level vs pipeline-level test shape.
  • Planemo Asserts Idiomsrelated note— Decision and idiom guide for picking planemo workflow-test assertions: which family per output type, how to size tolerances, when to validate.
  • Galaxy workflow test formatrelated note— JSON Schema for the planemo workflow test format (`<workflow>-tests.yml`), vendored from `@galaxy-tool-util/schema`.