IWC test-suite shortcuts and anti-patterns

Purpose

When an agent translates or authors a Galaxy workflow for IWC submission, the test suite it writes will be reviewed against IWC’s de facto style — not against an idealized assertion ladder. That style routinely tolerates assertions that look weak in isolation. This note distinguishes the corner-cutting that is normal and accepted in the corpus from the patterns that an agent should treat as smells worth flagging.

This note owns accepted-vs-smell calls. For positive workflow-structure guidance behind label stability, checkpoint promotion, and collection identifier design, use galaxy-workflow-testability-design.

Grounding: 115 *-tests.yml files under workflow-fixtures/iwc-src/workflows/ (mirror of galaxyproject/iwc), prior synthesis in galaxy-brain/vault/projects/workflow_state/skills/COMPONENT_GALAXY_WORKFLOW_TESTING.md. Path citations below are relative to iwc-src/workflows/ unless absolute.

TL;DR rules of thumb

Default to tolerant assertions. compare: sim_size + delta:, has_image_* + delta:, has_text substring, has_h5_keys, has_n_lines + delta: are the IWC vocabulary. Strict compare: diff or exact-file: is the exception, used only when the upstream tool is fully deterministic on fixed inputs.
No negative tests. expect_failure: does not appear in the corpus. Don’t author one.
No checksums. md5: / checksum: do not appear on outputs in the corpus. SHA-1 hashes are used on inputs (integrity of remote fetch), never on output assertions.
Preserve labels. Inputs and outputs are referenced by label. Renaming silently breaks tests; treat label changes as breaking changes that require a sibling -tests.yml update.
Big data goes to Zenodo. In-repo test-data/ is for toy fixtures and expected outputs only.

1. Existence-only content probes

Accepted

The HyPhy test files reduce JSON output validation to “starts with {”:

comparative_genomics/hyphy/hyphy-core-tests.yml:32-71 — four output collections (meme_output, prime_output, busted_output, fel_output), each gene element asserts has_text: text: "{" and nothing else.
comparative_genomics/hyphy/hyphy-compare-tests.yml, capheine-core-and-compare-tests.yml — same pattern across the rest of the family.

This is accepted because:

HyPhy’s MEME/PRIME/BUSTED/FEL/CFEL/RELAX statistical outputs embed run-dependent floats throughout (likelihoods, AIC, posterior probabilities). Substring assertions on numeric fields would fail intermittently.
Selecting any specific gene name or category in the JSON would couple the test to internal HyPhy keying that the wrapper has changed across versions.
The assertion does verify “the tool ran, produced JSON, did not crash, and the collection structure matches expected element identifiers” — which is genuinely useful given HyPhy’s history of opaque failures.

A quick scan finds 298 lines matching the existence-style has_text: pattern across the corpus (grep "has_text:\s*$\|text: \"{\"$" yields 298 hits across many files). It is widespread, not a HyPhy quirk.

Other variants in the same accepted-shortcut family:

First-line-of-header probes: amplicon/amplicon-mgnify/.../mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml:46-49 asserts has_text: "# mapseq v1.2.6 (Jan 20 2023)" — version banner only.
has_n_columns schema probes: same file lines 50-52, 67-69 — “the table has 15 columns” / “4 columns” with no row-content check.
has_h5_keys structure probes: scRNAseq/scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:27-32, 159-166 — confirms AnnData has obs/louvain, var/highly_variable, uns/rank_genes_groups, etc., but says nothing about cluster labels or values. Also imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml for Merged anndata.

Smell

Existence probes on deterministic outputs. If the underlying tool is deterministic on fixed inputs (alignment, simple QC stats, well-defined transformations), reducing to has_text: "{" is laziness — agent should at least pull a known stable substring from the expected JSON.

Heuristic for an agent: if the source workflow’s tool is on the “stochastic / floating-point heavy / version-fragile” list (HyPhy, RepeatModeler, scanpy plots, MCMC samplers, ML inference), existence probes are accepted. Otherwise, prefer has_text against a stable token from a real output.

2. Size-only comparisons (`compare: sim_size` + `delta:`)

Accepted

Canonical example: repeatmasking/RepeatMasking-Workflow-tests.yml:11-46 — every output is compare: sim_size with delta: 30000 (30 KB) on small outputs and delta: 90000000 (90 MB!) on the Stockholm seed-alignment file. RepeatModeler’s discovered repeat families differ run-to-run; only output magnitude is reproducible.

grep "compare: sim_size" returns 9 files using this pattern:

repeatmasking/RepeatMasking-Workflow-tests.yml, repeatmasking/Repeat-masking-with-RepeatModeler-and-RepeatMasker-tests.yml (RepeatModeler — large delta band)
epigenetics/hic-hicup-cooler/chic-fastq-to-cool-hicup-cooler-tests.yml, epigenetics/hic-hicup-cooler/hic-fastq-to-cool-hicup-cooler-tests.yml (HiC matrices)
genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences-tests.yml
VGP-assembly-v2/kmer-profiling-hifi-trio-VGP2/kmer-profiling-hifi-trio-VGP2-tests.yml
genome-assembly/polish-with-long-reads/Assembly-polishing-with-long-reads-tests.yml
scRNAseq/baredsc/baredSC-1d-logNorm-tests.yml, scRNAseq/baredsc/baredSC-2d-logNorm-tests.yml (Bayesian sampling — uses delta_frac: here)

Delta-magnitude survey

From grep "delta: [0-9]+" distribution across the corpus:

Delta band	Count	Typical use
4–100 (tiny)	~20	image pixel dimensions, line counts
1K–10K	~40	small text/tabular outputs, plot PNGs
25K–100K	~25	mid-size reports, multi-page plots
200K–1M	~10	report HTML, BAM stats
1M–10M	~10	medium BAM/BCF/cool files
10M+ (up to 90M)	~7	RepeatModeler libraries, large alignments

The 90 MB delta on RepeatMasking-Workflow-tests.yml:20 is at the extreme. It says “this output is somewhere between zero bytes and 180 MB” — effectively only catches the empty-output failure mode. Accepted because RepeatModeler’s seed alignments are known to vary by tens of MB across runs.

`delta_frac:`

Used in 3 files (scRNAseq/baredsc/*, genome-assembly/polish-with-long-reads/*). Preferred over absolute delta: when the expected output size scales with input. An agent translating a workflow whose output size depends on input volume should consider delta_frac: over delta:.

Smell

compare: sim_size on outputs from a deterministic tool. If the tool is bwa/bowtie2/samtools-sort with fixed seeds and pinned versions, there’s no excuse for size-only — compare: diff (with modest lines_diff:) or content assertions are appropriate.

Also a smell: stacking size + has_image_* checks on a PNG without any content assertion when the workflow’s claim is about the data shown (e.g., a clustering plot). The corpus does this routinely (Scanpy file below) — accepted, but a translated workflow that has a more deterministic plotter should do better.

3. Image plot assertions

Accepted

scRNAseq/scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:33-205 is the dense example. ~15 PNG outputs each get the same triple:

UMAP of louvain:
  has_size:
    size: 68416
    delta: 6000
  has_image_width:
    width: 601
    delta: 30
  has_image_height:
    height: 429
    delta: 25

What this catches:

Plot was rendered (non-zero size).
Render dimensions are stable (matplotlib defaults didn’t drift, theme didn’t change).
Approximate file size hasn’t shifted by an order of magnitude (no catastrophic content change like all-white or all-noise).

What this misses: cluster assignments wrong, axes mislabeled, points in wrong positions, colors swapped, the wrong subset plotted, NaN handling regression. Two visually different UMAPs can have identical width/height/size-within-10%.

Other observed image-assertion users (grep "has_image"):

imaging/tissue-microarray-analysis/tissue-microarray-analysis/tissue-micro-array-analysis-tests.yml
imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml

The TMA tests use a friendlier shorthand: has_size: size: 181K, delta: 50K (human-readable units).

Smell

Asserting has_image_width/has_image_height with zero delta on a tool that re-encodes (PNG round-trips through matplotlib) is brittle. The corpus uses 5–10% deltas; an agent emitting delta: 0 should be flagged unless the renderer is byte-stable.

Note has_image_channels, has_image_center_of_mass are documented (galaxy XSD) but not observed in the sampled corpus. An agent with a deterministic mask/segmentation output could use has_image_center_of_mass to actually verify spatial correctness — this would be an upgrade over the current corpus norm, not a smell.

4. Happy-path-only culture (`expect_failure:`)

grep -r "expect_failure" over all 115 tests files returns zero hits. The IWC corpus has no negative tests. Period.

This is a structural property of IWC: the workflows are published artifacts intended to succeed on canonical inputs. Adversarial / error-path testing happens in tool wrappers, not in workflow tests.

Implication for an agent: Do not author expect_failure: cases when translating a workflow. If the source pipeline (e.g., nf-core) had a “fail on bad reference” test, drop it — it doesn’t belong in IWC. If the validation logic is important, it should be in a wrapper-level tool test, not a workflow test.

5. `md5:` / `checksum:` rarity

grep -r "md5:\|checksum:" over *-tests.yml: zero hits.

SHA-1 hashes: blocks are pervasive — but exclusively on inputs (hash_function: SHA-1 paired with a remote location:), to guard against silent corruption of the fetched fixture. Output assertions never use them.

Accepted. The reason is empirical: outputs of real bioinformatics tools (BAM with PG headers and timestamps, VCFs with command-line provenance, JSON with run dates) are almost never byte-stable across runs. A checksum: would fail intermittently.

Smell: an agent emitting checksum: or md5: on a workflow output. Even for “fully deterministic” tools, embedded provenance breaks checksums. Use compare: diff + lines_diff: instead, or content assertions.

6. Output label coupling

Test files key outputs by workflow label, with spaces, capitals, punctuation preserved verbatim. Examples:

scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:27 — Anndata with Celltype Annotation: (spaces, mixed case)
scanpy-clustering/.../Preprocessing-...-Scanpy-tests.yml:33 — UMAP of louvain and top ranked genes:
imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml — Merged anndata:, Spatial Scatterplot Montage:
consensus-from-variation-tests.yml:30 — multisample_consensus_fasta: (snake-case style)

Both styles coexist. Snake-case (older / SARS-CoV-2 family) and natural-language-with-spaces (newer / scanpy / TMA) are equally valid. Reviewers do not enforce a single convention.

Coupling consequences

Renaming an output label in the .ga without updating the sibling -tests.yml is a silent breakage:

Test references the old label → key not present in invocation outputs → assertion mismatch surfaces as an opaque “output not found” error in planemo.
planemo workflow_lint --iwc enforces that workflow outputs are labeled, not that test labels match.

Discipline observed: every output a test asserts on is a labeled workflow output. The corpus does not assert on positional / unlabeled outputs.

Smell: a translated workflow with unlabeled outputs that later need test coverage. Agent should label every output it intends to assert on, before writing assertions.

Positive design guidance now lives in galaxy-workflow-testability-design: pick stable labels before test authoring and treat renames as test-breaking API changes.

7. Intermediate-step output gap

-tests.yml can only assert on workflow-level outputs (entries in the .ga’s top-level outputs). Intermediate step results are inaccessible to assertions.

Observed workaround across the corpus: promote the intermediate to a workflow output. This is visible indirectly — many workflows expose what would naturally be intermediates as labeled outputs solely for testability:

scanpy-clustering exposes Initial Anndata General Info, Anndata with raw attribute, Plot highly variable, Elbow plot of PCs and variance — these are mid-pipeline checkpoints surfaced specifically to be assertable. Compare counts: 22 outputs asserted vs the 7-or-so “user-meaningful” final artifacts.
MAGs-generation-tests.yml exposes a Full MultiQC Report even though MultiQC is logically intermediate to MAG annotation.

Cost: “test-only” outputs clutter the workflow’s user-facing output list. Reviewers tolerate this in exchange for testability.

Accepted shortcut: promoting an intermediate to a workflow output for test purposes. Not a smell.

Smell: asserting on a step output via some side-channel (e.g., relying on Galaxy collection ordering, indexing into tool_state). The corpus does not do this and an agent should not invent it.

Positive design guidance now lives in galaxy-workflow-testability-design: promote assertable checkpoints deliberately, especially when final reports or plots can only support weak smoke tests.

8. Remote-data fragility

Pattern

Overwhelming preference for Zenodo as the input store. Every remote location: is paired with a SHA-1 hashes: block. Examples already cited in nearly every snippet above.

Non-Zenodo remote sources observed (from grep "ftp.sra.ebi\|ftp://\|figshare\|github.com"):

virology/pox-virus-amplicon/pox-virus-half-genome-tests.yml:27,34,48,55 — ftp://ftp.sra.ebi.ac.uk/... SRA fastqs
amplicon/amplicon-mgnify/.../mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml:21 — ftp://ftp.ebi.ac.uk/... reference DB
data-fetching/parallel-accession-download/parallel-accession-download-tests.yml, VGP-assembly-v2/Plot-Nx-Size/... — accession-driven
variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml — github raw URLs

The fragility

Zenodo is a single point of failure across CI. A Zenodo outage breaks every IWC PR concurrently. SHA-1 hashes guard against silent corruption but provide no mitigation for outages or HTTP 503s.
EBI/SRA FTP is even less reliable — observed flake-prone in the broader Galaxy CI history.
No retry / backoff configured at the test-format level; planemo-ci-action’s defaults handle transient failures only via re-running the chunk job manually.

Accepted

This is just life in the IWC. Don’t try to “fix” it in a translated workflow by inlining large data — reviewers will push back (see §10).

Smell

Inputs hosted on a contributor’s personal endpoint, S3 bucket, or Dropbox. Reviewers ask for migration to Zenodo before merge.

9. `compare: diff` on timestamped outputs

compare: diff usage from grep:

sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation-tests.yml:32 — compare: diff, lines_diff: 6 on annotated VCF
sars-cov-2-variant-calling/sars-cov-2-pe-illumina-wgs-variant-calling/pe-wgs-variation-tests.yml:35 — same
imaging/fluorescence-nuclei-segmentation-and-counting/segmentation-and-counting-tests.yml:16 — lines_diff: 0
variant-calling/variation-reporting/Generic-variation-analysis-reporting-tests.yml:21,25,29,33 — lines_diff: 6
variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml:39,45,55 — lines_diff: 6

The lines_diff: 6 constant is suspicious — 6 lines is the typical VCF header preamble that embeds ##fileDate=... and ##source=.... The use is defensible: tolerance window matches the known mutable header lines, content lines are diffed strictly.

Smell

compare: diff with lines_diff: 0 on a file that contains any timestamp, command-line capture, version banner, or random tie-break (hash-ordered dictionaries in Python output, etc.). The single observed lines_diff: 0 case (segmentation-and-counting-tests.yml:16) appears to be on a numeric tabular output where it’s defensible — verify content type before flagging.
compare: diff on a BAM. BAM headers include @PG lines with full command lines and Galaxy job IDs. Use has_size + content extracts via has_archive_member or samtools view-piped XML asserts — not byte-level diff.

Recommended replacement when timestamps appear:

For VCF: compare: diff, lines_diff: <header-line-count> (corpus convention is 6).
For tabular reports: has_text against stable column headers + has_n_columns.
For HTML reports (MultiQC etc.): has_text substring on stable section names — example short-read-qc-trimming/short-read-quality-control-and-trimming-tests.yml:25-28 asserts "Filtered Reads" substring on the MultiQC HTML report rather than diffing.

10. Reviewer-feedback recurring asks

Synthesized from the COMPONENT_GALAXY_WORKFLOW_TESTING.md analysis (sections 9 and the “Common PR-review feedback” subsection) plus structural observation of what every accepted workflow has and rejected drafts apparently lack:

Reviewer ask	Where it bites	Source
Creator `identifier:` must be a full ORCID URI (`https://orcid.org/...`), not bare ID.	`.ga` frontmatter `creator:` block. Most common lint failure.	planemo PR #1458; `consensus-from-variation.ga:4-10` shows the conformant shape.
Move large inputs to Zenodo. Inline `path: test-data/big.bam` for >1 MB inputs gets pushback.	`-tests.yml` job inputs.	`iwc/workflows/README.md` (per prior analysis).
Bump `release` + add CHANGELOG.md entry in the same PR.	`.ga` `release:` and sibling `CHANGELOG.md`. Enforced by `bump_version.py`.	`iwc/workflows/README.md:217-247`.
Generate tests via `planemo workflow_test_init --from_invocation <id>`, not by hand. Reviewers push back on hand-authored job blocks.	Any new test contribution.	help.galaxyproject.org thread 13903.
Don’t use `compare: diff` on outputs that embed timestamps. Switch to `has_text`/`has_n_lines` with `delta:`.	See §9.	Recurring review comment.
Add labeled outputs for any output you assert on. Unlabeled outputs caught by `planemo workflow_lint --iwc`.	`.ga` outputs.	§6 above.
Hashes on every remote `location:`. SHA-1 block paired with the URL. Reviewers spot-check.	`-tests.yml` job inputs.	Universal in the corpus; missing hashes get flagged.

Smell to flag for an agent submission

Bare ORCID ID (0000-0002-...) in creator: instead of full URL.
Test job referencing >1 MB local fixture instead of a Zenodo URL.
PR that bumps a workflow without CHANGELOG / release: bump.
Hand-authored -tests.yml that reads “too clean” — reviewers know --from_invocation output has a recognizable fingerprint.

Summary cheatsheet for the implement-galaxy-workflow-test mold

Use these freely (accepted shortcuts):

has_text: text: "{" for stochastic JSON outputs.
compare: sim_size, delta: for non-deterministic file outputs; pick delta from §2 distribution by tool family.
has_image_width/height/has_size triple with 5–10% delta for matplotlib plots.
has_h5_keys for AnnData/HDF5 — assert structure not values.
Promoting intermediates to workflow outputs to make them assertable.
Labels with spaces, mixed case, punctuation as the output key.
SHA-1 hashes on every input location:.

Avoid (smells reviewers or future-you will catch):

expect_failure: — not an IWC pattern.
md5: / checksum: on outputs.
compare: diff, lines_diff: 0 on anything containing timestamps, BAM @PG lines, or Python dict ordering.
has_image_* with zero delta.
Existence-only has_text: "{" on outputs from a deterministic tool.
Asserting on positional/unlabeled outputs.
Inlining >1 MB binary fixtures in test-data/.
Bare ORCID identifier in .ga frontmatter.

Review-time checklist before submission:

Every output asserted on has a label in the .ga.
Every remote location: has a hashes: SHA-1 block.
Inputs >1 MB live on Zenodo, not in test-data/.
release: bumped and CHANGELOG.md updated if .ga changed.
creator.identifier: is a full https://orcid.org/... URL.
Test was generated by planemo workflow_test_init --from_invocation, not hand-written.

Sources

Positive workflow-structure guidance: galaxy-workflow-testability-design.
Prior synthesis: /Users/jxc755/projects/repositories/galaxy-brain/vault/projects/workflow_state/skills/COMPONENT_GALAXY_WORKFLOW_TESTING.md (sections 2c, 2d, 2e, 9).
Corpus root: /Users/jxc755/projects/repositories/workflow-fixtures/iwc-src/workflows/ (115 *-tests.yml files across 22 categories).
Specific files cited:
- comparative_genomics/hyphy/hyphy-core-tests.yml, hyphy-compare-tests.yml, capheine-core-and-compare-tests.yml
- repeatmasking/RepeatMasking-Workflow-tests.yml, Repeat-masking-with-RepeatModeler-and-RepeatMasker-tests.yml
- scRNAseq/scanpy-clustering/Preprocessing-and-Clustering-of-single-cell-RNA-seq-data-with-Scanpy-tests.yml
- read-preprocessing/short-read-qc-trimming/short-read-quality-control-and-trimming-tests.yml
- virology/pox-virus-amplicon/pox-virus-half-genome-tests.yml
- metabolomics/lcms-preprocessing/Mass_spectrometry__LC-MS_preprocessing_with_XCMS-tests.yml
- sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation-tests.yml
- sars-cov-2-variant-calling/sars-cov-2-pe-illumina-wgs-variant-calling/pe-wgs-variation-tests.yml
- sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation-tests.yml
- variant-calling/variation-reporting/Generic-variation-analysis-reporting-tests.yml
- variant-calling/generic-variant-calling-wgs-pe/Generic-variation-analysis-on-WGS-PE-data-tests.yml
- imaging/fluorescence-nuclei-segmentation-and-counting/segmentation-and-counting-tests.yml
- imaging/tissue-microarray-analysis/multiplex-tissue-microarray-analysis/multiplex-tma-tests.yml
- amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction-tests.yml
- epigenetics/atacseq/atacseq-tests.yml
- epigenetics/hic-hicup-cooler/chic-fastq-to-cool-hicup-cooler-tests.yml, hic-fastq-to-cool-hicup-cooler-tests.yml
- microbiome/mags-building/MAGs-generation-tests.yml
- genome-assembly/polish-with-long-reads/Assembly-polishing-with-long-reads-tests.yml
- scRNAseq/baredsc/baredSC-1d-logNorm-tests.yml, baredSC-2d-logNorm-tests.yml
Corpus-wide grep tallies:
- expect_failure: 0 hits.
- md5:|checksum: on outputs: 0 hits.
- compare: sim_size: 9 files.
- compare: diff: 5 files (variant-calling and imaging).
- has_image_*: 3 files.
- Existence-style has_text: patterns: ~298 line matches.
- delta_frac:: 3 files.

Iwc Shortcuts Anti Patterns

IWC test-suite shortcuts and anti-patterns

Purpose

TL;DR rules of thumb

1. Existence-only content probes

Accepted

Smell

2. Size-only comparisons (`compare: sim_size` + `delta:`)

Accepted

Delta-magnitude survey

`delta_frac:`

Smell

3. Image plot assertions

Accepted

Smell

4. Happy-path-only culture (`expect_failure:`)

5. `md5:` / `checksum:` rarity

6. Output label coupling

Coupling consequences

7. Intermediate-step output gap

8. Remote-data fragility

Pattern

The fragility

Accepted

Smell

9. `compare: diff` on timestamped outputs

Smell

10. Reviewer-feedback recurring asks

Smell to flag for an agent submission

Summary cheatsheet for the implement-galaxy-workflow-test mold

Sources

Incoming References (10)

IWC test-suite shortcuts and anti-patterns

Purpose

TL;DR rules of thumb

1. Existence-only content probes

Accepted

Smell

2. Size-only comparisons (compare: sim_size + delta:)

Accepted

Delta-magnitude survey

delta_frac:

Smell

3. Image plot assertions

Accepted

Smell

4. Happy-path-only culture (expect_failure:)

5. md5: / checksum: rarity

6. Output label coupling

Coupling consequences

7. Intermediate-step output gap

8. Remote-data fragility

Pattern

The fragility

Accepted

Smell

9. compare: diff on timestamped outputs

Smell

10. Reviewer-feedback recurring asks

Smell to flag for an agent submission

Summary cheatsheet for the implement-galaxy-workflow-test mold

Sources

Incoming References (10)

2. Size-only comparisons (`compare: sim_size` + `delta:`)

`delta_frac:`

4. Happy-path-only culture (`expect_failure:`)

5. `md5:` / `checksum:` rarity

9. `compare: diff` on timestamped outputs