INITIAL_CORPUS_INGESTION

Initial IWC Integration

(Filename retained as INITIAL_CORPUS_INGESTION.md for continuity, but “ingestion” overstates what survives. After deconstruction, the Foundry has no IWC ingestion pipeline, no exemplar mirror, and no workflow-fixtures runtime dependency. This document describes the lighter integration that replaces the earlier sketch.)

What changed and why

Earlier sketch: ingest gxformat2-cleaned IWC workflows from workflow-fixtures/ into per-workflow exemplar pages with auto-generated structural metadata (step counts, tool lists, test fixtures) and hand-curated annotations preserved across re-ingestion.

Problem: the unique value per exemplar page is the hand-curated ## Patterns demonstrated cross-reference. Everything else (step count, tool list, has_tests, the auto-generated Steps body) is just re-rendering of upstream .gxwf.yml content. The <!-- foundry:hand-curated --> markers, idempotent regeneration, drift warnings, and structural-change detection are a lot of machinery to protect a paragraph of hand-written cross-reference per workflow.

workflow-fixtures was R&D scratch for pattern-research. Treating it as Foundry-runtime input recreates the “we mirror IWC” problem in lighter clothing. Agents have gxwf and can clean / inspect IWC directly. Foundry contributors can use workflow-fixtures locally as an authoring aid without the Foundry depending on it.

What the Foundry does instead

  1. Patterns cite IWC by URL, in the page body. A pattern’s ## Exemplars section lists IWC workflows that demonstrate it, each as a free-form Markdown link ([bacterial-genomics/...](https://github.com/galaxyproject/iwc/blob/<sha>/workflows/...)) with one-liner author commentary. Pin to a specific commit SHA when stability matters; pin to main when freshness matters. Author choice per citation; no enforced policy.

  2. Inline excerpts when they earn it. A pattern author may paste 10–30 lines of cleaned gxformat2 directly into a pattern body to illustrate an idiom. The cleaning is done at authoring time by the human running gxwf locally (probably against workflow-fixtures, or against a fresh clone, or against a raw IWC URL — the Foundry doesn’t care). The excerpt is committed verbatim into the pattern page; no build-time regeneration; rot is rot.

  3. iwc/* tag family for category aggregation. Patterns (and Molds, when applicable) carry one or more iwc/<category> tags. The category vocabulary lives in meta_tags.yml, hand-maintained, seeded by a one-time inventory of current IWC categories.

  4. Per-category aggregation surfaces come from the tag system:

    • The standard tags/iwc/<category> page (free, falls out of tags/[...tag].astro) lists all patterns + Molds tagged that way.
    • A generated content/iwc-overview.md (--check-gated) groups every iwc/* tag in a single dashboard with counts and a one-line description per category. Single landing page for “what does the Foundry have for variant-calling, RNA-seq, …”.
    • A new dashboard_sections.json block surfaces an “IWC Coverage” section on the main dashboard.
  5. compare-against-iwc-exemplar (the Mold) operates against live IWC. The cast skill loads with instructions to fetch IWC at runtime via WebFetch / gxwf, not against Foundry-hosted exemplar pages. The Mold’s source artifact describes the procedure, not a corpus index.

What this gives up

iwc/* vocabulary

Hand-maintained in meta_tags.yml. Seeded once by inventorying IWC categories. Add as needed when a new pattern lands; never auto-synced.

Source of truth: top-level directories under ~/projects/repositories/iwc/workflows/. Each is a science-domain category. Slugs are normalized to lowercase + dashes (bacterial_genomicsiwc/bacterial-genomics). Per-.ga tags fields exist for finer labels (e.g., mpvx, generic) but are not the categorization spine — they may inform a finer tag tier later if needed; not in v1.

Concrete seed (as of iwc@<seed-commit-sha>):

iwc/amplicon:
  description: "Amplicon sequencing analysis"
iwc/bacterial-genomics:
  description: "Bacterial genome analysis"
iwc/comparative-genomics:
  description: "Comparative-genomics workflows"
iwc/computational-chemistry:
  description: "Computational-chemistry workflows"
iwc/data-fetching:
  description: "Data-acquisition and fetching workflows"
iwc/epigenetics:
  description: "Epigenetics (ChIP-seq, ATAC-seq, methylation, …)"
iwc/genome-annotation:
  description: "Genome-annotation workflows"
iwc/genome-assembly:
  description: "Genome-assembly workflows"
iwc/imaging:
  description: "Image-analysis workflows"
iwc/metabolomics:
  description: "Metabolomics workflows"
iwc/microbiome:
  description: "Microbiome / metagenomics workflows"
iwc/proteomics:
  description: "Proteomics workflows"
iwc/read-preprocessing:
  description: "Read pre-processing (trimming, filtering, QC)"
iwc/repeatmasking:
  description: "Repeat-masking workflows"
iwc/sars-cov-2-variant-calling:
  description: "SARS-CoV-2 variant calling"
iwc/scrnaseq:
  description: "Single-cell RNA-seq workflows"
iwc/transcriptomics:
  description: "Bulk RNA-seq / transcriptomics workflows"
iwc/variant-calling:
  description: "Variant-calling workflows (DNA-seq, somatic, germline)"
iwc/vgp-assembly:
  description: "VGP (Vertebrate Genomes Project) assembly workflows"
iwc/virology:
  description: "Virology / viral-genomics workflows"

A one-time tsx scripts/seed-iwc-tags.ts script reads <iwc-clone>/workflows/, emits this YAML block. Not idempotent in the sense of re-syncing on demand — re-running it generates a candidate patch the maintainer applies by hand to meta_tags.yml. After v1 seeding, vocabulary churn (rare) is hand-edited.

Cross-cutting cases worth a note for v1 seeding:

Generated iwc-overview.md

Auto-generated, --check-gated, drift detection wired into CI. Format:

# IWC Coverage Overview

Generated from `iwc/*` tag membership across patterns and Molds. Run `npm run iwc-overview` to regenerate.

## Variant Calling (`iwc/variant-calling`)

> Variant-calling workflows (DNA-seq, somatic, germline)

Patterns:
- [[galaxy-paired-collection-vcf-merge]] — paired collection → merged VCF idiom
- [[galaxy-conditional-calling-mode]]    — somatic vs. germline branching

Molds:
- [[implement-galaxy-tool-step]]         (general; tagged for relevance)

## RNA-seq (`iwc/rna-seq`)

> RNA-seq quantification, splicing, differential expression

Patterns:
- [[galaxy-tabular-counts-pivot]]        — STAR/Salmon counts → tidy tabular

Counts in the dashboard surface (“Variant Calling — 4 patterns, 2 Molds”) let a contributor immediately see which areas are well-covered vs thinly covered.

Validation

The validator gains one small Foundry-specific check (replacing the earlier validateExemplarPin and validateMoldRefs):

That’s the entirety of “IWC integration” enforcement. Citations in pattern bodies are not validated (URLs are URLs; the cost of brokenness is moderate, the cost of automated link-check at scale is real).

What lives where (summary)

v1 minimum

To exercise this lighter integration:

  1. Seed iwc/* categories in meta_tags.yml (one-time scrape or hand-typed).
  2. Author 2–3 patterns end-to-end with ## Exemplars sections citing IWC URLs and one inline excerpt.
  3. Wire the iwc-overview.md generator + --check drift gate.
  4. Confirm a Mold can wiki-link a tagged pattern and that casting resolves the citations as live URLs (not as embedded mirrors).

If the loop holds, scale to more patterns. No further integration tooling planned for v1.

Open questions