Alignment: gxy-sketches ↔ Galaxy Workflow Foundry
Project-infrastructure research for an adjacent workflow-sketch project. This note records where
the Foundry’s per-source summary Molds (summarize-paper, summarize-nextflow, summarize-cwl) and
a possible future summarize-galaxy-workflow should align with gxy-sketches, so the two projects
stay legible to each other without becoming dependent.
gxy-sketches is not a Foundry project. It is owned by another contributor (boss). The Foundry does not consume it at runtime, casting time, or build time. “Alignment” here means: shared field names, shared mental model for test manifests, and a documented mapping for vocabularies — nothing more invasive.
What gxy-sketches is
Repo: /Users/jxc755/projects/repositories/gxy-sketches (no public URL recorded here).
- Python project.
typerCLI:gxy-sketches ingest|generate|validate|list. - Unit: a sketch directory under
sketches/<domain>/<slug>/containingSKETCH.md(YAML frontmatter + markdown body), optionaltest_data/, optionalexpected_output/. - Sources (v1): nf-core (Nextflow) and Galaxy IWC (gxformat2
.ga+ planemo*-tests.yml). Snakemake / WDL deferred to v2. - Pipeline:
ingestclones a source repo intoworkflows_cache/, builds aWorkflowRecordwith an attachedTestManifest;generatecalls Claude (prompt-cached system prompt) and writes oneSKETCH.mdper workflow;validatelints the corpus. - Consumer:
gxy3(chat-driven bioinformatics desktop app) loadssketches/**/*.mdthe way Claude Code loads skills. The sketch teaches the agent when to pick this analysis class and the high-level recipe — it is a routing / decision aid, not a constructor. - Sketch body shape (fixed, enforced by prompt):
# Title→## When to use this sketch→## Do not use when→## Analysis outline→## Key parameters→## Test data→## Reference workflow.
Pydantic types worth knowing (src/gxy_sketches/schema.py):
WorkflowRecord— ingestor output:ecosystem,slug,display_name,source_url,version,license, harvestedfiles: list[WorkflowFile],test_manifest: TestManifest | None,tool_versions.TestManifest—inputs: list[TestDataRef],outputs: list[ExpectedOutputRef],output_source_map.TestDataRef—role,path(undertest_data/),url,sha1,filetype,description. Eitherpathorurlrequired.ExpectedOutputRef—role,path(underexpected_output/),url,kind,description,assertions: list[str]. Needs at least one of path/url/assertions.ToolSpec—name,version. Bare-string back-compat in input.SketchFrontmatter— strict (extra="forbid"); fields:name(kebab, 3-80),description(30-600),domain(enum),organism_class,input_data,source: SketchSource,tools: list[ToolSpec],tags,test_data,expected_output.SketchSource—ecosystem,workflow,url,version,license,slug.
domain enum (from prompts.py): variant-calling | assembly | rna-seq | single-cell | metagenomics | epigenomics | proteomics | phylogenetics | long-read | amplicon | structural-variants | qc | annotation | other.
Why it is adjacent, not the same target
| gxy-sketches | Foundry | |
|---|---|---|
| Question answered | ”Which analysis class do I pick for this user request?" | "How do I build this Galaxy/CWL workflow from this source?” |
| Consumer | gxy3 agent at routing time | Harnesses doing source→target translation, validation, debug |
| Per-workflow unit | One SKETCH.md (frontmatter + decision-aid prose) | summarize-<source> Mold output (structured JSON per [[summary-<source>]], schema in packages/summary-<source>-schema/src/) |
| Source coverage | nf-core, IWC (v1) | paper, nextflow, cwl. No IWC ingest — IWC cited by URL in pattern bodies (see CORPUS_INGESTION.md) |
| Output shape | Markdown + YAML frontmatter | JSON Schema-validated structured data |
| Test fixtures | Bundled into the sketch dir, capped at 5 MB total | Referenced as data; no bundling, no size cap |
| Generation | One-shot LLM call per workflow, prompt-cached system prompt | Per-Mold cast, per-kind dispatch over typed references |
Alignment moves
These are additive recommendations, not shared runtime contracts.
1. Test-fixture field-name parity
The Foundry’s summary-nextflow.schema.json and summary-cwl.schema.json will need a test-fixture sub-block. Adopt TestDataRef / ExpectedOutputRef field names verbatim:
inputs[]items:role,path,url,sha1,filetype,description.outputs[]items:role,path,url,kind,description,assertions[].path/urlsemantics: either / or / both, same rules as gxy-sketches.
Rationale: a Foundry summary becomes structurally consumable by anyone using gxy-sketches’ TestManifest shape and vice versa. No code dependency; just naming convergence. Cheap to do; hard to retrofit.
Drop: gxy-sketches’ constraint that path must be under test_data/ — that is a sketch-bundle invariant (the validator enforces it because the sketch directory bundles fixtures). The Foundry’s summary describes a workflow’s test fixtures as data; bundling is out of scope. Also drop the 5 MB cap thinking — it is a sketch-bundle invariant, irrelevant to summary schemas.
2. Tool-spec parity
summarize-galaxy-tool and summarize-nextflow / summarize-cwl all need to enumerate tools with versions. Match ToolSpec(name, version). Same field names.
3. Source-record parity
When the Foundry’s summarize-paper / summarize-nextflow / summarize-cwl (and a future summarize-galaxy-workflow, see §5) emit a “this is where I came from” block, mirror SketchSource field names: ecosystem, workflow, url, version, license, slug.
The Foundry’s ecosystem vocabulary should be a superset: gxy-sketches has nf-core | iwc | snakemake-workflows | wdl. The Foundry’s source axis is currently paper | nextflow | cwl; if the Foundry adds an IWC summarizer (§5), use iwc not a new term.
4. Domain ↔ iwc/* vocabulary mapping
gxy-sketches’ fixed domain enum overlaps heavily with Foundry’s iwc/* tag family (iwc/variant-calling, iwc/rna-seq, …). They will not be identical — gxy-sketches’ domain is a single value chosen by the LLM; the Foundry’s iwc/* is a multi-tag classification seeded from IWC directory layout (see CORPUS_INGESTION.md). Document the mapping in meta_tags.yml descriptions for iwc/* keys; do not force a merge.
5. Inventory gap surfaced by alignment: summarize-galaxy-workflow
gxy-sketches treats IWC (gxformat2 .ga files + planemo *-tests.yml) as a first-class source. The Foundry has no summarize-galaxy-workflow Mold — its source axis is paper/nextflow/cwl only. Worth adding to the inventory:
- It would serve
compare-against-iwc-exemplardirectly (the structural diff Mold needs a structured view of the exemplar workflow it compares against — currently unspecified). - It would let a future
sketchcast target be populated from IWC workflows entirely inside the Foundry pipeline, without re-implementing gxy-sketches’ IWC ingestor. - It mirrors
summarize-cwlcleanly (same target — Galaxy — on the input side).
Add to MOLDS.md as a candidate; do not commit until the first walk-through.
6. Inventory gap: decision-aid layer
The sketch body’s ## When to use / ## Do not use when sections are a routing/decision aid, not technical content. The Foundry’s summarize-* Molds are technical (DAG, containers, IO). The Foundry has no Mold today that produces “given this workflow, when should an agent reach for it?” content.
If a future sketch cast target is added (§7), it will need either:
- A new Mold (
derive-routing-aidor similar) that consumes a Foundry summary and emits the decision-aid sections, or - The decision-aid sections come from somewhere else (gxy-sketches’ own LLM prompt, the harness, …).
This is a deferred design question, not a v1 commit. Recorded here so it is not lost when the alignment is revisited.
7. Possible cast target: sketch
A plausible — not committed — Foundry cast target is sketch: a summarize-nextflow / summarize-galaxy-workflow / summarize-cwl Mold cast emits a SKETCH.md-shaped artifact (frontmatter + the technical body sections). gxy-sketches today derives this content via a single LLM call against the raw workflow files; a Foundry-cast version would be derived from a structured summary instead.
Open questions if/when this is pursued:
- Does the cast emit only the technical sections (
## Analysis outline,## Key parameters,## Test data,## Reference workflow) and leave the routing sections to gxy-sketches’ own pipeline? - Or does the Foundry add a routing-aid Mold (§6) and own the whole sketch?
- Either way, this is a v2+ concern. v1: align field names, do not entangle pipelines.
What is not aligned
Explicit non-goals, so future contributors do not retrofit them:
- Storage backends. gxy-sketches is Python + pydantic + typer +
frontmatterlib + plain markdown. The Foundry is TypeScript + Astro + Ajv. No code sharing. - Validators. gxy-sketches’ validator enforces a sketch-directory bundle contract (file presence, orphan files, 5 MB cap, name uniqueness). The Foundry’s validator enforces a content-collection contract (frontmatter schema, wiki-link integrity, tag coherence, Mold ref resolution). Different jobs.
- 5 MB test-fixture cap. Sketch-bundle-only invariant. Do not propagate into Foundry summary schemas.
- LLM generation pipeline. gxy-sketches: one-shot prompt-cached call per workflow, JSON output, deterministic frontmatter fill-in by the writer. Foundry: per-Mold cast with per-kind dispatch over typed references (patterns LLM-condensed, schemas verbatim, CLI command pages → JSON sidecars). Different shapes for different jobs.
- IWC mirroring. gxy-sketches clones IWC into
workflows_cache/. The Foundry does not — patterns cite IWC by URL (CORPUS_INGESTION.md). Ifsummarize-galaxy-workflowlands (§5), revisit whether the Foundry needs an IWC clone for that Mold’s runtime; the answer should still be “no” if the generated skill operates on URLs supplied at runtime.
Site placement
This is project-infrastructure research, not runtime Mold knowledge. It belongs on the design
landing page with other developer-facing records. It should not be pulled into cast artifacts unless
a future sketch target makes that relationship explicit.
Open questions
- Worth a follow-up Mold note
summarize-galaxy-workflowin inventory before walk-throughs, or wait until walks force it? - Does gxy-sketches’ boss want field-name parity adopted on their side too (e.g., would they rename anything to match Foundry choices), or is parity a one-way concession from Foundry?
- Is a
sketchcast target on the v2+ roadmap, or stays out of scope permanently and gxy-sketches keeps owning sketch generation end-to-end? - gxy-sketches v2 lists Snakemake + WDL; the Foundry’s source axis would need
snakemake/wdlif alignment is symmetric. Defer until those v2 ingestors land in gxy-sketches.