INITIAL_GXY_SKETCHES_ALIGNMENT

Alignment: gxy-sketches ↔ Galaxy Workflow Foundry

Adjacent project, different target, overlapping ingest sources. This doc records the seams where the Foundry’s per-source summary Molds (summarize-paper, summarize-nextflow, summarize-cwl) and the eventual summarize-galaxy-workflow should align with gxy-sketches, so the two projects stay legible to each other without becoming dependent.

gxy-sketches is not a Foundry project. It is owned by another contributor (boss). The Foundry does not consume it at runtime, casting time, or build time. “Alignment” here means: shared field names, shared mental model for test manifests, and a documented mapping for vocabularies — nothing more invasive.

What gxy-sketches is

Repo: /Users/jxc755/projects/repositories/gxy-sketches (no public URL recorded here).

Pydantic types worth knowing (src/gxy_sketches/schema.py):

domain enum (from prompts.py): variant-calling | assembly | rna-seq | single-cell | metagenomics | epigenomics | proteomics | phylogenetics | long-read | amplicon | structural-variants | qc | annotation | other.

Why it is adjacent, not the same target

gxy-sketchesFoundry
Question answered”Which analysis class do I pick for this user request?""How do I build this Galaxy/CWL workflow from this source?”
Consumergxy3 agent at routing timeHarnesses doing source→target translation, validation, debug
Per-workflow unitOne SKETCH.md (frontmatter + decision-aid prose)summarize-<source> Mold output (structured JSON per schemas/summary-<source>.schema.json)
Source coveragenf-core, IWC (v1)paper, nextflow, cwl. No IWC ingest — IWC cited by URL in pattern bodies (see INITIAL_CORPUS_INGESTION.md)
Output shapeMarkdown + YAML frontmatterJSON Schema-validated structured data
Test fixturesBundled into the sketch dir, capped at 5 MB totalReferenced as data; no bundling, no size cap
GenerationOne-shot LLM call per workflow, prompt-cached system promptPer-Mold cast, per-kind dispatch over typed references

Concrete alignment moves

Cheap, additive, none of them block either project. Each is a recommendation, not a contract.

1. Test-fixture field-name parity

The Foundry’s summary-nextflow.schema.json and summary-cwl.schema.json will need a test-fixture sub-block. Adopt TestDataRef / ExpectedOutputRef field names verbatim:

Rationale: a Foundry summary becomes structurally consumable by anyone using gxy-sketches’ TestManifest shape and vice versa. No code dependency; just naming convergence. Cheap to do; hard to retrofit.

Drop: gxy-sketches’ constraint that path must be under test_data/ — that is a sketch-bundle invariant (the validator enforces it because the sketch directory bundles fixtures). The Foundry’s summary describes a workflow’s test fixtures as data; bundling is out of scope. Also drop the 5 MB cap thinking — it is a sketch-bundle invariant, irrelevant to summary schemas.

2. Tool-spec parity

summarize-galaxy-tool and summarize-nextflow / summarize-cwl all need to enumerate tools with versions. Match ToolSpec(name, version). Same field names.

3. Source-record parity

When the Foundry’s summarize-paper / summarize-nextflow / summarize-cwl (and a future summarize-galaxy-workflow, see §5) emit a “this is where I came from” block, mirror SketchSource field names: ecosystem, workflow, url, version, license, slug.

The Foundry’s ecosystem vocabulary should be a superset: gxy-sketches has nf-core | iwc | snakemake-workflows | wdl. The Foundry’s source axis is currently paper | nextflow | cwl; if the Foundry adds an IWC summarizer (§5), use iwc not a new term.

4. Domain ↔ iwc/* vocabulary mapping

gxy-sketches’ fixed domain enum overlaps heavily with Foundry’s iwc/* tag family (iwc/variant-calling, iwc/rna-seq, …). They will not be identical — gxy-sketches’ domain is a single value chosen by the LLM; the Foundry’s iwc/* is a multi-tag classification seeded from IWC directory layout (see INITIAL_CORPUS_INGESTION.md). Document the mapping in meta_tags.yml descriptions for iwc/* keys; do not force a merge.

5. Inventory gap surfaced by alignment: summarize-galaxy-workflow

gxy-sketches treats IWC (gxformat2 .ga files + planemo *-tests.yml) as a first-class source. The Foundry has no summarize-galaxy-workflow Mold — its source axis is paper/nextflow/cwl only. Worth adding to the inventory:

Add to INITIAL_MOLDS.md as a candidate; do not commit until the first walk-through.

6. Inventory gap: decision-aid layer

The sketch body’s ## When to use / ## Do not use when sections are a routing/decision aid, not technical content. The Foundry’s summarize-* Molds are technical (DAG, containers, IO). The Foundry has no Mold today that produces “given this workflow, when should an agent reach for it?” content.

If a future sketch cast target is added (§7), it will need either:

This is a deferred design question, not a v1 commit. Recorded here so it is not lost when the alignment is revisited.

7. Future cast target: sketch

A plausible — not committed — Foundry cast target is sketch: a summarize-nextflow / summarize-galaxy-workflow / summarize-cwl Mold cast emits a SKETCH.md-shaped artifact (frontmatter + the technical body sections). gxy-sketches today derives this content via a single LLM call against the raw workflow files; a Foundry-cast version would be derived from a structured summary instead.

Open questions if/when this is pursued:

What is not aligned

Explicit non-goals, so future contributors do not retrofit them:

Linking moves

Done in this commit:

Suggested future, not done here:

Open questions