IDEA_GTN_BACKGROUND

IDEA_GTN_BACKGROUND.md

Pitch framing

The Galaxy Workflow Foundry establishes a reusable architectural pattern — Pattern 1: KB → Compile → Skill — in which a schema-bound, version-controlled, community-curated knowledge base is compiled at build time into target-portable agent skills with full provenance back to source. To prove the pattern generalizes beyond workflow conversion, we propose a second instantiation in a structurally different domain: compiling Galaxy Training Network (GTN) tutorials into learner-facing tutoring skills. GTN brings 525+ peer-reviewed tutorials with a strict YAML frontmatter contract, a 527-contributor community, ELIXIR alignment, and active i18n work — all of which make it the ideal second-domain proof that compile-time KB→skill is a discipline-wide capability, not a single-purpose trick.

1. GTN current state

The Galaxy Training Network is hosted at training.galaxyproject.org and developed in the open at github.com/galaxyproject/training-material (1.1k forks, 36k+ commits, CC-BY 4.0 content, MIT code). As of mid-2026 the site reports 525 tutorials across 35 topics, 527 contributors, ~11 years of project history. Coverage spans the full Galaxy methodology surface — assembly, epigenetics, metabolomics, proteomics, sequence analysis, single-cell, transcriptomics, variant analysis, imaging, machine learning, climate, computational chemistry, ecology, digital humanities — plus 55 admin and 39 developer tutorials.

Two anchor papers: Batut et al., Cell Systems 6(6):752–758 (2018), “Community-Driven Data Analysis Training for Biology”; and Hiltemann, Rasche, et al., PLOS Computational Biology 19(1):e1010752 (2023), “Galaxy Training: A Powerful Framework for Teaching!” A 2023 CoRDI paper documents FAIR-aligned scaling and TIaaS (Training Infrastructure as a Service), which served 17,000+ students across 330+ events between 2018 and 2022.

Frontmatter schema (per the GTN’s own “GTN Metadata” tutorial and the in-repo bin/schema-*.yaml files):

i18n status (2026): The EU-funded BioNT consortium (Digital Europe grant 101100604, 2023–2026) has produced human-quality translations of key GTN tutorials and the full FAQ set into German, Spanish, and Italian, with the Freiburg Galaxy Team as lead partner. The schema already supports translation linkage — the bottleneck is content volume, not infrastructure.

2. Existing GTN tooling

The repo ships a substantial validation and build apparatus that compile-time skill generation can ride on:

Already structured: frontmatter, hands-on boxes (numbered steps + nested Tip/Solution/Comment/Question), interactive tours (YAML in config/plugins/tours/, step selector + content), workflows (JSON), quizzes, FAQs, slide decks (with speaker notes consumed by TTS).

Still prose-only: free-text rationale paragraphs inside hands-on blocks, screenshots, narrative bridges between steps. This is exactly the surface where a compile step adds value — agents need the structured action and the surrounding pedagogical context surfaced as structured intent.

3. Why GTN is the right second domain

5. Agent-readiness story: what a compiled GTN skill does that vanilla RAG can’t

Concrete scenarios — each impossible with prose RAG, native to compiled skills:

(a) Walk a user through DE on their own Galaxy account. Compiled skill carries the structured hands-on sequence: tool ID + version + parameter dict + expected input collection type. Agent invokes via Galaxy’s Tool Request API against the learner’s history, not a paraphrased command. Tutorial’s linked tested workflow is the ground-truth oracle.

(b) Adapt to the user’s actual dataset. Frontmatter declares EDAM ontology terms and expected input shape (paired collection, FASTQ, count matrix, etc.). Compile step emits a precondition check; agent can recognize “your dataset is single-end, the tutorial assumes paired-end” and adapt or redirect to the matching tutorial — discoverable because GTN tutorials are cross-linked via requirements / follow_up_training.

(c) Detect history divergence. Each hands-on step has a known post-state (datasets produced, tool invocations recorded). Compiled skill encodes these as checkpoints; agent compares against the learner’s actual Galaxy history (now accessible via PR 21932 History Graph API in this vault’s research corpus) and surfaces “you skipped FastQC — recommend re-running before proceeding.”

(d) Citation per step. Every compiled action carries provenance: source tutorial slug, git commit SHA, contributor list, Zenodo DOI for data, EDAM terms, license. Agent surfaces these inline — solving the “where did the AI get this from” problem that’s blocking AI adoption in regulated/scientific contexts.

(e) i18n by construction. Because lang and translations are first-class frontmatter, the compiler emits one skill per language from the same source graph. A Spanish-speaking learner gets a Spanish-grounded tutor automatically when BioNT (or future translators) ship the translation — no separate prompt engineering, no model fine-tuning.

6. Galaxy Training community fit

GTN governance is mature and well-documented: topic maintainers safeguard per-domain content quality; the broader steering function is anchored by the long-tenured lead maintainers — Bérénice Batut (Freiburg, now Mulhouse), Saskia Hiltemann (Erasmus MC), Anthony Bretaudeau (INRAE), Helena Rasche (Erasmus MC), with Hans-Rudolf Hotz, Wendi Bacon, Nicola Soranzo, and others as regular reviewers. The 2023 PLOS CB paper documents 2,500+ PRs reviewed since 2016.

GTN as an ELIXIR resource: GTN is listed as an ELIXIR service (elixir-europe.org/services/galaxy-training-network), integrated with TeSS (the ELIXIR training portal) via BioSchemas markup, and aligned with the SPLASH recommendations for training life-cycle management. ELIXIR-UK, ELIXIR-IT, ELIXIR-DE all host Galaxy training instances. A 2025 GTN news item (“Enhancing Scientific Training: The Galaxy Training Network’s Role in the ELIXIR Training Life-Cycle”) formalizes this position.

Letter of support content: a GTN LoS would attest to (i) the schema stability and contributor velocity that make compile-time generation tractable; (ii) community willingness to accept compiler-driven frontmatter extensions (precedent: contributions typed dict and edam_ontology were both added via ADR-style process); (iii) co-development capacity through the annual GCC and the Galaxy Smörgåsbord training event; (iv) ELIXIR-aligned dissemination paths.

7. Suggested LOI landscape-analysis paragraph (≤200 words)

Compile-time generation of agent skills from curated knowledge bases is a new architectural pattern with no production precedent in life sciences. The narrowest adjacent work is Jeremy Howard’s llms.txt (Sept 2024) — an inference-time discovery file now served by Anthropic, Cloudflare, and Vercel — which addresses context-window fit but provides no schema contract, no provenance ledger, and no executable action surface. Anthropic’s Agent Skills format (2025) defines the bundle shape via progressive disclosure but ships no domain compiler. Khan Academy’s Khanmigo proves that curriculum-grounded tutors materially outperform unbounded chat by passing structured mastery state to GPT-4, but the curriculum is proprietary and the compiler is closed. The Carpentries’ 2025 community sessions on LLMs in workshops conclude that clumsy AI insertion harms learners and that deliberate curriculum-AI integration is required — yet provide no tooling. No prior work compiles a versioned, schema-bound, peer-reviewed scientific curriculum into provenance-bearing agent skills. The Galaxy Training Network — 525 tutorials, 527 contributors, ELIXIR-aligned, i18n-native — is uniquely positioned as the substrate, and the Workflow Foundry’s already-shipped KB→Compile→Skill pattern is the architectural template.

8. Risks and weaknesses

Open questions for the human

Sources