EVAL_SEARCH_SKILL

Evaluation review: find-shed-tool skill plans

Subagent review of PLAN_SEARCH_CLI.md and STRETCH_TOOLSHED.md from a testability/evaluability standpoint, plus concrete proposals for validating the skill itself (not the underlying CLI’s unit tests).

Captured verbatim from a research subagent run; kept here as input to the skill-design loop.


1. Plan critique

PLAN_SEARCH_CLI.md is sound and evaluable for the CLI half. Stages 1–3 have crisp inputs/outputs, JSON envelope locked, exit codes specified — easy to assert on. A few gaps that matter for skill eval (not CLI eval):

STRETCH_TOOLSHED.md is fine as a roadmap — clear effort/blast-radius framing. One eval-relevant note: B3 (embedding reranker) and A1 (carry version on hits) would each change the skill’s expected outputs. Any fixture corpus needs versioning so it can be re-baselined when those land. Worth a one-line note in the stretch doc that fixtures are tied to a search-backend snapshot version.

2. What “the skill works” means — concrete capability claims

An evaluation should verify each of these:

  1. Unambiguous tool name (“fastqc”, “samtools sort”, “bwa mem”) → picks iuc/<tool> (or documented preferred owner) within ≤3 CLI calls.
  2. Capability phrase (“trim adapters from paired-end fastq”) → picks a plausible tool from {trimmomatic, cutadapt, fastp, trim_galore} and explains the choice; eval grades “is this in the acceptable set” not “is this exactly X”.
  3. Multi-owner collision (e.g. tool_id bwa under devteam and iuc) → does not silently pick one; either applies the documented owner allowlist or surfaces ambiguity to the user.
  4. Zero hits (“kjsdfgh”) → exits cleanly with a “no tool found” message, does not loop endlessly or fabricate a trsToolId.
  5. Stale-index surface (recently published tool not yet indexed) → degrades gracefully; documents the staleness caveat to the user.
  6. Resolved version is sensible — the picked (trsToolId, version) is one TRS actually returns (no hallucinated versions).
  7. Cache landing — at end of run, galaxy-tool-cache info <trsId> returns the picked tool; downstream skills can consume it.
  8. Bounded cost — skill completes in ≤N CLI shell-outs and ≤M tokens for a typical query (set a budget so regressions are visible).

3. Evaluation approaches, ordered by cost/value

A. Static fixture replay (cheapest, build first). Record gxwf tool-search "<q>" --json and gxwf tool-versions <id> --json outputs for a fixed corpus of ~30–50 queries against the live shed once, commit as JSON. Add a --fixture-dir flag (or GXWF_FIXTURE_DIR env) that the CLI honors instead of HTTP. Eval = “given fixture for query Q, the skill picks expected trsToolId X”. Captures pick-logic regressions without flakiness. Limit: doesn’t catch “does the skill pick the right query to search for” — the input is the query, not the prose intent.

B. Golden intent → pick dataset. Hand-curate ~30 (user prose, expected tool family) tuples drawn from: nf-core module names (good “I want tool X” cases), tools-iuc README descriptions (capability phrasing), workflow-reports examples that name tools, and 5–10 deliberately bad/ambiguous prompts. Source: cheap, you already have CAPHEINE (15 tools) as a starter set per SKILLS_NF.md:262. Score = exact match for unambiguous, set-membership for capability, “ambiguity surfaced” boolean for collisions.

C. Agent-loop eval (medium cost, highest signal). Run the skill end-to-end via Claude Code subagent against fixture (A) backend + intent corpus (B). Subagent runs the skill, transcript captured, grader (separate model call or human) scores per the rubric in §2. This is what catches “skill picks wrong query term” because the skill drives the queries. Cap at ~50 cases initially; runs in ~10–20 min; one cost-bounded button.

D. Live shed runs, env-gated. EVAL_LIVE=1 npm run eval:skill — same intent corpus but hitting toolshed.g2.bx.psu.edu. Run nightly or weekly; flake budget tolerated. Catches search-backend drift, dead approved boost regressions, etc. Don’t gate CI on it.

E. Adversarial battery. Add to (B): typos (“fastqcc”), generic terms (“aligner”, “qc”), case variants (“FastQC” vs “fastqc” exercising the case asymmetry per research §6), single-letter queries (wildcard pathology), tool_id collisions (bwa), known-renamed tools. Grade on graceful degradation, not on picking “the right” answer.

F. Skill-vs-skill A/B harness. Once you have (C), running two skill prompts (or two model versions) against the same fixture is free. Useful for prompt iteration on SKILL.md.

4. What to instrument in CLI/skill to make eval cheap

5. Concrete recommendations — build first vs. defer

Build first (in this order):

  1. Fixture-replay seam in CLI (one flag, ~half-day). Unlocks A and C.
  2. Static fixture corpus + golden picks for 20 cases drawn from CAPHEINE + 5 ambiguous + 5 zero-hit. Markdown table of (query, expected pick or expected behavior) committed alongside the skill. Sub-day to author.
  3. Agent-loop harness — minimal: a script that invokes the skill via Claude Code on each intent, captures transcript+log, runs a grader prompt, prints pass/fail. ~1–2 days.

Defer:

The pattern matches workflow-reports (walked examples) but adds the fixture+harness layer. Don’t over-build: 20 golden cases + replay + simple grader beats 200 cases with no harness.

6. Open questions