PR #22692 Research: LLM Eval Harness for Agents
PR Overview
| Field | Value |
|---|---|
| Author | dannon |
| State | MERGED |
| Created | 2026-05-14 |
| Merged | 2026-05-21 |
| Merge SHA | 1f1eb3eaf1 |
Verified against origin/dev | 3eba535f3b |
| Labels | kind/enhancement, area/testing, area/API |
| Branch | dannon/agent-evals-harness |
Builds directly on PR 21942 - Shared Agent Operations and MCP Server (extends AgentOperationsManager + the in-process MCP server) and the agent fleet from PR 21434 - AI Agent Framework and ChatGXY. Complements PR 22070 - Static YAML Agent Backend for Deterministic Testing — static backend tests the plumbing deterministically; this harness measures real-LLM behavior. Explicitly does not supersede tcollins2011’s pytest+HTML-dashboard eval work in dannon/galaxy#64; rubrics and pricing here are adapted from it.
Summary
Adds an on-demand (not-CI) real-LLM evaluation harness under test/evals/ that runs curated datasets against the agents in lib/galaxy/agents/, scores each model with deterministic + LLM-as-judge evaluators, and emits a markdown comparison table with baseline diffing. Built on pydantic-evals; ships two runners sharing dataset/evaluator/report code — a fast mocked-trans CLI (run_evals.py) and a live-Galaxy pytest runner (test/integration/test_live_evals.py). The harness is deliberately excluded from CI: real LLMs are slow, flaky, and cost money — it’s a tool for picking default models and iterating on prompts with measurement instead of vibes. The PR also lands an agent-fleet change that fell out of the GCC2026 demo work: list_file_source_templates / list_user_file_sources on AgentOperationsManager, exposed both as router fast-path tools and as MCP tools, plus a router-prompt section teaching the file-source configure-then-export flow — so “How do I upload this to Omero?” stops being refused as out-of-scope.
Architecture
Eval harness layout — test/evals/ (all NEW, 19 tracked files)
Run as python -m evals.run_evals from the test/ directory (cd test first) — the package imports as evals relative to that CWD; from repo root it is physically test/evals/. See Cross-checks. Layout per README.md:
__init__.py— package docstring.datasets/— one module per dataset;__init__.pyre-exports seven*_dataset()builders.evaluators.py— custom scorers.judge.py— builds anOpenAIChatModelfor LLM-as-judge.specs.py—SPECSregistry mapping dataset name →build_*(task + evaluators).tasks.py— wraps agents as pydantic-evals task callables.run_evals.py— the CLI (+722 lines, largest file in the PR).pricing.py— per-million-token cost table (adapted from PR #64eval_utils.py).models.yaml.sample— sample model config; realmodels.yamlis gitignored.results/.gitkeep— placeholder;results/*gitignored.seed_staining_quantification_history.py— seeds a demo history for the live runner.
run_evals.py CLI
- Module docstring (lines 1–18) documents invocation
cd test; python -m evals.run_evals. Readsevals/models.yaml, falling back toevals/models.yaml.sample. - Relative package imports (lines 44–47):
from .judge import build_judge_model,from .pricing import model_cost,from .specs import SPECS,from .tasks import make_deps. parse_args()(line 517) flags:--models,--datasets(default = allSPECS.keys()),--judge-model(defaultgpt-oss-120b),--model-config,--include-galaxy-required,--only,--max-concurrency(default 4),--repeat(default 1),--results-dir(defaultevals/results),--no-write,--baseline.run_eval_suite()(line 581) is pure runtime — no argparse / file IO — so both the CLI and the live pytest runner call it with differentdeps_factorycallables.- Pass thresholds:
1.0for deterministic scorers,0.7forLLMJudge(lines 224, 292, 345). - Output path
results_dir / f"{stamp}-{slug}-{_git_sha()}.md"(line 657);_git_sha()at line 75; report footer stamps the SHA (line 402). Baselines are saved outside the repo with the SHA in the filename, so any baseline pairs back to the commit it was generated from. Writes both.mdand a JSON sidecar.
Datasets — test/evals/datasets/
Seven datasets, all registered in SPECS (specs.py lines 169–177):
- routing.py —
(query, expected handoff target)cases forQueryRouterAgent; deterministicHandoffMatch. Docstring records it was seeded from theTEST_QUERIESformerly intest_agents.py(theTestAgentConsistencyLiveLLMclass — see Tests). - error_analysis.py — prose failure descriptions for
ErrorAnalysisAgent;MustMentionkeyword check + per-caseLLMJudge. - tool_recommendation.py — “what tool for X?” for
ToolRecommendationAgent;MustMentionAny+ optionalLLMJudge. Runs without a live toolbox, so it measures model prior knowledge, not grounded search. - router_tool_use.py — inventory/availability queries that should trigger router fast-path tool calls; deterministic
ToolCallMatch(OR-semantics overexpected_tool_calls). First caseneoform_what_tools_installedis the explicit regression target for #21661 (router answered “what tools are installed?” with an essay instead of callingsearch_tools). - bioinformatics_workflows.py — analysis-shaped questions,
LLMJudge-only; rubrics adapted from PR #64test_bioinformatics_workflows.py. - orchestrator_planning.py — compound queries;
OrchestratorPlanIncludesagainstagents_used. Adapted from PR #64test_agent_routing.py. - staining_quantification.py — end-to-end bioimaging use case (brightfield RGB, color deconvolution, per-ROI quant, Omero export); per-case
LLMJudgerubrics. This is the GCC2026 “Live26” demo content the body alludes to; some cases carryrequires_galaxy=Trueand are off by default. Contains theomero_upload_guidancecase (line 90), also present in routing.py (line 201).
Evaluators — evaluators.py
Custom pydantic_evals Evaluator subclasses: HandoffMatch (line 20, exact route match), MustMention (line 28, all keywords), MustMentionAny (line 44, any keyword), OrchestratorPlanIncludes (line 60, requires agent_type == 'orchestrator' AND a non-empty agents_used ∩ expected_agents_any), ToolCallMatch (line 83, any expected tool name appears in output['tool_calls']). LLMJudge is the pydantic-evals built-in, used directly in dataset modules.
Judge / pricing / tasks
judge.py—build_judge_model(model_name, base_url, api_key) -> OpenAIChatModel(line 11) wraps anOpenAIProviderso the judge calls the same proxy as the agents under test.pricing.py—_PRICINGdict (input/output $/M tokens) across Anthropic/OpenAI/Google/local/SambaNova;model_cost()returns cost, unknown/local → $0.tasks.py—make_deps()(line 66) buildsGalaxyAgentDependenciesfrom aMagicMocktrans/config. It special-cases theFileSourceInstancesManager(_app_getitem, line 109) so the new file-source fast-path tools don’t crash under mocked deps — returns empty summaries / index.make_live_deps()(line 123) builds deps from a realtransfor the integration runner. Task factories wrap each agent:make_router_task,make_router_content_task,make_error_analysis_task,make_tool_recommendation_task,make_orchestrator_plan_task(readsorchestrator._get_agent_plan),make_router_inspect_task(bypassesprocess(), reads raw tool calls via_extract_tool_calls).
Agent-operations file-source methods — lib/galaxy/agents/operations.py
class AgentOperationsManager (line 58):
list_file_source_templates(self) -> dict[str, Any]— line 1035. Readsself.file_source_instances_manager.summaries, dumps each non-hidden template (model_dump(mode="json")), returns{"templates": [...], "count": n}.list_user_file_sources(self) -> dict[str, Any]— line 1050. RaisesValueError("User must be authenticated")if notrans.user; elsefile_source_instances_manager.index(self.trans), returns{"file_sources": [...], "count": n}.- Backing property
file_source_instances_manager(line 168) lazily importsgalaxy.managers.file_source_instances.FileSourceInstancesManagerand resolves viaself.app[...].
Router fast-path additions — lib/galaxy/agents/router.py
Two new @agent.tools registered in _register_fast_path_tools() (line 89), each running the ops method off-thread via anyio.to_thread.run_sync: list_file_source_templates (line 198) and list_user_file_sources (line 212). HEAD has 10 @agent.tool decorators (8 pre-existing browsing tools + these 2).
Router prompt — lib/galaxy/agents/prompts/router.md
- Scope bullet (line 14) for remote data repositories Galaxy integrates as file sources (Omero, Dropbox, S3, Zenodo, Invenio, Google Drive…).
- Fast-path tool entries for the two new tools (lines 56–57).
- New “Remote data repositories (file sources)” subsection (lines 59–67) explaining the plugin-template-then-instance model and a 3-step configure-then-export flow.
- Two “Important Distinctions” lines (159–160) routing Omero/Dropbox/S3 upload questions to direct answers via the new tools.
MCP tool exposure — lib/galaxy/webapps/galaxy/api/mcp.py
@mcp.tool() def list_file_source_templates(...) (line 1031) and @mcp.tool() def list_user_file_sources(...) (line 1043), each wrapping the corresponding ops-manager method under _mcp_error_handler.
.gitignore
Adds test/evals/models.yaml and test/evals/results/* with a !test/evals/results/.gitkeep negation.
Tests
Added (present at HEAD)
test/unit/app/managers/test_AgentOperationsManager.py(+65) — three net-new file-source tests:test_list_file_source_templates_filters_hidden(line 320, hidden templates excluded),test_list_user_file_sources(line 352, index→dump→count→fields),test_list_user_file_sources_requires_user(line 378,ValueErrorwhen no user).test/integration/test_live_evals.py(+179) — live-Galaxy runner wrappingevals.run_evals; seeds a staining history, runsrequires_galaxy=Truecases, writes a report. Gated byGALAXY_TEST_ENABLE_LIVE_LLM/GALAXY_TEST_LIVE_EVALS.- The seven eval datasets are themselves the bulk of the new behavioral surface but run on-demand, not in CI.
Modified — test/unit/app/test_agents.py (+11 / −78)
The 78 deletions removed two groups: (1) the model-capability-table tests (test_supports_structured_output_capability_table_match and siblings, plus their _capability_for_model / _load_model_capabilities imports), and (2) the entire TestAgentConsistencyLiveLLM class — the hardcoded query→route consistency tests the new routing dataset was meant to replace. The class docstring was rewritten from “three classes” to “two classes” and pointed readers at evals/.
The deletions did not durably stick (see Cross-checks): both groups are present again at the verified SHA.
Cross-checks vs PR body
- Body says
evals/+python -m evals.run_evals; files live attest/evals/. NOT a contradiction. The README (lines 98–107) andrun_evals.pydocstring (lines 3–7) both instructcd testfirst, after which the package imports asevalsand the relative imports resolve; defaultsevals/models.yamlandevals/resultsare relative totest/. ⚠️ Flag for readers: you mustcd test— running from repo root fails. list_file_source_templates/list_user_file_sourcesonAgentOperationsManager? YES — operations.py lines 1035 and 1050. ✓- Exposed as router fast-path tools? YES — router.py lines 198 and 212. ✓
- Exposed as MCP tools? YES — mcp.py lines 1031 and 1043. ✓
- Router prompt teaches configure-then-export flow? YES — router.md lines 59–67. ✓
- #21661 regression case present in
router_tool_use.py? YES —neoform_what_tools_installed,expected_tool_calls=["search_tools","get_server_info"], docstring cites the issue comment. ✓ - “live26_demo” dataset / “live26_omero_upload_guidance” case. ⚠️ FALSE as named: there is no
live26_demodataset and nolive26_omero_upload_guidancecase in the shipped code. The content shipped as thestaining_quantificationdataset with a case namedomero_upload_guidance(nolive26_prefix). The body’s “live26” naming is descriptive/aspirational, not a shipped identifier. - “red until IWC reintroduction lands.” The body said several demo cases stay red until the IWC and GTN branches merge. The IWC ops have since landed in
operations.py/mcp.py(see Changes since merge), but the entiretest/evals/tree is untouched since the merge — so no case was re-graded in-repo. Whether the red cases ever went green is not recorded here. - Provenance to PR #64 — confirmed in file headers (
pricing.py,bioinformatics_workflows.py,orchestrator_planning.py). ✓ - “86 passed, 4 skipped locally” — not independently runnable here (needs a live LLM); not contradicted.
Unresolved questions
- The PR deleted the capability-table tests and
TestAgentConsistencyLiveLLM, but both are back at HEAD — capability tests were already restored in the merge commit itself (merge with dev kept them), andTestAgentConsistencyLiveLLMwas re-added by post-merge commit948dd067(“producer/critic reflection loops”). Was the re-addition intentional, or an accidental merge revert? Theroutingdataset and the class it was meant to replace now coexist. - Harness is explicitly not-CI; nothing imports
evalsin CI, soSPECS/tasks.py/ dataset modules can drift out of sync with the agents undetected. Should there be a CI import-smoke test? tasks.make_depsdispatches theFileSourceInstancesManagermock by__name__string match — brittle if the manager class is renamed.- The body promised demo cases go green once IWC/GTN land; IWC landed but the eval tree is untouched. Where, if anywhere, were the red cases verified green?
tool_recommendationruns without a live toolbox, so it scores model prior knowledge rather than grounded search — is that the intended signal, or a coverage gap vs production behavior?
Changes since merge
Per-file git log 1f1eb3e..origin/dev:
test/evals/(entire tree) — UNTOUCHED since merge. The harness as shipped == HEAD.lib/galaxy/agents/operations.py— substantial IWC follow-up, none touching the file-source methods:ec647adb(search_iwc_workflows),499a5c14(get_iwc_workflow_details),857912f1(recommend_iwc_workflows),a93fcfa8(import_workflow_from_iwcop+MCP),a59bbefa,8740c102,1cd60568,56e09000,ded3e326,90c8911a. The “IWC reintroduction” branch the body said was still pending has since landed.lib/galaxy/webapps/galaxy/api/mcp.py— same IWC follow-ups; file-source MCP tools unchanged.lib/galaxy/agents/router.py—d4859b8d(propagate ChatGXY entity context through router); fast-path file-source tools unchanged.lib/galaxy/agents/prompts/router.md—1d922c22(@mention support for datasets/histories).test/unit/app/test_agents.py—948dd067re-addedTestAgentConsistencyLiveLLM(now line 1730); the capability-table tests were already restored at merge time.test/integration/test_live_evals.py,test/unit/app/managers/test_AgentOperationsManager.py, and the dataset modules — no follow-up commits.
File path migration
No touched file has been renamed/removed/relocated since the merge. The only path nuance is the evals/ vs test/evals/ working-directory convention (above), not an actual move.
Related
- PR 21942 - Shared Agent Operations and MCP Server — introduced
AgentOperationsManager+ the in-process MCP server; this PR adds the two file-source methods and MCP tools directly on top. - PR 21434 - AI Agent Framework and ChatGXY — defines the pydantic-ai multi-agent fleet (router, error analysis, tool recommendation, orchestrator) that the harness evaluates.
- PR 22070 - Static YAML Agent Backend for Deterministic Testing — the deterministic counterpart; static backend tests plumbing in CI, this harness measures real-LLM quality out of CI.
- PR 21692 - Standardize Agent API Schemas — establishes the
AgentResponseshape (agent_type,content,metadata) the eval tasks read. - Component - Agents Backend — backend agent architecture; the system under evaluation.
- Component - Agents UX — ChatGXY/GalaxyWizard surfaces whose routing decisions this harness scores.