TEST_AGENT_TOOLS_ISSUE

Test agent tool dispatch against a real DB in an event loop (aiocop regression guard)

Summary

Add an integration test that exercises the pydantic-ai @agent.tool dispatch chain end-to-end against a real Galaxy DB inside the FastAPI event loop, using pydantic_ai.models.test.TestModel to deterministically force the LLM to call every registered tool. This closes the coverage gap that let blocking sync Session I/O in lib/galaxy/agents/history_tools.py ship undetected (galaxyproject/galaxy#22361 review by @mvdbeek), and turns the aiocop middleware into a real regression guard for the agent tool path.

Placed under test/integration/ (not lib/galaxy_test/api/) because the test requires a custom Galaxy config — the test_model agent registry backend — which only the integration test driver supports per-class via handle_galaxy_config_kwds. See doc/source/dev/writing_tests.md decision tree and the cited precedent test/integration/test_event_loop_blocking.py, which uses the same aiocop assertion pattern.

Why / context

@mvdbeek (galaxyproject/galaxy#22361 review, 2026-05-19) flagged that history_tools.py declared async def functions doing synchronous SQLAlchemy I/O, awaited from async def pydantic-ai tools running on the uvicorn event loop — blocking every concurrent request on the worker. Fix shipped in c8b77adfde (threadpool offload via anyio.to_thread.run_sync(partial(...)) matching error_analysis.py:79 and router.py precedent).

His explicit follow-up question — “Our aiocop integration should have flagged this, is this executed in our test suite?” — exposes the real gap: no. aiocop is configured correctly (on by default for run_tests.sh API/integration suites, fails any request with X-Aiocop-Violations severity ≥ 50 via lib/galaxy_test/base/api.py:237), but nothing drives the agent tool dispatch path against a real DB inside an event loop:

So aiocop never sees the path, and the next analogous regression slips the same way. galaxyproject/galaxy-architecture#20 adds a gx-review-async-sync review-time check for the source pattern — this issue is the runtime-test complement.

Goal

A regression test that:

  1. Runs under ./run_tests.sh -integration (CI), no external LLM, no opt-in env var.
  2. Drives the real PageAssistantAgent (real registry, real @agent.tool registration, real await self.agent.run(...)) against a real DB and real history/page data.
  3. Forces pydantic-ai to invoke every registered tool deterministically (via TestModel(call_tools='all')). For PageAssistantAgent the registered tool surface is list_history_datasets, get_dataset_info, get_dataset_peek, get_collection_structure, resolve_hid (the @agent.tool callables in page_assistant.py:224-317), each delegating to the corresponding public function in history_tools.py (the agent tool is list_history_datasets; the wrapped helper is list_history_items).
  4. Asserts aiocop did not detect a blocking-I/O violation on the request — i.e., the existing _check_aiocop_violations in lib/galaxy_test/base/api.py is the assertion; no new assertion code needed.
  5. Would have failed prior to c8b77adfde and passes after — proving it’s a real guard.

Why TestModel works (mechanism)

pydantic_ai.models.test.TestModel default behavior (call_tools='all') iterates the agent’s registered function tools and emits a ToolCallPart for each, with auto-generated arguments derived from each tool’s schema (pydantic_ai/models/test.py:175-181, tool-call emission on first request at :216-228). That fires each registered @agent.tool callable in page_assistant.py:224-317, which awaits the corresponding public function in history_tools.py (list_history_items, get_dataset_info, get_dataset_peek, get_collection_structure, resolve_hid). Each public function now wraps its sync body in await anyio.to_thread.run_sync(partial(_list_history_items_impl, ...)) etc., so the session.execute(...) calls run on a worker thread. Without the fix, the _impl body runs on the event-loop thread and aiocop’s audit hook observes the blocking syscalls there.

Agent.override(model=...) (pydantic_ai/agent/abstract.py:1423) is the canonical scoped-swap mechanism, but it’s a per-call context manager applied at the call site — not a fit for swapping inside a long-lived BaseGalaxyAgent.agent that the test code can’t easily reach. The implementation uses a registry-level swap (see Design alternatives below).

What can go wrong (load-bearing details)

Two failure modes the design must address explicitly — caught during review and verified against aiocop.core.blocking_io.BLOCKING_EVENTS_DICT and pydantic-ai source.

1. The default AIOCOP_FAIL_SEVERITY = 50 threshold may not catch a sync session.execute on a warm pooled connection. Severity table:

EventSeverity
socket.getaddrinfo, socket.socket.connect50 (caught by default _check_aiocop_violations)
socket.socket.send, recv, sendall, recvfrom, …10 (NOT caught)
sqlite3.connect10
ssl.SSLSocket.send/recv/read/write10

A first DB call in a fresh SQLAlchemy pool emits getaddrinfo + socket.connect (50, caught). Subsequent calls reusing pooled connections emit only send/recv (10, not caught). The Galaxy test server warms its pool at startup, so by the time the test request fires, queries are likely pool-reuse and severity-10-only. Therefore the test cannot rely on the default _check_aiocop_violations assertion alone — it must directly inspect the X-Aiocop-Violations header and fail on any socket-send/recv event (which only an event-loop-thread sync DB call can produce inside an async handler). The default high-severity check stays as a separate safety net.

2. PageAssistantAgent declares structured output_type (page_assistant.py:184-211: ToolOutput(FullReplacementEdit, ...), ToolOutput(SectionPatchEdit, ...)). After the tool-call round, TestModel emits a second response choosing one output tool via _get_output (pydantic_ai/models/test.py:183-205) and auto-generates args via gen_tool_args against the chosen pydantic schema. If auto-generated args fail validation, pydantic-ai retries inside agent.run, and BaseGalaxyAgent._run_with_retry (base.py:415+) retries the whole run on top — burning request budget and potentially returning a 500 that masks the actual aiocop check. Mitigation: the registry swap must pass custom_output_args to TestModel with a hand-crafted valid FullReplacementEdit (e.g. {"new_content": "", "reasoning": ""}) so the second-round output validates immediately.

Acceptance criteria

Proposed implementation

1. Add a TestModel-backed registry (mirror StaticAgentRegistry)

Pattern is established by galaxyproject/galaxy#22070: factory.py checks inference_services.<switch> and returns an alternate AgentRegistry. Add a parallel switch:

Alternative to a registry subclass: a TestModelAgentDecorator that wraps any AgentRegistry.get_agent result and patches .agent lazily. Smaller surface, but harder to reason about for new maintainers. The subclass route mirrors StaticAgentRegistry exactly and is the cited precedent.

2. Add the test

test/integration/test_page_assistant_tool_dispatch.py (new) — integration test (not API) because it needs handle_galaxy_config_kwds to enable the test_model agent backend per server run. Mirrors the @pytest.mark.skipif(not AIOCOP_ENABLED, ...) gate from test/integration/test_event_loop_blocking.py. Sketch:

import pytest

from galaxy_test.base.api import AIOCOP_ENABLED
from galaxy_test.base.populators import DatasetPopulator, skip_without_agents
from galaxy_test.driver import integration_util


# Severity-10 socket events that should never appear on an async handler's
# request — their presence means a sync DB call ran on the event-loop thread.
_LOOP_BLOCKING_SOCKET_EVENTS = (
    "socket.socket.send", "socket.socket.sendall", "socket.socket.sendto",
    "socket.socket.recv", "socket.socket.recv_into", "socket.socket.recvfrom",
)


@pytest.mark.skipif(not AIOCOP_ENABLED, reason="GALAXY_TEST_AIOCOP not set")
@skip_without_agents
class TestPageAssistantToolDispatchIntegration(integration_util.IntegrationTestCase):
    """Drives the real PageAssistantAgent tool chain against a real DB.
    Guards against blocking sync I/O in @agent.tool wrappers (aiocop)."""

    @classmethod
    def handle_galaxy_config_kwds(cls, config):
        super().handle_galaxy_config_kwds(config)
        # Swap to the TestModel-backed registry; the real registry constructs
        # real agents, then TestModelAgentRegistry overrides each
        # BaseGalaxyAgent.agent.model with TestModel(call_tools='all').
        config["inference_services"] = {"test_model": True}

    def setUp(self):
        super().setUp()
        self.dataset_populator = DatasetPopulator(self.galaxy_interactor)

    def test_page_assistant_tool_chain_does_not_block_event_loop(self):
        history_id = self.dataset_populator.new_history()
        # Seed lazy-load paths: an HDA (creating_job lazy load) and a
        # collection (collection/elements/elem.hda lazy loads in
        # _get_collection_structure_impl).
        self.dataset_populator.new_dataset(history_id, content="seq\n")
        # … create a paired/list collection via DatasetCollectionPopulator
        # … create a history-attached page (populator added in #22361)
        page_id = self.dataset_populator.new_history_page(
            history_id=history_id, title="aiocop guard"
        )["id"]
        self.dataset_populator.wait_for_history(history_id)

        response = self._post(
            "chat",
            data={"query": "draft a summary", "page_id": page_id},
            json=True,
        )
        self._assert_status_code_is_ok(response)

        # Direct header inspection: catches severity-10 socket send/recv too,
        # which the default _check_aiocop_violations (>=50) would let through
        # on warm pooled DB connections.
        header = response.headers.get("x-aiocop-violations", "")
        fields = dict(p.split("=", 1) for p in header.split(";") if "=" in p)
        first = fields.get("first", "")
        assert not any(ev in first for ev in _LOOP_BLOCKING_SOCKET_EVENTS), (
            f"Sync DB on event loop detected: {header!r}"
        )
        # Sanity that TestModel swap landed (otherwise tools never fired and
        # the guard is a no-op). Use response metadata showing tools were
        # invoked — see Q6 for the exact assertion shape.

3. Verify the guard

Reproduction:

GALAXY_TEST_AIOCOP=1 ./run_tests.sh -integration test/integration/test_page_assistant_tool_dispatch.py

Before merge, revert the threadpool-offload changes on history_tools.py (pre-rebase SHA c8b77adfde; post-#22361-merge: the equivalent diff within the squashed merge commit) locally and rerun. Confirm the test fails with Sync DB on event loop detected: …socket.socket.send… (or recv). Capture the X-Aiocop-Violations header from the regressed run in the PR description as evidence — not “tests pass green,” but the actual blocking-event readout the test caught.

Design alternatives considered

Use feedback_decision_doc_authoring UPPER_SNAKE option naming. Recommendation: TEST_MODEL_VIA_REGISTRY. Brief table; full write-ups below.

OptionOne-linerIn scope?
TEST_MODEL_VIA_REGISTRYNew TestModelAgentRegistry swap mirroring StaticAgentRegistry, full real dispatch chain✅ Recommended
DIRECT_TOOL_INVOCATIONAPI/Python test that calls await list_history_items(real_session, ...) directly, no LLMUseful, narrower
LIVE_LLM_IN_CIRun with a real LLM in CI to exercise dispatch❌ Rejected
STATIC_BACKEND_ONLYTry to make #22070’s static backend cover this❌ Fundamentally cannot
MONKEYPATCH_AT_TEST_SETUPPatch BaseGalaxyAgent._create_agent in test fixtures❌ Test server is out-of-process

DIRECT_TOOL_INVOCATION

LIVE_LLM_IN_CI (rejected)

STATIC_BACKEND_ONLY (rejected)

MONKEYPATCH_AT_TEST_SETUP (rejected)

Open questions

Out of scope

Risks

References