PR #21706 Research Summary: Data Analysis Agent Integration
PR Metadata
| Field | Value |
|---|---|
| Title | Data analysis agent integration |
| Number | #21706 |
| URL | https://github.com/galaxyproject/galaxy/pull/21706 |
| State | OPEN (not merged) |
| Author | qchiujunhao (JunhaoQiu) |
| Base Branch | dev |
| Head Branch | da-integration |
| Changed Files | 34 |
| Additions | 8,614 |
| Deletions | 103 |
| Labels | area/documentation, area/UI-UX, area/testing, kind/feature, area/API, area/util, area/dependencies, area/testing/integration |
Summary
This PR integrates a “chat with your data” interactive tool into Galaxy’s agent framework. It adds a new data_analysis agent type that uses DSPy (CodeReact/ReAct modules) to generate Python analysis code, which is then executed in the browser using Pyodide (a WebAssembly Python runtime). The generated code can load user-selected datasets, produce plots/tables/files, and upload the resulting artifacts back to Galaxy as history datasets.
The architecture is a multi-step loop:
- User selects datasets in the ChatGXY UI and asks a question
- The router agent detects a data analysis request and hands off to
DataAnalysisAgent DataAnalysisAgentuses DSPy to plan the analysis and generate Python code- The backend returns a
pyodide_taskin the response metadata - The frontend’s Pyodide Web Worker executes the code in the browser
- Results (stdout/stderr/artifacts) are POSTed back to the server via
/api/chat/exchange/{id}/pyodide_result - The server generates a follow-up agent response incorporating the execution results
- WebSocket streaming pushes the follow-up to the UI in real time
Architecture Overview
User Query + Dataset Selection
|
v
ChatGXY.vue (frontend)
|
v
POST /api/chat (with dataset_ids)
|
v
QueryRouterAgent --> hand_off_to_data_analysis
|
v
DataAnalysisAgent.process()
|
v
GalaxyDSPyPlanner.plan() (DSPy CodeReact/ReAct)
|
v
Response with metadata.pyodide_task = { code, packages, files, alias_map }
|
v
Frontend: usePyodideRunner -> pyodide.worker.ts (Web Worker)
|
v
Pyodide executes Python, generates artifacts
|
v
POST /api/chat/exchange/{id}/artifacts (upload each artifact as HDA)
POST /api/chat/exchange/{id}/pyodide_result (submit execution results)
|
v
ChatExecutionService -> DataAnalysisAgent (follow-up reasoning)
|
v
WebSocket broadcast -> ChatGXY.vue updates UI
File Inventory
New Files (14 files — none exist in current codebase)
| File | Lines | Purpose |
|---|---|---|
client/src/components/ChatGXY/pyodide.worker.ts | ~570 | Web Worker: loads Pyodide, downloads datasets to virtual FS, executes Python, collects artifacts |
client/src/composables/usePyodideRunner.ts | ~146 | Vue composable: manages Worker lifecycle, pending task map, result routing |
lib/galaxy/agents/data_analysis.py | ~1047 | DataAnalysisAgent class: orchestrates DSPy planning, builds pyodide tasks, manages execution lifecycle |
lib/galaxy/agents/dataset_resolver.py | ~62 | Resolves dataset references (by name/HID/ID) for DSPy tool calls |
lib/galaxy/agents/dspy_adapter.py | ~634 | DSPy integration: GalaxyDSPyPlanner, CodeCaptureTool, DatasetLookupTool, GalaxyDataAnalysisModule |
lib/galaxy/agents/examples.json | ~56 (JSON) | Few-shot examples for DSPy planner (EDA, correlation, histograms, ML, etc.) |
lib/galaxy/chat_dspy.py | ~1940 | Vendored DSPy workflow from ChatAnalysis project (reference implementation) |
lib/galaxy/managers/chat_execution.py | ~315 | ChatExecutionService: handles pyodide result submissions, follow-up reasoning, artifact collection creation |
lib/galaxy/util/pyodide.py | ~259 | Shared utilities: infer_requirements_from_python(), merge_execution_metadata(), dataset_descriptors_from_files() |
doc/source/dev/data_analysis_temp_overview.md | ~39 | Temporary implementation notes |
test/unit/agents/__init__.py | 2 | Package init |
test/unit/agents/test_dspy_adapter_requirements.py | ~36 | Unit tests for infer_requirements_from_python() |
test/unit/managers/__init__.py | 2 | Package init |
test/unit/managers/test_chat_execution_metadata.py | ~85 | Unit tests for merge_execution_metadata() |
Modified Files (20 files)
| File | Current Status | Changes in PR |
|---|---|---|
.gitignore | EXISTS | Adds logs*.md pattern |
client/src/api/index.ts | EXISTS, no PR changes present | Adds HDACustom type export |
client/src/components/ChatGXY.vue | EXISTS, no PR changes present | Massive expansion (~1800+ lines): dataset selection, Pyodide execution loop, WebSocket streaming, artifact display, collapsible intermediate steps, analysis steps UI. Renames dataset_analyzer -> data_analysis, faChartBar -> faMicroscope |
client/src/components/ChatGXY/ActionCard.vue | EXISTS, no PR changes present | Minor: adds shouldShowArtifacts function |
client/src/composables/agentActions.ts | EXISTS, no PR changes present | Adds PYODIDE_EXECUTE action type |
lib/galaxy/agents/__init__.py | EXISTS, no PR changes present | Registers DataAnalysisAgent with try/except for optional deps |
lib/galaxy/agents/base.py | EXISTS, no PR changes present | Adds DATA_ANALYSIS to AgentType, adds USE_PYDANTIC_AGENT flag, makes agent attribute Optional |
lib/galaxy/agents/orchestrator.py | EXISTS, no PR changes present | Adds data_analysis to available agents list |
lib/galaxy/agents/prompts/orchestrator.md | EXISTS, no PR changes present | Adds data_analysis to agent list |
lib/galaxy/agents/prompts/router.md | EXISTS, no PR changes present | Adds routing rules for hand_off_to_data_analysis |
lib/galaxy/agents/router.py | EXISTS, no PR changes present | Adds _create_data_analysis_handoff(), sentinel-based handoff to DataAnalysisAgent in process(), dataset context injection |
lib/galaxy/dependencies/__init__.py | EXISTS, no PR changes present | Adds check_dspy_ai() method |
lib/galaxy/dependencies/conditional-requirements.txt | EXISTS, no PR changes present | Adds dspy-ai==2.5.43 and itsdangerous==2.2.0 |
lib/galaxy/managers/agents.py | EXISTS, no PR changes present | Adds dataset token signing (itsdangerous), _execute_data_analysis_dspy() method, _build_pyodide_task(), _dataset_descriptors() |
lib/galaxy/managers/chat.py | EXISTS, no PR changes present | Adds execution_result role handling, dataset_ids in conversation data, richer message format for agents |
lib/galaxy/schema/agents.py | EXISTS, no PR changes present | Adds PYODIDE_EXECUTE action type, Artifact, ExecutionTask, ExecutionResult models |
lib/galaxy/schema/schema.py | EXISTS, no PR changes present | Adds agent_type, dataset_ids to ChatPayload; new PyodideResultPayload model; adds exchange_id, agent_response, dataset_ids, routing_info to ChatResponse; extra="allow" on ChatResponse |
lib/galaxy/webapps/galaxy/api/chat.py | EXISTS, no PR changes present | Adds 6+ new endpoints: artifact upload, dataset download (signed token), WebSocket streaming, pyodide result submission; adds execution timeout/expiry logic, artifact URL refresh |
pyproject.toml | EXISTS, no PR changes present | Adds agents optional dependency group |
test/integration/test_agents.py | EXISTS, no PR changes present | Adds test_chat_with_dataset_context_records_execution_metadata |
Key Patterns and Conventions
1. Optional Dependency Pattern
DSPy and itsdangerous are optional. All imports use try/except guards:
try:
from .data_analysis import DataAnalysisAgent
except ImportError:
DataAnalysisAgent = None
2. USE_PYDANTIC_AGENT Flag
DataAnalysisAgent sets USE_PYDANTIC_AGENT = False because it uses DSPy instead of pydantic-ai. BaseGalaxyAgent.__init__() checks this flag.
3. Sentinel-Based Handoff
The router uses a string sentinel __HANDOFF_TO_DATA_ANALYSIS__:: to signal handoff. When process() sees this prefix, it instantiates DataAnalysisAgent and runs it with the full context.
4. Signed Dataset Download Tokens
Uses itsdangerous.URLSafeTimedSerializer to create time-limited signed tokens for dataset downloads. The Pyodide worker fetches datasets via /api/chat/datasets/{id}/download?token=....
5. WebSocket Streaming
A per-exchange WebSocket at /api/chat/exchange/{id}/stream pushes exec_followup messages to the frontend after pyodide result processing completes.
6. Pyodide Task Lifecycle
States: pending -> running -> completed/error/timeout. The frontend tracks these via metadata.pyodide_status and pyodideExecutions reactive map.
7. Collapsible Intermediate Messages
Multi-step analysis produces intermediate messages that are auto-collapsed in the UI. Only the final “complete” message is shown expanded.
API Changes
New Endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /api/chat/datasets/{dataset_id}/download | Stream dataset file with signed token auth |
| POST | /api/chat/exchange/{exchange_id}/artifacts | Upload Pyodide-generated artifacts as HDAs |
| POST | /api/chat/exchange/{exchange_id}/pyodide_result | Submit execution results, trigger follow-up |
| WS | /api/chat/exchange/{exchange_id}/stream | Real-time execution follow-up delivery |
Modified Endpoints
| Method | Path | Changes |
|---|---|---|
| POST | /api/chat | Accepts agent_type and dataset_ids in body; returns dataset_ids, agent_response, exchange_id |
| GET | /api/chat/exchange/{id}/messages | Returns execution_result role messages, dataset_ids, refreshed artifact URLs, stale pyodide task expiry |
Schema Changes
ChatPayload: Added agent_type: Optional[str], dataset_ids: Optional[list[str]]
ChatResponse: Added model_config = ConfigDict(extra="allow"), exchange_id, agent_response, dataset_ids, routing_info
New PyodideResultPayload: task_id, stdout, stderr, artifacts, metadata, success
AgentResponse (agents.py): Added PYODIDE_EXECUTE action type
New models: Artifact, ExecutionTask, ExecutionResult
Frontend Changes
ChatGXY.vue (massive expansion)
- Dataset selector: Uses
FormDatacomponent to let users pick datasets from current history - Pyodide execution loop: Manages Web Worker lifecycle, task queuing, retry limits
- WebSocket integration: Opens per-exchange WS connection for real-time updates
- Artifact display: Image previews, download buttons, size formatting
- Analysis steps UI: Thought/Action/Observation/Conclusion step rendering
- Collapsible messages: Auto-collapse intermediate steps, expand final results
- Busy state:
isChatBusycomputed ref blocks input during execution
New Composable: usePyodideRunner.ts
- Singleton Web Worker pattern
- Promise-based task execution with stdout/stderr/status hooks
- Automatic artifact buffer extraction from Worker messages
New Worker: pyodide.worker.ts
- Loads Pyodide v0.26.1 from CDN
- Package installation via
loadPackage+ micropip fallback - Virtual filesystem:
/datafor inputs,/tmp/galaxy/outputs_dir/generated_filefor outputs - Seeds Python env with
load_dataset(),get_dataset_path()helpers - Pre-execution AST syntax check
- Post-execution scalar summary + new file detection
- Artifact collection with MIME type guessing
Cross-Reference with Current Codebase
Files That Need to Be Created (not in current codebase)
All 14 new files from the PR are absent. Key ones:
lib/galaxy/agents/data_analysis.py(1047 lines)lib/galaxy/agents/dspy_adapter.py(634 lines)lib/galaxy/managers/chat_execution.py(315 lines)lib/galaxy/util/pyodide.py(259 lines)client/src/composables/usePyodideRunner.ts(146 lines)client/src/components/ChatGXY/pyodide.worker.ts(570 lines)
Files That Exist But Have NOT Been Modified
Every existing file that the PR modifies (20 files) currently shows no trace of the PR’s changes. Specifically:
HDACustomtype is not exported inclient/src/api/index.tsdataset_analyzerstill exists (not renamed todata_analysis) inChatGXY.vuePYODIDE_EXECUTEnot inagentActions.tsorschema/agents.pyDATA_ANALYSISnot inAgentTypeclassUSE_PYDANTIC_AGENTflag not inbase.py- No handoff to data_analysis in
router.py - No
check_dspy_aiindependencies/__init__.py - No
dspy-aioritsdangerousinconditional-requirements.txt - No
dataset_idsinChatPayloadorChatResponse - No
PyodideResultPayloadinschema.py - No artifact/streaming/pyodide endpoints in
api/chat.py - No
ChatExecutionServiceanywhere
Notable Divergence Points
The current codebase’s ChatGXY.vue still has dataset_analyzer (line 83) with faChartBar. The PR renames this to data_analysis with faMicroscope.
The current base.py has agent: Agent[GalaxyAgentDependencies, Any] as required (not Optional). The PR makes it Optional and adds the USE_PYDANTIC_AGENT bypass.
Design Decisions
-
Browser-side execution: Code runs in Pyodide (WASM Python in browser) rather than server-side. This avoids sandboxing concerns on the server but limits available packages to those compiled for WASM.
-
DSPy over pydantic-ai: The data analysis agent uses DSPy’s CodeReact/ReAct modules instead of pydantic-ai. This is why
USE_PYDANTIC_AGENT = Falseexists. -
Signed download tokens: Dataset downloads for Pyodide use
itsdangeroussigned tokens with configurable TTL (default 600s) rather than session cookies, since the Web Worker can’t access cookies. -
Vendored reference:
lib/galaxy/chat_dspy.py(1940 lines) is a vendored copy from the ChatAnalysis project. It importspsycopg2,pandas,dotenv,niceguietc. and appears to be a reference/legacy implementation that is NOT used at runtime. -
Artifact-as-HDA pattern: Generated files are uploaded as standard Galaxy HDAs, maintaining compatibility with the existing history system. Multiple artifacts get grouped into a list collection.
-
WebSocket for follow-ups: Instead of polling, the UI opens a WebSocket per exchange to receive execution follow-up messages in real time.
-
Intermediate message collapsing: Multi-turn agent reasoning (plan -> execute -> observe -> refine) is collapsed in the UI to keep the conversation clean.
Dependencies Added
dspy-ai==2.5.43(conditional: when AI backend configured)itsdangerous==2.2.0(conditional: when AI backend configured)
Test Coverage
- Integration test:
test_chat_with_dataset_context_records_execution_metadata— verifies dataset_ids flow through chat API and persist in exchange messages - Unit tests:
test_dspy_adapter_requirements.py— testsinfer_requirements_from_python()(import mapping, explicit markers, stdlib filtering)test_chat_execution_metadata.py— testsmerge_execution_metadata()(pyodide_context preference, descriptor fallback, plot/file inference from artifacts)
Potential Integration Concerns
-
lib/galaxy/chat_dspy.py(1940 lines) imports heavy dependencies (psycopg2,pandas,nicegui) that may not be available. It appears to be unused reference code — may want to exclude from integration. -
The PR is based on
devbranch, notmaster. There may be changes in dev that haven’t reached the currenthistory_markdownbranch. -
ChatResponsegetsConfigDict(extra="allow")which is a breaking schema change — may affect API clients. -
The
USE_PYDANTIC_AGENTpattern inBaseGalaxyAgentchanges the constructor signature for all agents. -
The
agent: Optional[Agent]change in base.py could have downstream effects if any code assumesself.agentis always set. -
The
effective_agent_typelogic inapi/chat.pyprefers bodyagent_typeover query param — potential breaking change for existing callers using query param.