Dashboard

Pr 21706 Data Analysis Agent Integration

Integrates data analysis agent with DSPy generating Python code executed in browser via Pyodide

Raw
Revised:
2026-05-21
Revision:
4
GitHub PR:
#21706
Related Notes:
Component - Agents Backend, Component - Agents UX, PR 21434 - AI Agent Framework and ChatGXY, PR 21692 - Standardize Agent API Schemas, PR 21942 - Shared Agent Operations and MCP Server, PR 22070 - Static YAML Agent Backend for Deterministic Testing

PR #21706 Research Summary: Data Analysis Agent Integration

PR Metadata

FieldValue
TitleData analysis agent integration
Number#21706
URLhttps://github.com/galaxyproject/galaxy/pull/21706
StateOPEN (not merged)
Authorqchiujunhao (JunhaoQiu)
Base Branchdev
Head Branchda-integration
Changed Files34
Additions8,614
Deletions103
Labelsarea/documentation, area/UI-UX, area/testing, kind/feature, area/API, area/util, area/dependencies, area/testing/integration

Summary

This PR integrates a “chat with your data” interactive tool into Galaxy’s agent framework. It adds a new data_analysis agent type that uses DSPy (CodeReact/ReAct modules) to generate Python analysis code, which is then executed in the browser using Pyodide (a WebAssembly Python runtime). The generated code can load user-selected datasets, produce plots/tables/files, and upload the resulting artifacts back to Galaxy as history datasets.

The architecture is a multi-step loop:

  1. User selects datasets in the ChatGXY UI and asks a question
  2. The router agent detects a data analysis request and hands off to DataAnalysisAgent
  3. DataAnalysisAgent uses DSPy to plan the analysis and generate Python code
  4. The backend returns a pyodide_task in the response metadata
  5. The frontend’s Pyodide Web Worker executes the code in the browser
  6. Results (stdout/stderr/artifacts) are POSTed back to the server via /api/chat/exchange/{id}/pyodide_result
  7. The server generates a follow-up agent response incorporating the execution results
  8. WebSocket streaming pushes the follow-up to the UI in real time

Architecture Overview

User Query + Dataset Selection
        |
        v
  ChatGXY.vue (frontend)
        |
        v
  POST /api/chat  (with dataset_ids)
        |
        v
  QueryRouterAgent --> hand_off_to_data_analysis
        |
        v
  DataAnalysisAgent.process()
        |
        v
  GalaxyDSPyPlanner.plan()  (DSPy CodeReact/ReAct)
        |
        v
  Response with metadata.pyodide_task = { code, packages, files, alias_map }
        |
        v
  Frontend: usePyodideRunner -> pyodide.worker.ts (Web Worker)
        |
        v
  Pyodide executes Python, generates artifacts
        |
        v
  POST /api/chat/exchange/{id}/artifacts  (upload each artifact as HDA)
  POST /api/chat/exchange/{id}/pyodide_result  (submit execution results)
        |
        v
  ChatExecutionService -> DataAnalysisAgent (follow-up reasoning)
        |
        v
  WebSocket broadcast -> ChatGXY.vue updates UI

File Inventory

New Files (14 files — none exist in current codebase)

FileLinesPurpose
client/src/components/ChatGXY/pyodide.worker.ts~570Web Worker: loads Pyodide, downloads datasets to virtual FS, executes Python, collects artifacts
client/src/composables/usePyodideRunner.ts~146Vue composable: manages Worker lifecycle, pending task map, result routing
lib/galaxy/agents/data_analysis.py~1047DataAnalysisAgent class: orchestrates DSPy planning, builds pyodide tasks, manages execution lifecycle
lib/galaxy/agents/dataset_resolver.py~62Resolves dataset references (by name/HID/ID) for DSPy tool calls
lib/galaxy/agents/dspy_adapter.py~634DSPy integration: GalaxyDSPyPlanner, CodeCaptureTool, DatasetLookupTool, GalaxyDataAnalysisModule
lib/galaxy/agents/examples.json~56 (JSON)Few-shot examples for DSPy planner (EDA, correlation, histograms, ML, etc.)
lib/galaxy/chat_dspy.py~1940Vendored DSPy workflow from ChatAnalysis project (reference implementation)
lib/galaxy/managers/chat_execution.py~315ChatExecutionService: handles pyodide result submissions, follow-up reasoning, artifact collection creation
lib/galaxy/util/pyodide.py~259Shared utilities: infer_requirements_from_python(), merge_execution_metadata(), dataset_descriptors_from_files()
doc/source/dev/data_analysis_temp_overview.md~39Temporary implementation notes
test/unit/agents/__init__.py2Package init
test/unit/agents/test_dspy_adapter_requirements.py~36Unit tests for infer_requirements_from_python()
test/unit/managers/__init__.py2Package init
test/unit/managers/test_chat_execution_metadata.py~85Unit tests for merge_execution_metadata()

Modified Files (20 files)

FileCurrent StatusChanges in PR
.gitignoreEXISTSAdds logs*.md pattern
client/src/api/index.tsEXISTS, no PR changes presentAdds HDACustom type export
client/src/components/ChatGXY.vueEXISTS, no PR changes presentMassive expansion (~1800+ lines): dataset selection, Pyodide execution loop, WebSocket streaming, artifact display, collapsible intermediate steps, analysis steps UI. Renames dataset_analyzer -> data_analysis, faChartBar -> faMicroscope
client/src/components/ChatGXY/ActionCard.vueEXISTS, no PR changes presentMinor: adds shouldShowArtifacts function
client/src/composables/agentActions.tsEXISTS, no PR changes presentAdds PYODIDE_EXECUTE action type
lib/galaxy/agents/__init__.pyEXISTS, no PR changes presentRegisters DataAnalysisAgent with try/except for optional deps
lib/galaxy/agents/base.pyEXISTS, no PR changes presentAdds DATA_ANALYSIS to AgentType, adds USE_PYDANTIC_AGENT flag, makes agent attribute Optional
lib/galaxy/agents/orchestrator.pyEXISTS, no PR changes presentAdds data_analysis to available agents list
lib/galaxy/agents/prompts/orchestrator.mdEXISTS, no PR changes presentAdds data_analysis to agent list
lib/galaxy/agents/prompts/router.mdEXISTS, no PR changes presentAdds routing rules for hand_off_to_data_analysis
lib/galaxy/agents/router.pyEXISTS, no PR changes presentAdds _create_data_analysis_handoff(), sentinel-based handoff to DataAnalysisAgent in process(), dataset context injection
lib/galaxy/dependencies/__init__.pyEXISTS, no PR changes presentAdds check_dspy_ai() method
lib/galaxy/dependencies/conditional-requirements.txtEXISTS, no PR changes presentAdds dspy-ai==2.5.43 and itsdangerous==2.2.0
lib/galaxy/managers/agents.pyEXISTS, no PR changes presentAdds dataset token signing (itsdangerous), _execute_data_analysis_dspy() method, _build_pyodide_task(), _dataset_descriptors()
lib/galaxy/managers/chat.pyEXISTS, no PR changes presentAdds execution_result role handling, dataset_ids in conversation data, richer message format for agents
lib/galaxy/schema/agents.pyEXISTS, no PR changes presentAdds PYODIDE_EXECUTE action type, Artifact, ExecutionTask, ExecutionResult models
lib/galaxy/schema/schema.pyEXISTS, no PR changes presentAdds agent_type, dataset_ids to ChatPayload; new PyodideResultPayload model; adds exchange_id, agent_response, dataset_ids, routing_info to ChatResponse; extra="allow" on ChatResponse
lib/galaxy/webapps/galaxy/api/chat.pyEXISTS, no PR changes presentAdds 6+ new endpoints: artifact upload, dataset download (signed token), WebSocket streaming, pyodide result submission; adds execution timeout/expiry logic, artifact URL refresh
pyproject.tomlEXISTS, no PR changes presentAdds agents optional dependency group
test/integration/test_agents.pyEXISTS, no PR changes presentAdds test_chat_with_dataset_context_records_execution_metadata

Key Patterns and Conventions

1. Optional Dependency Pattern

DSPy and itsdangerous are optional. All imports use try/except guards:

try:
    from .data_analysis import DataAnalysisAgent
except ImportError:
    DataAnalysisAgent = None

2. USE_PYDANTIC_AGENT Flag

DataAnalysisAgent sets USE_PYDANTIC_AGENT = False because it uses DSPy instead of pydantic-ai. BaseGalaxyAgent.__init__() checks this flag.

3. Sentinel-Based Handoff

The router uses a string sentinel __HANDOFF_TO_DATA_ANALYSIS__:: to signal handoff. When process() sees this prefix, it instantiates DataAnalysisAgent and runs it with the full context.

4. Signed Dataset Download Tokens

Uses itsdangerous.URLSafeTimedSerializer to create time-limited signed tokens for dataset downloads. The Pyodide worker fetches datasets via /api/chat/datasets/{id}/download?token=....

5. WebSocket Streaming

A per-exchange WebSocket at /api/chat/exchange/{id}/stream pushes exec_followup messages to the frontend after pyodide result processing completes.

6. Pyodide Task Lifecycle

States: pending -> running -> completed/error/timeout. The frontend tracks these via metadata.pyodide_status and pyodideExecutions reactive map.

7. Collapsible Intermediate Messages

Multi-step analysis produces intermediate messages that are auto-collapsed in the UI. Only the final “complete” message is shown expanded.


API Changes

New Endpoints

MethodPathPurpose
GET/api/chat/datasets/{dataset_id}/downloadStream dataset file with signed token auth
POST/api/chat/exchange/{exchange_id}/artifactsUpload Pyodide-generated artifacts as HDAs
POST/api/chat/exchange/{exchange_id}/pyodide_resultSubmit execution results, trigger follow-up
WS/api/chat/exchange/{exchange_id}/streamReal-time execution follow-up delivery

Modified Endpoints

MethodPathChanges
POST/api/chatAccepts agent_type and dataset_ids in body; returns dataset_ids, agent_response, exchange_id
GET/api/chat/exchange/{id}/messagesReturns execution_result role messages, dataset_ids, refreshed artifact URLs, stale pyodide task expiry

Schema Changes

ChatPayload: Added agent_type: Optional[str], dataset_ids: Optional[list[str]]

ChatResponse: Added model_config = ConfigDict(extra="allow"), exchange_id, agent_response, dataset_ids, routing_info

New PyodideResultPayload: task_id, stdout, stderr, artifacts, metadata, success

AgentResponse (agents.py): Added PYODIDE_EXECUTE action type

New models: Artifact, ExecutionTask, ExecutionResult


Frontend Changes

ChatGXY.vue (massive expansion)

  • Dataset selector: Uses FormData component to let users pick datasets from current history
  • Pyodide execution loop: Manages Web Worker lifecycle, task queuing, retry limits
  • WebSocket integration: Opens per-exchange WS connection for real-time updates
  • Artifact display: Image previews, download buttons, size formatting
  • Analysis steps UI: Thought/Action/Observation/Conclusion step rendering
  • Collapsible messages: Auto-collapse intermediate steps, expand final results
  • Busy state: isChatBusy computed ref blocks input during execution

New Composable: usePyodideRunner.ts

  • Singleton Web Worker pattern
  • Promise-based task execution with stdout/stderr/status hooks
  • Automatic artifact buffer extraction from Worker messages

New Worker: pyodide.worker.ts

  • Loads Pyodide v0.26.1 from CDN
  • Package installation via loadPackage + micropip fallback
  • Virtual filesystem: /data for inputs, /tmp/galaxy/outputs_dir/generated_file for outputs
  • Seeds Python env with load_dataset(), get_dataset_path() helpers
  • Pre-execution AST syntax check
  • Post-execution scalar summary + new file detection
  • Artifact collection with MIME type guessing

Cross-Reference with Current Codebase

Files That Need to Be Created (not in current codebase)

All 14 new files from the PR are absent. Key ones:

  • lib/galaxy/agents/data_analysis.py (1047 lines)
  • lib/galaxy/agents/dspy_adapter.py (634 lines)
  • lib/galaxy/managers/chat_execution.py (315 lines)
  • lib/galaxy/util/pyodide.py (259 lines)
  • client/src/composables/usePyodideRunner.ts (146 lines)
  • client/src/components/ChatGXY/pyodide.worker.ts (570 lines)

Files That Exist But Have NOT Been Modified

Every existing file that the PR modifies (20 files) currently shows no trace of the PR’s changes. Specifically:

  • HDACustom type is not exported in client/src/api/index.ts
  • dataset_analyzer still exists (not renamed to data_analysis) in ChatGXY.vue
  • PYODIDE_EXECUTE not in agentActions.ts or schema/agents.py
  • DATA_ANALYSIS not in AgentType class
  • USE_PYDANTIC_AGENT flag not in base.py
  • No handoff to data_analysis in router.py
  • No check_dspy_ai in dependencies/__init__.py
  • No dspy-ai or itsdangerous in conditional-requirements.txt
  • No dataset_ids in ChatPayload or ChatResponse
  • No PyodideResultPayload in schema.py
  • No artifact/streaming/pyodide endpoints in api/chat.py
  • No ChatExecutionService anywhere

Notable Divergence Points

The current codebase’s ChatGXY.vue still has dataset_analyzer (line 83) with faChartBar. The PR renames this to data_analysis with faMicroscope.

The current base.py has agent: Agent[GalaxyAgentDependencies, Any] as required (not Optional). The PR makes it Optional and adds the USE_PYDANTIC_AGENT bypass.


Design Decisions

  1. Browser-side execution: Code runs in Pyodide (WASM Python in browser) rather than server-side. This avoids sandboxing concerns on the server but limits available packages to those compiled for WASM.

  2. DSPy over pydantic-ai: The data analysis agent uses DSPy’s CodeReact/ReAct modules instead of pydantic-ai. This is why USE_PYDANTIC_AGENT = False exists.

  3. Signed download tokens: Dataset downloads for Pyodide use itsdangerous signed tokens with configurable TTL (default 600s) rather than session cookies, since the Web Worker can’t access cookies.

  4. Vendored reference: lib/galaxy/chat_dspy.py (1940 lines) is a vendored copy from the ChatAnalysis project. It imports psycopg2, pandas, dotenv, nicegui etc. and appears to be a reference/legacy implementation that is NOT used at runtime.

  5. Artifact-as-HDA pattern: Generated files are uploaded as standard Galaxy HDAs, maintaining compatibility with the existing history system. Multiple artifacts get grouped into a list collection.

  6. WebSocket for follow-ups: Instead of polling, the UI opens a WebSocket per exchange to receive execution follow-up messages in real time.

  7. Intermediate message collapsing: Multi-turn agent reasoning (plan -> execute -> observe -> refine) is collapsed in the UI to keep the conversation clean.


Dependencies Added

  • dspy-ai==2.5.43 (conditional: when AI backend configured)
  • itsdangerous==2.2.0 (conditional: when AI backend configured)

Test Coverage

  • Integration test: test_chat_with_dataset_context_records_execution_metadata — verifies dataset_ids flow through chat API and persist in exchange messages
  • Unit tests:
    • test_dspy_adapter_requirements.py — tests infer_requirements_from_python() (import mapping, explicit markers, stdlib filtering)
    • test_chat_execution_metadata.py — tests merge_execution_metadata() (pyodide_context preference, descriptor fallback, plot/file inference from artifacts)

Potential Integration Concerns

  1. lib/galaxy/chat_dspy.py (1940 lines) imports heavy dependencies (psycopg2, pandas, nicegui) that may not be available. It appears to be unused reference code — may want to exclude from integration.

  2. The PR is based on dev branch, not master. There may be changes in dev that haven’t reached the current history_markdown branch.

  3. ChatResponse gets ConfigDict(extra="allow") which is a breaking schema change — may affect API clients.

  4. The USE_PYDANTIC_AGENT pattern in BaseGalaxyAgent changes the constructor signature for all agents.

  5. The agent: Optional[Agent] change in base.py could have downstream effects if any code assumes self.agent is always set.

  6. The effective_agent_type logic in api/chat.py prefers body agent_type over query param — potential breaking change for existing callers using query param.

Incoming References (6)