Multiple History Support for ID-Based Workflow Extraction
Executive Summary
Yes, the plan should be updated to remove the single-history limitation.
The single-history restriction in WORKFLOW_EXTRACTION_HID_TO_ID_ISSUE.md is an artificial constraint inherited from HID-based assumptions, not a technical requirement. The extraction logic does not fundamentally require datasets to be in one history - it needs a set of datasets/jobs and their relationships. ID-based extraction unlocks the ability to support cross-history datasets, which would fix the “copied dataset problem” (#9161, #13823) more comprehensively.
Analysis: Why Single-History Exists in Current Plan
In the HID-Based System
The current HID-based extraction requires a single history because:
-
HID is history-scoped: HIDs only have meaning within a specific history.
hid=5in History A is unrelated tohid=5in History B. -
WorkflowSummaryiterates one history: Line 282 ofextract.py:for content in self.history.visible_contents:This builds the
hid_to_output_pairmapping by scanning one history. -
HID lookup requires context: The
hid()method (lines 259-275) tries to map cross-history objects back to “current history” HIDs, but this is fragile.
In the Proposed ID-Based System
The plan carries over from_history_id as a required parameter, but this is not technically necessary when using IDs:
# From the plan (lines 226-232):
if dataset.history_id != history.id:
# Cross-history dataset - could be supported in future
raise exceptions.RequestParameterInvalidException(
f"Dataset {dataset_id} is not in history {history.id}"
)
This check is artificially restrictive. With ID-based lookup:
- Dataset ID uniquely identifies the dataset across all histories
- No HID collision is possible
- Connection tracing uses IDs not HIDs
Technical Feasibility of Cross-History Extraction
What Would Need to Change
1. API Changes
Current plan:
from_history_id: DecodedDatabaseIdField # Required
input_dataset_ids: List[DecodedDatabaseIdField]
Updated for cross-history:
from_history_id: Optional[DecodedDatabaseIdField] = None # Optional, for UI context only
input_dataset_ids: List[DecodedDatabaseIdField] # Required
input_dataset_collection_ids: List[DecodedDatabaseIdField]
The from_history_id becomes optional/context-only rather than required.
2. Extraction Function Changes
Remove history-scoped validation:
# OLD (restrictive):
if dataset.history_id != history.id:
raise RequestParameterInvalidException(...)
# NEW (permissive with access check):
if not self.hda_manager.is_accessible(dataset, trans.user):
raise ItemAccessibilityException(...)
3. WorkflowSummary Changes
For ID-based extraction, WorkflowSummary would not iterate history.visible_contents. Instead:
def extract_steps_by_ids(trans, input_dataset_ids, input_collection_ids, job_ids):
# Load datasets directly by ID
for dataset_id in input_dataset_ids:
dataset = trans.sa_session.get(model.HDA, dataset_id)
# No history check needed, just permission check
# Load jobs directly by ID
for job_id in job_ids:
job = trans.sa_session.get(model.Job, job_id)
# Trace connections via IDs not HIDs
4. Connection Mapping Uses IDs
Current (HID-based):
hid_to_output_pair[hid] = (step, "output")
# Later...
if other_hid in hid_to_output_pair:
# connect
ID-based:
id_to_output_pair[(content_type, content_id)] = (step, "output")
# Later...
if (content_type, content_id) in id_to_output_pair:
# connect
This is already proposed in the plan. It naturally supports cross-history because IDs are globally unique.
Benefits of Cross-History Extraction
1. Fixes Copied Dataset Problem Completely
From WORKFLOW_EXTRACTION_ISSUES.md (#9161):
When datasets are copied from other histories: All connections are broken, Includes tools from original history
With cross-history ID extraction:
- User copies dataset from History A to History B
- Runs tools on copy in History B
- Extraction request includes: datasets from History B, jobs from History B
- The copied dataset is marked as an input (its ID in B)
- Connections to jobs in B work correctly via IDs
- No “foreign jobs” pulled from History A
2. Supports Job Cache Outputs
When job caching returns outputs from a job in another history, ID-based extraction can still trace connections because it doesn’t require all outputs to be in the “current” history.
3. More Flexible Workflow Construction
Users could potentially construct workflows from multiple analysis sessions across histories:
- Select outputs from History A (training data processing)
- Select outputs from History B (validation data processing)
- Combine into single workflow
Permission Model Considerations
Current State
The existing code has minimal permission checking in extract.py:
# Line 97-98:
# Find each job, for security we (implicitly) check that they are
# associated with a job in the current history.
This “implicit” security relies on:
- User can only access their own history’s contents
- History access is checked at API level (
history_manager.get_accessible)
Required Changes for Cross-History
Explicit Dataset Access Checks:
for dataset_id in input_dataset_ids:
dataset = trans.sa_session.get(model.HDA, dataset_id)
if not dataset:
raise ObjectNotFound(f"Dataset {dataset_id} not found")
# Explicit permission check
if not hda_manager.is_accessible(dataset, trans.user):
raise ItemAccessibilityException(
f"Dataset {dataset_id} is not accessible to user"
)
Explicit Job Access Checks:
for job_id in job_ids:
job = job_manager.get_accessible_job(trans, job_id)
# This already exists in jobs.py:340
Permission Scenarios:
| Scenario | Should Work? | Permission Check |
|---|---|---|
| User’s own dataset from another history | Yes | dataset.user_id == trans.user.id |
| Shared dataset user can access | Yes | Galaxy’s standard dataset access rules |
| Published dataset | Yes | dataset.dataset.published |
| Private dataset from another user | No | Access check fails |
| Anonymous user’s dataset (session-based) | Complex | May need session tracking |
Recommendation
Use existing HDAManager.get_accessible() or similar for each dataset/collection. This leverages Galaxy’s existing permission model without reinventing it.
UI Implications
Current UI Flow
- User opens extraction from current history
- UI shows all jobs/datasets from that history
- User selects items
- Extraction runs on single history
Cross-History UI Options
Option A: History-Scoped UI (Minimal Change)
- Keep current UI showing one history at a time
- User selects items from current history
- For copied datasets, UI could show “(copied from History X)”
- API accepts cross-history IDs but UI doesn’t expose it directly
Option B: Multi-History Selection (Future Enhancement)
- UI allows browsing multiple histories
- User can select items from different histories
- More complex but more powerful
Recommendation: Start with Option A. The primary benefit of cross-history support is handling copied datasets correctly, which happens transparently when using IDs. The UI can show the current history but the backend accepts any accessible dataset.
Required Changes to the Plan
1. Make from_history_id Optional
# Changed from required to optional
from_history_id: Optional[DecodedDatabaseIdField] = None
# If provided, used for:
# - Backward compatibility
# - UI context (which history was open)
# - Fallback for HID-based params
2. Remove Cross-History Validation Error
In extract_steps_by_ids():
# REMOVE this code:
if dataset.history_id != history.id:
raise exceptions.RequestParameterInvalidException(...)
# REPLACE with:
self.hda_manager.error_unless_accessible(dataset, trans.user)
3. Add Permission Checks
def extract_steps_by_ids(trans, input_dataset_ids, ...):
hda_manager = trans.app.hda_manager # or inject via DI
for dataset_id in input_dataset_ids:
dataset = trans.sa_session.get(model.HDA, dataset_id)
if not dataset:
raise ObjectNotFound(f"Dataset {dataset_id} not found")
hda_manager.error_unless_accessible(dataset, trans.user)
# ... rest of step creation
4. Update API Documentation
:param from_history_id: Optional. The history context for extraction.
Not required for ID-based extraction but may be used for UI purposes.
:param input_dataset_ids: Dataset IDs to use as workflow inputs.
Datasets may be from any history the user can access.
5. Update Test Cases
Add tests for:
- Extracting with datasets from different histories (same user)
- Permission denied for inaccessible cross-history dataset
- Mixed: some datasets from current history, some from another
Recommendation
Update the plan to support cross-history extraction. Specifically:
-
Phase 1 (Initial Implementation):
- Add ID-based params as planned
- Remove the
history_id != history.idvalidation - Add explicit permission checks per dataset/collection
- Keep
from_history_idoptional (for UI context, backward compat)
-
Phase 2 (Vue UI):
- UI continues to show current history
- Copied datasets work correctly without special handling
- Future: consider multi-history selection UI
-
Documentation:
- Document that ID-based extraction supports cross-history datasets
- Note permission requirements
Justification:
- Fixes copied dataset bugs (#9161, #13823) more completely
- Minimal additional complexity (just permission checks)
- ID-based lookup naturally supports cross-history
- Single-history was only needed due to HID semantics
Unresolved Questions
-
Should
from_history_idbe required for backward compat? Or can we make it fully optional from the start? -
Anonymous users: If a session-based user copies datasets between histories, does session tracking support cross-history access?
-
Job access: When job cache is used, the job may be in a different history. Should we allow referencing jobs from other histories if user can access them?
-
UI discovery: How does a user discover/select datasets from other histories for extraction? Is this needed in initial implementation or can we rely on copied datasets being in current history?
-
Shared histories: If History A is shared with user B, can user B extract workflows using datasets from History A? (Probably yes if access checks pass, but need to verify.)
-
Collection elements: For copied collections, do all elements need to be accessible, or just the HDCA itself?