Galaxy Implicit Dataset Conversion - Research Findings
Executive Summary
Galaxy’s “implicit dataset conversion” is a transparent mechanism allowing users to pass a History Dataset Association (HDA) of one datatype to a tool accepting a different datatype, provided a converter exists. The key insight: implicitly converted datasets are full HDAs with their own database IDs, linked to parent datasets through the ImplicitlyConvertedDatasetAssociation table, but marked invisible (visible=False) in history views.
1. Core Data Structure: ImplicitlyConvertedDatasetAssociation
Location: lib/galaxy/model/__init__.py:6877-6930
class ImplicitlyConvertedDatasetAssociation(Base, Serializable):
__tablename__ = "implicitly_converted_dataset_association"
id = Column(Integer, primary_key=True)
create_time = Column(DateTime, default=now)
update_time = Column(DateTime, default=now, onupdate=now)
hda_id = Column(Integer, ForeignKey("history_dataset_association.id"), index=True, nullable=True)
hda_parent_id = Column(Integer, ForeignKey("history_dataset_association.id"), index=True, nullable=True)
ldda_id = Column(Integer, ForeignKey("library_dataset_dataset_association.id"), index=True, nullable=True)
ldda_parent_id = Column(Integer, ForeignKey("library_dataset_dataset_association.id"), index=True, nullable=True)
type = Column(TrimmedString(255)) # Target extension (e.g., "tabular", "gff")
metadata_safe = Column(Boolean)
deleted = Column(Boolean, default=False)
Key Fields:
hda_id/ldda_id: The converted datasethda_parent_id/ldda_parent_id: The original datasettype: Target extension (e.g., “tabular”, “gff”, “bigwig”)
Navigation via bidirectional relationships on DatasetInstance:
hda.implicitly_converted_datasets→ datasets converted FROM this onehda.implicitly_converted_parent_datasets→ parent datasets this was converted FROM
2. Datatype Converter System
2.1 Converter Registration
Location: lib/galaxy/datatypes/registry.py:665-687
Converters are registered in datatypes_conf.xml with <converter> elements:
<converter file="fasta_to_tabular_converter.xml" target_datatype="tabular"/>
Registry loads converters into:
registry.converter_tools: Set of all converter toolsregistry.datatype_converters: Dict mappingsource_ext -> {target_ext: converter_tool}registry.converter_deps: Dict tracking multi-step conversion dependencies
2.2 Converter Discovery
Key Methods in lib/galaxy/datatypes/registry.py:
get_converters_by_datatype(ext)(line 875): Returns dict of all conversions FROM an extensionget_converter_by_target_type(source_ext, target_ext)(line 890): Returns specific converter toolfind_conversion_destination_for_dataset_by_extensions()(line 897-956): Core logic to find if conversion is needed/available
2.3 Converter Tools
~100 converter tools in lib/galaxy/datatypes/converters/ directory. Examples:
fasta_to_tabular_converter.xmlbed_to_gff_converter.xmlbam_to_bigwig_converter.xml
Converters are normal Galaxy tools with single input, single output.
3. Conversion Trigger Point
Location: lib/galaxy/tools/actions/__init__.py:164-184
When tool inputs are collected via process_dataset():
def process_dataset(data, formats=None):
direct_match, target_ext, converted_dataset = data.find_conversion_destination(formats)
if not direct_match and target_ext:
if converted_dataset:
data = converted_dataset # Use existing conversion
else:
data = data.get_converted_dataset(
trans,
target_ext,
target_context=parent,
history=history,
use_cached_job=param_values.get("__use_cached_job__", False),
)
return data
Flow:
- For each dataset parameter, calls
find_conversion_destination(accepted_formats) - Returns
(direct_match, target_ext, existing_converted_dataset) - If conversion needed and exists → use existing
- If conversion needed but doesn’t exist → call
get_converted_dataset()
4. Conversion Execution Flow
4.1 DatasetInstance.get_converted_dataset()
Location: lib/galaxy/model/__init__.py:5476-5531
- Check if converter exists; raise
NoConverterExceptionif not - Check metadata-based conversions (e.g., BAM index via
get_metadata_dataset()) - Check existing conversions via
get_converted_files_by_type() - Resolve dependencies recursively via
get_converted_dataset_deps() - Execute converter via
datatype.convert_dataset()
4.2 Data.convert_dataset()
Location: lib/galaxy/datatypes/data.py:829-880
def convert_dataset(self, trans, original_dataset, target_type, ...):
converter = trans.app.datatypes_registry.get_converter_by_target_type(
original_dataset.extension, target_type
)
params = {"input1": original_dataset, "__target_datatype__": target_type}
# Check for cached job if requested
if use_cached_job:
completed_jobs = converter.completed_jobs(trans, params)
if completed_jobs:
return completed_jobs[0].get_output(converter.outputs.keys()[0])
# Execute converter
converted_dataset = converter.execute(trans, params, history=history)
original_dataset.attach_implicitly_converted_dataset(session, converted_dataset, target_type)
return converted_dataset
4.3 Attaching Converted Dataset
Location: lib/galaxy/model/__init__.py:5533-5540
def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
new_dataset.name = self.name
self.copy_attributes(new_dataset)
assoc = ImplicitlyConvertedDatasetAssociation(
parent=self, file_type=target_ext, dataset=new_dataset, metadata_safe=False
)
session.add(new_dataset)
session.add(assoc)
5. Checking for Existing Conversions
Location: lib/galaxy/model/__init__.py:5452-5463
def get_converted_files_by_type(self, file_type, include_errored=False):
for assoc in self.implicitly_converted_datasets:
if not assoc.deleted and assoc.type == file_type:
item = assoc.dataset or assoc.dataset_ldda
valid_states = (
(Dataset.states.ERROR, *Dataset.valid_input_states)
if include_errored
else Dataset.valid_input_states
)
if not item.deleted and item.state in valid_states:
return item
return None
This prevents redundant conversions by checking if the conversion was already performed.
6. Dataset Matcher System
Location: lib/galaxy/tools/parameters/dataset_matcher.py
6.1 Match Classes
class HdaDirectMatch:
implicit_conversion = False
class HdaImplicitMatch:
implicit_conversion = True
target_ext: str # Format to convert to
original_hda: HDA # Parent before conversion
hda: HDA # Converted HDA (or original if not yet converted)
6.2 Matching Logic
DatasetMatcher.valid_hda_match() (line 113-136):
def valid_hda_match(self, hda, check_implicit_conversions=True):
direct_match, target_ext, converted_dataset = hda.find_conversion_destination(formats)
if direct_match:
return HdaDirectMatch(hda)
else:
if not check_implicit_conversions:
return False
if target_ext:
original_hda = hda
if converted_dataset:
hda = converted_dataset
return HdaImplicitMatch(hda, target_ext, original_hda)
else:
return False
7. API Endpoint
Location: lib/galaxy/webapps/galaxy/api/tools.py:802-849
POST /api/tools/{tool_id}/conversion
Payload:
{
"id": "<dataset_id>",
"src": "hda",
"source_type": "fasta",
"target_type": "tabular",
"history_id": "<optional>"
}
Executes converter and returns result.
8. UI Display
Location: lib/galaxy/tools/parameters/basic.py:2350-2480
The tool parameter building groups HDAs by HID and shows conversion info:
matches_by_hid: dict[int, list] = {}
for hda in history.active_visible_datasets_and_roles:
match = dataset_matcher.hda_match(hda)
if match:
matches_by_hid[match.hda.hid].append(match)
for matches in matches_by_hid.values():
match = matches[0]
# Prefer original HDA over already-converted
if len(matches) > 1:
match = next((m for m in matches
if len(m.hda.implicitly_converted_parent_datasets) == 0), match)
# Display name shows "(as format)" for implicit conversions
m_name = (
f"{match.original_hda.name} (as {match.target_ext})"
if match.implicit_conversion
else match.hda.name
)
9. HID and Visibility Mechanism
Critical Design Points:
- Converted HDA is a NEW HDA with its own database ID
- Created with
visible=False→ hidden from history UI by default - Same HID as parent: The converted dataset shares the parent’s HID
- User experience: User sees original dataset at HID, tool receives converted version transparently
The HID sharing means:
- Query by HID returns multiple HDAs (original + conversions)
- UI groups by HID and prefers showing the original
ImplicitlyConvertedDatasetAssociationtable enables discovery of relationships
10. Caching & Job Reuse
Location: lib/galaxy/datatypes/data.py:850-860
if use_cached_job:
completed_jobs = converter.completed_jobs(trans, params)
if completed_jobs:
return completed_jobs[0].get_output(...)
- Checks for previous identical conversions via
converter.completed_jobs() - If found and
use_cached_job=True, reuses result - Avoids re-execution of expensive conversions
11. Special Cases
11.1 Metadata Conversions
Location: lib/galaxy/model/__init__.py:5547-5560
Some “conversions” are actually metadata files (e.g., BAM index):
def get_metadata_dataset(self, dataset, name):
# Returns metadata file as fake HDA
# No actual conversion needed
11.2 Multi-Step Conversions
converter_deps dictionary tracks dependencies:
- Example: fasta → bed might require fasta → gff → bed
get_converted_dataset_deps()recursively resolves chain
11.3 Library Datasets (LDDA)
Parallel FK fields support library datasets:
ldda_id/ldda_parent_id- Same association table, different dataset type
12. Tests
12.1 Unit Tests
test/unit/app/tools/test_data_parameters.py:71-142:
test_field_implicit_conversion_new: Tests “(as tabular)” display when not yet convertedtest_field_implicit_conversion_existing: Tests using existing converted HDA
test/unit/app/tools/test_dataset_matcher.py:46-75:
test_valid_hda_implicit_convered: Tests matching already-converted datasettest_hda_match_implicit_can_convert: Tests matching when conversion neededtest_hda_match_properly_skips_conversion: Testscheck_implicit_conversions=False
12.2 Integration Tests
test/integration/test_extended_metadata.py: Integration tests for conversion + metadata
13. Data Flow Diagram
Tool Execution Request
│
▼
For Each Data Input Parameter
│
▼
process_dataset(hda)
│
▼
hda.find_conversion_destination(required_formats)
│
├──► Direct Match?
│ YES → Use HDA as-is
│ NO → Check if conversion possible
│
▼
Conversion Possible?
│
├──► YES → Check if Already Converted
│ │ │
│ │ ├──► Already Converted? (via ImplicitlyConvertedDatasetAssociation)
│ │ │ YES → Use Converted Dataset
│ │ │ NO → get_converted_dataset()
│ │
│ NO → Error or Fallback
│
▼
get_converted_dataset(trans, target_ext)
│
▼
Execute Converter Tool (via job queue)
│
▼
original_hda.attach_implicitly_converted_dataset(converted_hda)
│
▼
Create ImplicitlyConvertedDatasetAssociation Link
│
▼
Return converted_hda (visible=False, shares parent HID)
14. Key Files Summary
| File | Lines | Purpose |
|---|---|---|
lib/galaxy/model/__init__.py | 5452-5560, 6877-6930 | Model definitions, conversion methods |
lib/galaxy/datatypes/data.py | 829-880 | convert_dataset() implementation |
lib/galaxy/datatypes/registry.py | 665-687, 875-956 | Converter registration and discovery |
lib/galaxy/tools/actions/__init__.py | 164-184 | Conversion trigger in tool execution |
lib/galaxy/tools/parameters/basic.py | 2350-2480 | UI parameter building with conversion display |
lib/galaxy/tools/parameters/dataset_matcher.py | 90-186 | Dataset matching logic |
lib/galaxy/webapps/galaxy/api/tools.py | 802-849 | API endpoint for explicit conversion |
test/unit/app/tools/test_data_parameters.py | 71-142 | Unit tests |
test/unit/app/tools/test_dataset_matcher.py | 46-75 | Matcher unit tests |
15. Open Questions
- Purging strategy: When are implicitly converted datasets purged vs kept?
- Collection conversions: How do implicit conversions work with dataset collections?
- Security model: Are implicit conversions subject to same permission checks?
- Workflow extraction: How does current extraction handle implicit conversions when HID has multiple datasets?
- Performance: What’s the overhead of conversion discovery for large histories?