Dashboard

Component Implicit Dataset Conversion

Transparent datatype conversion mechanism via ImplicitlyConvertedDatasetAssociation, invisible HDAs

Raw
Revised:
2026-02-05
Revision:
1

Galaxy Implicit Dataset Conversion - Research Findings

Executive Summary

Galaxy’s “implicit dataset conversion” is a transparent mechanism allowing users to pass a History Dataset Association (HDA) of one datatype to a tool accepting a different datatype, provided a converter exists. The key insight: implicitly converted datasets are full HDAs with their own database IDs, linked to parent datasets through the ImplicitlyConvertedDatasetAssociation table, but marked invisible (visible=False) in history views.

1. Core Data Structure: ImplicitlyConvertedDatasetAssociation

Location: lib/galaxy/model/__init__.py:6877-6930

class ImplicitlyConvertedDatasetAssociation(Base, Serializable):
    __tablename__ = "implicitly_converted_dataset_association"

    id = Column(Integer, primary_key=True)
    create_time = Column(DateTime, default=now)
    update_time = Column(DateTime, default=now, onupdate=now)
    hda_id = Column(Integer, ForeignKey("history_dataset_association.id"), index=True, nullable=True)
    hda_parent_id = Column(Integer, ForeignKey("history_dataset_association.id"), index=True, nullable=True)
    ldda_id = Column(Integer, ForeignKey("library_dataset_dataset_association.id"), index=True, nullable=True)
    ldda_parent_id = Column(Integer, ForeignKey("library_dataset_dataset_association.id"), index=True, nullable=True)
    type = Column(TrimmedString(255))  # Target extension (e.g., "tabular", "gff")
    metadata_safe = Column(Boolean)
    deleted = Column(Boolean, default=False)

Key Fields:

  • hda_id/ldda_id: The converted dataset
  • hda_parent_id/ldda_parent_id: The original dataset
  • type: Target extension (e.g., “tabular”, “gff”, “bigwig”)

Navigation via bidirectional relationships on DatasetInstance:

  • hda.implicitly_converted_datasets → datasets converted FROM this one
  • hda.implicitly_converted_parent_datasets → parent datasets this was converted FROM

2. Datatype Converter System

2.1 Converter Registration

Location: lib/galaxy/datatypes/registry.py:665-687

Converters are registered in datatypes_conf.xml with <converter> elements:

<converter file="fasta_to_tabular_converter.xml" target_datatype="tabular"/>

Registry loads converters into:

  • registry.converter_tools: Set of all converter tools
  • registry.datatype_converters: Dict mapping source_ext -> {target_ext: converter_tool}
  • registry.converter_deps: Dict tracking multi-step conversion dependencies

2.2 Converter Discovery

Key Methods in lib/galaxy/datatypes/registry.py:

  • get_converters_by_datatype(ext) (line 875): Returns dict of all conversions FROM an extension
  • get_converter_by_target_type(source_ext, target_ext) (line 890): Returns specific converter tool
  • find_conversion_destination_for_dataset_by_extensions() (line 897-956): Core logic to find if conversion is needed/available

2.3 Converter Tools

~100 converter tools in lib/galaxy/datatypes/converters/ directory. Examples:

  • fasta_to_tabular_converter.xml
  • bed_to_gff_converter.xml
  • bam_to_bigwig_converter.xml

Converters are normal Galaxy tools with single input, single output.

3. Conversion Trigger Point

Location: lib/galaxy/tools/actions/__init__.py:164-184

When tool inputs are collected via process_dataset():

def process_dataset(data, formats=None):
    direct_match, target_ext, converted_dataset = data.find_conversion_destination(formats)
    if not direct_match and target_ext:
        if converted_dataset:
            data = converted_dataset  # Use existing conversion
        else:
            data = data.get_converted_dataset(
                trans,
                target_ext,
                target_context=parent,
                history=history,
                use_cached_job=param_values.get("__use_cached_job__", False),
            )
    return data

Flow:

  1. For each dataset parameter, calls find_conversion_destination(accepted_formats)
  2. Returns (direct_match, target_ext, existing_converted_dataset)
  3. If conversion needed and exists → use existing
  4. If conversion needed but doesn’t exist → call get_converted_dataset()

4. Conversion Execution Flow

4.1 DatasetInstance.get_converted_dataset()

Location: lib/galaxy/model/__init__.py:5476-5531

  1. Check if converter exists; raise NoConverterException if not
  2. Check metadata-based conversions (e.g., BAM index via get_metadata_dataset())
  3. Check existing conversions via get_converted_files_by_type()
  4. Resolve dependencies recursively via get_converted_dataset_deps()
  5. Execute converter via datatype.convert_dataset()

4.2 Data.convert_dataset()

Location: lib/galaxy/datatypes/data.py:829-880

def convert_dataset(self, trans, original_dataset, target_type, ...):
    converter = trans.app.datatypes_registry.get_converter_by_target_type(
        original_dataset.extension, target_type
    )
    params = {"input1": original_dataset, "__target_datatype__": target_type}

    # Check for cached job if requested
    if use_cached_job:
        completed_jobs = converter.completed_jobs(trans, params)
        if completed_jobs:
            return completed_jobs[0].get_output(converter.outputs.keys()[0])

    # Execute converter
    converted_dataset = converter.execute(trans, params, history=history)
    original_dataset.attach_implicitly_converted_dataset(session, converted_dataset, target_type)
    return converted_dataset

4.3 Attaching Converted Dataset

Location: lib/galaxy/model/__init__.py:5533-5540

def attach_implicitly_converted_dataset(self, session, new_dataset, target_ext: str):
    new_dataset.name = self.name
    self.copy_attributes(new_dataset)
    assoc = ImplicitlyConvertedDatasetAssociation(
        parent=self, file_type=target_ext, dataset=new_dataset, metadata_safe=False
    )
    session.add(new_dataset)
    session.add(assoc)

5. Checking for Existing Conversions

Location: lib/galaxy/model/__init__.py:5452-5463

def get_converted_files_by_type(self, file_type, include_errored=False):
    for assoc in self.implicitly_converted_datasets:
        if not assoc.deleted and assoc.type == file_type:
            item = assoc.dataset or assoc.dataset_ldda
            valid_states = (
                (Dataset.states.ERROR, *Dataset.valid_input_states)
                if include_errored
                else Dataset.valid_input_states
            )
            if not item.deleted and item.state in valid_states:
                return item
    return None

This prevents redundant conversions by checking if the conversion was already performed.

6. Dataset Matcher System

Location: lib/galaxy/tools/parameters/dataset_matcher.py

6.1 Match Classes

class HdaDirectMatch:
    implicit_conversion = False

class HdaImplicitMatch:
    implicit_conversion = True
    target_ext: str           # Format to convert to
    original_hda: HDA         # Parent before conversion
    hda: HDA                  # Converted HDA (or original if not yet converted)

6.2 Matching Logic

DatasetMatcher.valid_hda_match() (line 113-136):

def valid_hda_match(self, hda, check_implicit_conversions=True):
    direct_match, target_ext, converted_dataset = hda.find_conversion_destination(formats)
    if direct_match:
        return HdaDirectMatch(hda)
    else:
        if not check_implicit_conversions:
            return False
        if target_ext:
            original_hda = hda
            if converted_dataset:
                hda = converted_dataset
            return HdaImplicitMatch(hda, target_ext, original_hda)
        else:
            return False

7. API Endpoint

Location: lib/galaxy/webapps/galaxy/api/tools.py:802-849

POST /api/tools/{tool_id}/conversion

Payload:

{
    "id": "<dataset_id>",
    "src": "hda",
    "source_type": "fasta",
    "target_type": "tabular",
    "history_id": "<optional>"
}

Executes converter and returns result.

8. UI Display

Location: lib/galaxy/tools/parameters/basic.py:2350-2480

The tool parameter building groups HDAs by HID and shows conversion info:

matches_by_hid: dict[int, list] = {}
for hda in history.active_visible_datasets_and_roles:
    match = dataset_matcher.hda_match(hda)
    if match:
        matches_by_hid[match.hda.hid].append(match)

for matches in matches_by_hid.values():
    match = matches[0]
    # Prefer original HDA over already-converted
    if len(matches) > 1:
        match = next((m for m in matches
                     if len(m.hda.implicitly_converted_parent_datasets) == 0), match)

    # Display name shows "(as format)" for implicit conversions
    m_name = (
        f"{match.original_hda.name} (as {match.target_ext})"
        if match.implicit_conversion
        else match.hda.name
    )

9. HID and Visibility Mechanism

Critical Design Points:

  1. Converted HDA is a NEW HDA with its own database ID
  2. Created with visible=False → hidden from history UI by default
  3. Same HID as parent: The converted dataset shares the parent’s HID
  4. User experience: User sees original dataset at HID, tool receives converted version transparently

The HID sharing means:

  • Query by HID returns multiple HDAs (original + conversions)
  • UI groups by HID and prefers showing the original
  • ImplicitlyConvertedDatasetAssociation table enables discovery of relationships

10. Caching & Job Reuse

Location: lib/galaxy/datatypes/data.py:850-860

if use_cached_job:
    completed_jobs = converter.completed_jobs(trans, params)
    if completed_jobs:
        return completed_jobs[0].get_output(...)
  • Checks for previous identical conversions via converter.completed_jobs()
  • If found and use_cached_job=True, reuses result
  • Avoids re-execution of expensive conversions

11. Special Cases

11.1 Metadata Conversions

Location: lib/galaxy/model/__init__.py:5547-5560

Some “conversions” are actually metadata files (e.g., BAM index):

def get_metadata_dataset(self, dataset, name):
    # Returns metadata file as fake HDA
    # No actual conversion needed

11.2 Multi-Step Conversions

converter_deps dictionary tracks dependencies:

  • Example: fasta → bed might require fasta → gff → bed
  • get_converted_dataset_deps() recursively resolves chain

11.3 Library Datasets (LDDA)

Parallel FK fields support library datasets:

  • ldda_id / ldda_parent_id
  • Same association table, different dataset type

12. Tests

12.1 Unit Tests

test/unit/app/tools/test_data_parameters.py:71-142:

  • test_field_implicit_conversion_new: Tests “(as tabular)” display when not yet converted
  • test_field_implicit_conversion_existing: Tests using existing converted HDA

test/unit/app/tools/test_dataset_matcher.py:46-75:

  • test_valid_hda_implicit_convered: Tests matching already-converted dataset
  • test_hda_match_implicit_can_convert: Tests matching when conversion needed
  • test_hda_match_properly_skips_conversion: Tests check_implicit_conversions=False

12.2 Integration Tests

test/integration/test_extended_metadata.py: Integration tests for conversion + metadata

13. Data Flow Diagram

Tool Execution Request


For Each Data Input Parameter


process_dataset(hda)


hda.find_conversion_destination(required_formats)

        ├──► Direct Match?
        │    YES → Use HDA as-is
        │    NO  → Check if conversion possible


Conversion Possible?

        ├──► YES → Check if Already Converted
        │    │     │
        │    │     ├──► Already Converted? (via ImplicitlyConvertedDatasetAssociation)
        │    │     │    YES → Use Converted Dataset
        │    │     │    NO  → get_converted_dataset()
        │    │
        │    NO → Error or Fallback


get_converted_dataset(trans, target_ext)


Execute Converter Tool (via job queue)


original_hda.attach_implicitly_converted_dataset(converted_hda)


Create ImplicitlyConvertedDatasetAssociation Link


Return converted_hda (visible=False, shares parent HID)

14. Key Files Summary

FileLinesPurpose
lib/galaxy/model/__init__.py5452-5560, 6877-6930Model definitions, conversion methods
lib/galaxy/datatypes/data.py829-880convert_dataset() implementation
lib/galaxy/datatypes/registry.py665-687, 875-956Converter registration and discovery
lib/galaxy/tools/actions/__init__.py164-184Conversion trigger in tool execution
lib/galaxy/tools/parameters/basic.py2350-2480UI parameter building with conversion display
lib/galaxy/tools/parameters/dataset_matcher.py90-186Dataset matching logic
lib/galaxy/webapps/galaxy/api/tools.py802-849API endpoint for explicit conversion
test/unit/app/tools/test_data_parameters.py71-142Unit tests
test/unit/app/tools/test_dataset_matcher.py46-75Matcher unit tests

15. Open Questions

  1. Purging strategy: When are implicitly converted datasets purged vs kept?
  2. Collection conversions: How do implicit conversions work with dataset collections?
  3. Security model: Are implicit conversions subject to same permission checks?
  4. Workflow extraction: How does current extraction handle implicit conversions when HID has multiple datasets?
  5. Performance: What’s the overhead of conversion discovery for large histories?