Galaxy Dataset Collection Model Layer
Comprehensive reference for the model layer that underpins Galaxy’s dataset collection system.
Table of Contents
- Core Model Classes
- Database Tables and Columns
- SQLAlchemy Relationships
- Collection Type Plugin System
- Type Description and Registry
- Collection Structure and Matching
- Builder Pattern
- Subcollection Splitting and Adapters
- Implicit Collections
- Manager Layer
- Workflow Integration Models
- Collection Creation Flow
- Key Enums and Constants
- File Index
1. Core Model Classes
All core model classes live in lib/galaxy/model/__init__.py.
1.1 DatasetCollection
The root model representing a collection of datasets. This is the “pure data” object — it has no notion of history or library; it just holds the collection type, population state, and elements.
Table: dataset_collection
class DatasetCollection(Base, Dictifiable, UsesAnnotations, Serializable):
__tablename__ = "dataset_collection"
Key Fields:
| Field | Type | Description |
|---|---|---|
id | int (PK) | Primary key |
collection_type | str(255) | e.g. "list", "paired", "list:paired" |
populated_state | str(64) | "new", "ok", or "failed" |
populated_state_message | TEXT | Error message if population failed |
element_count | int (nullable) | Number of direct child elements |
fields | JSON (nullable) | For record type — field definitions |
column_definitions | JSON (nullable) | For sample_sheet type — column schema |
create_time / update_time | datetime | Timestamps |
Key Properties:
elements— ORM relationship toDatasetCollectionElement, ordered byelement_indexpopulated— True only if this collection AND all nested subcollections havepopulated_state == "ok"populated_optimized— DB-optimized version ofpopulatedthat queries nested collections via SQLhas_subcollections—Trueif":"appears incollection_typeallow_implicit_mapping—Falseforrecordtype,Truefor everything elsedataset_instances— Flattened list of all leafHistoryDatasetAssociationobjects (recursive)dataset_elements— Flattened list of all leafDatasetCollectionElementobjects (recursive)has_deferred_data— Whether any leaf dataset has stateDEFERREDelements_datatypes— Aggregated extension counts across all leaf datasetselements_states— Aggregated state counts across all leaf datasetselements_deleted— Whether any leaf dataset is deleteddataset_action_tuples— Permission tuples for all leaf datasetsstate— Currently hardcoded to return"ok"(TODO in code)
Key Methods:
__getitem__(key)— Access element by integer index or string identifiercopy(...)— Deep-copy the collection and all elementscopy_from(other)— Copy elements from another collection into this onereplace_elements_with_copies(replacements, history)— Replace elements in-place with copiesreplace_failed_elements(replacements)— Replace specific failed HDAsfinalize(collection_type_description)— Mark as populated, recursivelymark_as_populated()/handle_population_failed(message)— State transitionsvalidate()— Ensurescollection_typeis set_build_nested_collection_attributes_stmt(...)— Builds a SQL query that traverses nested collections via self-joins, used by many summary propertiesdataset_elements_and_identifiers(identifiers=None)— Used in remote tool evaluation to walk elements with their identifier pathselement_identifiers_extensions_paths_and_metadata_files— Returns tuples of (identifiers, extension, path, metadata_files) for all leaf datasets
The Nested Query Builder:
_build_nested_collection_attributes_stmt is a critical internal method. It builds a SQL SELECT that self-joins dataset_collection and dataset_collection_element tables once per nesting level (determined by counting ":" in collection_type). This avoids loading all elements into Python for summary computations like state counts, extension sets, and permission tuples.
1.2 DatasetCollectionElement
An element within a DatasetCollection. An element points to exactly one of: an HDA, an LDDA, or a child DatasetCollection (for nesting).
Table: dataset_collection_element
class DatasetCollectionElement(Base, Dictifiable, Serializable):
__tablename__ = "dataset_collection_element"
Key Fields:
| Field | Type | Description |
|---|---|---|
id | int (PK) | Primary key |
dataset_collection_id | int (FK) | Parent collection |
hda_id | int (FK, nullable) | Points to HDA if leaf element |
ldda_id | int (FK, nullable) | Points to LDDA if leaf element |
child_collection_id | int (FK, nullable) | Points to nested DatasetCollection |
element_index | int | Ordering index within parent |
element_identifier | str(255) | Human-readable name (e.g. "forward", "sample1") |
columns | JSON (nullable) | For sample_sheet elements — row metadata values |
Constraint: Exactly one of hda_id, ldda_id, or child_collection_id should be non-null (enforced in __init__, not as a DB constraint).
Key Properties:
element_type— Returns"hda","ldda", or"dataset_collection"is_collection—Trueifelement_type == "dataset_collection"element_object— Returns whichever ofhda,ldda, orchild_collectionis setdataset_instance— Returns the HDA/LDDA; raisesAttributeErrorif this is a nested collectiondataset— Returns the underlyingDatasetobjecthas_deferred_data— Delegates toelement_objectauto_propagated_tags— Tags from the first dataset instance that should auto-propagate
Special Sentinel:
DatasetCollectionElement.UNINITIALIZED_ELEMENT is an object() sentinel used during pre-creation of implicit collections. Elements are created with this placeholder and later populated when jobs complete.
1.3 DatasetCollectionInstance (Abstract Base)
Not a database table itself — a Python-level mixin/base class that provides shared behavior for HDCA and LDCA.
class DatasetCollectionInstance(HasName, UsesCreateAndUpdateTime):
Key Properties (delegated to .collection):
state,populated,dataset_instances,has_deferred_data
Key Methods:
_base_to_dict(view)— Common dict serializationset_from_dict(new_data)— Update from dict, delegates to collection + own editable keys
1.4 HistoryDatasetCollectionAssociation (HDCA)
Associates a DatasetCollection with a History. This is the primary user-facing collection object.
Table: history_dataset_collection_association
class HistoryDatasetCollectionAssociation(
Base, DatasetCollectionInstance, HasTags, Dictifiable, UsesAnnotations, Serializable
):
Key Fields:
| Field | Type | Description |
|---|---|---|
id | int (PK) | Primary key |
collection_id | int (FK) | Points to DatasetCollection |
history_id | int (FK) | Parent history |
name | str(255) | Display name |
hid | int | History item ID (ordering within history) |
visible | bool | Whether shown in history panel |
deleted | bool | Soft-delete flag |
copied_from_history_dataset_collection_association_id | int (FK, nullable) | Copy provenance |
implicit_output_name | str(255) (nullable) | Tool output name if this is an implicit collection |
job_id | int (FK, nullable) | The single job that produced this (for non-mapped tools) |
implicit_collection_jobs_id | int (FK, nullable) | The ImplicitCollectionJobs group (for mapped tools) |
create_time / update_time | datetime | Timestamps |
Key Relationships:
collection— TheDatasetCollectionobjecthistory— The parentHistoryimplicit_input_collections— List ofImplicitlyCreatedDatasetCollectionInput(which input collections produced this)implicit_collection_jobs— TheImplicitCollectionJobsobject (job group)job— SingleJob(for non-implicit collections created by a tool)tags,annotations,ratings— Standard Galaxy associationscreating_job_associations—JobToOutputDatasetCollectionAssociation(viewonly)copied_from_history_dataset_collection_association— Self-referential copy chaintool_request_association— Link toToolRequestImplicitCollectionAssociation
Key Properties:
job_source_type— Returns"ImplicitCollectionJobs"or"Job"orNonejob_source_id— Returnsimplicit_collection_jobs_idorjob_idjob_state_summary— AggregateJobStateSummary(named tuple of state counts) computed via SQLdataset_dbkeys_and_extensions_summary— Tuple of (dbkeys, extensions, states, deleted, store_times)history_content_type— Always"dataset_collection"type_id— Hybrid property:"dataset_collection-{id}"waiting_for_elements— Checks job states + collection population state
Key Methods:
copy(...)— Deep copy: creates new HDCA with copied collection and elementsadd_implicit_input_collection(name, hdca)— Records which input HDCA was mapped overfind_implicit_input_collection(name)— Looks up input HDCA by parameter nameto_hda_representative(multiple=False)— Gets representative HDA(s) from the collectioncontains_collection(collection_id)— Uses recursive CTE SQL to check membershiptouch()— Flags for update (triggersupdate_timechange)
1.5 LibraryDatasetCollectionAssociation (LDCA)
Associates a DatasetCollection with a library folder. Much simpler than HDCA.
Table: library_dataset_collection_association
Key Fields: id, collection_id (FK), folder_id (FK), name, deleted
Relationships: collection, folder, tags, annotations, ratings
1.6 CollectionStateSummary
A NamedTuple used to aggregate state information across all leaf datasets:
class CollectionStateSummary(NamedTuple):
dbkeys: list[Union[str, None]]
extensions: list[str]
states: dict[str, int]
deleted: int
2. Database Tables and Columns
Core Collection Tables
dataset_collection
├── id (PK)
├── collection_type (VARCHAR 255)
├── populated_state (VARCHAR 64, default "ok")
├── populated_state_message (TEXT)
├── element_count (INT, nullable)
├── fields (JSON, nullable) -- record type field definitions
├── column_definitions (JSON, nullable) -- sample_sheet column definitions
├── create_time (DATETIME)
└── update_time (DATETIME)
dataset_collection_element
├── id (PK)
├── dataset_collection_id (FK -> dataset_collection.id, indexed)
├── hda_id (FK -> history_dataset_association.id, nullable, indexed)
├── ldda_id (FK -> library_dataset_dataset_association.id, nullable, indexed)
├── child_collection_id (FK -> dataset_collection.id, nullable, indexed)
├── element_index (INT)
├── element_identifier (VARCHAR 255)
└── columns (JSON, nullable) -- sample_sheet row values
history_dataset_collection_association
├── id (PK)
├── collection_id (FK -> dataset_collection.id, indexed)
├── history_id (FK -> history.id, indexed)
├── name (VARCHAR 255)
├── hid (INT)
├── visible (BOOL)
├── deleted (BOOL, default false)
├── copied_from_history_dataset_collection_association_id (FK -> self.id)
├── implicit_output_name (VARCHAR 255, nullable)
├── job_id (FK -> job.id, nullable, indexed)
├── implicit_collection_jobs_id (FK -> implicit_collection_jobs.id, nullable, indexed)
├── create_time (DATETIME)
└── update_time (DATETIME, indexed)
library_dataset_collection_association
├── id (PK)
├── collection_id (FK -> dataset_collection.id, indexed)
├── folder_id (FK -> library_folder.id, indexed)
├── name (VARCHAR 255)
└── deleted (BOOL, default false)
Job-Collection Association Tables
job_to_input_dataset_collection
├── id (PK)
├── job_id (FK -> job.id, indexed)
├── dataset_collection_id (FK -> history_dataset_collection_association.id, indexed)
├── name (VARCHAR 255)
└── adapter (JSON, nullable) -- serialized adapter for ephemeral collections
job_to_input_dataset_collection_element
├── id (PK)
├── job_id (FK -> job.id, indexed)
├── dataset_collection_element_id (FK -> dataset_collection_element.id, indexed)
├── name (VARCHAR 255)
└── adapter (JSON, nullable)
job_to_output_dataset_collection
├── id (PK)
├── job_id (FK -> job.id, indexed)
├── dataset_collection_id (FK -> history_dataset_collection_association.id, indexed)
└── name (VARCHAR 255)
job_to_implicit_output_dataset_collection
├── id (PK)
├── job_id (FK -> job.id, indexed)
├── dataset_collection_id (FK -> dataset_collection.id, indexed)
└── name (VARCHAR 255)
Note the distinction: job_to_output_dataset_collection points to HDCA (the instance), while job_to_implicit_output_dataset_collection points to DatasetCollection (the raw collection). Many jobs map to one HDCA for output collections (when mapping over), but each job maps to at most one DatasetCollection per output.
Implicit Collection Tables
implicit_collection_jobs
├── id (PK)
└── populated_state (VARCHAR 64, default "new")
implicit_collection_jobs_job_association
├── id (PK)
├── implicit_collection_jobs_id (FK -> implicit_collection_jobs.id, indexed)
├── job_id (FK -> job.id, indexed)
└── order_index (INT)
implicitly_created_dataset_collection_inputs
├── id (PK)
├── dataset_collection_id (FK -> history_dataset_collection_association.id, indexed)
├── input_dataset_collection_id (FK -> history_dataset_collection_association.id, indexed)
└── name (VARCHAR 255)
tool_request_implicit_collection_association
├── id (PK)
├── tool_request_id (FK -> tool_request.id, indexed)
├── dataset_collection_id (FK -> history_dataset_collection_association.id, indexed)
└── output_name (VARCHAR 255)
3. SQLAlchemy Relationships
DatasetCollection
DatasetCollection.elements -> [DatasetCollectionElement] (ordered by element_index)
DatasetCollectionElement
DCE.collection -> DatasetCollection (parent, via dataset_collection_id)
DCE.hda -> HistoryDatasetAssociation
DCE.ldda -> LibraryDatasetDatasetAssociation
DCE.child_collection -> DatasetCollection (nested child, via child_collection_id)
HistoryDatasetCollectionAssociation
HDCA.collection -> DatasetCollection
HDCA.history -> History (back_populates="dataset_collections")
HDCA.implicit_input_collections -> [ImplicitlyCreatedDatasetCollectionInput]
HDCA.implicit_collection_jobs -> ImplicitCollectionJobs (uselist=False)
HDCA.job -> Job (uselist=False)
HDCA.tags -> [HistoryDatasetCollectionTagAssociation]
HDCA.annotations -> [HDCAAnnotationAssociation]
HDCA.ratings -> [HDCACollectionRatingAssociation]
HDCA.creating_job_associations -> [JobToOutputDatasetCollectionAssociation] (viewonly)
HDCA.copied_from_hdca -> HDCA (self-referential)
HDCA.tool_request_association -> ToolRequestImplicitCollectionAssociation
ImplicitCollectionJobs
ICJ.jobs -> [ImplicitCollectionJobsJobAssociation] (back_populates)
ImplicitCollectionJobsJobAssociation
ICJJA.implicit_collection_jobs -> ImplicitCollectionJobs
ICJJA.job -> Job
ImplicitlyCreatedDatasetCollectionInput
ICDCI.input_dataset_collection -> HDCA (the input that was mapped over)
4. Collection Type Plugin System
Directory: lib/galaxy/model/dataset_collections/types/
Each collection type is a plugin class inheriting from BaseDatasetCollectionType.
BaseDatasetCollectionType (Abstract Base)
class BaseDatasetCollectionType(metaclass=ABCMeta):
collection_type: str
@abstractmethod
def generate_elements(self, dataset_instances: DatasetInstanceMapping, **kwds
) -> Iterable[DatasetCollectionElement]:
"""Generate elements from a mapping of identifier->dataset/collection."""
def _validation_failed(self, message):
raise ObjectAttributeInvalidException(message)
DatasetInstanceMapping is Mapping[str, Union[DatasetCollection, DatasetInstance]] — an ordered dict from element identifier to either a dataset or a nested collection.
ListDatasetCollectionType
File: types/list.py
collection_type: "list"
The simplest type. Accepts any number of elements with arbitrary identifiers. Just yields a DatasetCollectionElement for each entry in dataset_instances.
def generate_elements(self, dataset_instances, **kwds):
for identifier, element in dataset_instances.items():
yield DatasetCollectionElement(element=element, element_identifier=identifier)
No validation on identifiers, no fixed element count.
PairedDatasetCollectionType
File: types/paired.py
collection_type: "paired"
Fixed structure: exactly two elements with identifiers "forward" and "reverse".
Also provides prototype_elements() which yields two placeholder elements — used when the structure must be known before actual data exists (e.g. for pre-creating implicit output collections).
Constants: FORWARD_IDENTIFIER = "forward", REVERSE_IDENTIFIER = "reverse"
PairedOrUnpairedDatasetCollectionType
File: types/paired_or_unpaired.py
collection_type: "paired_or_unpaired"
A union type: either 1 element ("unpaired") or 2 elements ("forward" + "reverse"). Validates element count is 1 or 2.
Constant: SINGLETON_IDENTIFIER = "unpaired"
This is used for genomics workflows where samples may be single-end or paired-end reads. A paired_or_unpaired input can consume either a paired collection (treated as its forward/reverse case) or a single dataset (wrapped as unpaired).
RecordDatasetCollectionType
File: types/record.py
collection_type: "record"
CWL-style heterogeneous named fields. Requires a fields keyword argument providing the schema. Validates that supplied elements match the field definitions.
Also provides prototype_elements(fields=...) for pre-creation.
Important: DatasetCollection.allow_implicit_mapping returns False for record types, meaning records cannot be mapped over.
SampleSheetDatasetCollectionType
File: types/sample_sheet.py
collection_type: "sample_sheet"
A list-like collection where each element carries additional column metadata (columns field on DatasetCollectionElement). Requires rows and column_definitions keyword arguments. Validates each row against the column definitions.
Collection Type Nesting
Collection types compose via ":" separator. For "list:paired":
- The outer level is a
list(arbitrary count, arbitrary identifiers) - Each element at the outer level is a
DatasetCollectionof type"paired" - Those inner collections each have
"forward"and"reverse"elements
The nesting is stored as:
DatasetCollection(collection_type="list:paired")
└── DatasetCollectionElement(element_identifier="sample1", child_collection_id=X)
└── DatasetCollection(collection_type="paired")
├── DatasetCollectionElement(element_identifier="forward", hda_id=A)
└── DatasetCollectionElement(element_identifier="reverse", hda_id=B)
└── DatasetCollectionElement(element_identifier="sample2", child_collection_id=Y)
└── DatasetCollection(collection_type="paired")
├── DatasetCollectionElement(element_identifier="forward", hda_id=C)
└── DatasetCollectionElement(element_identifier="reverse", hda_id=D)
5. Type Description and Registry
CollectionTypeDescriptionFactory
File: lib/galaxy/model/dataset_collections/type_description.py
Factory for creating CollectionTypeDescription instances:
COLLECTION_TYPE_DESCRIPTION_FACTORY = CollectionTypeDescriptionFactory()
The factory holds a reference to the type registry (though it doesn’t heavily use it yet).
CollectionTypeDescription
The workhorse class for reasoning about collection types at an abstract level. Wraps a collection type string (e.g. "list:paired") and provides methods for querying type relationships.
Key Methods:
| Method | Description |
|---|---|
rank_collection_type() | Outermost type. "list:paired" -> "list" |
child_collection_type() | Inner type after removing rank. "list:paired" -> "paired" |
child_collection_type_description() | Returns a new CollectionTypeDescription for the child type |
has_subcollections() | True if ":" in type |
subcollection_type_description() | Same as child_collection_type_description() |
effective_collection_type(subcollection_type) | Computes remaining type after consuming a subcollection. "list:list:paired".effective("paired") -> "list:list" |
has_subcollections_of_type(other) | Whether this type contains proper subcollections of another type |
is_subcollection_of_type(other) | Inverse of above |
can_match_type(other) | Whether two types are compatible for linked matching |
multiply(other) | Combines types: "list" * "paired" -> "list:paired" |
rank_type_plugin() | Returns the plugin instance for the outermost rank |
dimension | Number of type components + 1. "list" = 2, "list:paired" = 3 |
validate() | Checks against COLLECTION_TYPE_REGEX |
Special Type Matching for paired_or_unpaired:
has_subcollections_of_type:
paired_or_unpairedis treated as a subcollection of anything except"paired"(sincepairedexactly matches it)"single_datasets"is a valid subcollection type for any collection
can_match_type:
pairedcan matchpaired_or_unpaired- Types ending in
:paired_or_unpairedcan match the corresponding plain list or the paired list variant
Collection Type Validation Regex
COLLECTION_TYPE_REGEX = re.compile(
r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*"
r"|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)
Valid types: any combination of list, paired, paired_or_unpaired, record separated by :, OR sample_sheet optionally followed by one of :paired, :record, :paired_or_unpaired.
DatasetCollectionTypesRegistry
File: lib/galaxy/model/dataset_collections/registry.py
PLUGIN_CLASSES = [
ListDatasetCollectionType,
PairedDatasetCollectionType,
RecordDatasetCollectionType,
PairedOrUnpairedDatasetCollectionType,
SampleSheetDatasetCollectionType,
]
class DatasetCollectionTypesRegistry:
def __init__(self):
self.__plugins = {p.collection_type: p() for p in PLUGIN_CLASSES}
def get(self, plugin_type) -> BaseDatasetCollectionType
def prototype(self, plugin_type, fields=None) -> DatasetCollection
The prototype() method creates an empty DatasetCollection with placeholder elements matching the type structure. Only works for types with prototype_elements (paired and record).
Singleton: DATASET_COLLECTION_TYPES_REGISTRY = DatasetCollectionTypesRegistry()
6. Collection Structure and Matching
6.1 Structure Module
File: lib/galaxy/model/dataset_collections/structure.py
Provides tree-based reasoning about collection structure for matching and output computation.
Leaf
Represents a terminal node (single dataset). Singleton: leaf = Leaf().
is_leaf=Truemultiply(other)=other.clone()(leaf times anything = that thing)__len__()= 1
UninitializedTree
A tree whose children are not yet known — only the collection type is known. Used for output structures before jobs have run.
children_known=Falsemultiply(other)— Produces anotherUninitializedTreewith combined type
Tree
A fully materialized tree with known children.
class Tree(BaseTree):
def __init__(self, children, collection_type_description, when_values=None):
self.children = children # list of (identifier, substructure) tuples
self.when_values = when_values
Key Methods:
-
Tree.for_dataset_collection(dc, ctd)— Static factory. Recursively walks aDatasetCollection’s elements to build the tree. Leaf elements becomeleafnodes; subcollections become subtrees. -
walk_collections(hdca_dict)— Iterates over the tree structure, yielding(element_dict, when_value)tuples at leaf positions. Eachelement_dictmaps input names to their correspondingDatasetCollectionElementat that position. This is the core iterator used during job expansion. -
can_match(other_structure)— Checks structural compatibility:- Collection types must be compatible (via
can_match_type) - Same number of children at each level
- Nested structure matches recursively
- Collection types must be compatible (via
-
multiply(other_structure)— Computes the cross-product structure. If mapping alist[a,b]over a tool that outputspaired, the result islist:pairedwith{a: paired, b: paired}. -
clone()— Deep copy.
get_structure()
def get_structure(dataset_collection_instance, collection_type_description, leaf_subcollection_type=None):
Converts a DatasetCollectionInstance (HDCA or DCE) into a Tree. If leaf_subcollection_type is specified, computes the effective type after consuming that subcollection type.
tool_output_to_structure()
def tool_output_to_structure(get_sliced_input_collection_structure, tool_output, collections_manager):
Determines the output structure for a tool output. Handles three cases:
structured_like— Output mirrors an input’s structurecollection_type_source— Output type derived from an input’s type- Explicit
collection_type— Fixed output type
6.2 Matching Module
File: lib/galaxy/model/dataset_collections/matching.py
CollectionsToMatch
Accumulates collections that need to be matched for tool execution:
class CollectionsToMatch:
def add(self, input_name, hdca, subcollection_type=None, linked=True):
Each added collection records:
hdca— The HDCA being mapped oversubcollection_type— The subcollection type to consume (e.g."paired"when mappinglist:pairedover a paired input)linked— Whether this collection should be linked (dot-product) or unlinked (cross-product) with others
MatchingCollections
Result of matching collections. Created by MatchingCollections.for_collections(collections_to_match, type_descriptions).
class MatchingCollections:
linked_structure: Tree # Structure for linked collections
unlinked_structures: [Tree] # Structures for cross-product collections
collections: dict # input_name -> hdca
subcollection_types: dict # input_name -> subcollection_type
Key Methods:
-
slice_collections()— Callslinked_structure.walk_collections()to iterate over element slices. Each slice is a dict mapping input names to theirDatasetCollectionElementat that position. -
structure— The effective mapping structure: cross-product of unlinked structures multiplied by the linked structure. ReturnsNoneif everything is a leaf (no mapping). -
is_mapped_over(input_name)— Whether a specific input is being mapped over. -
map_over_action_tuples(input_name)— Permission tuples for the mapped-over collection.
6.3 Query Module
File: lib/galaxy/model/dataset_collections/query.py
HistoryQuery determines which collections in a history are valid for a given tool parameter:
class HistoryQuery:
def direct_match(self, hdca) -> bool # Can this HDCA be used directly?
def can_map_over(self, hdca) # Can this HDCA be mapped over? Returns type description or False.
Key behavior: from_collection_types sorts type descriptions by dimension (descending) so that subcollection mapping defaults to providing the tool as much data as possible.
7. Builder Pattern
File: lib/galaxy/model/dataset_collections/builder.py
build_collection()
Top-level function to create a DatasetCollection:
def build_collection(type, dataset_instances, collection=None, associated_identifiers=None,
fields=None, column_definitions=None, rows=None) -> DatasetCollection:
- Creates or reuses a
DatasetCollection - Calls
set_collection_elements()which invokestype.generate_elements()on the plugin - Assigns
element_indexto each element - Adds elements to collection
- Updates
element_count
CollectionBuilder
Functional builder for constructing collections programmatically (used when creating implicit output collections):
builder = CollectionBuilder(collection_type_description)
builder.add_dataset("sample1", hda1)
builder.add_dataset("sample2", hda2)
collection = builder.build()
For nested collections:
builder = CollectionBuilder(ctd_for("list:paired"))
level = builder.get_level("sample1") # returns sub-builder for "paired"
level.add_dataset("forward", hda_f)
level.add_dataset("reverse", hda_r)
collection = builder.build()
Key Methods:
get_level(identifier)— Returns a sub-CollectionBuilderfor nested collectionsadd_dataset(identifier, dataset_instance)— Adds a leaf elementbuild()— Constructs theDatasetCollection, calling the type plugin’sgenerate_elementsbuild_elements()— Recursively builds element dict (sub-builders -> sub-collections)replace_elements_in_collection(template, replacements)— Creates a new collection with replaced datasets
BoundCollectionBuilder
Extends CollectionBuilder for modifying an existing (unpopulated) DatasetCollection:
class BoundCollectionBuilder(CollectionBuilder):
def __init__(self, dataset_collection):
# Reads type from the existing collection
# Fails if collection is already populated
def populate_partial(self):
# Adds elements without marking as populated
def populate(self):
# Adds elements AND marks as populated
This is used during job completion to fill in elements of pre-created implicit collections.
8. Subcollection Splitting and Adapters
8.1 Subcollection Splitting
File: lib/galaxy/model/dataset_collections/subcollections.py
When mapping a nested collection over a subcollection type, the collection must be split into its subcollections.
def split_dataset_collection_instance(hdca, collection_type) -> SplitReturnType:
For example, splitting a list:paired by "paired" yields a list of DatasetCollectionElement objects, each pointing to a paired child collection.
Special case: splitting by "single_datasets" wraps each leaf element in a PromoteCollectionElementToCollectionAdapter, effectively treating each dataset as a paired_or_unpaired collection with a single "unpaired" element.
8.2 Adapters
File: lib/galaxy/model/dataset_collections/adapters.py
Adapters create ephemeral/pseudo collections that don’t exist in the database but present a collection-like interface for tool processing.
CollectionAdapter (Abstract Base)
Defines the interface: dataset_action_tuples, dataset_states_and_extensions_summary, dataset_instances, elements, to_adapter_model(), adapting, collection, collection_type.
DCECollectionAdapter
Wraps a DatasetCollectionElement to act as a collection. If the element has a child collection, delegates to it; if it’s a leaf dataset, wraps it.
PromoteCollectionElementToCollectionAdapter
Extends DCECollectionAdapter. Promotes a single leaf element to act as a paired_or_unpaired collection (the "single_datasets" subcollection mapping case).
collection_typealways returns"paired_or_unpaired"
PromoteDatasetToCollection
Wraps a single HDA to act as either a list or paired_or_unpaired collection.
- For
paired_or_unpaired, the element identifier is"unpaired" - For
list, the element identifier is the HDA’s name
PromoteDatasetsToCollection
Wraps multiple HDAs to act as either a paired or paired_or_unpaired collection.
TransientCollectionAdapterDatasetInstanceElement
Lightweight stand-in for DatasetCollectionElement used by adapters. Has element_identifier, hda, child_collection=None, element_object, dataset_instance, is_collection=False, columns=None.
recover_adapter()
Reconstructs an adapter from its serialized model (stored in job_to_input_dataset_collection.adapter JSON field).
Adapter Serialization
Each adapter has a to_adapter_model() method that returns a Pydantic model describing how to reconstruct it. This is stored as JSON in the adapter column of the job input association tables.
9. Implicit Collections
“Implicit collections” are output collections created automatically when mapping a collection over tool inputs. The system pre-creates these before jobs run and populates them as jobs complete.
9.1 ImplicitCollectionJobs
Groups all jobs that were created by mapping a collection over a tool.
Table: implicit_collection_jobs
class ImplicitCollectionJobs(Base, Serializable):
populated_state: str # "new", "ok", or "failed"
jobs: [ImplicitCollectionJobsJobAssociation]
States (inner enum populated_states):
NEW— Job associations not yet finalizedOK— All job associations set and fixedFAILED— Problem populating job associations
Key Properties:
job_list— Returns all associatedJobobjectsget_job_attributes(attributes)— Efficient query for specific job attributes
9.2 ImplicitCollectionJobsJobAssociation
Links an ImplicitCollectionJobs to a specific Job with an ordering index.
Fields: implicit_collection_jobs_id, job_id, order_index
9.3 ImplicitlyCreatedDatasetCollectionInput
Records which input collections were used to create an implicit output collection. Stored on the HDCA.
Fields: dataset_collection_id (the output HDCA), input_dataset_collection_id (the input HDCA), name (the input parameter name)
9.4 ToolRequestImplicitCollectionAssociation
Links a ToolRequest to its implicit output collections.
Fields: tool_request_id, dataset_collection_id (output HDCA), output_name
9.5 How Implicit Collections Are Created
-
Pre-creation (
DatasetCollectionManager.precreate_dataset_collection_instance):- Determines the output structure by multiplying the mapping structure with the output structure
- Creates
DatasetCollectionobjects withpopulated_state="new"(orUNINITIALIZED_ELEMENTsentinels) - Creates the HDCA, sets
implicit_output_nameandimplicit_input_collections - For known structures (children_known=True), creates placeholder elements pointing to UNINITIALIZED_ELEMENT or sub-collections
-
Job creation: Each element position in the mapping structure becomes a separate job. An
ImplicitCollectionJobsobject groups them. -
Job completion: The
BoundCollectionBuilderis used to populate the pre-created collection with actual output datasets.populate_partial()adds elements incrementally;populate()marks as complete. -
Finalization:
DatasetCollection.finalize()recursively marks all subcollections as populated.
10. Manager Layer
10.1 DatasetCollectionManager
File: lib/galaxy/managers/collections.py
Primary service interface for collection operations. Manages:
- Collection creation (user-initiated and implicit)
- Pre-creation of implicit output collections
- Collection matching for tool execution
- CRUD operations
- Rule-based collection building
Constructor dependencies: model (ORM session), security, hda_manager, history_manager, ldda_manager, short_term_storage_monitor
Key Method: create()
def create(self, trans, parent, name, collection_type, element_identifiers=None,
elements=None, ...) -> DatasetCollectionInstance:
Flow:
- Validates and sanitizes element identifiers (unless trusted)
- Creates the
DatasetCollectionviacreate_dataset_collection() - Wraps in an HDCA or LDCA via
_create_instance_for_collection() - Handles tags (from list or dict), implicit inputs, implicit output name
- Persists to database
Key Method: create_dataset_collection()
def create_dataset_collection(self, trans, collection_type, element_identifiers=None,
elements=None, ...) -> DatasetCollection:
Flow:
- Gets
CollectionTypeDescriptionfor the type - If
element_identifiersprovided (API request), resolves them to actual objects via__load_elements() - If
elementsprovided (internal), recursively creates subcollections if needed - Calls
builder.build_collection()with the type plugin - Sets
collection_typeon the result
Key Method: precreate_dataset_collection_instance()
For implicit collections — creates the HDCA with a pre-created (possibly partially initialized) collection.
Key Method: match_collections()
def match_collections(self, collections_to_match) -> MatchingCollections:
return MatchingCollections.for_collections(collections_to_match, self.collection_type_descriptions)
Key Method: apply_rules()
Applies rule-based collection manipulation. Takes an existing HDCA, flattens it to tabular data+sources, applies a rule set, then builds new elements from the result.
10.2 HDCAManager
File: lib/galaxy/managers/hdcas.py
CRUD manager for HDCAs. Extends ModelManager, AccessibleManagerMixin, OwnableManagerMixin, PurgableManagerMixin, AnnotatableManagerMixin.
Key Method: map_datasets(content, fn, *parents)
Recursively iterates over all datasets in a collection, calling fn on each. Used for bulk attribute updates.
Ownership: Checked via history.user == user (or anonymous session match).
10.3 Serializers
The file also contains serializers:
DCESerializer— SerializesDatasetCollectionElement(delegates toHDASerializerorDCSerializerbased on element type)DCSerializer— SerializesDatasetCollectionDCASerializer— Abstract base for instance serializers (proxies most fields toDCSerializer)HDCASerializer— Serializes HDCA with full history context (hid, visible, deleted, job state summary, tags, etc.)
11. Workflow Integration Models
WorkflowRequestToInputDatasetCollectionAssociation
Table: workflow_request_to_input_collection_dataset
Records which HDCA was used as input to a workflow invocation step.
Fields: workflow_invocation_id, workflow_step_id, dataset_collection_id (FK to HDCA), name, request (JSON)
WorkflowInvocationOutputDatasetCollectionAssociation
Table: workflow_invocation_output_dataset_collection_association
Records HDCA outputs of a workflow invocation.
Fields: workflow_invocation_id, workflow_step_id, dataset_collection_id (FK to HDCA), workflow_output_id
WorkflowInvocationStepOutputDatasetCollectionAssociation
Table: workflow_invocation_step_output_dataset_collection_association
Records HDCA outputs at the individual step level.
Fields: workflow_invocation_step_id, workflow_step_id, dataset_collection_id (FK to HDCA), output_name
12. Collection Creation Flow
User-Initiated Collection Creation (API)
API Request (element_identifiers, collection_type)
│
▼
DatasetCollectionManager.create()
│
├── validate_input_element_identifiers()
│
├── create_dataset_collection()
│ │
│ ├── CollectionTypeDescriptionFactory.for_collection_type()
│ │
│ ├── __load_elements() or __recursively_create_collections_for_elements()
│ │ │
│ │ └── For each identifier: resolve src (hda/ldda/hdca/new_collection)
│ │ - hda: hda_manager.get_accessible()
│ │ - ldda: ldda_manager.get() + to_history_dataset_association()
│ │ - hdca: __get_history_collection_instance().collection
│ │ - new_collection: recursive create_dataset_collection()
│ │
│ └── builder.build_collection(type_plugin, elements)
│ │
│ └── type_plugin.generate_elements(elements)
│ └── yields DatasetCollectionElement objects
│
├── _create_instance_for_collection()
│ │
│ ├── Create HDCA or LDCA
│ ├── Set implicit_input_collections if applicable
│ ├── parent.add_dataset_collection() for HIDs
│ └── Apply tags
│
└── __persist() -> flush to DB
Implicit Collection Creation (Tool Execution)
Tool execution with collection input (map-over scenario)
│
▼
ExecutionTracker.precreate_output_collections()
│
├── Compute _mapped_output_structure()
│ │
│ ├── tool_output_to_structure() -- output's own structure
│ └── mapping_structure.multiply(output_structure) -- combined
│
├── DatasetCollectionManager.precreate_dataset_collection_instance()
│ │
│ ├── precreate_dataset_collection(structure)
│ │ │
│ │ ├── If structure unknown: DatasetCollection(populated=False)
│ │ ├── If structure known:
│ │ │ ├── Create DatasetCollection(populated=False)
│ │ │ └── For each (identifier, substructure) in children:
│ │ │ ├── Leaf: UNINITIALIZED_ELEMENT placeholder
│ │ │ └── Subcollection: recursive precreate_dataset_collection()
│ │ └── Create DatasetCollectionElement for each
│ │
│ └── _create_instance_for_collection() with implicit_output_name
│
├── Create ImplicitCollectionJobs
│
└── For each element position in mapping structure:
├── Create Job
├── Create ImplicitCollectionJobsJobAssociation(order_index=i)
└── Create JobToOutputDatasetCollectionAssociation
(many jobs -> one HDCA)
--- Later, when each job completes ---
BoundCollectionBuilder(pre_created_collection)
.add_dataset(identifier, output_hda)
.populate_partial() or .populate()
13. Key Enums and Constants
DatasetCollectionPopulatedState
File: lib/galaxy/schema/schema.py
class DatasetCollectionPopulatedState(str, Enum):
NEW = "new" # Unpopulated elements
OK = "ok" # Elements populated (HDAs may still have errors)
FAILED = "failed" # Problem populating, won't be populated
ImplicitCollectionJobs.populated_states
class populated_states(str, Enum):
NEW = "new" # Unpopulated job associations
OK = "ok" # Job associations set and fixed
FAILED = "failed" # Issues populating job associations
Type Constants
# In paired.py
FORWARD_IDENTIFIER = "forward"
REVERSE_IDENTIFIER = "reverse"
# In paired_or_unpaired.py
SINGLETON_IDENTIFIER = "unpaired"
# In type_description.py
COLLECTION_TYPE_REGEX = re.compile(
r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*"
r"|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)
DATA_COLLECTION_FIELDS
DATA_COLLECTION_FIELDS = list[dict[str, Any]] # JSON-serializable field definitions for record types
14. File Index
| File | Contents |
|---|---|
lib/galaxy/model/__init__.py | Core model classes: DatasetCollection, DatasetCollectionElement, DatasetCollectionInstance, HistoryDatasetCollectionAssociation, LibraryDatasetCollectionAssociation, ImplicitCollectionJobs, ImplicitCollectionJobsJobAssociation, ImplicitlyCreatedDatasetCollectionInput, ToolRequestImplicitCollectionAssociation, all job association classes, workflow association classes, tag/annotation/rating associations |
lib/galaxy/model/dataset_collections/__init__.py | Empty (package marker) |
lib/galaxy/model/dataset_collections/types/__init__.py | BaseDatasetCollectionType abstract base class |
lib/galaxy/model/dataset_collections/types/list.py | ListDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/paired.py | PairedDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py | PairedOrUnpairedDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/record.py | RecordDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/sample_sheet.py | SampleSheetDatasetCollectionType |
lib/galaxy/model/dataset_collections/types/sample_sheet_util.py | SampleSheetColumnDefinitionModel, validation utilities for sample sheet columns and rows |
lib/galaxy/model/dataset_collections/types/semantics.py | YAML-to-Markdown doc generator for collection_semantics.yml |
lib/galaxy/model/dataset_collections/type_description.py | CollectionTypeDescription, CollectionTypeDescriptionFactory, COLLECTION_TYPE_DESCRIPTION_FACTORY, type validation regex |
lib/galaxy/model/dataset_collections/registry.py | DatasetCollectionTypesRegistry, DATASET_COLLECTION_TYPES_REGISTRY |
lib/galaxy/model/dataset_collections/matching.py | CollectionsToMatch, MatchingCollections |
lib/galaxy/model/dataset_collections/structure.py | Leaf, UninitializedTree, Tree, get_structure, tool_output_to_structure |
lib/galaxy/model/dataset_collections/builder.py | build_collection, set_collection_elements, CollectionBuilder, BoundCollectionBuilder |
lib/galaxy/model/dataset_collections/subcollections.py | split_dataset_collection_instance, _split_dataset_collection |
lib/galaxy/model/dataset_collections/adapters.py | CollectionAdapter, DCECollectionAdapter, PromoteCollectionElementToCollectionAdapter, PromoteDatasetToCollection, PromoteDatasetsToCollection, TransientCollectionAdapterDatasetInstanceElement, recover_adapter |
lib/galaxy/model/dataset_collections/query.py | HistoryQuery (matches collections to tool parameters) |
lib/galaxy/model/dataset_collections/auto_pairing.py | auto_pair(), automatic forward/reverse pairing heuristics |
lib/galaxy/model/dataset_collections/auto_identifiers.py | filename_to_element_identifier(), fill_in_identifiers() |
lib/galaxy/managers/collections.py | DatasetCollectionManager — primary service layer |
lib/galaxy/managers/hdcas.py | HDCAManager, DCESerializer, DCSerializer, DCASerializer, HDCASerializer |
lib/galaxy/schema/schema.py | DatasetCollectionPopulatedState enum, SampleSheetColumnDefinitions, SampleSheetRow types |