Dataset Collection Creation API - Deep Dive
Overview
Dataset collections are containers that group related datasets together. Galaxy supports several collection types (list, paired, record, sample_sheet, paired_or_unpaired) and nested compositions of these (e.g. list:paired).
There are two creation paths:
- Direct creation —
POST /api/dataset_collections— creates a collection from existing datasets already in a history - Fetch-based creation —
POST /api/tools/fetch— uploads new data and creates the collection in one step
Both produce the same result: a HistoryDatasetCollectionAssociation (HDCA) in a history.
1. API Endpoint
File: lib/galaxy/webapps/galaxy/api/dataset_collections.py
POST /api/dataset_collections
- Request body:
CreateNewCollectionPayload - Response:
HDCADetailed
The endpoint delegates to DatasetCollectionsService.create().
2. Request Schema — CreateNewCollectionPayload
File: lib/galaxy/schema/schema.py:1769
| Field | Type | Default | Description |
|---|---|---|---|
collection_type | CollectionType (str) | None | e.g. "list", "paired", "list:paired", "record", "sample_sheet", "sample_sheet:paired", "paired_or_unpaired" |
element_identifiers | list[CollectionElementIdentifier] | None | Elements to include |
name | str | None | Display name |
instance_type | "history" | "library" | "history" | Where to create the collection |
history_id | DecodedDatabaseIdField | None | Required when instance_type="history" |
folder_id | LibraryFolderDatabaseIdField | None | Required when instance_type="library" |
hide_source_items | bool | False | Hide original HDAs after collection creation |
copy_elements | bool | True | Copy source HDAs vs reference them |
fields | str | list[FieldDict] | [] | For record type: field definitions. "auto" to guess from identifiers |
column_definitions | SampleSheetColumnDefinitions | None | For sample_sheet type: column schema |
rows | SampleSheetRows (dict) | None | For sample_sheet type: {element_name: [col_values...]} |
CollectionElementIdentifier
File: lib/galaxy/schema/schema.py:1740
| Field | Type | Description |
|---|---|---|
name | str | Element identifier name (e.g. "forward", "data1", "sample1") |
src | CollectionSourceType | Source: "hda", "ldda", "hdca", "new_collection" |
id | DecodedDatabaseIdField | ID of existing dataset/collection (for hda/ldda/hdca) |
collection_type | CollectionType | For src="new_collection": the sub-collection type |
element_identifiers | list[CollectionElementIdentifier] | For src="new_collection": nested elements |
tags | list[str] | Tags for this element |
CollectionSourceType enum
hda — existing HistoryDatasetAssociation
ldda — existing LibraryDatasetDatasetAssociation
hdca — existing HistoryDatasetCollectionAssociation (nesting existing collections)
new_collection — inline sub-collection definition (with nested element_identifiers)
3. Response Schema — HDCADetailed
File: lib/galaxy/schema/schema.py:1249
Extends HDCASummary which extends HDCACommon.
Key fields: id, name, collection_type, elements (list of DCESummary), element_count, populated, populated_state, contents_url, collection_id, column_definitions, implicit_collection_jobs_id, tags, elements_datatypes, elements_states.
Each element (DCESummary) contains: element_identifier, element_index, element_type, object (the HDA or nested DC), columns (for sample sheets), model_class.
4. Collection Types
Type Registry
File: lib/galaxy/model/dataset_collections/registry.py
PLUGIN_CLASSES = [
ListDatasetCollectionType,
PairedDatasetCollectionType,
RecordDatasetCollectionType,
PairedOrUnpairedDatasetCollectionType,
SampleSheetDatasetCollectionType,
]
A singleton DatasetCollectionTypesRegistry maps collection_type strings to plugin instances. All plugins extend BaseDatasetCollectionType and implement generate_elements().
4a. list
File: lib/galaxy/model/dataset_collections/types/list.py
Flat list of arbitrarily-named elements. Simply yields each element with its name.
{
"collection_type": "list",
"instance_type": "history",
"history_id": "<id>",
"element_identifiers": [
{"name": "data1", "src": "hda", "id": "<id>"},
{"name": "data2", "src": "hda", "id": "<id>"},
{"name": "data3", "src": "hda", "id": "<id>"}
]
}
4b. paired
File: lib/galaxy/model/dataset_collections/types/paired.py
Exactly two elements named "forward" and "reverse".
{
"collection_type": "paired",
"instance_type": "history",
"history_id": "<id>",
"element_identifiers": [
{"name": "forward", "src": "hda", "id": "<id>"},
{"name": "reverse", "src": "hda", "id": "<id>"}
]
}
4c. record
File: lib/galaxy/model/dataset_collections/types/record.py
CWL-style record with named fields. Requires a fields parameter defining field names/types, or fields="auto" to guess from identifiers.
{
"collection_type": "record",
"instance_type": "history",
"history_id": "<id>",
"name": "a record",
"fields": [
{"name": "condition", "type": "File"},
{"name": "control1", "type": "File"},
{"name": "control2", "type": "File"}
],
"element_identifiers": [
{"name": "condition", "src": "hda", "id": "<id>"},
{"name": "control1", "src": "hda", "id": "<id>"},
{"name": "control2", "src": "hda", "id": "<id>"}
]
}
Validation: field count must match element count, field names must match element identifiers.
4d. paired_or_unpaired
File: lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py
Either 1 element (unpaired) or 2 (paired with forward/reverse).
{
"collection_type": "paired_or_unpaired",
"instance_type": "history",
"history_id": "<id>",
"element_identifiers": [
{"name": "unpaired", "src": "hda", "id": "<id>"}
]
}
4e. sample_sheet
File: lib/galaxy/model/dataset_collections/types/sample_sheet.py
A list with per-element metadata columns. Requires column_definitions and rows.
{
"collection_type": "sample_sheet",
"instance_type": "history",
"history_id": "<id>",
"name": "my sample sheet",
"column_definitions": [
{"type": "int", "name": "replicate", "optional": false},
{"type": "string", "name": "condition", "optional": false}
],
"rows": {
"sample1": [1, "control"],
"sample2": [2, "treatment"]
},
"element_identifiers": [
{"name": "sample1", "src": "hda", "id": "<id>"},
{"name": "sample2", "src": "hda", "id": "<id>"}
]
}
Column types: int, string, boolean, element_identifier (cross-references another element).
4f. Nested types (colon notation)
Types can be composed: "list:paired", "list:list", "sample_sheet:paired", etc. The string is split on : — the first segment is the “rank” (outer) type, the rest describe the inner structure.
Example: list:paired — use src="new_collection" for inner collections:
{
"collection_type": "list:paired",
"instance_type": "history",
"history_id": "<id>",
"name": "a nested collection",
"element_identifiers": [
{
"name": "test_level_1",
"src": "new_collection",
"collection_type": "paired",
"element_identifiers": [
{"name": "forward", "src": "hda", "id": "<id>"},
{"name": "reverse", "src": "hda", "id": "<id>"}
]
}
]
}
Example: sample_sheet:paired — sample sheet wrapping paired collections:
{
"collection_type": "sample_sheet:paired",
"instance_type": "history",
"history_id": "<id>",
"column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
"rows": {"sample1": [42]},
"element_identifiers": [
{
"name": "sample1",
"src": "new_collection",
"collection_type": "paired",
"element_identifiers": [
{"name": "forward", "src": "hda", "id": "<id>"},
{"name": "reverse", "src": "hda", "id": "<id>"}
]
}
]
}
Nested type regex (from type_description.py):
^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*
|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$
5. Plumbing — The Creation Call Chain
API endpoint (dataset_collections.py)
└─ DatasetCollectionsService.create() [services/dataset_collections.py]
├─ api_payload_to_create_params() [managers/collections_util.py]
│ ├─ validates required params (collection_type, element_identifiers)
│ └─ validate_column_definitions() [types/sample_sheet_util.py]
│
└─ DatasetCollectionManager.create() [managers/collections.py:180]
├─ validate_input_element_identifiers() [managers/collections_util.py:52]
│ ├─ no __object__ injection
│ ├─ all elements have names
│ ├─ no duplicate names
│ ├─ src in {hda, hdca, ldda, new_collection}
│ └─ new_collection requires element_identifiers
│
├─ create_dataset_collection() [managers/collections.py:309]
│ ├─ CollectionTypeDescriptionFactory.for_collection_type()
│ │ → CollectionTypeDescription wrapping the type string
│ │
│ ├─ _element_identifiers_to_elements() [managers/collections.py:403]
│ │ ├─ for nested types: __recursively_create_collections_for_identifiers()
│ │ └─ __load_elements() — resolves src/id → actual model objects
│ │
│ ├─ rank_type_plugin()
│ │ → looks up type string in DatasetCollectionTypesRegistry
│ │
│ └─ builder.build_collection(type_plugin, elements, ...)
│ └─ set_collection_elements()
│ └─ type_plugin.generate_elements()
│ → yields DatasetCollectionElement objects
│
└─ _create_instance_for_collection() [managers/collections.py:250]
├─ creates HistoryDatasetCollectionAssociation (or LDCA)
├─ wires up implicit inputs/outputs (for workflow-generated collections)
├─ applies tags
└─ persists to database
Key function: _element_identifiers_to_elements()
File: lib/galaxy/managers/collections.py:403
Resolves identifier dicts into actual model objects:
- For nested types, recursively builds inner DatasetCollections first
- Resolves
src="hda"→ loads HDA by ID,src="hdca"→ loads collection - Returns an ordered dict of
{name: model_object}
Key function: builder.build_collection()
File: lib/galaxy/model/dataset_collections/builder.py:27
Creates a DatasetCollection model, then calls set_collection_elements() which delegates to the type plugin’s generate_elements() to produce DatasetCollectionElement objects with proper indices.
6. The Fetch Path (Alternative Creation)
Endpoint: POST /api/tools/fetch
Instead of creating datasets first then referencing them, the fetch API uploads data and creates collections atomically. The payload uses a targets array:
{
"history_id": "<id>",
"targets": [
{
"destination": {"type": "hdca"},
"elements": [
{"src": "pasted", "paste_content": "data...", "name": "data1"},
{"src": "url", "url": "https://...", "name": "data2"},
{"src": "files", "dbkey": "hg19", "info": "..."}
],
"collection_type": "list",
"name": "My Collection",
"tags": ["name:mytag"]
}
]
}
Element src values for fetch: "pasted", "url", "files" (multipart upload).
For nested fetch collections:
{
"targets": [{
"destination": {"type": "hdca"},
"collection_type": "list:list",
"elements": [
{
"name": "samp1",
"elements": [
{"src": "files", "dbkey": "hg19"}
]
}
]
}]
}
For sample_sheet fetch:
{
"targets": [{
"destination": {"type": "hdca"},
"collection_type": "sample_sheet",
"column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
"elements": [
{"src": "url", "url": "...", "name": "sample1", "row": [42]}
]
}]
}
Note: row on each element (fetch path) vs rows dict on the payload (direct path).
7. Testing
File: lib/galaxy_test/api/test_dataset_collections.py
Test Matrix
| Test | Collection Type | Creation Path | What’s Tested |
|---|---|---|---|
test_create_pair_from_history | paired | fetch | Basic pair creation, 2 elements |
test_create_list_from_history | list | direct | Basic list creation, 3 elements |
test_create_list_of_existing_pairs | list (of hdca) | direct | Nesting existing collections via src="hdca" |
test_create_list_of_new_pairs | list:paired | direct | Nested creation via src="new_collection" |
test_create_paried_or_unpaired | paired_or_unpaired | direct | Single-element unpaired |
test_create_record | record | direct | Record with explicit fields |
test_record_requires_fields | record | direct | 400 when fields missing |
test_record_auto_fields | record | direct | fields="auto" |
test_record_field_validation | record | direct | Wrong count / wrong names → 400 |
test_sample_sheet_requires_columns | sample_sheet | direct | Columns on response elements |
test_sample_sheet_column_definition_problems | sample_sheet | direct | Invalid column defs → 400 |
test_sample_sheet_element_identifier_column_type | sample_sheet | direct | element_identifier column type |
test_sample_sheet_validating_against_column_definition | sample_sheet | direct | Type mismatch + validator failure |
test_sample_sheet_of_pairs_creation | sample_sheet:paired | direct | Nested sample sheet |
test_sample_sheet_map_over_preserves_columns | sample_sheet | direct | Columns survive tool mapping |
test_copy_sample_sheet_collection | sample_sheet | direct | Columns survive copy |
test_upload_collection | list | fetch | File upload with tags |
test_upload_nested | list:list | fetch | Nested fetch upload |
test_upload_collection_from_url | list | fetch | URL-based upload |
test_upload_collection_deferred | list | fetch | Deferred (lazy) upload |
test_upload_flat_sample_sheet | sample_sheet | fetch | Sample sheet via fetch |
test_upload_sample_sheet_paired | sample_sheet:paired | fetch | Nested sample sheet via fetch |
test_enforces_unique_names | list | direct | Duplicate names → 400 |
test_hda_security | paired | direct | Cannot use another user’s HDA → 403 |
Test Helper: _check_create_response
Handles both direct and fetch responses:
def _check_create_response(self, create_response):
self._assert_status_code_is(create_response, 200)
dataset_collection = create_response.json()
if "output_collections" in dataset_collection:
# fetch response — follow up with GET
dataset_collection = dataset_collection["output_collections"][0]
dataset_collection = self._get(f"dataset_collections/{dataset_collection['id']}").json()
self._assert_has_keys(dataset_collection, "elements", "url", "name", "collection_type", "element_count")
return dataset_collection
Standalone helpers
def assert_one_collection_created_in_history(dataset_populator, history_id):
# Lists history contents, asserts exactly 1 collection, returns full details
def upload_flat_sample_sheet(dataset_populator):
# Creates a sample_sheet via fetch and validates columns
8. Test Populator Helpers
File: lib/galaxy_test/base/populators.py
DatasetCollectionPopulator
Central test helper class. Key methods:
Identifier builders
| Method | Returns | Description |
|---|---|---|
list_identifiers(history_id, contents) | [{name, src, id}...] | Creates N HDAs, returns element identifiers for a list |
pair_identifiers(history_id, contents) | [{name:"forward",...}, {name:"reverse",...}] | Creates 2 HDAs, returns forward/reverse identifiers |
nested_collection_identifiers(history_id, collection_type) | nested identifier tree | Recursively builds identifiers for list:paired etc. |
Payload builders
| Method | Description |
|---|---|
create_list_payload(history_id, **kwds) | Builds payload for list creation (delegates to fetch or direct based on direct_upload kwarg) |
create_pair_payload(history_id, **kwds) | Builds payload for pair creation |
Both call __create_payload() which dispatches:
direct_upload=True(default) →__create_payload_fetch()→ builds atargets-based fetch payloaddirect_upload=False→__create_payload_collection()→ builds anelement_identifiers-based direct payload
Collection creators
| Method | Description |
|---|---|
create_list_in_history(history_id) | Creates a list in history |
create_pair_in_history(history_id) | Creates a pair in history |
create_list_of_pairs_in_history(history_id) | Creates list:paired via upload_collection() |
create_list_of_list_in_history(history_id) | Creates list:list (or deeper) by chaining — first creates an inner list, then wraps it via create_nested_collection() |
upload_collection(history_id, collection_type, elements) | Generic fetch-based upload |
create_nested_collection(history_id, collection_type, collection) | Creates nested collection from existing HDCA IDs via src="hdca" |
copy_collection(history_id, hdca_id) | Copies a collection via POST histories/{id}/contents/dataset_collections |
Dispatch logic — __create(payload)
def __create(self, payload, wait=False):
if "targets" not in payload:
return self._create_collection(payload) # POST /api/dataset_collections
else:
return self.dataset_populator.fetch(payload) # POST /api/tools/fetch
Sample sheet helpers
| Method | Description |
|---|---|
download_workbook(collection_type, column_definitions) | Downloads XLSX workbook template |
download_workbook_for_collection(hdca_id, column_definitions) | Downloads workbook for existing collection |
parse_workbook(xlsx_content, collection_type, column_definitions) | Parses uploaded XLSX into rows |
parse_workflow_for_collection(hdca_id, xlsx_content, column_definitions) | Parses XLSX against existing collection |
9. Validation Summary
Validation happens at multiple layers:
- Schema level (
CreateNewCollectionPayload) — Pydantic type validation api_payload_to_create_params()— requirescollection_type+element_identifiersvalidate_column_definitions()— validates sample sheet column defs againstSampleSheetColumnDefinitionModelvalidate_input_element_identifiers()— no__object__injection, names required, no duplicates, validsrc- Type plugin
generate_elements()— type-specific validation:record: field count/names must match elementspaired: expects forward/reversepaired_or_unpaired: 1 or 2 elements onlysample_sheet: validates row data types against column definitions
- Security — user must own/have access to referenced HDAs; 403 otherwise
10. Key Files
| File | Purpose |
|---|---|
lib/galaxy/webapps/galaxy/api/dataset_collections.py | FastAPI endpoint |
lib/galaxy/webapps/galaxy/services/dataset_collections.py | Service layer |
lib/galaxy/managers/collections.py | Core manager with create(), create_dataset_collection() |
lib/galaxy/managers/collections_util.py | Payload parsing, element identifier validation |
lib/galaxy/model/dataset_collections/registry.py | Type plugin registry |
lib/galaxy/model/dataset_collections/type_description.py | CollectionTypeDescription — nested type parsing |
lib/galaxy/model/dataset_collections/builder.py | build_collection(), set_collection_elements() |
lib/galaxy/model/dataset_collections/types/__init__.py | BaseDatasetCollectionType abstract base |
lib/galaxy/model/dataset_collections/types/list.py | List type plugin |
lib/galaxy/model/dataset_collections/types/paired.py | Paired type plugin |
lib/galaxy/model/dataset_collections/types/record.py | Record type plugin |
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py | PairedOrUnpaired type plugin |
lib/galaxy/model/dataset_collections/types/sample_sheet.py | SampleSheet type plugin |
lib/galaxy/model/dataset_collections/types/sample_sheet_util.py | Column definition & row validation |
lib/galaxy/schema/schema.py | Pydantic models (CreateNewCollectionPayload, CollectionElementIdentifier, HDCADetailed) |
lib/galaxy_test/api/test_dataset_collections.py | API tests |
lib/galaxy_test/base/populators.py | DatasetCollectionPopulator test helper |