Dataset Collection Creation API - Deep Dive

Overview

Dataset collections are containers that group related datasets together. Galaxy supports several collection types (list, paired, record, sample_sheet, paired_or_unpaired) and nested compositions of these (e.g. list:paired).

There are two creation paths:

Direct creation — POST /api/dataset_collections — creates a collection from existing datasets already in a history
Fetch-based creation — POST /api/tools/fetch — uploads new data and creates the collection in one step

Both produce the same result: a HistoryDatasetCollectionAssociation (HDCA) in a history.

1. API Endpoint

File: lib/galaxy/webapps/galaxy/api/dataset_collections.py

POST /api/dataset_collections

Request body: CreateNewCollectionPayload
Response: HDCADetailed

The endpoint delegates to DatasetCollectionsService.create().

2. Request Schema — `CreateNewCollectionPayload`

File: lib/galaxy/schema/schema.py:1769

Field	Type	Default	Description
`collection_type`	`CollectionType` (str)	`None`	e.g. `"list"`, `"paired"`, `"list:paired"`, `"record"`, `"sample_sheet"`, `"sample_sheet:paired"`, `"paired_or_unpaired"`
`element_identifiers`	`list[CollectionElementIdentifier]`	`None`	Elements to include
`name`	`str`	`None`	Display name
`instance_type`	`"history"` \| `"library"`	`"history"`	Where to create the collection
`history_id`	`DecodedDatabaseIdField`	`None`	Required when `instance_type="history"`
`folder_id`	`LibraryFolderDatabaseIdField`	`None`	Required when `instance_type="library"`
`hide_source_items`	`bool`	`False`	Hide original HDAs after collection creation
`copy_elements`	`bool`	`True`	Copy source HDAs vs reference them
`fields`	`str \| list[FieldDict]`	`[]`	For `record` type: field definitions. `"auto"` to guess from identifiers
`column_definitions`	`SampleSheetColumnDefinitions`	`None`	For `sample_sheet` type: column schema
`rows`	`SampleSheetRows` (dict)	`None`	For `sample_sheet` type: `{element_name: [col_values...]}`

`CollectionElementIdentifier`

File: lib/galaxy/schema/schema.py:1740

Field	Type	Description
`name`	`str`	Element identifier name (e.g. `"forward"`, `"data1"`, `"sample1"`)
`src`	`CollectionSourceType`	Source: `"hda"`, `"ldda"`, `"hdca"`, `"new_collection"`
`id`	`DecodedDatabaseIdField`	ID of existing dataset/collection (for `hda`/`ldda`/`hdca`)
`collection_type`	`CollectionType`	For `src="new_collection"`: the sub-collection type
`element_identifiers`	`list[CollectionElementIdentifier]`	For `src="new_collection"`: nested elements
`tags`	`list[str]`	Tags for this element

`CollectionSourceType` enum

hda             — existing HistoryDatasetAssociation
ldda            — existing LibraryDatasetDatasetAssociation
hdca            — existing HistoryDatasetCollectionAssociation (nesting existing collections)
new_collection  — inline sub-collection definition (with nested element_identifiers)

3. Response Schema — `HDCADetailed`

File: lib/galaxy/schema/schema.py:1249

Extends HDCASummary which extends HDCACommon.

Key fields: id, name, collection_type, elements (list of DCESummary), element_count, populated, populated_state, contents_url, collection_id, column_definitions, implicit_collection_jobs_id, tags, elements_datatypes, elements_states.

Each element (DCESummary) contains: element_identifier, element_index, element_type, object (the HDA or nested DC), columns (for sample sheets), model_class.

4. Collection Types

Type Registry

File: lib/galaxy/model/dataset_collections/registry.py

PLUGIN_CLASSES = [
    ListDatasetCollectionType,
    PairedDatasetCollectionType,
    RecordDatasetCollectionType,
    PairedOrUnpairedDatasetCollectionType,
    SampleSheetDatasetCollectionType,
]

A singleton DatasetCollectionTypesRegistry maps collection_type strings to plugin instances. All plugins extend BaseDatasetCollectionType and implement generate_elements().

4a. `list`

File: lib/galaxy/model/dataset_collections/types/list.py

Flat list of arbitrarily-named elements. Simply yields each element with its name.

{
  "collection_type": "list",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "data1", "src": "hda", "id": "<id>"},
    {"name": "data2", "src": "hda", "id": "<id>"},
    {"name": "data3", "src": "hda", "id": "<id>"}
  ]
}

4b. `paired`

File: lib/galaxy/model/dataset_collections/types/paired.py

Exactly two elements named "forward" and "reverse".

{
  "collection_type": "paired",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "forward", "src": "hda", "id": "<id>"},
    {"name": "reverse", "src": "hda", "id": "<id>"}
  ]
}

4c. `record`

File: lib/galaxy/model/dataset_collections/types/record.py

CWL-style record with named fields. Requires a fields parameter defining field names/types, or fields="auto" to guess from identifiers.

{
  "collection_type": "record",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "a record",
  "fields": [
    {"name": "condition", "type": "File"},
    {"name": "control1", "type": "File"},
    {"name": "control2", "type": "File"}
  ],
  "element_identifiers": [
    {"name": "condition", "src": "hda", "id": "<id>"},
    {"name": "control1", "src": "hda", "id": "<id>"},
    {"name": "control2", "src": "hda", "id": "<id>"}
  ]
}

Validation: field count must match element count, field names must match element identifiers.

4d. `paired_or_unpaired`

File: lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py

Either 1 element (unpaired) or 2 (paired with forward/reverse).

{
  "collection_type": "paired_or_unpaired",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "unpaired", "src": "hda", "id": "<id>"}
  ]
}

4e. `sample_sheet`

File: lib/galaxy/model/dataset_collections/types/sample_sheet.py

A list with per-element metadata columns. Requires column_definitions and rows.

{
  "collection_type": "sample_sheet",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "my sample sheet",
  "column_definitions": [
    {"type": "int", "name": "replicate", "optional": false},
    {"type": "string", "name": "condition", "optional": false}
  ],
  "rows": {
    "sample1": [1, "control"],
    "sample2": [2, "treatment"]
  },
  "element_identifiers": [
    {"name": "sample1", "src": "hda", "id": "<id>"},
    {"name": "sample2", "src": "hda", "id": "<id>"}
  ]
}

Column types: int, string, boolean, element_identifier (cross-references another element).

4f. Nested types (colon notation)

Types can be composed: "list:paired", "list:list", "sample_sheet:paired", etc. The string is split on : — the first segment is the “rank” (outer) type, the rest describe the inner structure.

Example: list:paired — use src="new_collection" for inner collections:

{
  "collection_type": "list:paired",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "a nested collection",
  "element_identifiers": [
    {
      "name": "test_level_1",
      "src": "new_collection",
      "collection_type": "paired",
      "element_identifiers": [
        {"name": "forward", "src": "hda", "id": "<id>"},
        {"name": "reverse", "src": "hda", "id": "<id>"}
      ]
    }
  ]
}

Example: sample_sheet:paired — sample sheet wrapping paired collections:

{
  "collection_type": "sample_sheet:paired",
  "instance_type": "history",
  "history_id": "<id>",
  "column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
  "rows": {"sample1": [42]},
  "element_identifiers": [
    {
      "name": "sample1",
      "src": "new_collection",
      "collection_type": "paired",
      "element_identifiers": [
        {"name": "forward", "src": "hda", "id": "<id>"},
        {"name": "reverse", "src": "hda", "id": "<id>"}
      ]
    }
  ]
}

Nested type regex (from type_description.py):

^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*
  |sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$

5. Plumbing — The Creation Call Chain

API endpoint (dataset_collections.py)
  └─ DatasetCollectionsService.create()                    [services/dataset_collections.py]
       ├─ api_payload_to_create_params()                   [managers/collections_util.py]
       │    ├─ validates required params (collection_type, element_identifiers)
       │    └─ validate_column_definitions()               [types/sample_sheet_util.py]
       │
       └─ DatasetCollectionManager.create()                [managers/collections.py:180]
            ├─ validate_input_element_identifiers()        [managers/collections_util.py:52]
            │    ├─ no __object__ injection
            │    ├─ all elements have names
            │    ├─ no duplicate names
            │    ├─ src in {hda, hdca, ldda, new_collection}
            │    └─ new_collection requires element_identifiers
            │
            ├─ create_dataset_collection()                 [managers/collections.py:309]
            │    ├─ CollectionTypeDescriptionFactory.for_collection_type()
            │    │    → CollectionTypeDescription wrapping the type string
            │    │
            │    ├─ _element_identifiers_to_elements()     [managers/collections.py:403]
            │    │    ├─ for nested types: __recursively_create_collections_for_identifiers()
            │    │    └─ __load_elements() — resolves src/id → actual model objects
            │    │
            │    ├─ rank_type_plugin()
            │    │    → looks up type string in DatasetCollectionTypesRegistry
            │    │
            │    └─ builder.build_collection(type_plugin, elements, ...)
            │         └─ set_collection_elements()
            │              └─ type_plugin.generate_elements()
            │                   → yields DatasetCollectionElement objects
            │
            └─ _create_instance_for_collection()           [managers/collections.py:250]
                 ├─ creates HistoryDatasetCollectionAssociation (or LDCA)
                 ├─ wires up implicit inputs/outputs (for workflow-generated collections)
                 ├─ applies tags
                 └─ persists to database

Key function: `_element_identifiers_to_elements()`

File: lib/galaxy/managers/collections.py:403

Resolves identifier dicts into actual model objects:

For nested types, recursively builds inner DatasetCollections first
Resolves src="hda" → loads HDA by ID, src="hdca" → loads collection
Returns an ordered dict of {name: model_object}

Key function: `builder.build_collection()`

File: lib/galaxy/model/dataset_collections/builder.py:27

Creates a DatasetCollection model, then calls set_collection_elements() which delegates to the type plugin’s generate_elements() to produce DatasetCollectionElement objects with proper indices.

6. The Fetch Path (Alternative Creation)

Endpoint: POST /api/tools/fetch

Instead of creating datasets first then referencing them, the fetch API uploads data and creates collections atomically. The payload uses a targets array:

{
  "history_id": "<id>",
  "targets": [
    {
      "destination": {"type": "hdca"},
      "elements": [
        {"src": "pasted", "paste_content": "data...", "name": "data1"},
        {"src": "url", "url": "https://...", "name": "data2"},
        {"src": "files", "dbkey": "hg19", "info": "..."}
      ],
      "collection_type": "list",
      "name": "My Collection",
      "tags": ["name:mytag"]
    }
  ]
}

Element src values for fetch: "pasted", "url", "files" (multipart upload).

For nested fetch collections:

{
  "targets": [{
    "destination": {"type": "hdca"},
    "collection_type": "list:list",
    "elements": [
      {
        "name": "samp1",
        "elements": [
          {"src": "files", "dbkey": "hg19"}
        ]
      }
    ]
  }]
}

For sample_sheet fetch:

{
  "targets": [{
    "destination": {"type": "hdca"},
    "collection_type": "sample_sheet",
    "column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
    "elements": [
      {"src": "url", "url": "...", "name": "sample1", "row": [42]}
    ]
  }]
}

Note: row on each element (fetch path) vs rows dict on the payload (direct path).

7. Testing

File: lib/galaxy_test/api/test_dataset_collections.py

Test Matrix

Test	Collection Type	Creation Path	What’s Tested
`test_create_pair_from_history`	`paired`	fetch	Basic pair creation, 2 elements
`test_create_list_from_history`	`list`	direct	Basic list creation, 3 elements
`test_create_list_of_existing_pairs`	`list` (of `hdca`)	direct	Nesting existing collections via `src="hdca"`
`test_create_list_of_new_pairs`	`list:paired`	direct	Nested creation via `src="new_collection"`
`test_create_paried_or_unpaired`	`paired_or_unpaired`	direct	Single-element unpaired
`test_create_record`	`record`	direct	Record with explicit fields
`test_record_requires_fields`	`record`	direct	400 when fields missing
`test_record_auto_fields`	`record`	direct	`fields="auto"`
`test_record_field_validation`	`record`	direct	Wrong count / wrong names → 400
`test_sample_sheet_requires_columns`	`sample_sheet`	direct	Columns on response elements
`test_sample_sheet_column_definition_problems`	`sample_sheet`	direct	Invalid column defs → 400
`test_sample_sheet_element_identifier_column_type`	`sample_sheet`	direct	`element_identifier` column type
`test_sample_sheet_validating_against_column_definition`	`sample_sheet`	direct	Type mismatch + validator failure
`test_sample_sheet_of_pairs_creation`	`sample_sheet:paired`	direct	Nested sample sheet
`test_sample_sheet_map_over_preserves_columns`	`sample_sheet`	direct	Columns survive tool mapping
`test_copy_sample_sheet_collection`	`sample_sheet`	direct	Columns survive copy
`test_upload_collection`	`list`	fetch	File upload with tags
`test_upload_nested`	`list:list`	fetch	Nested fetch upload
`test_upload_collection_from_url`	`list`	fetch	URL-based upload
`test_upload_collection_deferred`	`list`	fetch	Deferred (lazy) upload
`test_upload_flat_sample_sheet`	`sample_sheet`	fetch	Sample sheet via fetch
`test_upload_sample_sheet_paired`	`sample_sheet:paired`	fetch	Nested sample sheet via fetch
`test_enforces_unique_names`	`list`	direct	Duplicate names → 400
`test_hda_security`	`paired`	direct	Cannot use another user’s HDA → 403

Test Helper: `_check_create_response`

Handles both direct and fetch responses:

def _check_create_response(self, create_response):
    self._assert_status_code_is(create_response, 200)
    dataset_collection = create_response.json()
    if "output_collections" in dataset_collection:
        # fetch response — follow up with GET
        dataset_collection = dataset_collection["output_collections"][0]
        dataset_collection = self._get(f"dataset_collections/{dataset_collection['id']}").json()
    self._assert_has_keys(dataset_collection, "elements", "url", "name", "collection_type", "element_count")
    return dataset_collection

Standalone helpers

def assert_one_collection_created_in_history(dataset_populator, history_id):
    # Lists history contents, asserts exactly 1 collection, returns full details

def upload_flat_sample_sheet(dataset_populator):
    # Creates a sample_sheet via fetch and validates columns

8. Test Populator Helpers

File: lib/galaxy_test/base/populators.py

`DatasetCollectionPopulator`

Central test helper class. Key methods:

Identifier builders

Method	Returns	Description
`list_identifiers(history_id, contents)`	`[{name, src, id}...]`	Creates N HDAs, returns element identifiers for a list
`pair_identifiers(history_id, contents)`	`[{name:"forward",...}, {name:"reverse",...}]`	Creates 2 HDAs, returns forward/reverse identifiers
`nested_collection_identifiers(history_id, collection_type)`	nested identifier tree	Recursively builds identifiers for `list:paired` etc.

Payload builders

Method	Description
`create_list_payload(history_id, **kwds)`	Builds payload for list creation (delegates to fetch or direct based on `direct_upload` kwarg)
`create_pair_payload(history_id, **kwds)`	Builds payload for pair creation

Both call __create_payload() which dispatches:

direct_upload=True (default) → __create_payload_fetch() → builds a targets-based fetch payload
direct_upload=False → __create_payload_collection() → builds an element_identifiers-based direct payload

Collection creators

Method	Description
`create_list_in_history(history_id)`	Creates a list in history
`create_pair_in_history(history_id)`	Creates a pair in history
`create_list_of_pairs_in_history(history_id)`	Creates `list:paired` via `upload_collection()`
`create_list_of_list_in_history(history_id)`	Creates `list:list` (or deeper) by chaining — first creates an inner list, then wraps it via `create_nested_collection()`
`upload_collection(history_id, collection_type, elements)`	Generic fetch-based upload
`create_nested_collection(history_id, collection_type, collection)`	Creates nested collection from existing HDCA IDs via `src="hdca"`
`copy_collection(history_id, hdca_id)`	Copies a collection via `POST histories/{id}/contents/dataset_collections`

Dispatch logic — `__create(payload)`

def __create(self, payload, wait=False):
    if "targets" not in payload:
        return self._create_collection(payload)   # POST /api/dataset_collections
    else:
        return self.dataset_populator.fetch(payload)  # POST /api/tools/fetch

Sample sheet helpers

Method	Description
`download_workbook(collection_type, column_definitions)`	Downloads XLSX workbook template
`download_workbook_for_collection(hdca_id, column_definitions)`	Downloads workbook for existing collection
`parse_workbook(xlsx_content, collection_type, column_definitions)`	Parses uploaded XLSX into rows
`parse_workflow_for_collection(hdca_id, xlsx_content, column_definitions)`	Parses XLSX against existing collection

9. Validation Summary

Validation happens at multiple layers:

Schema level (CreateNewCollectionPayload) — Pydantic type validation
api_payload_to_create_params() — requires collection_type + element_identifiers
validate_column_definitions() — validates sample sheet column defs against SampleSheetColumnDefinitionModel
validate_input_element_identifiers() — no __object__ injection, names required, no duplicates, valid src
Type plugin generate_elements() — type-specific validation:
- record: field count/names must match elements
- paired: expects forward/reverse
- paired_or_unpaired: 1 or 2 elements only
- sample_sheet: validates row data types against column definitions
Security — user must own/have access to referenced HDAs; 403 otherwise

10. Key Files

File	Purpose
`lib/galaxy/webapps/galaxy/api/dataset_collections.py`	FastAPI endpoint
`lib/galaxy/webapps/galaxy/services/dataset_collections.py`	Service layer
`lib/galaxy/managers/collections.py`	Core manager with `create()`, `create_dataset_collection()`
`lib/galaxy/managers/collections_util.py`	Payload parsing, element identifier validation
`lib/galaxy/model/dataset_collections/registry.py`	Type plugin registry
`lib/galaxy/model/dataset_collections/type_description.py`	`CollectionTypeDescription` — nested type parsing
`lib/galaxy/model/dataset_collections/builder.py`	`build_collection()`, `set_collection_elements()`
`lib/galaxy/model/dataset_collections/types/__init__.py`	`BaseDatasetCollectionType` abstract base
`lib/galaxy/model/dataset_collections/types/list.py`	List type plugin
`lib/galaxy/model/dataset_collections/types/paired.py`	Paired type plugin
`lib/galaxy/model/dataset_collections/types/record.py`	Record type plugin
`lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py`	PairedOrUnpaired type plugin
`lib/galaxy/model/dataset_collections/types/sample_sheet.py`	SampleSheet type plugin
`lib/galaxy/model/dataset_collections/types/sample_sheet_util.py`	Column definition & row validation
`lib/galaxy/schema/schema.py`	Pydantic models (`CreateNewCollectionPayload`, `CollectionElementIdentifier`, `HDCADetailed`)
`lib/galaxy_test/api/test_dataset_collections.py`	API tests
`lib/galaxy_test/base/populators.py`	`DatasetCollectionPopulator` test helper

Component Collection Creation Api

Dataset Collection Creation API - Deep Dive

Overview

1. API Endpoint

2. Request Schema — `CreateNewCollectionPayload`

`CollectionElementIdentifier`

`CollectionSourceType` enum

3. Response Schema — `HDCADetailed`

4. Collection Types

Type Registry

4a. `list`

4b. `paired`

4c. `record`

4d. `paired_or_unpaired`

4e. `sample_sheet`

4f. Nested types (colon notation)

5. Plumbing — The Creation Call Chain

Key function: `_element_identifiers_to_elements()`

Key function: `builder.build_collection()`

6. The Fetch Path (Alternative Creation)

7. Testing

Test Matrix

Test Helper: `_check_create_response`

Standalone helpers

8. Test Populator Helpers

`DatasetCollectionPopulator`

Identifier builders

Payload builders

Collection creators

Dispatch logic — `__create(payload)`

Sample sheet helpers

9. Validation Summary

10. Key Files

Incoming References (5)

Dataset Collection Creation API - Deep Dive

Overview

1. API Endpoint

2. Request Schema — CreateNewCollectionPayload

CollectionElementIdentifier

CollectionSourceType enum

3. Response Schema — HDCADetailed

4. Collection Types

Type Registry

4a. list

4b. paired

4c. record

4d. paired_or_unpaired

4e. sample_sheet

4f. Nested types (colon notation)

5. Plumbing — The Creation Call Chain

Key function: _element_identifiers_to_elements()

Key function: builder.build_collection()

6. The Fetch Path (Alternative Creation)

7. Testing

Test Matrix

Test Helper: _check_create_response

Standalone helpers

8. Test Populator Helpers

DatasetCollectionPopulator

Identifier builders

Payload builders

Collection creators

Dispatch logic — __create(payload)

Sample sheet helpers

9. Validation Summary

10. Key Files

Incoming References (5)

2. Request Schema — `CreateNewCollectionPayload`

`CollectionElementIdentifier`

`CollectionSourceType` enum

3. Response Schema — `HDCADetailed`

4a. `list`

4b. `paired`

4c. `record`

4d. `paired_or_unpaired`

4e. `sample_sheet`

Key function: `_element_identifiers_to_elements()`

Key function: `builder.build_collection()`

Test Helper: `_check_create_response`

`DatasetCollectionPopulator`

Dispatch logic — `__create(payload)`