Dashboard

Component Collection Creation Api

Two-path collection creation: direct POST with element identifiers or fetch API uploading new data atomically

Raw
Revised:
2026-04-22
Revision:
2
Related Notes:
Component - Auto Pairing, Component - Collection API, Component - Collection Models, Component - Data Fetch, Problem - Workflow Test Collection Inputs

Dataset Collection Creation API - Deep Dive

Overview

Dataset collections are containers that group related datasets together. Galaxy supports several collection types (list, paired, record, sample_sheet, paired_or_unpaired) and nested compositions of these (e.g. list:paired).

There are two creation paths:

  1. Direct creationPOST /api/dataset_collections — creates a collection from existing datasets already in a history
  2. Fetch-based creationPOST /api/tools/fetch — uploads new data and creates the collection in one step

Both produce the same result: a HistoryDatasetCollectionAssociation (HDCA) in a history.


1. API Endpoint

File: lib/galaxy/webapps/galaxy/api/dataset_collections.py

POST /api/dataset_collections
  • Request body: CreateNewCollectionPayload
  • Response: HDCADetailed

The endpoint delegates to DatasetCollectionsService.create().


2. Request Schema — CreateNewCollectionPayload

File: lib/galaxy/schema/schema.py:1769

FieldTypeDefaultDescription
collection_typeCollectionType (str)Nonee.g. "list", "paired", "list:paired", "record", "sample_sheet", "sample_sheet:paired", "paired_or_unpaired"
element_identifierslist[CollectionElementIdentifier]NoneElements to include
namestrNoneDisplay name
instance_type"history" | "library""history"Where to create the collection
history_idDecodedDatabaseIdFieldNoneRequired when instance_type="history"
folder_idLibraryFolderDatabaseIdFieldNoneRequired when instance_type="library"
hide_source_itemsboolFalseHide original HDAs after collection creation
copy_elementsboolTrueCopy source HDAs vs reference them
fieldsstr | list[FieldDict][]For record type: field definitions. "auto" to guess from identifiers
column_definitionsSampleSheetColumnDefinitionsNoneFor sample_sheet type: column schema
rowsSampleSheetRows (dict)NoneFor sample_sheet type: {element_name: [col_values...]}

CollectionElementIdentifier

File: lib/galaxy/schema/schema.py:1740

FieldTypeDescription
namestrElement identifier name (e.g. "forward", "data1", "sample1")
srcCollectionSourceTypeSource: "hda", "ldda", "hdca", "new_collection"
idDecodedDatabaseIdFieldID of existing dataset/collection (for hda/ldda/hdca)
collection_typeCollectionTypeFor src="new_collection": the sub-collection type
element_identifierslist[CollectionElementIdentifier]For src="new_collection": nested elements
tagslist[str]Tags for this element

CollectionSourceType enum

hda             — existing HistoryDatasetAssociation
ldda            — existing LibraryDatasetDatasetAssociation
hdca            — existing HistoryDatasetCollectionAssociation (nesting existing collections)
new_collection  — inline sub-collection definition (with nested element_identifiers)

3. Response Schema — HDCADetailed

File: lib/galaxy/schema/schema.py:1249

Extends HDCASummary which extends HDCACommon.

Key fields: id, name, collection_type, elements (list of DCESummary), element_count, populated, populated_state, contents_url, collection_id, column_definitions, implicit_collection_jobs_id, tags, elements_datatypes, elements_states.

Each element (DCESummary) contains: element_identifier, element_index, element_type, object (the HDA or nested DC), columns (for sample sheets), model_class.


4. Collection Types

Type Registry

File: lib/galaxy/model/dataset_collections/registry.py

PLUGIN_CLASSES = [
    ListDatasetCollectionType,
    PairedDatasetCollectionType,
    RecordDatasetCollectionType,
    PairedOrUnpairedDatasetCollectionType,
    SampleSheetDatasetCollectionType,
]

A singleton DatasetCollectionTypesRegistry maps collection_type strings to plugin instances. All plugins extend BaseDatasetCollectionType and implement generate_elements().

4a. list

File: lib/galaxy/model/dataset_collections/types/list.py

Flat list of arbitrarily-named elements. Simply yields each element with its name.

{
  "collection_type": "list",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "data1", "src": "hda", "id": "<id>"},
    {"name": "data2", "src": "hda", "id": "<id>"},
    {"name": "data3", "src": "hda", "id": "<id>"}
  ]
}

4b. paired

File: lib/galaxy/model/dataset_collections/types/paired.py

Exactly two elements named "forward" and "reverse".

{
  "collection_type": "paired",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "forward", "src": "hda", "id": "<id>"},
    {"name": "reverse", "src": "hda", "id": "<id>"}
  ]
}

4c. record

File: lib/galaxy/model/dataset_collections/types/record.py

CWL-style record with named fields. Requires a fields parameter defining field names/types, or fields="auto" to guess from identifiers.

{
  "collection_type": "record",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "a record",
  "fields": [
    {"name": "condition", "type": "File"},
    {"name": "control1", "type": "File"},
    {"name": "control2", "type": "File"}
  ],
  "element_identifiers": [
    {"name": "condition", "src": "hda", "id": "<id>"},
    {"name": "control1", "src": "hda", "id": "<id>"},
    {"name": "control2", "src": "hda", "id": "<id>"}
  ]
}

Validation: field count must match element count, field names must match element identifiers.

4d. paired_or_unpaired

File: lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py

Either 1 element (unpaired) or 2 (paired with forward/reverse).

{
  "collection_type": "paired_or_unpaired",
  "instance_type": "history",
  "history_id": "<id>",
  "element_identifiers": [
    {"name": "unpaired", "src": "hda", "id": "<id>"}
  ]
}

4e. sample_sheet

File: lib/galaxy/model/dataset_collections/types/sample_sheet.py

A list with per-element metadata columns. Requires column_definitions and rows.

{
  "collection_type": "sample_sheet",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "my sample sheet",
  "column_definitions": [
    {"type": "int", "name": "replicate", "optional": false},
    {"type": "string", "name": "condition", "optional": false}
  ],
  "rows": {
    "sample1": [1, "control"],
    "sample2": [2, "treatment"]
  },
  "element_identifiers": [
    {"name": "sample1", "src": "hda", "id": "<id>"},
    {"name": "sample2", "src": "hda", "id": "<id>"}
  ]
}

Column types: int, string, boolean, element_identifier (cross-references another element).

4f. Nested types (colon notation)

Types can be composed: "list:paired", "list:list", "sample_sheet:paired", etc. The string is split on : — the first segment is the “rank” (outer) type, the rest describe the inner structure.

Example: list:paired — use src="new_collection" for inner collections:

{
  "collection_type": "list:paired",
  "instance_type": "history",
  "history_id": "<id>",
  "name": "a nested collection",
  "element_identifiers": [
    {
      "name": "test_level_1",
      "src": "new_collection",
      "collection_type": "paired",
      "element_identifiers": [
        {"name": "forward", "src": "hda", "id": "<id>"},
        {"name": "reverse", "src": "hda", "id": "<id>"}
      ]
    }
  ]
}

Example: sample_sheet:paired — sample sheet wrapping paired collections:

{
  "collection_type": "sample_sheet:paired",
  "instance_type": "history",
  "history_id": "<id>",
  "column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
  "rows": {"sample1": [42]},
  "element_identifiers": [
    {
      "name": "sample1",
      "src": "new_collection",
      "collection_type": "paired",
      "element_identifiers": [
        {"name": "forward", "src": "hda", "id": "<id>"},
        {"name": "reverse", "src": "hda", "id": "<id>"}
      ]
    }
  ]
}

Nested type regex (from type_description.py):

^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*
  |sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$

5. Plumbing — The Creation Call Chain

API endpoint (dataset_collections.py)
  └─ DatasetCollectionsService.create()                    [services/dataset_collections.py]
       ├─ api_payload_to_create_params()                   [managers/collections_util.py]
       │    ├─ validates required params (collection_type, element_identifiers)
       │    └─ validate_column_definitions()               [types/sample_sheet_util.py]

       └─ DatasetCollectionManager.create()                [managers/collections.py:180]
            ├─ validate_input_element_identifiers()        [managers/collections_util.py:52]
            │    ├─ no __object__ injection
            │    ├─ all elements have names
            │    ├─ no duplicate names
            │    ├─ src in {hda, hdca, ldda, new_collection}
            │    └─ new_collection requires element_identifiers

            ├─ create_dataset_collection()                 [managers/collections.py:309]
            │    ├─ CollectionTypeDescriptionFactory.for_collection_type()
            │    │    → CollectionTypeDescription wrapping the type string
            │    │
            │    ├─ _element_identifiers_to_elements()     [managers/collections.py:403]
            │    │    ├─ for nested types: __recursively_create_collections_for_identifiers()
            │    │    └─ __load_elements() — resolves src/id → actual model objects
            │    │
            │    ├─ rank_type_plugin()
            │    │    → looks up type string in DatasetCollectionTypesRegistry
            │    │
            │    └─ builder.build_collection(type_plugin, elements, ...)
            │         └─ set_collection_elements()
            │              └─ type_plugin.generate_elements()
            │                   → yields DatasetCollectionElement objects

            └─ _create_instance_for_collection()           [managers/collections.py:250]
                 ├─ creates HistoryDatasetCollectionAssociation (or LDCA)
                 ├─ wires up implicit inputs/outputs (for workflow-generated collections)
                 ├─ applies tags
                 └─ persists to database

Key function: _element_identifiers_to_elements()

File: lib/galaxy/managers/collections.py:403

Resolves identifier dicts into actual model objects:

  • For nested types, recursively builds inner DatasetCollections first
  • Resolves src="hda" → loads HDA by ID, src="hdca" → loads collection
  • Returns an ordered dict of {name: model_object}

Key function: builder.build_collection()

File: lib/galaxy/model/dataset_collections/builder.py:27

Creates a DatasetCollection model, then calls set_collection_elements() which delegates to the type plugin’s generate_elements() to produce DatasetCollectionElement objects with proper indices.


6. The Fetch Path (Alternative Creation)

Endpoint: POST /api/tools/fetch

Instead of creating datasets first then referencing them, the fetch API uploads data and creates collections atomically. The payload uses a targets array:

{
  "history_id": "<id>",
  "targets": [
    {
      "destination": {"type": "hdca"},
      "elements": [
        {"src": "pasted", "paste_content": "data...", "name": "data1"},
        {"src": "url", "url": "https://...", "name": "data2"},
        {"src": "files", "dbkey": "hg19", "info": "..."}
      ],
      "collection_type": "list",
      "name": "My Collection",
      "tags": ["name:mytag"]
    }
  ]
}

Element src values for fetch: "pasted", "url", "files" (multipart upload).

For nested fetch collections:

{
  "targets": [{
    "destination": {"type": "hdca"},
    "collection_type": "list:list",
    "elements": [
      {
        "name": "samp1",
        "elements": [
          {"src": "files", "dbkey": "hg19"}
        ]
      }
    ]
  }]
}

For sample_sheet fetch:

{
  "targets": [{
    "destination": {"type": "hdca"},
    "collection_type": "sample_sheet",
    "column_definitions": [{"type": "int", "name": "replicate", "optional": false}],
    "elements": [
      {"src": "url", "url": "...", "name": "sample1", "row": [42]}
    ]
  }]
}

Note: row on each element (fetch path) vs rows dict on the payload (direct path).


7. Testing

File: lib/galaxy_test/api/test_dataset_collections.py

Test Matrix

TestCollection TypeCreation PathWhat’s Tested
test_create_pair_from_historypairedfetchBasic pair creation, 2 elements
test_create_list_from_historylistdirectBasic list creation, 3 elements
test_create_list_of_existing_pairslist (of hdca)directNesting existing collections via src="hdca"
test_create_list_of_new_pairslist:paireddirectNested creation via src="new_collection"
test_create_paried_or_unpairedpaired_or_unpaireddirectSingle-element unpaired
test_create_recordrecorddirectRecord with explicit fields
test_record_requires_fieldsrecorddirect400 when fields missing
test_record_auto_fieldsrecorddirectfields="auto"
test_record_field_validationrecorddirectWrong count / wrong names → 400
test_sample_sheet_requires_columnssample_sheetdirectColumns on response elements
test_sample_sheet_column_definition_problemssample_sheetdirectInvalid column defs → 400
test_sample_sheet_element_identifier_column_typesample_sheetdirectelement_identifier column type
test_sample_sheet_validating_against_column_definitionsample_sheetdirectType mismatch + validator failure
test_sample_sheet_of_pairs_creationsample_sheet:paireddirectNested sample sheet
test_sample_sheet_map_over_preserves_columnssample_sheetdirectColumns survive tool mapping
test_copy_sample_sheet_collectionsample_sheetdirectColumns survive copy
test_upload_collectionlistfetchFile upload with tags
test_upload_nestedlist:listfetchNested fetch upload
test_upload_collection_from_urllistfetchURL-based upload
test_upload_collection_deferredlistfetchDeferred (lazy) upload
test_upload_flat_sample_sheetsample_sheetfetchSample sheet via fetch
test_upload_sample_sheet_pairedsample_sheet:pairedfetchNested sample sheet via fetch
test_enforces_unique_nameslistdirectDuplicate names → 400
test_hda_securitypaireddirectCannot use another user’s HDA → 403

Test Helper: _check_create_response

Handles both direct and fetch responses:

def _check_create_response(self, create_response):
    self._assert_status_code_is(create_response, 200)
    dataset_collection = create_response.json()
    if "output_collections" in dataset_collection:
        # fetch response — follow up with GET
        dataset_collection = dataset_collection["output_collections"][0]
        dataset_collection = self._get(f"dataset_collections/{dataset_collection['id']}").json()
    self._assert_has_keys(dataset_collection, "elements", "url", "name", "collection_type", "element_count")
    return dataset_collection

Standalone helpers

def assert_one_collection_created_in_history(dataset_populator, history_id):
    # Lists history contents, asserts exactly 1 collection, returns full details
def upload_flat_sample_sheet(dataset_populator):
    # Creates a sample_sheet via fetch and validates columns

8. Test Populator Helpers

File: lib/galaxy_test/base/populators.py

DatasetCollectionPopulator

Central test helper class. Key methods:

Identifier builders

MethodReturnsDescription
list_identifiers(history_id, contents)[{name, src, id}...]Creates N HDAs, returns element identifiers for a list
pair_identifiers(history_id, contents)[{name:"forward",...}, {name:"reverse",...}]Creates 2 HDAs, returns forward/reverse identifiers
nested_collection_identifiers(history_id, collection_type)nested identifier treeRecursively builds identifiers for list:paired etc.

Payload builders

MethodDescription
create_list_payload(history_id, **kwds)Builds payload for list creation (delegates to fetch or direct based on direct_upload kwarg)
create_pair_payload(history_id, **kwds)Builds payload for pair creation

Both call __create_payload() which dispatches:

  • direct_upload=True (default) → __create_payload_fetch() → builds a targets-based fetch payload
  • direct_upload=False__create_payload_collection() → builds an element_identifiers-based direct payload

Collection creators

MethodDescription
create_list_in_history(history_id)Creates a list in history
create_pair_in_history(history_id)Creates a pair in history
create_list_of_pairs_in_history(history_id)Creates list:paired via upload_collection()
create_list_of_list_in_history(history_id)Creates list:list (or deeper) by chaining — first creates an inner list, then wraps it via create_nested_collection()
upload_collection(history_id, collection_type, elements)Generic fetch-based upload
create_nested_collection(history_id, collection_type, collection)Creates nested collection from existing HDCA IDs via src="hdca"
copy_collection(history_id, hdca_id)Copies a collection via POST histories/{id}/contents/dataset_collections

Dispatch logic — __create(payload)

def __create(self, payload, wait=False):
    if "targets" not in payload:
        return self._create_collection(payload)   # POST /api/dataset_collections
    else:
        return self.dataset_populator.fetch(payload)  # POST /api/tools/fetch

Sample sheet helpers

MethodDescription
download_workbook(collection_type, column_definitions)Downloads XLSX workbook template
download_workbook_for_collection(hdca_id, column_definitions)Downloads workbook for existing collection
parse_workbook(xlsx_content, collection_type, column_definitions)Parses uploaded XLSX into rows
parse_workflow_for_collection(hdca_id, xlsx_content, column_definitions)Parses XLSX against existing collection

9. Validation Summary

Validation happens at multiple layers:

  1. Schema level (CreateNewCollectionPayload) — Pydantic type validation
  2. api_payload_to_create_params() — requires collection_type + element_identifiers
  3. validate_column_definitions() — validates sample sheet column defs against SampleSheetColumnDefinitionModel
  4. validate_input_element_identifiers() — no __object__ injection, names required, no duplicates, valid src
  5. Type plugin generate_elements() — type-specific validation:
    • record: field count/names must match elements
    • paired: expects forward/reverse
    • paired_or_unpaired: 1 or 2 elements only
    • sample_sheet: validates row data types against column definitions
  6. Security — user must own/have access to referenced HDAs; 403 otherwise

10. Key Files

FilePurpose
lib/galaxy/webapps/galaxy/api/dataset_collections.pyFastAPI endpoint
lib/galaxy/webapps/galaxy/services/dataset_collections.pyService layer
lib/galaxy/managers/collections.pyCore manager with create(), create_dataset_collection()
lib/galaxy/managers/collections_util.pyPayload parsing, element identifier validation
lib/galaxy/model/dataset_collections/registry.pyType plugin registry
lib/galaxy/model/dataset_collections/type_description.pyCollectionTypeDescription — nested type parsing
lib/galaxy/model/dataset_collections/builder.pybuild_collection(), set_collection_elements()
lib/galaxy/model/dataset_collections/types/__init__.pyBaseDatasetCollectionType abstract base
lib/galaxy/model/dataset_collections/types/list.pyList type plugin
lib/galaxy/model/dataset_collections/types/paired.pyPaired type plugin
lib/galaxy/model/dataset_collections/types/record.pyRecord type plugin
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.pyPairedOrUnpaired type plugin
lib/galaxy/model/dataset_collections/types/sample_sheet.pySampleSheet type plugin
lib/galaxy/model/dataset_collections/types/sample_sheet_util.pyColumn definition & row validation
lib/galaxy/schema/schema.pyPydantic models (CreateNewCollectionPayload, CollectionElementIdentifier, HDCADetailed)
lib/galaxy_test/api/test_dataset_collections.pyAPI tests
lib/galaxy_test/base/populators.pyDatasetCollectionPopulator test helper

Incoming References (5)

  • Component Auto Pairing related note — Automatic forward/reverse read pairing: parallel frontend/backend implementations validated against shared YAML spec
  • Component Collection Api related note — Full collection API surface: POST/GET/PUT/DELETE endpoints, DatasetCollectionsService, fuzzy drill-down, export
  • Component Collection Models related note — Core model classes: DatasetCollection, DatasetCollectionElement, HDCA/LDCA instances, implicit collections from mapping
  • Component Data Fetch related note — Import pipeline from URLs/paste/files/FTP via /api/tools/fetch wrapping __DATA_FETCH__ tool producing HDAs or HDCAs
  • Problem Workflow Test Collection Inputs related note — Framework test collections populated via fetch API dispatches on type, supports nested structures