Dashboard

Component Collections Records

Heterogeneous fixed-shape collection: CWL-derived `fields` schema of named typed slots; no implicit mapping

Raw
Revised:
2026-05-09
Revision:
1
Related Notes:
Component - Collection Models, Component - Collection API, Component - Collections - Sample Sheets Backend, Component - Collections - Paired or Unpaired

The record Collection Type in Galaxy

A technical reference for the backend implementation of record collections and their associated fields schema.


Table of Contents

  1. Introduction
  2. Data Model
  3. Type Plugin System
  4. Type System and Matching
  5. Tool Declaration and Execution
  6. API and Collection Creation
  7. Workflow Integration
  8. Field Schema Detail
  9. Collection Semantics Specification
  10. Testing Coverage
  11. Implementation Details
  12. Relationship to Other Collection Types
  13. Limitations and Future Work

1. Introduction

The Problem

Galaxy’s original collection types — list, paired, and (later) paired_or_unpaired — describe homogeneous structures: every element of a list is the same kind of dataset, and a paired collection always contains exactly the same two slots (forward, reverse). This works for the dominant biology pattern of “n samples of the same shape,” but it has no way to express a fixed, named, heterogeneous bundle of datasets where each slot has a different role.

Concrete cases:

  • A trio analysis with parent, mother, father BAMs.
  • A reference bundle with genome, gtf, index files of different formats.
  • A CWL tool that declares an InputRecordSchema with named fields.

Before records, these had to be passed as separate top-level inputs, losing the structural grouping and forcing tools or workflows to maintain the field/dataset correspondence externally.

What Records Solve

A record collection is a fixed-length, named-slot, heterogeneous collection: a collection whose shape is described by a schema of fields, where each field has a name (matching the element identifier) and a type (drawn from a CWL subset). The collection’s structure — count and ordering of slots — is fixed at creation by its fields schema.

This gives Galaxy a first-class representation of grouped, role-named datasets that mirrors the CWL record pattern.

How They Differ from Lists and Sample Sheets

Propertylistrecordsample_sheet
Element countArbitraryFixed by schemaArbitrary
Element identifiersUser-definedSchema-defined field namesUser-defined
Element homogeneityHomogeneousHeterogeneousHomogeneous
Schema column on collectionNonefields (JSON)column_definitions (JSON)
Per-element metadataNoneNonecolumns (per-element row)
Allow implicit mappingYesNoYes
Composable as innerYesYesNo
Composable as outerYesYesYes (always)

The defining feature is heterogeneity: a record’s elements are not interchangeable. That semantic shift drives the central design decision — records do not allow implicit mapping — because mapping a single-input tool over heterogeneous slots is not type-safe.

Historical Context

Records as an in-memory concept have existed since Galaxy’s CWL-conformance work (the earliest WIP commits date back several years). They became a persistable collection shape only in late 2024, when migration ec25b23b08e2_implement_fixed_length_collections.py (2024-12-09) added the fields JSON column to dataset_collection along with the adapter columns on job_to_input_* tables (the related “dataset adapters” feature). Sample sheets followed in 2025 (revision 3af58c192752), adding column_definitions alongside the existing fields column.


2. Data Model

Database Schema Changes

File: lib/galaxy/model/migrations/alembic/versions_gxy/ec25b23b08e2_implement_fixed_length_collections.py:31-37

def upgrade():
    with transaction():
        add_column(dataset_collection_table, Column("fields", JSONType(), default=None))
        add_column(job_to_input_dataset_table, Column("adapter", JSONType(), default=None))
        add_column(job_to_input_dataset_collection_table, Column("adapter", JSONType(), default=None))
        add_column(job_to_input_dataset_collection_element_table, Column("adapter", JSONType(), default=None))

fields is nullable; pre-existing collections are unaffected.

DatasetCollection: fields

File: lib/galaxy/model/__init__.py:7113-7116

# if collection_type is 'record' (heterogenous collection)
fields: Mapped[Optional[DATA_COLLECTION_FIELDS]] = mapped_column(JSONType)
# if collection_type is 'sample_sheet' (collection of rows that datasets with extra column metadata)
column_definitions: Mapped[Optional[SampleSheetColumnDefinitions]] = mapped_column(JSONType)

Type alias (lib/galaxy/model/__init__.py:284):

DATA_COLLECTION_FIELDS = list[dict[str, Any]]

Intentionally weakly typed at the model layer — the structural schema lives in lib/galaxy/tool_util_models/tool_source.py. The constructor accepts fields and assigns directly (__init__.py:7135-7150).

DatasetCollectionElement (Unchanged for Records)

Unlike sample sheets, records do not add a per-element column. The schema is entirely on the parent DatasetCollection.fields. Per-element identity comes from the standard element_identifier, with the constraint that:

  • The Nth element_identifier must equal fields[N]["name"].
  • The collection has exactly len(fields) elements.

FieldDict (Field Schema)

File: lib/galaxy/tool_util_models/tool_source.py:171-183

# For fields... just implementing a subset of CWL for Galaxy flavors of these objects so far.
CwlType = Literal["File", "null", "boolean", "int", "float", "string"]
FieldType = Union[CwlType, List[CwlType]]

@with_config(ConfigDict(extra="forbid"))
class FieldDict(TypedDict, closed=True):
    name: str
    type: FieldType
    format: NotRequired[Optional[str]]
  • type is a deliberate subset of CWL. In practice, the workflow editor surfaces only File and ["File", "null"] (optional file). The other CWL primitive types (int, float, string, boolean) are accepted by the model but not exposed in UI.
  • format is an optional Galaxy datatype hint, rarely used.
  • The TypedDict is closed (extra="forbid"), so unknown keys are rejected.

Storage Layout Example

A record with fields = [{"name": "parent", "type": "File"}, {"name": "child", "type": "File"}]:

DatasetCollection
  id: 100
  collection_type: "record"
  fields: [{"name":"parent","type":"File"},{"name":"child","type":"File"}]
  column_definitions: NULL

  DatasetCollectionElement
    element_identifier: "parent"   element_index: 0   hda_id: 501
  DatasetCollectionElement
    element_identifier: "child"    element_index: 1   hda_id: 502

A list:record collection has the fields schema on each inner record collection. The outer list has no schema.

A sample_sheet:record carries both schemas: column_definitions on the outer sample_sheet and fields on each inner record. See Section 12.


3. Type Plugin System

Plugin Registration

File: lib/galaxy/model/dataset_collections/registry.py:11-17

PLUGIN_CLASSES: list[type[BaseDatasetCollectionType]] = [
    ListDatasetCollectionType,
    paired.PairedDatasetCollectionType,
    record.RecordDatasetCollectionType,
    paired_or_unpaired.PairedOrUnpairedDatasetCollectionType,
    sample_sheet.SampleSheetDatasetCollectionType,
]

RecordDatasetCollectionType

File: lib/galaxy/model/dataset_collections/types/record.py

class RecordDatasetCollectionType(BaseDatasetCollectionType):
    """Arbitrary CWL-style record type."""

    collection_type = "record"

    def generate_elements(self, dataset_instances, **kwds):
        fields = kwds.get("fields", None)
        if fields is None:
            raise RequestParameterMissingException(
                "Missing or null parameter 'fields' required for record types.")
        if len(dataset_instances) != len(fields):
            self._validation_failed("Supplied element do not match fields.")
        index = 0
        for identifier, element in dataset_instances.items():
            field = fields[index]
            if field["name"] != identifier:
                self._validation_failed("Supplied element do not match fields.")
            # TODO: validate type and such.
            association = DatasetCollectionElement(
                element=element, element_identifier=identifier)
            yield association
            index += 1

    def prototype_elements(self, fields=None, **kwds):
        if fields is None:
            raise RequestParameterMissingException(
                "Missing or null parameter 'fields' required for record types.")
        for field in fields:
            name = field.get("name", None)
            assert name
            assert field.get("type", "File")  # NS: this assert doesn't make sense as it is
            field_dataset = DatasetCollectionElement(
                element=HistoryDatasetAssociation(),
                element_identifier=name,
            )
            yield field_dataset

Key behaviors:

  1. fields is mandatory. Missing → RequestParameterMissingException.
  2. Cardinality must match. len(dataset_instances) == len(fields).
  3. Positional identifier match. The Nth supplied element identifier must equal fields[N]["name"]. Order matters; reordering is rejected.
  4. Type is not validated at the model layer. The # TODO: validate type and such. comment marks unfinished work — a field declared type: "int" will silently accept any HDA.
  5. prototype_elements creates placeholder DatasetCollectionElement rows backed by empty HistoryDatasetAssociation instances, so an unpopulated record collection can be persisted before its datasets exist (used by workflow scheduling for known-shape outputs). The inline # NS: comment flags that the assert field.get("type", "File") line is a no-op (the default makes the assertion always truthy).

“auto” Fields

File: lib/galaxy/model/dataset_collections/builder.py:62-63, 81-89

if type.collection_type == "record" and fields == "auto":
    fields = guess_fields(dataset_instances)
...
def guess_fields(dataset_instances):
    fields: list[FieldDict] = []
    for identifier, element in dataset_instances.items():
        if isinstance(element, DatasetCollection):
            return []
        else:
            fields.append({"type": "File", "name": identifier})
    return fields

Passing the literal string "auto" for fields synthesizes a flat [{type: "File", name: <id>}] schema from the supplied element identifiers. Exercised by test_record_auto_fields (lib/galaxy_test/api/test_dataset_collections.py:228).


4. Type System and Matching

Type Validation Regex

File: lib/galaxy/model/dataset_collections/type_description.py:15-17

COLLECTION_TYPE_REGEX = re.compile(
    r"^((list|paired|paired_or_unpaired|record)(:(list|paired|paired_or_unpaired|record))*"
    r"|sample_sheet|sample_sheet:paired|sample_sheet:record|sample_sheet:paired_or_unpaired)$"
)

Implications for record:

  • record may appear at any position in the standard :-separated rank chain. Valid: record, list:record, record:list, list:record:paired, sample_sheet:record.
  • Unlike sample_sheet, record is not restricted to the outermost rank.
  • record:record is regex-permitted but is not exercised in tests or surfaced in the workflow editor UI.

CollectionTypeDescription Carries fields

File: lib/galaxy/model/dataset_collections/type_description.py:26-49

def for_collection_type(self, collection_type, fields=None):
    return CollectionTypeDescription(collection_type, self, fields=fields)

class CollectionTypeDescription:
    def __init__(self, collection_type, factory, fields=None):
        ...
        self.fields = fields

The schema travels with the type description so workflow planning code can introspect record structure when emitting prototype output collections.

allow_implicit_mapping — The Central Asymmetry

File: lib/galaxy/model/__init__.py:7521-7523

@property
def allow_implicit_mapping(self):
    return self.collection_type != "record"

This single line drives much of the record’s distinct behavior. Records cannot be mapped over. A tool that consumes a single dataset cannot accept a record on its data input slot to produce an implicit output collection of records, because:

  • Record elements are heterogeneous by design (different field roles, possibly different formats).
  • There is no semantic guarantee the same tool can be applied to every slot.
  • An implicit output keyed by the same field names cannot be guaranteed type-correct.

Records must be consumed explicitly by a tool whose input declares collection_type="record", with the tool unpacking fields by name (e.g., $input.parent, $input['child']).

Matching

type_description.py:146-165 does no special-casing for record. Record matching is identity-only: record matches record, list:record matches list:record. Records do not interconvert with list, paired, or sample_sheet for matching purposes (with one exception in the sample_sheet:record case discussed in Section 12).

Composition Rules (UI-Surfaced)

Although the regex permits broader composition, the workflow editor (client/src/components/Workflow/Editor/Forms/FormCollectionType.vue:29-38) surfaces only:

  • record
  • list:record
  • sample_sheet:record

Other regex-valid combinations (record:list, record:paired) are reachable via API and YAML but are not first-class in the editor.


5. Tool Declaration and Execution

Tool Input Declaration (XML)

File: test/functional/tools/collection_record_test_two_files.xml:7

<param name="f1" type="data_collection" collection_type="record" label="Input collection" />

Tool Input Declaration (YAML)

File: test/functional/tools/parameters/gx_data_collection_record.yml:10-14

inputs:
  - name: parameter
    type: data_collection
    collection_type: record
    label: Input record collection

A union-type-style declaration is supported via comma: collection_type: "list,record" (gx_data_collection_list_or_record.yml:13) — accept either a list or a record.

DataCollectionToolParameter

File: lib/galaxy/tools/parameters/basic.py:2559-2560, 2680-2681

self._fields = input_source.get("fields", None)
self._column_definitions = input_source.get("column_definitions", None)
...
d["fields"] = self._fields
d["column_definitions"] = self._column_definitions

Declared fields flow through to the workflow editor as a structural hint: the editor knows what shape of record the tool consumes. Note: there is no record analogue of the sample-sheet column_definitions_compatible matcher (basic.py:2585-2589); record collections are filtered only by collection_type matching.

Runtime Wrapper

DatasetCollectionWrapper (lib/galaxy/tools/wrappers.py:643-707) provides field access via the standard element-identifier dictionary path:

<command>
  cat $f1.parent $f1['child'] >> $out1;
  echo 'Collection name: $f1.name'
</command>

The wrapper exposes:

  • $f1.<identifier> (attribute access) and $f1['<identifier>'] (item access) for each field’s dataset
  • $f1.name for the collection name
  • $f1.keys() to iterate identifiers

There is no record-specific wrapper method; records reuse the generic identifier-keyed access path. (Compare to sample sheets, which add a dedicated sample_sheet_row() method.)

Built-In Record Tools

There is no built-in record tool analogous to __SAMPLE_SHEET_TO_TABULAR__. Records are constructed via the generic POST /api/dataset_collections endpoint or via workflow inputs; there is no record-specific fetch helper, no record-targeted rule operation, and no record-to-tabular converter.

Test Tools Using Records

  • test/functional/tools/collection_record_test_two_files.xml — XML record consumer
  • test/functional/tools/parameters/gx_data_collection_record.yml — YAML record consumer
  • test/functional/tools/parameters/gx_data_collection_list_or_record.ymllist,record union

6. API and Collection Creation

Direct Collection Creation API

Endpoint: POST /api/dataset_collections

Schema: lib/galaxy/schema/schema.py:1830-1834

fields_: Optional[Union[str, list[FieldDict]]] = Field(
    default=[],
    description="List of fields to create for this collection. Set to 'auto' to guess fields from identifiers.",
    alias="fields",
)

The payload accepts either an explicit list[FieldDict] or the literal string "auto".

Manager Flow

File: lib/galaxy/managers/collections_util.py:36-49

column_definitions = payload.get("column_definitions", None)
validate_column_definitions(column_definitions)

params = dict(
    collection_type=payload.get("collection_type"),
    element_identifiers=payload.get("element_identifiers"),
    name=payload.get("name", None),
    ...
    fields=payload.get("fields", None),
    column_definitions=column_definitions,
    rows=payload.get("rows", None),
)

Note: there is no validate_fields() analogue to validate_column_definitions(). Field shape is enforced only at two points:

  1. Pydantic-level coercion against the closed FieldDict TypedDict when the request lands.
  2. Inside RecordDatasetCollectionType.generate_elements() — which checks count and positional name match only.

This is a thinner validation pipeline than sample sheets, and is one of the larger gaps in the record implementation.

Manager (lib/galaxy/managers/collections.py:199, 226, 319, 331, 360):

def create(self, ..., fields=None, column_definitions=None, rows=None):
    ...
    dataset_collection = self.create_dataset_collection(..., fields=fields, ...)

def create_dataset_collection(self, ..., fields=None, ...):
    collection_type_description = self.collection_type_descriptions.for_collection_type(
        collection_type, fields=fields)
    ...
    builder.build_collection(
        type_plugin, elements, fields=fields,
        column_definitions=column_definitions, rows=rows)

Builder

File: lib/galaxy/model/dataset_collections/builder.py:27-46

def build_collection(type, dataset_instances, collection=None,
                     associated_identifiers=None,
                     fields=None, column_definitions=None, rows=None):
    dataset_collection = collection or DatasetCollection(
        fields=fields, column_definitions=column_definitions)
    set_collection_elements(dataset_collection, type, dataset_instances,
                            associated_identifiers, fields=fields, rows=rows)
    return dataset_collection

set_collection_elements (builder.py:49-78) handles the "auto" sentinel for record only, and forwards fields into type.generate_elements().

Sample Payload

File: lib/galaxy_test/api/test_dataset_collections.py:180-209 (test_create_record)

payload = dict(
    name="a record",
    instance_type="history",
    history_id=history_id,
    element_identifiers=record_identifiers,
    collection_type="record",
    fields=[
        {"name": "condition", "type": "File"},
        {"name": "control1", "type": "File"},
        {"name": "control2", "type": "File"},
    ],
)

7. Workflow Integration

Workflow Input Module

File: lib/galaxy/workflow/modules.py:1180-1260

InputCollectionModule handles fields symmetrically with column_definitions.

Validation (modules.py:1189-1200):

def validate_state(self, state):
    collection_type = state.get("collection_type")
    fields = state.get("fields")
    if collection_type:
        collection_type_description = COLLECTION_TYPE_DESCRIPTION_FACTORY.for_collection_type(
            collection_type, fields=fields)
        collection_type_description.validate()
    if column_definitions := state.get("column_definitions"):
        validate_column_definitions(column_definitions)

Runtime input synthesis (modules.py:1215-1218):

if "column_definitions" in parameter_def:
    collection_param_source["column_definitions"] = parameter_def["column_definitions"]
if "fields" in parameter_def:
    collection_param_source["fields"] = parameter_def["fields"]

State parsing (modules.py:1253-1259):

if "fields" in inputs:
    fields = inputs["fields"]
else:
    fields = None
state_as_dict["fields"] = fields
state_as_dict["column_definitions"] = column_definitions

Workflow YAML Format

inputs:
  my_record:
    type: collection
    collection_type: record
    fields:
      - name: parent
        type: File
      - name: child
        type: File

Workflow Editor (Client-Side)

  • client/src/components/Workflow/Editor/Forms/FormCollectionType.vue:29-38 — exposes record, list:record, sample_sheet:record as selectable collection types.
  • client/src/components/Workflow/Editor/Forms/FormInputCollection.vue:81-87, 160 — when the chosen type is record-bearing, renders <FormRecordFieldDefinitions>; clears fields when a non-record type is chosen.
  • client/src/components/Workflow/Editor/Forms/FormRecordFieldDefinitions.vue — repeatable field list.
  • client/src/components/Workflow/Editor/Forms/FormRecordFieldDefinition.vue — a single field row (name + type), composing FormFieldType.
  • client/src/components/Workflow/Editor/Forms/FormFieldType.vue:40-43 — the editor exposes only File and File? (optional file). Other CwlType values from FieldDict (int, float, string, boolean) are accepted by the model but not user-selectable in UI.

8. Field Schema Detail

PropertyTypeNotes
namestrMust equal the element_identifier; positional
typeCwlType or List[CwlType]Subset of CWL; ["File","null"] indicates an optional file
formatOptional[str]Galaxy datatype hint; rarely used

CWL alignment: deliberately a subset of CWL record / InputRecordSchema. Galaxy implements only the structural shape needed for typed pipelines:

  • Single-typed fields per slot; no nested array or record types in the schema column.
  • No doc, label, or secondaryFiles exposed in the schema row (secondary files are handled separately at the dataset level).
  • File format is Galaxy-native, not CWL-native.

The FieldDict TypedDict is closed (extra="forbid" via with_config), so unknown keys in the JSON are rejected at the Pydantic boundary.


9. Collection Semantics Specification

File: lib/galaxy/model/dataset_collections/types/collection_semantics.yml

Records have no entries in the formal mapping/reduction specification. The doc-block on sample sheets explicitly contrasts mapping behavior with lists; records are absent because the central rule (allow_implicit_mapping = False) precludes most of the patterns the YAML describes:

  • Mapping over data inputs — not allowed for records.
  • Sub-collection mapping — not exercised for records.
  • Reduction by multiple=true — not exercised for records.

Implicit consequences (not codified in the YAML, but enforced by behavior):

  • A record cannot be mapped over a single-dataset tool input.
  • A list:record cannot be subcollection-mapped over a record-typed input the way list:paired maps over paired. There is no special-casing in effective_collection_type and no test for this pattern.
  • Records can only feed tools that explicitly declare collection_type="record" (or "list,record" etc.).

Adding record entries to collection_semantics.yml would clarify these rules and is identified as future work.


10. Testing Coverage

API Tests — lib/galaxy_test/api/test_dataset_collections.py

  • test_create_record (line 180) — canonical happy path
  • test_record_requires_fields (211) — 400 when fields is missing
  • test_record_auto_fields (228) — fields="auto" synthesizes a schema from identifiers
  • test_record_field_validation (246) — rejects too-few, too-many, mismatched-name field lists

Workflow API Tests

There are no record-specific tests in lib/galaxy_test/api/test_workflows.py. Record behavior in workflows is covered indirectly through shared collection paths.

Unit Tests

test/unit/data/dataset_collections/ contains no test_record* files. Record behavior is covered indirectly by test_type_descriptions.py (regex acceptance) and the API integration tests above. There is no validate_fields() unit test because no such validator exists.

Selenium Tests

No Selenium tests target the record creator or editor. Compare to sample sheets, which have dedicated test_collection_input_sample_sheet_chipseq_example* end-to-end tests. Records currently lack UI-level test coverage.

Test Tools

  • test/functional/tools/collection_record_test_two_files.xml — in-tool framework test, asserts on output content and stdout
  • test/functional/tools/parameters/gx_data_collection_record.yml — parameter-modeling test
  • test/functional/tools/parameters/gx_data_collection_list_or_record.ymllist,record union test

11. Implementation Details

Key Code Paths

Collection Creation (Direct API)

  1. lib/galaxy/webapps/galaxy/api/dataset_collections.py — API endpoint receives payload
  2. lib/galaxy/managers/collections_util.py:36-49 — extracts fields (no validator pre-pass)
  3. lib/galaxy/managers/collections.py:199-226DatasetCollectionManager.create() passes fields through
  4. lib/galaxy/managers/collections.py:319, 331, 360create_dataset_collection() resolves elements and dispatches to builder
  5. lib/galaxy/model/dataset_collections/builder.py:27-46 — builds DatasetCollection(fields=...) and calls set_collection_elements()
  6. lib/galaxy/model/dataset_collections/builder.py:62-63, 81-89 — handles "auto" sentinel for records
  7. lib/galaxy/model/dataset_collections/types/record.pygenerate_elements() validates count and positional name match, yields elements

Tool Execution with Records

  1. lib/galaxy/tools/parameters/basic.py:2559-2560, 2680-2681DataCollectionToolParameter reads and serializes fields
  2. lib/galaxy/tools/wrappers.py:643-707DatasetCollectionWrapper exposes elements via attribute and item access (no record-specific method)
  3. Tool command templates access fields by identifier: $input.parent, $input['child']

Workflow Input

  1. lib/galaxy/workflow/modules.py:1189-1200validate_state() builds a CollectionTypeDescription carrying fields
  2. lib/galaxy/workflow/modules.py:1215-1218 — propagates fields into the runtime parameter source
  3. lib/galaxy/workflow/modules.py:1253-1259 — preserves fields in serialized state

Validation Surface

Record validation is count-and-name only. The # TODO: validate type and such. comment in record.py marks the unfinished work of validating that a supplied dataset matches the field’s declared type (e.g., a File-typed field receiving a File-shaped element, or ["File","null"] permitting a missing slot).

There is no safe-validators allowlist (records do not accept user-supplied validators). There are no special-character restrictions on field names beyond the TypedDict shape check. Compare to sample-sheet column definitions, which run through validate_column_definitions() with explicit type, validator, and special-character checks.


12. Relationship to Other Collection Types

Comparison Matrix

Featurelistpairedpaired_or_unpairedrecordsample_sheet
Plugin filetypes/list.pytypes/paired.pytypes/paired_or_unpaired.pytypes/record.pytypes/sample_sheet.py
Element countArbitraryExactly 21 or 2Fixed by fieldsArbitrary
Fixed identifiersNoforward/reverseunpaired or forward/reverseField namesNo
Schema columnNoneNoneNonefieldscolumn_definitions
Per-element metadataNoneNoneNoneNonecolumns
allow_implicit_mappingTrueTrueTrueFalseTrue
prototype_elements()NoYesNoYesNo
"auto" builderNoNoNoYesNo
Can be inner typeYesYesYesYesNo
Can be outermostYesYesYesYesYes (always)

fields vs column_definitions

fieldscolumn_definitions
Used byrecordsample_sheet (and variants)
Stored onDatasetCollection.fieldsDatasetCollection.column_definitions
Element-side counterpartNone — identifier-positionalDatasetCollectionElement.columns
Schema describesNamed, typed, fixed slots (heterogeneity)Per-element metadata columns (homogeneity + sidecar)
Element countFixed by schema lengthArbitrary
Validators / restrictionsNot implementedvalidate_column_definitions, safe-validators allowlist
"auto" builderYesNo
Special-char checksNoneColumn names + string values

sample_sheet:record Composition

The two schemas are independent and coexist in a sample_sheet:record collection:

  • The outer DatasetCollection (rank sample_sheet) carries column_definitions. Outer DatasetCollectionElement rows carry columns (the per-record row of metadata values).
  • Each outer element’s child_collection is a record carrying its own fields schema.

This is the only “compound metadata” collection in the system. A column_definitions change has no effect on fields and vice versa. The regex and effective_collection_type treat sample_sheet:record as a sample_sheet of records (outer rank sample_sheet, inner record); _normalize_collection_type (type_description.py:216-223) normalizes sample_sheetlist for matching purposes, so a sample_sheet:record matches list:record for input-compatibility checks.

Implicit Mapping at the Outer Level

allow_implicit_mapping checks the outer collection_type only:

return self.collection_type != "record"

So a sample_sheet:record does allow implicit mapping (its outer collection_type string is "sample_sheet:record", not "record"). This is consistent: subcollection-mapping a sample_sheet:record over a record-typed input is the meaningful case, where each inner record is consumed as a unit. The effective_collection_type machinery strips the :record suffix, leaving sample_sheet as the implicit output shape.

A bare record (no outer wrapper), however, has collection_type == "record", and the rule fires: no implicit mapping.

Relationship to list

list:record is the canonical “n samples, each a heterogeneous bundle” pattern. The outer list is mappable; the inner record is consumed as a unit by tools declaring collection_type="record". This is structurally analogous to list:paired, with record filling the same role as paired (a fixed-shape inner unit).

Relationship to sample_sheet

The two schemas solve different problems and are not interchangeable:

  • Records describe structural heterogeneity: each slot is a different role with potentially a different format. Element count is fixed. No implicit mapping.
  • Sample sheets describe per-element metadata over homogeneous elements. Element count is arbitrary. Implicit mapping allowed.

When you need both — per-record metadata over heterogeneous bundles — compose them as sample_sheet:record.


13. Limitations and Future Work

Current Limitations

  1. No type-level field validation at element creation. RecordDatasetCollectionType.generate_elements() carries a # TODO: validate type and such. comment. A field declared type: "int" will silently accept a File-shaped element. Only count and positional name match are checked.

  2. No validate_fields() API-side pre-pass. Field shape is enforced only at the Pydantic boundary (closed TypedDict) and at element-yield time. There is no analogue to validate_column_definitions() running through a dedicated validator.

  3. prototype_elements field-type assert is a no-op. The line assert field.get("type", "File") always passes because "File" is truthy. Source-flagged with # NS: this assert doesn't make sense as it is. Either remove the assert or replace it with a real type check.

  4. Order-sensitive identifier match. The Nth supplied element identifier must equal fields[N]["name"]. Reordering produces _validation_failed. Sample sheets are order-insensitive on element identifier (they validate via rows[identifier] lookup). A dict-style match would be more ergonomic.

  5. Editor exposes only File / File?. Although FieldDict.type accepts the full CWL primitive subset (File, null, boolean, int, float, string), the workflow editor’s FormFieldType.vue only offers File and Optional File. Other types are reachable only via API/YAML.

  6. No record-specific built-in tools. Records have no analogue to __SAMPLE_SHEET_TO_TABULAR__ — there is no built-in way to extract records to tabular form, no record-aware fetch helper, and no record-targeted rule operations.

  7. No Selenium coverage. The record creation UI and record-bearing workflow runs are not exercised end-to-end at the Selenium level. Sample sheets, by contrast, have full Selenium coverage of the wizard and AG Grid editing flow.

  8. No collection_semantics.yml entries. The formal type-system specification is silent on records. Documenting the (mostly negative) rules would clarify the contract for tool authors and enable automated test generation.

  9. No workflow API tests. lib/galaxy_test/api/test_workflows.py has no record-specific tests. Record-bearing workflow inputs are covered only by shared collection paths.

  10. Weak schema column type. DATA_COLLECTION_FIELDS = list[dict[str, Any]] at the model layer trades safety for flexibility. Strengthening to a list[FieldDict] Pydantic-validated column would push more checking to the model boundary.

Areas for Improvement

  1. Field-level type validation at create time — close the # TODO: validate type and such. gap. Validate that supplied datasets match field-declared types, including optional (["File","null"]) handling.

  2. Workflow editor support for richer field types — expose the full CWL primitive subset, not just File/File?, so authors can declare typed parameter records.

  3. collection_semantics.yml record entries — document the absence of mapping rules explicitly, plus the positive rules (subcollection-consumption inside list:record, sample_sheet:record outer mapping behavior).

  4. Selenium coverage — end-to-end tests for record creation, record-bearing workflow runs, and record consumption.

  5. Record-aware built-in tooling — a __RECORD_TO_TABULAR__ or __RECORD_FIELD_EXTRACT__ helper analogous to the sample-sheet tabular tool would make records easier to inspect and convert.

  6. Auto-fields for nested casesguess_fields() returns [] for collections containing inner DatasetCollection elements. A richer auto-detection that handles list:record or sample_sheet:record cases would smooth API ergonomics.

  7. Cross-references and constraint validators — analogous to the sample-sheet element_identifier column type, records could grow validators for constraints between fields (e.g., “field index must reference field genome”).

  8. Weakening positional-order match — accept dataset_instances as a dict keyed by identifier and validate against fields by name, decoupling supply order from schema order.

Future Plans

The roadmap for richer per-element metadata and column-typed validation is described in detail in Component - Collections - Sample Sheets Backend. Several of the directions there — richer column types, column metadata propagation, cross-collection references, formal collection-semantics entries — have direct analogues in the record space (typed slots, slot metadata propagation through tool execution, cross-record references, record semantics in collection_semantics.yml). Future record work should be planned alongside that effort to keep the two schemas (fields and column_definitions) evolving coherently.


Cross-References

  • lib/galaxy/model/dataset_collections/types/record.py — type plugin
  • lib/galaxy/model/dataset_collections/registry.py:11-17 — registration
  • lib/galaxy/model/dataset_collections/type_description.py:15-17 — regex
  • lib/galaxy/model/dataset_collections/builder.py:27-46, 62-63, 81-89 — builder + auto fields
  • lib/galaxy/model/__init__.py:284, 7113-7150, 7521-7523 — model column, ctor, mapping flag
  • lib/galaxy/model/migrations/alembic/versions_gxy/ec25b23b08e2_implement_fixed_length_collections.py — schema migration
  • lib/galaxy/tool_util_models/tool_source.py:171-183FieldDict / CwlType
  • lib/galaxy/schema/schema.py:1830-1834 — API payload schema
  • lib/galaxy/managers/collections.py:199-360, collections_util.py:36-49 — manager
  • lib/galaxy/tools/parameters/basic.py:2559-2560, 2680-2681 — tool parameter
  • lib/galaxy/tools/wrappers.py:643-707 — runtime wrapper
  • lib/galaxy/workflow/modules.py:1180-1260 — workflow input module
  • lib/galaxy_test/api/test_dataset_collections.py:180-278 — API tests
  • test/functional/tools/collection_record_test_two_files.xml, parameters/gx_data_collection_record.yml, parameters/gx_data_collection_list_or_record.yml — test tools
  • client/src/components/Workflow/Editor/Forms/FormCollectionType.vue, FormInputCollection.vue, FormRecordFieldDefinitions.vue, FormRecordFieldDefinition.vue, FormFieldType.vue — editor UI

Incoming References (4)