Home Research

nf-schema sample sheet validation gaps in Galaxy

nf-schema validation mapped to Galaxy column_definitions: what survives, degrades, or is lost; Galaxy work items + cast loss-recording vocabulary.

Raw
Revised
2026-05-06
Rev
1
component

nf-schema sample sheet validation gaps in Galaxy

Use this note when nextflow-summary-to-galaxy-interface casts a Nextflow sample-sheet input to a Galaxy sample_sheet* collection input. It enumerates which nf-schema validation features survive, which degrade, and which are wholly lost; flags concrete Galaxy code paths to extend; and gives the cast Mold a vocabulary for recording validation losses.

Evidence quality:

  • Corpus-observed (CO) — pinned fixtures under workflow-fixtures/pipelines/.
  • Galaxy source (GS) — file paths in lib/galaxy/... from the dev branch.
  • gxformat2 source (FS)gxformat2/....
  • External-doc (ED) — nf-schema spec + nf-core component docs.
  • Design inference (DI) — clearly marked.

TL;DR

  • nf-schema sample sheets carry four classes of constraint: per-cell type/pattern, per-cell path semantics (format, exists, mimetype), per-row dependent-requiredness (dependentRequired, anyOf), and cross-row uniqueness (uniqueEntries). Galaxy column_definitions natively covers only the first class, partially covers pattern/enum, and has zero expression for the per-row and cross-row classes.
  • Galaxy’s safe-validator allowlist is exactly three: regex, in_range, length (GS lib/galaxy/tool_util_models/parameter_validators.py:469-476). Anything richer (file existence, mimetype, conditional required, uniqueness) cannot be encoded — must downgrade to prose, drop, or promote out of the sample sheet.
  • The [\w\-_ ?]* charset gate (GS lib/galaxy/model/dataset_collections/types/sample_sheet_util.py:117) silently rejects column values containing ;, ., :, ,, /, etc. Concrete blocker found in corpus: nf-core/taxprofiler db_type default "short;long" cannot be stored.
  • Dataset-typed columns are validated only for presence and element-identifier shape — Galaxy does not sniff the dataset against a format, does not check exists, does not enforce a path regex against the dataset’s filename. Datatype filtering happens at upload, not at sample-sheet construction.
  • gxformat2 has no formal column_definitions declaration in schema/v19_09/workflow.yml. Column metadata round-trips as additional state via FS gxformat2/normalized/_conversion.py:524-540. Adding strong validator support will likely require a gxformat2 schema rev and a Galaxy state-persistence rev.

nf-schema validation feature inventory

Per-cell (column-property) features

FeatureWhat it enforcesClassEvidence
type: string|integer|number|booleanscalar coercionper-cellCO nf-core__rnaseq/assets/schema_input.json:11
type: ["string","integer"] (union)accept eitherper-cellCO nf-core__taxprofiler/assets/schema_input.json:11
type: array / type: object (nested)nested structure within a cellper-cellED nf-schema spec; rare in nf-core sample sheets
pattern (regex)string regexper-cellCO nf-core__rnaseq/...:12 (^\S+$), :20 (path-pattern fastq)
enumclosed value setper-cellCO nf-core__rnaseq/...:33 (forward/reverse/...), nf-core__sarek/...:25 (XX/XY/NA), nf-core__taxprofiler/schema_database.json:12-27 (15 tools)
format: file-pathstring is path-to-fileper-cell, nf-schema customCO nf-core__rnaseq/...:18; ED
format: directory-pathpath-to-directoryper-cell, nf-schema customED
format: pathfile or directoryper-cell, nf-schema customCO nf-core__taxprofiler/schema_database.json:52 (db_path)
format: file-path-patternglobper-cell, nf-schema customED
format: email / uri / date etc.standard JSON Schema formatsper-cellED
exists: truepath resolves on diskper-cell, nf-schema customCO nf-core__rnaseq/...:19; CO nf-core__sarek/...:46-110
mimetypeMIME for file at pathper-cell, nf-schema customED
minimum/maximumnumeric boundsper-cellCO nf-core__rnaseq/...:64-65 (percent_mapped 0..100)
exclusiveMinimum/exclusiveMaximumstrict numeric boundsper-cellED
multipleOfdivisibilityper-cellED, rare in samplesheets
minLength/maxLengthstring lengthper-cellED
defaultdefault if cell omittedper-cellCO nf-core__sarek/...:27 (sex default NA), :33 (status default 0), nf-core__taxprofiler/schema_database.json:46 (db_type default short;long)
description / help_textUI proseidentifieruniversal
errorMessagemessage override on failureper-cell bindingCO nf-core__rnaseq/...:13,21,28... (every column)
deprecated: truewarn/error on useper-cellED
hidden: trueUI hintper-cellED
fa_iconUI iconidentifierED
meta: ["id"] or "id"column is a Nextflow meta-map fieldidentifier (channel-shaping)CO nf-core__rnaseq/...:14, nf-core__sarek/...:15-34

Per-row features

FeatureWhat it enforcesEvidence
required: ["sample","fastq_1","strandedness"]listed columns must be non-nullCO nf-core__rnaseq/...:70, nf-core__sarek/...:137, nf-core__taxprofiler/...:60
dependentRequired: {"fastq_2": ["fastq_1"]}column A present → column B requiredCO nf-core__sarek/...:133-136 (R2 implies R1; spring_2 implies spring_1)
anyOf: [{dependentRequired: {...}}, ...]at least one dependentRequired branch must hold (e.g. lane requires one of fastq_1/spring_1/bam)CO nf-core__sarek/...:122-132
oneOf / allOf of object schemasexactly-one / all branchesCO nf-core__taxprofiler/schema_input.json:62-67 uses allOf for cross-row uniqueness
if/then/else (JSON Schema 2019-09+)conditional required by branchED
Object-level pattern / property-name patternsrareED

Cross-row features

FeatureWhat it enforcesEvidence
uniqueEntries: ["lane","patient","sample"]tuple unique across rowsCO nf-core__sarek/...:138
uniqueEntries: ["fastq_1"]column unique across rowsCO nf-core__taxprofiler/schema_input.json:63-65
uniqueEntries: ["tool","db_name"]composite key uniqueCO nf-core__taxprofiler/schema_database.json:58
uniqueItems: true (array level)full-row uniquenessED

Identifier / non-validation keywords

FeatureEffectNotes
meta: ["sample"] / meta: "id"column joins channel meta-map, not the data tupleshapes samplesheetToList output (ED); orthogonal to validation
schema: assets/schema_input.json (on a param)declares the file is itself a sample sheet to validateCO nf-core__rnaseq/nextflow_schema.json (input param)
errorMessageUX-only, not a validation primitiveuniversal in nf-core
description, help_text, fa_icon, hidden, deprecatedUI/help; preserve as description only

Params vs samplesheet vocabulary

ED docs/nextflow_schema/sample_sheet_schema_specification.md: most keys are shared. Divergences:

  • meta is samplesheet-only.
  • uniqueEntries is samplesheet-only.
  • hidden, fa_icon, help_text are mostly params-only.

samplesheetToList(file, schema) materializes the sheet as a list of [meta_map, data_field_1, ...] tuples (ED). Schema property order, not CSV column order, is the source of truth for tuple order. Meta-annotated columns collapse into the leading map.

Galaxy column_definitions capability inventory

Column definition vocabulary

SampleSheetColumnDefinition (GS lib/galaxy/tool_util_models/sample_sheet.py:39-47):

FieldTypeNotes
namestrgated by [\w\-_ ?]* (GS sample_sheet_util.py:117) — no ., /, :, ,, quotes
descriptionstr?free text
type"string"|"int"|"float"|"boolean"|"element_identifier"closed; no data, path, email, array/object, no `int
optionalboolrequired by schema
default_valuescalar?type-checked against type (GS sample_sheet_util.py:42-56)
validatorsAnySafeValidatorModel[]?allowlist of three (see below)
restrictionsscalar[]?maps to nf-schema enum
suggestionsscalar[]?UI dropdown hint, non-binding

Allowed validators

GS lib/galaxy/tool_util_models/parameter_validators.py:469-476 — three discriminated subclasses tagged _safe = True:

ValidatorFieldsnf-schema analogue
regexexpression, negatepattern (caveat: regex.match not fullmatch — GS parameter_validators.py:182-187)
in_rangemin, max, exclude_min, exclude_max, negateminimum/maximum/exclusiveMinimum/exclusiveMaximum
lengthmin, max, negateminLength/maxLength

expression (Python eval) and all dataset-aware validators (metadata, dataset_metadata_in_data_table, dataset_ok, empty_dataset, …) are excluded (_safe: False or absent).

What is enforced and where

StageMechanismWhat it checksSource
Workflow saveInputCollectionModule.save_to_stepvalidate_column_definitionscolumn-def schema well-formed; default-value type matches type; validators conform to safe allowlistGS lib/galaxy/workflow/modules.py:1198-1199, tool_util_models/sample_sheet.py:42-56
Collection construction (POST /api/dataset_collections, fetch, sample_sheet_workbook parse)validate_rowvalidate_column_value per cellrow arity; cell type-coerces; restrictions membership; safe validators run staticallyGS sample_sheet_util.py:97-174
Element-identifier columnsvalidate_column_value checks value ∈ element_identifiers of same collectionwithin-collection cross-referenceGS sample_sheet_util.py:155-162
Workflow form / runtimeDataCollectionToolParameter filters by column_definitions_compatiblestructural compatibility (name + type + arity, in order) — no validator/restrictions checkGS tools/parameters/basic.py:2585-2588, sample_sheet_util.py:177-212
Dataset columnnone beyond presence / element_identifier shapeno datatype sniff vs a format, no path-exists, no mimetype, no per-row dataset validationDI
Cross-rownoneno uniqueness, no aggregate constraintDI — search “unique” in sample_sheet_util.py returns zero matches
Conditional/dependent requirednoneoptional/required is per-column onlyDI
Row-level error escalationRequestParameterInvalidException → API 400first-failure short-circuit, not aggregatedGS sample_sheet_util.py:104-112

Variant-specific differences

Per galaxy-sample-sheet-collections, all four variants (sample_sheet, sample_sheet:paired, sample_sheet:paired_or_unpaired, sample_sheet:record) share column_definitions semantics. Variant axis controls element shape, not column validation. Practical consequence: a paired-end nf-core sheet with one R1 path and an optional R2 path becomes sample_sheet:paired or sample_sheet:paired_or_unpaired — the path columns vanish from column_definitions because they become the dataset payload, leaving only metadata columns.

Round-trip through gxformat2

  • gxformat2 → Galaxy: import accepts column_definitions on data_collection inputs as additional state. gxformat2 v19_09 (FS schema/v19_09/workflow.yml) has no column_definitions field declaration. Field passes through FS gxformat2/normalized/_conversion.py:524-540.
  • Galaxy → gxformat2 export: same code path round-trips it. No silent drop observed for the in-allowlist subset.
  • DI: because gxformat2 has no schema-level declaration, additions like new validator types could in principle round-trip without a schema rev, but tooling that strict-validates against the SALAD schema (gxwf, IDE) will not understand them.
  • Existing example: gxformat2/examples/format2/synthetic-sample-sheet-input.gxwf.yml — only uses restrictions, name, default_value, optional.

Gap matrix

Support: N=Native, P=Partial, L=Lossy, A=Absent. Loss observable: cast / import / invocation / runtime. Foundry recommendation: preserve / record / promote / drop / refuse.

nf-schema featureGalaxy supportLoss observableFoundry recommendation
type: stringNpreserve
type: integerN (int)preserve
type: numberN (float)preserve
type: booleanNpreserve
type: ["string","integer"] unionAcastpromote to string; record loss_class: type_union_collapsed
type: array / object (nested cell)Acastrefuse; keep as scalar string JSON-encoded with warn
patternP (regex)invocationpreserve via regex validator; record anchoring caveat (match vs fullmatch)
enumN (restrictions)preserve
format: file-pathPruntimepromote: column becomes dataset payload of sample_sheet* variant; the path itself disappears from column_definitions
format: directory-pathAcastrefuse — Galaxy has no directory dataset; record loss
format: path (file or dir)Pcasttreat as file-path; record loss_class: directory_path_unsupported if directory branch reachable
format: file-path-pattern (glob)Acastrefuse — promote to data_collection input outside the sample sheet
format: email / othersAcastpreserve as regex if a regex is supplied; otherwise record loss
exists: trueA on column; partial via Galaxy runtime for dataset columnscast / runtimefor non-data columns refuse and promote to data input. For data columns record loss_class: exists_implicit_via_dataset
mimetypeAcastrecord loss; recommend Galaxy datatype filter on a separate data input
minimum/maximumN (in_range)preserve
exclusiveMinimum/exclusiveMaximumN (in_range.exclude_*)preserve
multipleOfAinvocationdrop with loss_class: numeric_multipleof_dropped
minLength/maxLengthN (length)preserve
Per-column requiredN (optional: false)preserve
defaultN (default_value)preserve
descriptionNpreserve
errorMessageA as binding; partial via custom messagescastpreserve text into description; do not silently lose it
deprecatedAcastdrop column
hiddenAcastdrop
fa_iconAcastdrop silently
meta: [...]N as identifieruse to choose element_identifier for sample id; remaining meta columns survive as ordinary column_definitions
dependentRequiredAcastrecord + promote — emit composite scalar mode input when enumerable; otherwise record loss_class: conditional_required_dropped
anyOf of dependentRequired (sarek lane discriminator)Acastrefuse single-sheet mapping; offer split (paired sheet + record sheet) plus mode scalar; record loss
oneOf / if/then/elseAcastrecord loss; require interface decision
uniqueEntries (single col)Acastrecord loss_class: cross_row_unique_dropped; rely on user discipline
uniqueEntries (composite key)Acastrecord loss; consider promoting composite key column to element_identifier if it forms a primary key
uniqueItems (full row)Acastrecord loss
samplesheetToList field-order ruleDI: gxformat2 column order is authoritativepreserve order — emit column_definitions in nf-schema property order, not CSV order

Worked examples

nf-core/rnaseq assets/schema_input.json

Variant: sample_sheet:paired_or_unpaired (R2 optional, plus optional alt bam columns).

Source columnnf-schema featuresGalaxy decisionLoss class
samplestring, pattern: ^\S+$, meta: id, errorMessageelement_identifier; errorMessage → descriptionmeta_id_promoted_to_element_identifier
fastq_1string, file-path, exists, fastq path-pattern, errorMessagedataset payload (forward); column drops out of column_definitionspath_pattern_to_dataset_format
fastq_2same, optionaldataset payload (reverse, optional → paired_or_unpaired)same
strandednessenum 4-way, meta: strandednessstring with restrictions: [forward,reverse,unstranded,auto], requirednone
seq_platformstring ^\S+$, metastring with regexnone
seq_centerstring ^\S+$, metastring with regexnone
genome_bamfile-path, exists, \.bam$conflict — alternative input branch. Promote out of sample sheet to parallel data input or split into sample_sheet:record.alternative_input_branch
transcriptome_bamsamesamesame
percent_mappednumber 0..100, metafloat with in_range(min=0, max=100)none

Items-level constraints: none — no cross-row losses for rnaseq.

nf-core/sarek assets/schema_input.json

Richest case in the corpus.

Source columnnf-schema featuresGalaxy decisionLoss class
patientstring ^\S+$, metastring with regex, requirednone
samplestring ^\S+$, metaelement_identifiermeta promotion
sexenum XX/XY/NA, default NA, metastring with restrictions, default_value: "NA"none
statusinteger enum 0/1, default 0, metaint with restrictions: [0,1], default_value: 0none
laneanyOf: [int, string] union, ^\S+$, metastring (collapse), regextype_union_collapsed
fastq_1 / fastq_2path-pattern fastq, existsdataset payload of sample_sheet:paired_or_unpairedpath-format loss
spring_1 / spring_2spring fastqalternative input — Galaxy has no spring datatype baseline; refuse or splitalternative_branch_unsupported_format
tablerecalibration tablealternative branch, data inputbranch
cram/crai, bam/baipreprocessed alternativesplit into parallel sample_sheet:record or data inputsbranch
contaminationnumber, exists: true (probably schema bug — number with exists)float, drop existsexists_on_non_string_dropped
vcfpathalternative branch, data inputbranch
variantcallerstringstringnone

Items-level (the meat of the gap):

  • dependentRequired: {fastq_2: [fastq_1], spring_2: [spring_1]}A. paired_or_unpaired makes R2-without-R1 unrepresentable; spring_2-without-spring_1 would still be possible if both columns existed. Record loss_class: dependentRequired_partially_structural.
  • anyOf: [{dependentRequired: {lane: [fastq_1]}}, {lane: [spring_1]}, {lane: [bam]}]A. Encodes “lane requires one of the data branches.” Mitigation: a top-level scalar data_source: fastq|spring|bam enum input plus three sample-sheet inputs (gated by docs). Record loss_class: discriminated_union_required.
  • required: ["patient","sample"]N.
  • uniqueEntries: ["lane","patient","sample"]A. Composite key. Record loss_class: cross_row_unique_composite.

nf-core/taxprofiler assets/schema_database.json

Reference/database sheet — typically a separate Galaxy input from the biological samplesheet (see nextflow-params-to-galaxy-inputs §Reference data).

Source columnnf-schema featuresGalaxy decisionLoss class
toolenum (15 profilers), metastring with restrictions: [bracken, centrifuge, ...]none
db_namestring ^\S+$, metastring with regexnone
db_paramsstring ^[^"']*$, metastring with regex — caveat: column-value [\w\-_ ?]* gate (GS sample_sheet_util.py:117) is stricter than the source regex. CLI flag strings will commonly fail Galaxy’s gate (e.g. --threshold 0.5 has ., 0, ). Record loss_class: galaxy_value_charset_overrestrictive.per-cell value-charset
db_typeenum short|long|short;long, default short;longstring restrictions, default_value: "short;long" — value short;long contains ; which is rejected by [\w\-_ ?]*. Galaxy will refuse this value at row submission.per-cell value-charset (blocking)
db_pathstring, format=path, existsdataset payload of sample_sheet — but format: path means dir-or-file, and Galaxy has no directory dataset; if a tool needs directory, split into data input plus documentdirectory-path-unsupported

Items-level: uniqueEntries: ["tool","db_name"]A. Record loss.

The db_type finding is the highest-impact concrete blocker discovered: a literal nf-core enum value cannot be stored in a Galaxy column_definition cell. It must be rewritten (short_long), promoted to a separate scalar, or the gate loosened (Galaxy work item W1 below).

Galaxy implementation roadmap

Prioritized by frequency-of-bite across the 8 nf-core fixtures.

W1. Loosen sample-sheet column-value charset to allow nf-core-canonical values

Problem. has_special_characters (GS lib/galaxy/model/dataset_collections/types/sample_sheet_util.py:116-119) rejects column values matching anything outside [\w\-_ ?]*. nf-core enums and free-text routinely contain ;, ., ,, :, /, =, ', ". Concrete blocker: db_type: "short;long" (taxprofiler default).

Fix shape. Distinguish three charset gates:

  1. column name (current strict gate — keep, serializes into TSV header).
  2. element_identifier value (must serialize cleanly into TSV — keep current gate).
  3. arbitrary cell value (relax to “no control characters, no embedded newline/tab; CSV-escapable”) — strip_control_characters plus a CSV-escapability check, not the current word-boundary regex.

Touch points:

  • GS lib/galaxy/model/dataset_collections/types/sample_sheet_util.py:116-129,155-162 — split validate_no_special_characters into validate_identifier_charset and validate_value_charset; only call the strict one for element_identifier.
  • GS lib/galaxy/tools/sample_sheet_to_tabular.xml — verify TSV escapes \t, \n correctly (introduce CSV-mode or quote rule).

Risk. Medium. Touches collection-build-time validation and downstream TSV writers.

Size. S–M. gxformat2 rev? No. Existing PRs. None found; Galaxy issue #20831 (Sample Sheets follow-up) is the umbrella.

W2. Add unique column-level flag and unique_entries items-level flag

Problem. Cross-row uniqueness is the second-most-common nf-core constraint. sarek ["lane","patient","sample"], taxprofiler-input ["fastq_1"]/["fastq_2"]/["fasta"]/["sample","run_accession"], taxprofiler-database ["tool","db_name"].

Fix shape.

  • Single-column unique: extend SampleSheetColumnDefinition with unique: bool = False (GS lib/galaxy/tool_util_models/sample_sheet.py).
  • Composite unique: add a top-level unique_entries: List[List[str]] to the collection’s column_definitions envelope (currently flat list — wrap as {"columns": [...], "unique_entries": [...]} or attach a sibling JSON column on dataset_collection). The wrapper option is more future-proof.
  • Validation: in validate_row only the current row is visible; cross-row check runs after all rows are collected. Plug into SampleSheetDatasetCollectionType.generate_elements (GS types/sample_sheet.py:18-40) — accumulate seen tuples, raise on duplicate.
  • Workbook parser (GS types/sample_sheet_workbook.py) needs the same check at upload.

Risk. Medium — schema migration touches dataset_collection.column_definitions shape; backwards compat required (accept both list and {columns, unique_entries} envelope).

Size. M. gxformat2 rev? Yes for typed authoring tools.

W3. Express conditional / dependent required at the column-definitions level

Problem. Sarek’s dependentRequired and anyOf-of-dependentRequired are routine; without them the cast Mold either over-promotes columns to required (creating impossible workflows) or under-promotes (silent runtime errors).

Fix shape. Two options.

  • (a) Rule-based — add requires: [column_name] on a column for dependentRequired. Validate in validate_row. Covers sarek’s R2→R1 and spring2→spring1.
  • (b) Discriminator-based — add envelope-level discriminator: {column: "data_source", branches: {fastq: [fastq_1], spring: [spring_1], bam: [bam]}}. Covers sarek’s anyOf-of-dependentRequired. Closer to JSON-Schema’s oneOf but constrained to a single discriminator column, which is the only shape nf-core uses.

Touch points:

  • GS lib/galaxy/tool_util_models/sample_sheet.py — extend model.
  • GS lib/galaxy/model/dataset_collections/types/sample_sheet_util.py:97-113 — extend validate_row.
  • GS lib/galaxy/tools/parameters/basic.py:2585-2588 (column_definitions_compatible) — decide whether dependent-required participates in compatibility.

Risk. Medium-high. Affects compatibility matching, which gates editor-time wiring.

Size. M. gxformat2 rev? Yes. Existing PRs. None found.

W4. Add a path column type with optional datatype enforcement

Problem. nf-schema format: file-path columns become Galaxy datasets, but the path-pattern regex (^\S+\.bam$ etc.) is lost.

Fix shape.

  • (a, smaller) For path-bearing columns, record a format (Galaxy datatype) hint on column_definitions; have DataCollectionToolParameter filter compatible collections by it (GS tools/parameters/basic.py:2585-2588). Piggybacks on existing Galaxy datatype machinery.
  • (b, larger) Promote nf-schema pattern on path columns to a Galaxy datatype lookup at cast time and assert at upload — out of scope for Galaxy; cast Mold concern via nextflow-path-glob-to-galaxy-datatype.

Risk. Low for (a). Size. S. gxformat2 rev? Optional.

W5. Carry errorMessage and description end-to-end

Problem. nf-core authors write errorMessage on every column. Galaxy’s only landing slot is column_definition.description; safe-allowlist validators don’t accept user-supplied messages — regex/in_range/length models have no message field (GS parameter_validators.py:160-245).

Fix shape. Add message: Optional[str] to RegexParameterValidatorModel, InRangeParameterValidatorModel, LengthParameterValidatorModel. Plumb through default_message override.

Risk. Low. Size. S. gxformat2 rev? No.

W6. Distinguish “directory” vs “file” path columns

Problem. nf-schema format: directory-path and format: path (file-or-dir). Galaxy has no directory dataset. taxprofiler db_path is the routine case.

Fix shape. No Galaxy code change recommended now — cast-time refusal and documented loss. Galaxy roadmap item only if directory support arrives via something like CWL Directory inputs.

W7. Aggregate-row error reporting

Problem. validate_row short-circuits on first failure (GS sample_sheet_util.py:104-112). Users uploading a 200-row sheet get one error at a time. nf-schema reports all errors per row.

Fix shape. Collect errors into a list; raise a single RequestParameterInvalidException with structured payload (row index → field → message).

Risk. Low (backward compatible if message text preserved). Size. S.

Priority summary

#TitleBite frequencyRiskSize
W1Loosen value charsetevery taxprofiler / db_params-styleMedS–M
W2uniqueEntriessarek + taxprofiler (3 of 8)MedM
W3dependentRequired / discriminatorsarek (highest schema complexity)Med-HighM
W4per-column path/format hintrnaseq / sarek alt-branchesLowS
W5errorMessage round-tripuniversalLowS
W7aggregate row errorsquality-of-lifeLowS
W6directory pathstaxprofiler db_pathHighL

Cast Mold loss-recording guidance

Cast Mold should write a per-column entry into the interface brief whenever an nf-schema feature is mapped to Galaxy column_definitions. Record shape:

column_loss_records:
  - source_column: db_type
    nf_schema_features:
      type: string
      enum: ["short", "long", "short;long"]
      default: "short;long"
      meta: ["db_type"]
    galaxy_column_definition:
      name: db_type
      type: string
      restrictions: ["short", "long", "short;long"]
      default_value: "short;long"
      optional: true
    loss_class: galaxy_value_charset_overrestrictive
    loss_severity: blocking
    mitigation: rename canonical value "short;long" to "short_long" and remap upstream; record CLI mismatch

loss_class enum

loss_classWhen
nonefeature preserved exactly
regex_anchor_driftpattern preserved as regex validator (Galaxy uses match, not fullmatch)
type_union_collapsednf-schema ["string","integer"] collapsed to string
numeric_multipleof_droppedmultipleOf not expressible
path_pattern_to_dataset_formatpath column became dataset payload; per-pattern check lost
directory_path_unsupportedformat: directory-path
path_glob_unsupportedformat: file-path-pattern
mimetype_droppedmimetype lost
exists_droppedexists: true dropped (non-data column)
exists_implicit_via_datasetexists satisfied because column is a Galaxy dataset
errorMessage_droppednf-schema errorMessage lost (until W5)
dependentRequired_droppedper-row dependent required not expressible
dependentRequired_partially_structuralsatisfied by paired/paired_or_unpaired structure
discriminated_union_requiredanyOf-of-dependentRequired not expressible
cross_row_unique_droppedsingle-column uniqueEntries
cross_row_unique_compositecomposite-key uniqueEntries
galaxy_value_charset_overrestrictivecolumn value contains a char outside [\w\-_ ?]
meta_id_promoted_to_element_identifiermeta: ["id"] mapped to element_identifier
alternative_input_branchcolumn belongs to an anyOf-style alt branch; promoted out of sample sheet
deprecated_droppeddeprecated: true columns excluded

loss_severity enum

ValueMeaning
noneround-trip exact
cosmeticUI prose lost, behavior identical
informationalconstraint not enforced, but unlikely to mis-fire
behavioralconstraint not enforced; user discipline required
blockingfeature fundamentally cannot be expressed; user must be redirected (alt branch, refuse)

Refuse vs map-and-warn

Refuse mapping (push back to interface brief as a question) when:

  • A format: directory-path column is required.
  • A format: file-path-pattern column would be the dataset payload.
  • A discriminated anyOf-of-dependentRequired cannot be modeled by a single sample-sheet variant.
  • Column value charset is fundamentally incompatible (e.g. nf-schema enum value contains ;).
  • The sheet uses a nested schema: reference to validate a per-row file (a sheet of sheets).

Map-and-warn (record loss, proceed) when:

  • multipleOf, exclusive*, minLength/maxLength.
  • pattern (regex anchoring caveat).
  • errorMessage, description, help_text (cosmetic until W5).
  • uniqueEntries (record behavioral; user discipline).
  • dependentRequired already structurally satisfied by the variant.

Open questions

  • gxformat2 schema authority for column_definitions. Today round-trips as additional state. Should W2/W3 trigger a real v19_09 rev plus IDE bindings, or remain in additional state? Punt: small-rev once W2 lands.
  • column_definitions_compatible (GS sample_sheet_util.py:177-212) compares only name + type. Should validators participate in compatibility? Tightening would break editor wiring of legacy sample sheets.
  • RegexParameterValidatorModel uses regex.match, not fullmatch — should the sample-sheet caller force fullmatch semantics by appending $? Cast Mold could do this transparently and record under regex_anchor_drift.
  • nf-schema meta mapping: is every meta-marked column promoted to element_identifier, or only meta: id? Convention is meta: ["id"] for primary key, others as ordinary metadata. Confirm with sarek (six meta columns).
  • For sample sheets that mix multiple “branches” (sarek: fastq vs spring vs bam vs cram vs vcf), should the cast Mold always split into multiple Galaxy inputs gated by a scalar mode, or attempt sample_sheet:record? Likely a per-pipeline interface decision.
  • Should W3’s “discriminator” be added as gxformat2 schema or as a Galaxy-only extension? Affects portability to CWL.
  • Galaxy issue #20541 (Custom Tabular Inputs for Workflows) — long-term home for richer column-validation, or extend sample_sheet codepath?