EXTRACT_IO_VALIDATION_PLAN

EXTRACT_IO_VALIDATION_PLAN

Make input-name and output-label uniqueness validation symmetric across the workflow-extraction frontend and backend.

Status: IMPLEMENTED

All three changes + tests landed on branch extract_io_validation. Frontend vitest 29/29 green locally; backend API tests authored (py_compile-clean) and deferred to CI (worktree had no .venv). Four files touched: services/workflows.py, WorkflowExtractionForm.vue, test_workflow_extraction.py, WorkflowExtractionForm.test.ts.

Two deltas vs. the plan as written (see Implementation notes at bottom):

Problem

Workflow extraction names inputs (step labels) and labels outputs (workflow output labels). Both namespaces require uniqueness for a valid workflow, but enforcement is lopsided — the only uniqueness check that exists today is the output check, and it lives backend-only:

frontend uniquenessbackend uniqueness
inputsnone (empty-only)none — silently drops the dup’s label (extract.py:161-163,177-179)
outputsnone (empty-only)hard 400 (services/workflows.py:342-346)

Consequences:

Target end state — all four cells enforced:

frontendbackend
inputsinline disabled + reasonhard 400
outputsinline disabled + reasonhard 400 (already done)

Targeting: dev (verified)

This work targets dev, independent of the notebook-extraction commit (44dd1b1d65) on this branch:

No dependency on notebook extraction. Can branch from dev.

Changes

1. BACKEND — input-name validation (hard 400)

Add a module-level helper _validate_input_names(dataset_names, dataset_collection_names) in lib/galaxy/webapps/galaxy/services/workflows.py, called from both service entry points so the by-ids and HID/from-history API paths are guarded identically (DECISION: BACKEND_VALIDATION_LOCATION):

Input step labels share one namespace across datasets and collections (the single step_labels set in extract_steps), so the helper validates the combined provided names. It only inspects names that were actually supplied — the no-names default path ("Input Dataset" constants) is untouched, so legacy unnamed extraction does not regress. Three checks, each raising RequestParameterInvalidException naming the offender:

  1. Non-empty — reject a provided name that is empty / whitespace-only (parity with _sanitize_output_label; UI already blocks this, so it guards direct API callers).
  2. Length — reject a provided name >255 chars. WorkflowStep.label is Unicode(255) (model/__init__.py:9224); an over-long input name is a commit-time error today. Reject, not truncate, to keep accepted names verbatim (DECISION: INPUT_NAMES_RAW).
  3. Uniqueness — reject duplicates in the combined list, compared raw (no strip/collapse) to match how extract_steps keys step_labels.

The silent-dedup in extract_steps stays as a safety net for the no-names default path.

2. FRONTEND — output-label uniqueness (inline)

In WorkflowExtractionForm.vue, add hasDuplicateOutputLabels over the already-built selectedOutputLabels array (which is filtered to exposed + has-output_name + non-empty). Detect duplicate normalized label values. Wire into submissionDisabled and add a submissionDisabledMsg branch (e.g. “Exposed output labels must be unique”).

Normalize the same way the backend’s _sanitize_output_label does — .trim() plus collapse of internal whitespace — so the frontend check and the backend guard agree exactly (see DECISION: NORMALIZATION_PARITY).

3. FRONTEND — input-name uniqueness (inline)

Add hasDuplicateInputNames over selectedInputs (the flattened checked-input list). Detect duplicate newName values across all selected input steps (datasets + collections together — single namespace). Wire into submissionDisabled + submissionDisabledMsg (e.g. “Workflow input names must be unique”). Required so that change #1 doesn’t recreate the submit-time-400 disconnect for inputs. Compare raw newName (backend uses input names raw, no sanitize).

Design decisions

DECISION: BACKEND_VALIDATION_LOCATION

Where to enforce input-name validation on the backend. Note both API paths forward input names — extract_from_history (HID) at services/workflows.py:238 and extract_by_ids at :260 — so a by-ids-only check would leave the HID path on silent-dedup.

DECISION: NORMALIZATION_PARITY (parity — chosen)

Frontend uniqueness must compare the same normalized form each side keys on, or a residual disconnect remains:

DECISION: EMPTY_INPUT_NAMES (reject — chosen)

The backend helper rejects empty / whitespace-only provided input names, matching the output path. The UI already blocks empties via hasUnnamedSelectedInputs, so this is a safety net for direct API callers.

DECISION: INPUT_NAMES_RAW (keep raw — chosen)

Accepted input names are stored verbatim — no whitespace-collapse, no silent truncation — so generated input step labels are unchanged (avoids snapshot/iwc drift). The only limits are hard rejections: empty and >255. Truncation was rejected because it would alter the user’s chosen label without telling them.

Test plan (red → green)

Backend (lib/galaxy_test/api/test_workflow_extraction.py)

Mirror the existing output tests (test_extract_duplicate_output_label_rejected, test_extract_distinct_outputs_with_duplicate_label_string_rejected) using _assert_extract_rejected + _seed_two_inputs_and_run_cat1:

Frontend (WorkflowExtractionForm.test.ts)

The file already exists (notebook commit added cases). Add, asserting on submissionDisabled / submissionDisabledMsg:

Run: backend via /galaxy-backend-tests (single suite, one at a time); frontend vitest under the venv-bootstrapped node per repo convention.

Resolved decisions

  1. Backend location — shared _validate_input_names helper, called from both extract_from_history and extract_by_ids (covers HID + by-ids).
  2. Normalization parity — frontend output check collapses internal whitespace to mirror _sanitize_output_label (trim + \s+ → single space).
  3. Empty input names — backend rejects empty / whitespace-only provided names (parity with outputs).
  4. Input names raw — accepted names stored verbatim; limits enforced by rejection (empty, >255), never truncation.

Open questions

None outstanding.

Implementation notes

HID-path endpoint correction

The test plan said to reuse TestWorkflowExtractionApi’s HID fixtures, but that class drives the legacy POST /workflows?from_history_id= endpoint (api/workflows.py:287), which calls extract_workflow directly, bypasses WorkflowsService entirely, and never forwards dataset_names. The validated extract_from_history service is served by the new POST /api/histories/{history_id}/extract_workflow route (api/histories.py:887). The HID-path test (test_extract_from_history_duplicate_input_names_rejected) posts there directly with dataset_hids + dataset_names. The legacy endpoint never names inputs, so leaving it unguarded is fine.

Output-label truncation parity (added)

See the addendum under DECISION: NORMALIZATION_PARITY. Frontend now trim → collapse \s+ → slice(0,255); paired truncation-collision tests added front and back.

Deferred (out of plan scope — backend-only safety nets)

Test results