Auto-Pairing in Galaxy
Auto-pairing is Galaxy’s mechanism for automatically matching forward/reverse paired-end sequencing datasets by filename convention. It exists as parallel implementations in the frontend (TypeScript) and backend (Python), both validated against a shared YAML test specification.
Test Specification Schema
client/src/components/Collections/auto_pairing_spec.yml and lib/galaxy/model/dataset_collections/auto_pairing_spec.yml are identical YAML files defining expected auto-pairing behavior. Each entry:
- doc: Human-readable description of the test case
inputs:
- filename_R1.fastq
- filename_R2.fastq
paired:
<expected_pair_name>:
forward: <forward_filename>
reverse: <reverse_filename>
doc- descriptioninputs- flat list of filenames to pairpaired- map of pair name ->{forward, reverse}filenames
The pair name key (e.g. input541) is what the algorithm should derive as the identifier for the paired collection element. Adding a new test case means adding an entry to both YAML files and running both the frontend and backend test suites.
Frontend tests: client/src/components/Collections/pairing.test.ts - loads the YAML, creates {name} objects, calls autoPairWithCommonFilters(), asserts pair names and forward/reverse assignments.
Backend tests: test/unit/data/dataset_collections/test_auto_pairing.py - loads the YAML via resource_string, creates MockDataset objects, calls auto_pair(), asserts the same.
Filter Patterns
Both implementations define the same COMMON_FILTERS:
| Key | Forward | Reverse | Example |
|---|---|---|---|
illumina | _1 | _2 | sample_1.fastq / sample_2.fastq |
Rs | _R1 | _R2 | sample_R1.fastq / sample_R2.fastq |
dot12s | .1.fastq | .2.fastq | sample.1.fastq / sample.2.fastq |
Filter detection priority: The guessInitialFilterType() function counts how many input filenames contain each pattern. The pattern with the highest count wins; ties default to illumina. Detection order matters: .1.fastq/.2.fastq is checked first (since _1/_2 would also match .1.fastq names), then _R1/_R2, then _1/_2.
Frontend Implementation
Core Algorithm (client/src/components/Collections/pairing.ts)
Entry points, from highest to lowest level:
-
autoPairWithCommonFilters(elements, willRemoveExtensions)- Guesses filter type, splits elements, runs pairing. Returns{filterType, forwardFilter, reverseFilter, pairs, unpaired}. -
splitIntoPairedAndUnpaired(elements, forwardFilter, reverseFilter, willRemoveExtensions)- Given explicit filters, splits elements into forward/reverse lists via regex, runsautoDetectPairs, returnsAutoPairingResult<T>. Returns all elements as unpaired if either filter is empty. -
autoDetectPairs(listA, listB, forwardFilter, reverseFilter, willRemoveExtensions)- Two-pass matching:- Pass 1 (
matchOnlyIfExact): After stripping filter strings, only pairs items whose names are identical (score = 1.0, threshold 0.6). Handles the common case wheresample_R1.fastqandsample_R2.fastqbecomesample.fastqandsample.fastq. - Pass 2 (
matchOnPercentOfStartingAndEndingLCS): For remaining unpaired items, uses LCS-based fuzzy matching (threshold 0.99). Handles cases with minor naming variations.
- Pass 1 (
-
statelessAutoPairFnBuilder(match, scoreThreshold, ...)- Factory that builds a pairing function from a scoring function. Iterates listA, scores each against all of listB, splices out the best match if it exceeds the threshold. The splice prevents double-pairing. -
guessNameForPair(fwd, rev, forwardFilter, reverseFilter, willRemoveExtensions)- Generates the pair identifier by stripping filters, computing the LCS of the remaining names, stripping URL prefixes and extensions. Falls back tofwd_and_revif LCS is empty. -
naiveStartingAndEndingLCS(s1, s2)- Concatenates the longest common prefix and longest common suffix of two strings. This is not a true LCS but works well for filenames that differ only in a short middle segment (the filter pattern).
Key Types
interface HasName { name: string | null; }
type CommonFiltersType = "illumina" | "Rs" | "dot12s";
type GenericPair<T> = { forward: T; reverse: T; name: string; };
type AutoPairingResult<T> = { pairs: GenericPair<T>[]; unpaired: T[]; forwardFilter: string; reverseFilter: string; };
Vue Composables
usePairing.ts - useAutoPairing<T>(): Top-level composable used by wizard components. Wraps autoPairWithCommonFilters() and exposes reactive refs for pairs, unpaired, countPaired, countUnpaired, currentForwardFilter, currentReverseFilter. Provides the AutoPairing Vue component reference and an autoPair(selectedItems) function.
usePairingSummary.ts - usePairingSummary<T>(props): Lower-level composable used by AutoPairing.vue. Wraps splitIntoPairedAndUnpaired() and generates human-readable summary text. Differentiates messaging for list:paired (unpaired datasets excluded) vs list:paired_or_unpaired (unpaired datasets included).
Vue Components
AutoPairing.vue - The auto-pairing UI panel. Displays filter controls, auto-matched pairs list, and unmatched datasets. Operates in two modes:
wizardmode: step in theListWizardflow, “next” button advancesmodalmode: dialog with “Apply Auto Pairing” / “Cancel” buttons, used byPairedOrUnpairedListCollectionCreator
Props: elements, collectionType, forwardFilter, reverseFilter, removeExtensions, extensions, mode, showHid
Emits: on-apply, on-update, on-cancel
PairingFilterInputGroup.vue - Dropdown + two text inputs for selecting/customizing forward and reverse filter patterns. The dropdown offers the three COMMON_FILTERS presets plus a “Clear All Filtering” option. Custom regex can be typed directly into the inputs.
PairedOrUnpairedListCollectionCreator.vue - The main collection builder for list:paired and list:paired_or_unpaired types. Uses AG Grid to display paired and unpaired datasets. Supports manual pairing via click/drag, pair swapping, unpairing. Integrates auto-pairing as initial state.
PairedDatasetCellComponent.vue - AG Grid cell renderer for paired datasets. Shows forward/reverse indicators, swap/unpair/pair actions.
Extension Stripping (stripExtension.ts)
When removeExtensions is enabled, identifiers are stripped of file extensions before display. Handles compound extensions (.fastq.gz, .fastq.bz2) by first removing secondary extensions (.gz, .bz2, .tgz, .crai, .bai), then the primary extension. The useUpdateIdentifiersForRemoveExtensions composable manages toggling this on/off while preserving user edits.
Integration Points
ListWizard.vue: Uses useAutoPairing() to auto-pair on initialization. Uses pair count to infer builder type (list vs list:paired). The auto-pairing step appears as a wizard page.
SampleSheetWizard.vue: Same pattern, extended to sample sheet workflows with URI-based elements.
Backend Implementation
Core Module (lib/galaxy/model/dataset_collections/auto_pairing.py)
Simpler than the frontend - uses only exact matching (no LCS fuzzy pass).
auto_pair(elements) -> AutoPairResponse[T]: Guesses filter type, splits elements, builds a PartialPair dict keyed by base name (filter stripped, extensions stripped via filename_to_element_identifier). Matches forward/reverse by identical base name. Elements that don’t pair up go to unpaired.
paired_element_list_identifier(forward, reverse) -> str: Generates pair name from two filenames. Used by workbook/fetch integration to auto-label pairs.
longest_prefix(s1, s2): Simple common prefix (not LCS like frontend).
Data Structures
@dataclass
class Pair(Generic[T]):
name: str
forward: T
reverse: T
@dataclass
class AutoPairResponse(Generic[T]):
paired: list[Pair[T]]
unpaired: list[T]
Collection Types
lib/galaxy/model/dataset_collections/types/paired.py - PairedDatasetCollectionType: Collection with exactly two elements using identifiers "forward" and "reverse".
lib/galaxy/model/dataset_collections/types/paired_or_unpaired.py - PairedOrUnpairedDatasetCollectionType: Collection with 1-2 elements. If 2, uses "forward"/"reverse" identifiers. If 1, uses "unpaired" identifier.
Workbook Integration (lib/galaxy/tools/fetch/workbooks.py)
_split_paired_data_if_needed() detects when workbook imports have two URI columns (paired data). Uses paired_element_list_identifier() to auto-generate pair names from the URLs.
Auto-Identifiers (lib/galaxy/model/dataset_collections/auto_identifiers.py)
filename_to_element_identifier(filename_or_uri) extracts base filenames and strips compression extensions. Used by the backend pairing code to normalize names before matching.
Adding a New Filter Pattern
-
Add the pattern to
COMMON_FILTERSin both:client/src/components/Collections/pairing.tslib/galaxy/model/dataset_collections/auto_pairing.py
-
Update
guessInitialFilterType()in both files to count/detect the new pattern. Consider detection priority (more specific patterns should be checked before less specific ones to avoid false matches). -
Add test cases to both
auto_pairing_spec.ymlfiles. -
Run tests:
- Frontend:
npx vitest run client/src/components/Collections/pairing.test.ts - Backend:
pytest test/unit/data/dataset_collections/test_auto_pairing.py
- Frontend:
-
The
PairingFilterInputGroup.vuedropdown automatically picks up newCOMMON_FILTERSentries.
Adding a New Test Case
Add an entry to both YAML spec files with the same content:
client/src/components/Collections/auto_pairing_spec.ymllib/galaxy/model/dataset_collections/auto_pairing_spec.yml
Both test runners iterate all spec entries, so no test code changes needed.
Data Flow Summary
User selects datasets in history
-> ListWizard / SampleSheetWizard calls useAutoPairing().autoPair()
-> autoPairWithCommonFilters() guesses filter, splits, pairs
-> guessInitialFilterType() counts pattern occurrences
-> splitElementsByFilter() divides into forward/reverse lists
-> autoDetectPairs() runs exact match then LCS match
-> guessNameForPair() generates pair identifier via LCS
-> AutoPairing.vue displays results, user adjusts filters
-> PairedOrUnpairedListCollectionCreator shows AG Grid table
-> User can manually pair/unpair/swap, edit identifiers
-> Collection creation payload sent to API
-> Backend DatasetCollectionManager.create() builds collection
-> PairedDatasetCollectionType or PairedOrUnpairedDatasetCollectionType
generates DatasetCollectionElements with forward/reverse/unpaired identifiers
Frontend vs Backend Differences
| Aspect | Frontend (TS) | Backend (Python) |
|---|---|---|
| Matching strategy | Two-pass: exact then LCS fuzzy | Single pass: exact only |
| Name generation | LCS (prefix + suffix concat) | Longest common prefix only |
| Extension handling | stripExtension() with toggle | filename_to_element_identifier() |
| Used by | Collection builder UI | Workbook/fetch imports, API |
The frontend is more sophisticated because it handles interactive use cases where users may have inconsistently named files. The backend handles programmatic cases where filenames tend to be more regular.