Reference for Galaxy’s Apply Rules DSL — the rule grammar consumed by __APPLY_RULES__ (see galaxy-collection-tools for the surrounding tool catalog and galaxy-collection-semantics for collection mapping/reduction semantics).
Key principle: rules transform collection metadata (identifiers, indices, tags) as tabular data; mapping operations turn the resulting columns back into collection structure.
Sources of truth in Galaxy:
lib/galaxy/util/rules_dsl.py— rule implementationlib/galaxy/util/rules_dsl_spec.yml— test spec covering every rule typelib/galaxy/managers/collections.py— collection building from rules- PR #5819 — original implementation
This note is the consumer-facing companion to those files. Verify against the spec YAML when in doubt.
Rules DSL Architecture
Core Concepts
Data Model:
data: [[cell values]] # 2D array of strings (tabular data)
sources: [source objects] # Metadata for each row (identifiers, indices, tags)
Initial State Example:
# Input: list:paired with elements [sample1/forward, sample1/reverse, sample2/forward, sample2/reverse]
data = [[], [], [], []] # Empty rows, one per dataset
sources = [
{"identifiers": ["sample1", "forward"], "indices": [0, 0], "dataset": <hda>, "tags": []},
{"identifiers": ["sample1", "reverse"], "indices": [0, 1], "dataset": <hda>, "tags": []},
{"identifiers": ["sample2", "forward"], "indices": [1, 0], "dataset": <hda>, "tags": []},
{"identifiers": ["sample2", "reverse"], "indices": [1, 1], "dataset": <hda>, "tags": []},
]
Execution Flow:
- Collection metadata extracted to tabular format
- Rules applied sequentially to transform data
- Mapping operations convert transformed data to new collection
Example:
Input collection: list [i1, i2]
Initial state:
data: [["value1"], ["value2"]]
sources: [
{"identifiers": ["i1"], "indices": [0]},
{"identifiers": ["i2"], "indices": [1]}
]
After rules:
data: [["value1", "i1"], ["value2", "i2"]] # Added identifier column
After mapping:
Output collection: list [i1, i2]
Rule Operations
Rules are applied sequentially in the order specified. Each rule transforms the data table.
1. Column Addition Rules
add_column_basename
Purpose: Extract basename from file paths
Parameters:
target_column(int): Column containing paths
Example:
rules:
- type: add_column_basename
target_column: 0
Transformation:
Input: [["/path/to/moo.txt"], ["moo.txt"]]
Output: [["/path/to/moo.txt", "moo.txt"], ["moo.txt", "moo.txt"]]
Use cases:
- Extract filenames from full paths
- Create identifiers from uploaded file paths
- Normalize identifiers across different upload methods
add_column_regex
Purpose: Capture regex groups or perform replacements
Parameters:
target_column(int): Column to processexpression(string): Regular expression patternreplacement(string, optional): Replacement template with\1,\2for groupsgroup_count(int, optional): Number of groups to capture as separate columnsallow_unmatched(bool, default: false): If false, errors on unmatched rows
Mode 1: Simple capture (default)
rules:
- type: add_column_regex
target_column: 0
expression: '(o)+'
Input: [["foo"], ["cow"]]
Output: [["foo", "oo"], ["cow", "o"]]
Mode 2: Replacement
rules:
- type: add_column_regex
target_column: 0
expression: '(o+)'
replacement: 'the os \1'
Input: [["foo"], ["cow"]]
Output: [["foo", "the os oo"], ["cow", "the os o"]]
Mode 3: Multiple groups
rules:
- type: add_column_regex
target_column: 0
expression: '.*(o)(o)'
group_count: 2
Input: [["foo"], ["boo"]]
Output: [["foo", "o", "o"], ["boo", "o", "o"]]
Mode 4: Allow unmatched
rules:
- type: add_column_regex
target_column: 0
expression: '(o)+'
allow_unmatched: true
Input: [["foo"], ["cow"], ["cat"]]
Output: [["foo", "oo"], ["cow", "o"], ["cat", ""]]
Use cases:
- Extract sample names from filenames (e.g.,
sample_(\w+)_R1.fastq) - Parse structured identifiers (e.g.,
TCGA-(\w+)-(\d+)) - Clean up identifiers (remove prefixes/suffixes)
- Extract metadata embedded in filenames
Common patterns:
# Extract sample ID from "sample_123_R1.fastq"
expression: 'sample_(\w+)_R\d'
# Extract prefix before underscore
expression: '([^_]+)_.*'
# Extract everything before last dot
expression: '(.+)\.[^.]+$'
add_column_substr
Purpose: Extract or remove fixed-length substrings
Parameters:
target_column(int): Column to processsubstr_type(enum): Operation typekeep_prefix: Keep first N characterskeep_suffix: Keep last N charactersdrop_prefix: Remove first N charactersdrop_suffix: Remove last N characters
length(int): Number of characters
Examples:
# Keep first 2 characters
rules:
- type: add_column_substr
target_column: 0
substr_type: keep_prefix
length: 2
Input: [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "fo"], ["cow", "co"], ["ba", "ba"], ["d", "d"]]
# Drop last 2 characters
rules:
- type: add_column_substr
target_column: 0
substr_type: drop_suffix
length: 2
Input: [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "f"], ["cow", "c"], ["ba", ""], ["d", ""]]
Use cases:
- Remove common prefixes/suffixes
- Extract barcodes from fixed positions
- Truncate long identifiers
add_column_rownum
Purpose: Add sequential row numbers
Parameters:
start(int): Starting number (0 or 1)
Example:
rules:
- type: add_column_rownum
start: 1
Input: [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "1"], ["cow", "2"], ["ba", "3"], ["d", "4"]]
Use cases:
- Create numerical identifiers
- Track original row order after sorting
- Generate replicate numbers
add_column_value
Purpose: Add constant value to all rows
Parameters:
value(string): Constant value
Example:
rules:
- type: add_column_value
value: "control"
Input: [["foo"], ["cow"]]
Output: [["foo", "control"], ["cow", "control"]]
Use cases:
- Add condition labels (treatment/control)
- Add constant metadata
- Create separator columns for concatenation
add_column_concatenate
Purpose: Combine two columns into one
Parameters:
target_column_0(int): First columntarget_column_1(int): Second column
Example:
rules:
- type: add_column_concatenate
target_column_0: 0
target_column_1: 1
Input: [["sample", "001"], ["sample", "002"]]
Output: [["sample", "001", "sample001"], ["sample", "002", "sample002"]]
Use cases:
- Combine sample ID + replicate number
- Build hierarchical identifiers
- Create unique identifiers from multiple parts
Common pattern - add separator:
rules:
- type: add_column_value
value: "_"
- type: add_column_concatenate
target_column_0: 0
target_column_1: 2 # The "_" column
- type: add_column_concatenate
target_column_0: 3
target_column_1: 1 # Result + second original column
add_column_metadata
Purpose: Extract metadata from source objects
Parameters:
value(enum): Metadata typeidentifier0,identifier1,identifier2, …index0,index1,index2, …tags
Identifier extraction:
rules:
- type: add_column_metadata
value: identifier0 # Outermost identifier
Input: [["moo"], ["meow"], ["bark"]]
Sources: [{"identifiers": ["cow"]}, {"identifiers": ["cat"]}, {"identifiers": ["dog"]}]
Output: [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]
Multiple levels:
rules:
- type: add_column_metadata
value: identifier0 # Outer identifier
- type: add_column_metadata
value: identifier1 # Inner identifier
Sources: [
{"identifiers": ["sample1", "forward"]},
{"identifiers": ["sample1", "reverse"]}
]
Output: [["data", "sample1", "forward"], ["data", "sample1", "reverse"]]
Index extraction:
rules:
- type: add_column_metadata
value: index0
- type: add_column_metadata
value: index1
Sources: [
{"indices": [0, 0]}, # First sample, forward
{"indices": [0, 1]}, # First sample, reverse
{"indices": [1, 0]}, # Second sample, forward
{"indices": [1, 1]} # Second sample, reverse
]
Output: [
["samp1for", "0", "0"],
["samp1rev", "0", "1"],
["samp2for", "1", "0"],
["samp2rev", "1", "1"]
]
Tags extraction:
rules:
- type: add_column_metadata
value: tags
Sources: [
{"identifiers": ["cow"], "tags": ["farm"]},
{"identifiers": ["dog"], "tags": ["house", "firestation"]}
]
Output: [["moo", "farm"], ["bark", "firestation,house"]] # Sorted, comma-joined
Use cases:
- Access collection structure metadata
- Build identifiers from nested collections
- Use positional indices for numerical IDs
- Extract tags for grouping/filtering
add_column_group_tag_value
Purpose: Extract specific group tag value
Parameters:
value(string): Group tag name (e.g., “condition”, “type”)default_value(string): Value if tag not present
Example:
rules:
- type: add_column_group_tag_value
value: condition
default_value: 'control'
Sources: [
{"tags": ["group:condition:treated"]},
{"tags": ["group:condition:control"]},
{"tags": []} # No condition tag
]
Output: [["data", "treated"], ["data", "control"], ["data", "control"]]
Multiple tags - first alphabetically wins:
rules:
- type: add_column_group_tag_value
value: where
default_value: 'barn'
Sources: [
{"tags": ["group:where:house", "group:where:firestation"]}
]
Output: [["data", "firestation"]] # "firestation" < "house" alphabetically
Use cases:
- Group samples by experimental condition
- Extract sample type (single-end/paired-end)
- Use tags for nested collection organization
add_column_from_sample_sheet_index
Purpose: Retrieve values from sample sheet columns
Parameters:
value(int): Sample sheet column index
Example:
rules:
- type: add_column_from_sample_sheet_index
value: 0
- type: add_column_from_sample_sheet_index
value: 1
Sources: [
{"columns": [0, 1]},
{"columns": [2, 3]}
]
Output: [["moo", 0, 1], ["cow", 2, 3]]
Use cases:
- Extract metadata from uploaded sample sheets
- Access additional columns beyond identifiers
- Incorporate external metadata
2. Filter Rules
Filters remove rows from the data table based on conditions.
add_filter_regex
Purpose: Keep/remove rows matching pattern
Parameters:
target_column(int): Column to testexpression(string): Regular expressioninvert(bool, default: false): If true, keep non-matching rows
Keep matching:
rules:
- type: add_filter_regex
target_column: 0
expression: '(a+)'
invert: false
Input: [["a", "b", "c"], ["e", "f", "g"]]
Output: [["a", "b", "c"]]
Remove matching:
rules:
- type: add_filter_regex
target_column: 2
expression: '(c+)'
invert: true
Input: [["a", "b", "c"], ["e", "f", "g"]]
Output: [["e", "f", "g"]]
Use cases:
- Filter by sample name pattern
- Remove control samples
- Select specific file types
add_filter_count
Purpose: Keep/remove first or last N rows
Parameters:
count(int): Number of rowswhich(enum):firstorlastinvert(bool, default: false): If true, reverse filter
Remove first row:
rules:
- type: add_filter_count
count: 1
which: first
invert: false # Remove first, keep rest
Input: [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["e", "f", "g"], ["h", "i", "j"]]
Keep only last row:
rules:
- type: add_filter_count
count: 1
which: last
invert: true # Remove all but last
Input: [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["h", "i", "j"]]
Use cases:
- Remove header rows
- Skip first N samples
- Select specific replicates
add_filter_empty
Purpose: Remove rows with empty cells
Parameters:
target_column(int): Column to checkinvert(bool, default: false): If true, keep only empty
Remove empty:
rules:
- type: add_filter_empty
target_column: 0
invert: false
Input: [["", "b", "c"], ["a", "b", "c"]]
Output: [["a", "b", "c"]]
Use cases:
- Remove rows with missing identifiers
- Clean up sparse data
- Filter failed extractions
add_filter_matches
Purpose: Exact value matching (case-sensitive)
Parameters:
value(string): Exact value to matchtarget_column(int): Column to checkinvert(bool, default: false): If true, keep non-matching
Example:
rules:
- type: add_filter_matches
value: "a"
target_column: 0
invert: false
Input: [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["a", "b", "c"]]
Important: Exact match only, no partial matches:
rules:
- type: add_filter_matches
value: "a"
target_column: 1
Input: [["a ", "b", "c"]] # Note space after "a"
Output: [] # No match - "a " != "a"
Use cases:
- Filter by specific sample ID
- Select exact condition matches
- Boolean filtering (match “true”/“false”)
add_filter_compare
Purpose: Numeric comparisons
Parameters:
target_column(int): Column with numeric valuesvalue(number): Comparison valuecompare_type(enum):less_thanless_than_equalgreater_thangreater_than_equal
Example:
rules:
- type: add_filter_compare
target_column: 0
value: 13
compare_type: less_than
Input: [["1", "moo"], ["10", "cow"], ["13", "rat"], ["20", "dog"]]
Output: [["1", "moo"], ["10", "cow"]]
Use cases:
- Filter by quality scores
- Select samples by replicate number
- Threshold-based filtering
3. Structural Rules
remove_columns
Purpose: Delete specified columns
Parameters:
target_columns(list[int]): Column indices to remove
Example:
rules:
- type: remove_columns
target_columns: [0, 1]
Input: [["a", "b", "c"], ["e", "f", "g"]]
Output: [["c"], ["g"]]
Use cases:
- Clean up intermediate columns
- Remove temporary concatenation columns
- Keep only final identifier columns
sort
Purpose: Sort rows by column value
Parameters:
target_column(int): Column to sort bynumeric(bool): If true, numeric sort; if false, alphabetic
Alphabetic sort:
rules:
- type: sort
numeric: false
target_column: 0
Input: [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]
Output: [["bark", "dog"], ["meow", "cat"], ["moo", "cow"]]
Note: Case-sensitive, uppercase sorts before lowercase
Input: [["Dog"], ["cat"], ["cow"]]
Output: [["Dog"], ["cat"], ["cow"]] # "Dog" < "cat" < "cow"
Use cases:
- Alphabetize samples
- Order by numerical IDs
- Group similar identifiers together
swap_columns
Purpose: Exchange two column positions
Parameters:
target_column_0(int): First columntarget_column_1(int): Second column
Example:
rules:
- type: swap_columns
target_column_0: 0
target_column_1: 1
Input: [["moo", "cow"], ["meow", "cat"]]
Output: [["cow", "moo"], ["cat", "meow"]]
Use cases:
- Reorder identifier columns for mapping
- Fix column order mistakes
- Prepare for specific mapping requirements
split_columns
Purpose: Create Cartesian product of column groups (split rows)
Parameters:
target_columns_0(list[int]): First column grouptarget_columns_1(list[int]): Second column group
Example:
rules:
- type: split_columns
target_columns_0: [0]
target_columns_1: [1]
Input: [["moo", "cow", "A"], ["meow", "cat", "B"]]
Output: [
["moo", "A"],
["cow", "A"],
["meow", "B"],
["cat", "B"]
]
How it works:
- For each row, creates N×M new rows where:
- N = number of columns in group 0
- M = number of columns in group 1
- Each new row contains one value from group 0 + one value from group 1 + all other columns
Use cases:
- Split paired-end data into forward/reverse
- Expand multiple samples per row
- Create all combinations for comparisons
Mapping Operations
Mapping operations define how transformed data columns become collection structure. These are the final step that converts tabular data back to collections.
Available Mapping Types
list_identifiers
Purpose: Create list structure with specified nesting levels
Parameters:
columns(list[int]): Column indices for identifiers
Single column = simple list:
mapping:
- type: list_identifiers
columns: [0]
Data: [["sample1"], ["sample2"]]
Result: list [sample1, sample2]
Two columns = nested list:list:
mapping:
- type: list_identifiers
columns: [0, 1]
Data: [["group1", "s1"], ["group1", "s2"], ["group2", "s3"]]
Result: list:list [
group1 → [s1, s2],
group2 → [s3]
]
Three columns = list:list:list:
mapping:
- type: list_identifiers
columns: [0, 1, 2]
Nesting logic:
- Column 0 = outermost identifier
- Column 1 = next level identifier
- Column 2 = innermost identifier
- Groups rows by matching outer identifiers
paired_identifier
Purpose: Add paired collection level
Parameters:
columns(list[int]): Single column with paired identifier
Valid identifier values:
forward,f,1,R1→ becomesforwardreverse,r,2,R2→ becomesreverse
Simple paired:
mapping:
- type: paired_identifier
columns: [0]
Data: [["forward"], ["reverse"]]
Result: paired {forward, reverse}
Combined with list:
mapping:
- type: list_identifiers
columns: [0]
- type: paired_identifier
columns: [1]
Data: [
["sample1", "forward"],
["sample1", "reverse"],
["sample2", "forward"],
["sample2", "reverse"]
]
Result: list:paired [
sample1 → {forward, reverse},
sample2 → {forward, reverse}
]
paired_or_unpaired_identifier
Purpose: Add paired_or_unpaired collection level (allows unpaired single datasets)
Parameters:
columns(list[int]): Single column with paired/unpaired identifier
Valid identifier values:
- All paired values above, plus:
unpaired,u→ becomesunpaired
Example:
mapping:
- type: list_identifiers
columns: [0]
- type: paired_or_unpaired_identifier
columns: [1]
Note: If a sample has only forward and no reverse, it becomes unpaired automatically.
tags
Purpose: Apply tags to collection elements
Parameters:
columns(list[int]): Columns containing tag values
Example:
mapping:
- type: list_identifiers
columns: [0]
- type: tags
columns: [1]
Data: [["sample1", "replicate1"], ["sample2", "replicate2"]]
Result: list with tags [
sample1 (tags: ["replicate1"]),
sample2 (tags: ["replicate2"])
]
group_tags
Purpose: Apply group tags (format: group:name:value)
Parameters:
columns(list[int]): Columns containing group tag values
Example:
mapping:
- type: list_identifiers
columns: [1, 0] # Group by column 1, element ID column 0
- type: group_tags
columns: [1] # Apply as group tag
Data: [["s1", "treated"], ["s2", "control"]]
Result: list:list with group tags [
treated → [s1 (tags: ["group:treated"])],
control → [s2 (tags: ["group:control"])]
]
Collection Type Determination
The output collection type is determined solely by the mapping:
# From RuleSet.collection_type property:
list_columns = mapping_as_dict.get("list_identifiers", {"columns": []})["columns"]
collection_type = ":".join("list" for c in list_columns)
if "paired_identifier" in mapping_as_dict:
collection_type += ":paired" if collection_type else "paired"
if "paired_or_unpaired_identifier" in mapping_as_dict:
collection_type += ":paired_or_unpaired" if collection_type else "paired_or_unpaired"
Examples:
list_identifiers: [0]→listlist_identifiers: [0, 1]→list:listlist_identifiers: [0]+paired_identifier: [1]→list:pairedlist_identifiers: [0, 1]+paired_identifier: [2]→list:list:paired
Complete Example: list:record to list:paired
This example demonstrates complex transformation combining multiple rule types:
Goal: Convert list:record collection where records have “mother” and “child” elements into list:paired with “forward” and “reverse”.
rules:
- type: add_column_metadata
value: identifier0 # Sample identifier
- type: add_column_metadata
value: identifier1 # Record type (mother/father/child)
- type: add_column_regex
target_column: 2
expression: 'mother'
replacement: 'forward'
allow_unmatched: true # Leaves others as ""
- type: add_column_regex
target_column: 2
expression: 'child'
replacement: 'reverse'
allow_unmatched: true
- type: add_column_concatenate
target_column_0: 3 # Result of first regex
target_column_1: 4 # Result of second regex
- type: add_filter_empty
target_column: 5 # Remove rows that didn't match (father)
invert: false
- type: remove_columns
target_columns: [2, 3, 4] # Clean up intermediate columns
mapping:
- type: list_identifiers
columns: [1, 2] # Sample ID, then forward/reverse
Transformation steps:
Initial:
data: [["el1"], ["el2"], ["el3"]]
sources: [
{"identifiers": ["samp1", "mother"]},
{"identifiers": ["samp1", "father"]},
{"identifiers": ["samp1", "child"]}
]
After add_column_metadata (identifier0, identifier1):
[["el1", "samp1", "mother"],
["el2", "samp1", "father"],
["el3", "samp1", "child"]]
After first regex (mother → forward):
[["el1", "samp1", "mother", "forward"],
["el2", "samp1", "father", ""],
["el3", "samp1", "child", ""]]
After second regex (child → reverse):
[["el1", "samp1", "mother", "forward", ""],
["el2", "samp1", "father", "", ""],
["el3", "samp1", "child", "", "reverse"]]
After concatenate (cols 3+4):
[["el1", "samp1", "mother", "forward", "", "forward"],
["el2", "samp1", "father", "", "", ""],
["el3", "samp1", "child", "", "reverse", "reverse"]]
After filter empty (col 5):
[["el1", "samp1", "mother", "forward", "", "forward"],
["el3", "samp1", "child", "", "reverse", "reverse"]]
After remove_columns [2, 3, 4]:
[["el1", "samp1", "forward"],
["el3", "samp1", "reverse"]]
Final mapping with list_identifiers [1, 2]:
Result: list:paired [
samp1 → {forward, reverse}
]
Rule Composition Patterns
Pattern 1: Extract and Flatten
Goal: Flatten list:paired → list with combined identifiers
rules:
- type: add_column_metadata
value: identifier0 # Outer ID
- type: add_column_metadata
value: identifier1 # Pair ID (forward/reverse)
- type: add_column_concatenate
target_column_0: 1
target_column_1: 2 # Combine them
mapping:
- type: list_identifiers
columns: [3] # Use concatenated column
Pattern 2: Group by Tag
Goal: Reorganize by tag value into nested structure
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_group_tag_value
value: condition # Extract "condition" tag
default_value: "unassigned"
mapping:
- type: list_identifiers
columns: [1, 0] # Group by condition, then sample ID
- type: group_tags
columns: [1] # Apply as group tags
Pattern 3: Filter and Sort
Goal: Select subset and alphabetize
rules:
- type: add_column_metadata
value: identifier0
- type: add_filter_regex
target_column: 0
expression: '^control_' # Only controls
invert: false
- type: sort
numeric: false
target_column: 0
mapping:
- type: list_identifiers
columns: [0]
Pattern 4: Parse Filename Structure
Goal: Extract sample info from “sample_123_R1.fastq.gz” format
rules:
- type: add_column_metadata
value: identifier0 # Original filename
- type: add_column_regex
target_column: 0
expression: 'sample_(\w+)_R(\d)'
group_count: 2 # Sample ID and read number
- type: add_column_value
value: "_R"
- type: add_column_concatenate
target_column_0: 3
target_column_1: 2 # "_R" + "1" = "_R1"
- type: add_column_concatenate
target_column_0: 1
target_column_1: 4 # "123" + "_R1" = "123_R1"
- type: remove_columns
target_columns: [0, 2, 3, 4] # Keep only final identifier
mapping:
- type: list_identifiers
columns: [0]
Pattern 5: Create Paired from Separate Lists
Goal: Combine separate forward/reverse lists into paired
Assumption: Files named like sample1_R1.fastq, sample1_R2.fastq
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_regex
target_column: 0
expression: '(.+)_R([12])'
group_count: 2 # Sample name and read number
- type: add_column_regex
target_column: 2
expression: '1'
replacement: 'forward'
allow_unmatched: true
- type: add_column_regex
target_column: 2
expression: '2'
replacement: 'reverse'
allow_unmatched: true
- type: add_column_concatenate
target_column_0: 3
target_column_1: 4
- type: sort
numeric: false
target_column: 1 # Ensure pairs adjacent
- type: remove_columns
target_columns: [0, 2, 3, 4]
mapping:
- type: list_identifiers
columns: [0] # Sample ID
- type: paired_identifier
columns: [1] # forward/reverse
Best Practices
1. Plan Column Layout
Before writing rules, sketch the transformations:
Col 0: Original identifier
Col 1: Extracted sample ID (regex)
Col 2: Extracted replicate (regex)
Col 3: Separator "_"
Col 4: Concatenate 1+3+2
Col 5: Final identifier after cleanup
2. Test Incrementally
Add rules one at a time and verify output:
- Start with metadata extraction
- Add one transformation
- Check result
- Continue
3. Use allow_unmatched Carefully
Only use when genuinely optional:
# BAD - silently fails to extract
- type: add_column_regex
expression: 'wrong_pattern'
allow_unmatched: true
# GOOD - errors if pattern doesn't match
- type: add_column_regex
expression: 'expected_pattern'
allow_unmatched: false
4. Remove Intermediate Columns
Clean up before mapping:
rules:
- type: add_column_metadata
value: identifier0
# ... many transformations ...
- type: remove_columns
target_columns: [0, 2, 3] # Remove temp columns
mapping:
- type: list_identifiers
columns: [0] # Only final column remains
5. Validate with Filters
Use filters to ensure data quality:
rules:
- type: add_column_regex
expression: 'pattern'
allow_unmatched: false # Errors if doesn't match
- type: add_filter_empty
target_column: 1
invert: false # Remove any that became empty
6. Document Complex Rules
Add comments explaining logic:
rules:
# Extract sample ID from filename "sample_123_R1.fastq"
- type: add_column_regex
target_column: 0
expression: 'sample_(\w+)_R\d'
# Remove original filename column
- type: remove_columns
target_columns: [0]
Common Pitfalls
Pitfall 1: Column Indices Shift
Problem: After removing columns, indices change
# WRONG
rules:
- type: remove_columns
target_columns: [0]
- type: add_column_regex
target_column: 1 # This is now wrong! Column 1 became 0
Solution: Remove columns last, or recalculate indices
Pitfall 2: Forgetting Invert Logic
Problem: Confusion about filter invert
# Remove matching rows (keep non-matching)
- type: add_filter_regex
expression: 'control_'
invert: false # FALSE means "remove matching"
# Keep matching rows
- type: add_filter_regex
expression: 'sample_'
invert: true # TRUE means "remove non-matching" = keep matching
Clearer thinking: invert: false = “remove matches”, invert: true = “remove non-matches”
Pitfall 3: Regex Escaping
Problem: Special regex characters not escaped
# WRONG - . matches any character
expression: 'file.fastq'
# RIGHT
expression: 'file\.fastq'
# For literal parentheses
expression: '\(sample\)'
Pitfall 4: Case Sensitivity
Problem: Filters are case-sensitive
# Doesn't match "Sample1"
- type: add_filter_matches
value: "sample1"
target_column: 0
Solution: Use regex with case-insensitive flag or normalize case first
Pitfall 5: Empty Sources After Filtering
Problem: All rows filtered out
rules:
- type: add_filter_regex
expression: 'nonexistent'
invert: false
# Result: Empty collection!
Solution: Test filters carefully, use allow_unmatched: true when appropriate
When to Use / When NOT to Use Apply Rules
When to Use Apply Rules
- Complex identifier parsing (multiple regex extractions)
- Tag-based restructuring (group by experimental condition)
- Conditional filtering combined with restructuring
- Structure transformations not covered by simple tools
- Multiple transformations needed in one step
When NOT to Use Apply Rules
| Operation | Use This Instead | Why |
|---|---|---|
| Simple filtering | __FILTER_FROM_FILE__ | Simpler, clearer intent |
| Basic flattening | __FLATTEN__ | One-step operation |
| Sort collection | __SORTLIST__ | Dedicated tool |
| Extract element | __EXTRACT_DATASET__ | Direct operation |
| Zip two lists | __ZIP_COLLECTION__ | Simpler syntax |
| Unzip paired | __UNZIP_COLLECTION__ | Straightforward |
| Relabel identifiers | __RELABEL_FROM_FILE__ | If mapping from file |
Comparison Table
| Operation | Simple Tool | When to use Apply Rules instead |
|---|---|---|
| Filter | Filter Collection | Need to filter on derived metadata, combine with restructuring |
| Flatten | Flatten Collection | Need control over identifier format, filter simultaneously |
| Relabel | Relabel Identifiers | Need regex-based transformation, derive from existing metadata |
| Sort | Sort Collection | Need to sort by derived values, combine with other operations |
| Restructure | N/A | Full control over nesting structure from any metadata |
Key Insight: Apply Rules is the tool of choice when:
- Multiple transformations needed in one step
- Restructuring based on metadata (tags, identifier patterns)
- Complex identifier manipulation required
- Standard tools don’t cover the use case
Use Case Examples
Use Case 1: Standard Paired-End RNA-seq
Files: sample1_R1.fastq.gz, sample1_R2.fastq.gz, sample2_R1.fastq.gz, sample2_R2.fastq.gz
Goal: Create list:paired collection
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_regex
target_column: 0
expression: '(.+)_R([12])\.fastq\.gz'
group_count: 2
- type: add_column_regex
target_column: 2
expression: '1'
replacement: 'forward'
allow_unmatched: true
- type: add_column_regex
target_column: 2
expression: '2'
replacement: 'reverse'
allow_unmatched: true
- type: add_column_concatenate
target_column_0: 3
target_column_1: 4
- type: sort
target_column: 1
numeric: false
- type: remove_columns
target_columns: [0, 2, 3, 4]
mapping:
- type: list_identifiers
columns: [0]
- type: paired_identifier
columns: [1]
Use Case 2: Remove Control Samples
Goal: Filter out samples starting with “control_“
rules:
- type: add_column_metadata
value: identifier0
- type: add_filter_regex
target_column: 0
expression: '^control_'
invert: true # Remove matches = keep non-controls
mapping:
- type: list_identifiers
columns: [0]
Use Case 3: Group by Treatment Condition
Goal: Reorganize by “group:condition:*” tag into nested list
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_group_tag_value
value: condition
default_value: 'unassigned'
mapping:
- type: list_identifiers
columns: [1, 0] # Group by condition, then sample
- type: group_tags
columns: [1]
Use Case 4: Select Top N by Quality Score
Assumption: Quality score in sample name like “sample_123_q95”
Goal: Keep only samples with quality >= 90
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_regex
target_column: 0
expression: 'sample_\w+_q(\d+)'
- type: add_filter_compare
target_column: 1
value: 90
compare_type: greater_than_equal
- type: remove_columns
target_columns: [1]
mapping:
- type: list_identifiers
columns: [0]
Use Case 5: Replicate Structure
Files: treatment_rep1, treatment_rep2, control_rep1, control_rep2
Goal: Create list:list [treatment → [rep1, rep2], control → [rep1, rep2]]
rules:
- type: add_column_metadata
value: identifier0
- type: add_column_regex
target_column: 0
expression: '(.+)_rep(\d+)'
group_count: 2
- type: sort
target_column: 1
numeric: false
- type: remove_columns
target_columns: [0]
mapping:
- type: list_identifiers
columns: [0, 1] # Condition, then replicate
API Usage
inputs = {
"input": {"src": "hdca", "id": collection_id},
"rules": {
"rules": [...],
"mapping": [...]
}
}
response = POST /api/tools {"tool_id": "__APPLY_RULES__", "history_id": "...", "inputs": inputs}