Galaxy Apply Rules DSL

Reference for Galaxy’s Apply Rules DSL — the rule grammar consumed by __APPLY_RULES__ (see galaxy-collection-tools for the surrounding tool catalog and galaxy-collection-semantics for collection mapping/reduction semantics).

Key principle: rules transform collection metadata (identifiers, indices, tags) as tabular data; mapping operations turn the resulting columns back into collection structure.

Sources of truth in Galaxy:

lib/galaxy/util/rules_dsl.py — rule implementation
lib/galaxy/util/rules_dsl_spec.yml — test spec covering every rule type
lib/galaxy/managers/collections.py — collection building from rules
PR #5819 — original implementation

This note is the consumer-facing companion to those files. Verify against the spec YAML when in doubt.

Rules DSL Architecture

Core Concepts

Data Model:

data: [[cell values]]      # 2D array of strings (tabular data)
sources: [source objects]   # Metadata for each row (identifiers, indices, tags)

Initial State Example:

# Input: list:paired with elements [sample1/forward, sample1/reverse, sample2/forward, sample2/reverse]

data = [[], [], [], []]  # Empty rows, one per dataset
sources = [
    {"identifiers": ["sample1", "forward"], "indices": [0, 0], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample1", "reverse"], "indices": [0, 1], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample2", "forward"], "indices": [1, 0], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample2", "reverse"], "indices": [1, 1], "dataset": <hda>, "tags": []},
]

Execution Flow:

Collection metadata extracted to tabular format
Rules applied sequentially to transform data
Mapping operations convert transformed data to new collection

Example:

Input collection: list [i1, i2]

Initial state:
  data: [["value1"], ["value2"]]
  sources: [
    {"identifiers": ["i1"], "indices": [0]},
    {"identifiers": ["i2"], "indices": [1]}
  ]

After rules:
  data: [["value1", "i1"], ["value2", "i2"]]  # Added identifier column

After mapping:
  Output collection: list [i1, i2]

Rule Operations

Rules are applied sequentially in the order specified. Each rule transforms the data table.

1. Column Addition Rules

add_column_basename

Purpose: Extract basename from file paths

Parameters:

target_column (int): Column containing paths

Example:

rules:
  - type: add_column_basename
    target_column: 0

Transformation:

Input:  [["/path/to/moo.txt"], ["moo.txt"]]
Output: [["/path/to/moo.txt", "moo.txt"], ["moo.txt", "moo.txt"]]

Use cases:

Extract filenames from full paths
Create identifiers from uploaded file paths
Normalize identifiers across different upload methods

add_column_regex

Purpose: Capture regex groups or perform replacements

Parameters:

target_column (int): Column to process
expression (string): Regular expression pattern
replacement (string, optional): Replacement template with \1, \2 for groups
group_count (int, optional): Number of groups to capture as separate columns
allow_unmatched (bool, default: false): If false, errors on unmatched rows

Mode 1: Simple capture (default)

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o)+'

Input:  [["foo"], ["cow"]]
Output: [["foo", "oo"], ["cow", "o"]]

Mode 2: Replacement

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o+)'
    replacement: 'the os \1'

Input:  [["foo"], ["cow"]]
Output: [["foo", "the os oo"], ["cow", "the os o"]]

Mode 3: Multiple groups

rules:
  - type: add_column_regex
    target_column: 0
    expression: '.*(o)(o)'
    group_count: 2

Input:  [["foo"], ["boo"]]
Output: [["foo", "o", "o"], ["boo", "o", "o"]]

Mode 4: Allow unmatched

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o)+'
    allow_unmatched: true

Input:  [["foo"], ["cow"], ["cat"]]
Output: [["foo", "oo"], ["cow", "o"], ["cat", ""]]

Use cases:

Extract sample names from filenames (e.g., sample_(\w+)_R1.fastq)
Parse structured identifiers (e.g., TCGA-(\w+)-(\d+))
Clean up identifiers (remove prefixes/suffixes)
Extract metadata embedded in filenames

Common patterns:

# Extract sample ID from "sample_123_R1.fastq"
expression: 'sample_(\w+)_R\d'

# Extract prefix before underscore
expression: '([^_]+)_.*'

# Extract everything before last dot
expression: '(.+)\.[^.]+$'

add_column_substr

Purpose: Extract or remove fixed-length substrings

Parameters:

target_column (int): Column to process
substr_type (enum): Operation type
- keep_prefix: Keep first N characters
- keep_suffix: Keep last N characters
- drop_prefix: Remove first N characters
- drop_suffix: Remove last N characters
length (int): Number of characters

Examples:

# Keep first 2 characters
rules:
  - type: add_column_substr
    target_column: 0
    substr_type: keep_prefix
    length: 2

Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "fo"], ["cow", "co"], ["ba", "ba"], ["d", "d"]]

# Drop last 2 characters
rules:
  - type: add_column_substr
    target_column: 0
    substr_type: drop_suffix
    length: 2

Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "f"], ["cow", "c"], ["ba", ""], ["d", ""]]

Use cases:

Remove common prefixes/suffixes
Extract barcodes from fixed positions
Truncate long identifiers

add_column_rownum

Purpose: Add sequential row numbers

Parameters:

start (int): Starting number (0 or 1)

Example:

rules:
  - type: add_column_rownum
    start: 1

Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "1"], ["cow", "2"], ["ba", "3"], ["d", "4"]]

Use cases:

Create numerical identifiers
Track original row order after sorting
Generate replicate numbers

add_column_value

Purpose: Add constant value to all rows

Parameters:

value (string): Constant value

Example:

rules:
  - type: add_column_value
    value: "control"

Input:  [["foo"], ["cow"]]
Output: [["foo", "control"], ["cow", "control"]]

Use cases:

Add condition labels (treatment/control)
Add constant metadata
Create separator columns for concatenation

add_column_concatenate

Purpose: Combine two columns into one

Parameters:

target_column_0 (int): First column
target_column_1 (int): Second column

Example:

rules:
  - type: add_column_concatenate
    target_column_0: 0
    target_column_1: 1

Input:  [["sample", "001"], ["sample", "002"]]
Output: [["sample", "001", "sample001"], ["sample", "002", "sample002"]]

Use cases:

Combine sample ID + replicate number
Build hierarchical identifiers
Create unique identifiers from multiple parts

Common pattern - add separator:

rules:
  - type: add_column_value
    value: "_"
  - type: add_column_concatenate
    target_column_0: 0
    target_column_1: 2  # The "_" column
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 1  # Result + second original column

add_column_metadata

Purpose: Extract metadata from source objects

Parameters:

value (enum): Metadata type
- identifier0, identifier1, identifier2, …
- index0, index1, index2, …
- tags

Identifier extraction:

rules:
  - type: add_column_metadata
    value: identifier0  # Outermost identifier

Input:  [["moo"], ["meow"], ["bark"]]
Sources: [{"identifiers": ["cow"]}, {"identifiers": ["cat"]}, {"identifiers": ["dog"]}]
Output:  [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]

Multiple levels:

rules:
  - type: add_column_metadata
    value: identifier0  # Outer identifier
  - type: add_column_metadata
    value: identifier1  # Inner identifier

Sources: [
  {"identifiers": ["sample1", "forward"]},
  {"identifiers": ["sample1", "reverse"]}
]
Output:  [["data", "sample1", "forward"], ["data", "sample1", "reverse"]]

Index extraction:

rules:
  - type: add_column_metadata
    value: index0
  - type: add_column_metadata
    value: index1

Sources: [
  {"indices": [0, 0]},  # First sample, forward
  {"indices": [0, 1]},  # First sample, reverse
  {"indices": [1, 0]},  # Second sample, forward
  {"indices": [1, 1]}   # Second sample, reverse
]
Output:  [
  ["samp1for", "0", "0"],
  ["samp1rev", "0", "1"],
  ["samp2for", "1", "0"],
  ["samp2rev", "1", "1"]
]

Tags extraction:

rules:
  - type: add_column_metadata
    value: tags

Sources: [
  {"identifiers": ["cow"], "tags": ["farm"]},
  {"identifiers": ["dog"], "tags": ["house", "firestation"]}
]
Output:  [["moo", "farm"], ["bark", "firestation,house"]]  # Sorted, comma-joined

Use cases:

Access collection structure metadata
Build identifiers from nested collections
Use positional indices for numerical IDs
Extract tags for grouping/filtering

add_column_group_tag_value

Purpose: Extract specific group tag value

Parameters:

value (string): Group tag name (e.g., “condition”, “type”)
default_value (string): Value if tag not present

Example:

rules:
  - type: add_column_group_tag_value
    value: condition
    default_value: 'control'

Sources: [
  {"tags": ["group:condition:treated"]},
  {"tags": ["group:condition:control"]},
  {"tags": []}  # No condition tag
]
Output:  [["data", "treated"], ["data", "control"], ["data", "control"]]

Multiple tags - first alphabetically wins:

rules:
  - type: add_column_group_tag_value
    value: where
    default_value: 'barn'

Sources: [
  {"tags": ["group:where:house", "group:where:firestation"]}
]
Output:  [["data", "firestation"]]  # "firestation" < "house" alphabetically

Use cases:

Group samples by experimental condition
Extract sample type (single-end/paired-end)
Use tags for nested collection organization

add_column_from_sample_sheet_index

Purpose: Retrieve values from sample sheet columns

Parameters:

value (int): Sample sheet column index

Example:

rules:
  - type: add_column_from_sample_sheet_index
    value: 0
  - type: add_column_from_sample_sheet_index
    value: 1

Sources: [
  {"columns": [0, 1]},
  {"columns": [2, 3]}
]
Output:  [["moo", 0, 1], ["cow", 2, 3]]

Use cases:

Extract metadata from uploaded sample sheets
Access additional columns beyond identifiers
Incorporate external metadata

2. Filter Rules

Filters remove rows from the data table based on conditions.

add_filter_regex

Purpose: Keep/remove rows matching pattern

Parameters:

target_column (int): Column to test
expression (string): Regular expression
invert (bool, default: false): If true, keep non-matching rows

Keep matching:

rules:
  - type: add_filter_regex
    target_column: 0
    expression: '(a+)'
    invert: false

Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["a", "b", "c"]]

Remove matching:

rules:
  - type: add_filter_regex
    target_column: 2
    expression: '(c+)'
    invert: true

Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["e", "f", "g"]]

Use cases:

Filter by sample name pattern
Remove control samples
Select specific file types

add_filter_count

Purpose: Keep/remove first or last N rows

Parameters:

count (int): Number of rows
which (enum): first or last
invert (bool, default: false): If true, reverse filter

Remove first row:

rules:
  - type: add_filter_count
    count: 1
    which: first
    invert: false  # Remove first, keep rest

Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["e", "f", "g"], ["h", "i", "j"]]

Keep only last row:

rules:
  - type: add_filter_count
    count: 1
    which: last
    invert: true  # Remove all but last

Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["h", "i", "j"]]

Use cases:

Remove header rows
Skip first N samples
Select specific replicates

add_filter_empty

Purpose: Remove rows with empty cells

Parameters:

target_column (int): Column to check
invert (bool, default: false): If true, keep only empty

Remove empty:

rules:
  - type: add_filter_empty
    target_column: 0
    invert: false

Input:  [["", "b", "c"], ["a", "b", "c"]]
Output: [["a", "b", "c"]]

Use cases:

Remove rows with missing identifiers
Clean up sparse data
Filter failed extractions

add_filter_matches

Purpose: Exact value matching (case-sensitive)

Parameters:

value (string): Exact value to match
target_column (int): Column to check
invert (bool, default: false): If true, keep non-matching

Example:

rules:
  - type: add_filter_matches
    value: "a"
    target_column: 0
    invert: false

Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["a", "b", "c"]]

Important: Exact match only, no partial matches:

rules:
  - type: add_filter_matches
    value: "a"
    target_column: 1

Input:  [["a ", "b", "c"]]  # Note space after "a"
Output: []  # No match - "a " != "a"

Use cases:

Filter by specific sample ID
Select exact condition matches
Boolean filtering (match “true”/“false”)

add_filter_compare

Purpose: Numeric comparisons

Parameters:

target_column (int): Column with numeric values
value (number): Comparison value
compare_type (enum):
- less_than
- less_than_equal
- greater_than
- greater_than_equal

Example:

rules:
  - type: add_filter_compare
    target_column: 0
    value: 13
    compare_type: less_than

Input:  [["1", "moo"], ["10", "cow"], ["13", "rat"], ["20", "dog"]]
Output: [["1", "moo"], ["10", "cow"]]

Use cases:

Filter by quality scores
Select samples by replicate number
Threshold-based filtering

3. Structural Rules

remove_columns

Purpose: Delete specified columns

Parameters:

target_columns (list[int]): Column indices to remove

Example:

rules:
  - type: remove_columns
    target_columns: [0, 1]

Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["c"], ["g"]]

Use cases:

Clean up intermediate columns
Remove temporary concatenation columns
Keep only final identifier columns

sort

Purpose: Sort rows by column value

Parameters:

target_column (int): Column to sort by
numeric (bool): If true, numeric sort; if false, alphabetic

Alphabetic sort:

rules:
  - type: sort
    numeric: false
    target_column: 0

Input:  [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]
Output: [["bark", "dog"], ["meow", "cat"], ["moo", "cow"]]

Note: Case-sensitive, uppercase sorts before lowercase

Input:  [["Dog"], ["cat"], ["cow"]]
Output: [["Dog"], ["cat"], ["cow"]]  # "Dog" < "cat" < "cow"

Use cases:

Alphabetize samples
Order by numerical IDs
Group similar identifiers together

swap_columns

Purpose: Exchange two column positions

Parameters:

target_column_0 (int): First column
target_column_1 (int): Second column

Example:

rules:
  - type: swap_columns
    target_column_0: 0
    target_column_1: 1

Input:  [["moo", "cow"], ["meow", "cat"]]
Output: [["cow", "moo"], ["cat", "meow"]]

Use cases:

Reorder identifier columns for mapping
Fix column order mistakes
Prepare for specific mapping requirements

split_columns

Purpose: Create Cartesian product of column groups (split rows)

Parameters:

target_columns_0 (list[int]): First column group
target_columns_1 (list[int]): Second column group

Example:

rules:
  - type: split_columns
    target_columns_0: [0]
    target_columns_1: [1]

Input:  [["moo", "cow", "A"], ["meow", "cat", "B"]]
Output: [
  ["moo", "A"],
  ["cow", "A"],
  ["meow", "B"],
  ["cat", "B"]
]

How it works:

For each row, creates N×M new rows where:
- N = number of columns in group 0
- M = number of columns in group 1
Each new row contains one value from group 0 + one value from group 1 + all other columns

Use cases:

Split paired-end data into forward/reverse
Expand multiple samples per row
Create all combinations for comparisons

Mapping Operations

Mapping operations define how transformed data columns become collection structure. These are the final step that converts tabular data back to collections.

Available Mapping Types

list_identifiers

Purpose: Create list structure with specified nesting levels

Parameters:

columns (list[int]): Column indices for identifiers

Single column = simple list:

mapping:
  - type: list_identifiers
    columns: [0]

Data: [["sample1"], ["sample2"]]
Result: list [sample1, sample2]

Two columns = nested list:list:

mapping:
  - type: list_identifiers
    columns: [0, 1]

Data: [["group1", "s1"], ["group1", "s2"], ["group2", "s3"]]
Result: list:list [
  group1 → [s1, s2],
  group2 → [s3]
]

Three columns = list:list:list:

mapping:
  - type: list_identifiers
    columns: [0, 1, 2]

Nesting logic:

Column 0 = outermost identifier
Column 1 = next level identifier
Column 2 = innermost identifier
Groups rows by matching outer identifiers

paired_identifier

Purpose: Add paired collection level

Parameters:

columns (list[int]): Single column with paired identifier

Valid identifier values:

forward, f, 1, R1 → becomes forward
reverse, r, 2, R2 → becomes reverse

Simple paired:

mapping:
  - type: paired_identifier
    columns: [0]

Data: [["forward"], ["reverse"]]
Result: paired {forward, reverse}

Combined with list:

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_identifier
    columns: [1]

Data: [
  ["sample1", "forward"],
  ["sample1", "reverse"],
  ["sample2", "forward"],
  ["sample2", "reverse"]
]
Result: list:paired [
  sample1 → {forward, reverse},
  sample2 → {forward, reverse}
]

paired_or_unpaired_identifier

Purpose: Add paired_or_unpaired collection level (allows unpaired single datasets)

Parameters:

columns (list[int]): Single column with paired/unpaired identifier

Valid identifier values:

All paired values above, plus:
unpaired, u → becomes unpaired

Example:

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_or_unpaired_identifier
    columns: [1]

Note: If a sample has only forward and no reverse, it becomes unpaired automatically.

group_tags

Purpose: Apply group tags (format: group:name:value)

Parameters:

columns (list[int]): Columns containing group tag values

Example:

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by column 1, element ID column 0
  - type: group_tags
    columns: [1]     # Apply as group tag

Data: [["s1", "treated"], ["s2", "control"]]
Result: list:list with group tags [
  treated → [s1 (tags: ["group:treated"])],
  control → [s2 (tags: ["group:control"])]
]

Collection Type Determination

The output collection type is determined solely by the mapping:

# From RuleSet.collection_type property:
list_columns = mapping_as_dict.get("list_identifiers", {"columns": []})["columns"]
collection_type = ":".join("list" for c in list_columns)
if "paired_identifier" in mapping_as_dict:
    collection_type += ":paired" if collection_type else "paired"
if "paired_or_unpaired_identifier" in mapping_as_dict:
    collection_type += ":paired_or_unpaired" if collection_type else "paired_or_unpaired"

Examples:

list_identifiers: [0] → list
list_identifiers: [0, 1] → list:list
list_identifiers: [0] + paired_identifier: [1] → list:paired
list_identifiers: [0, 1] + paired_identifier: [2] → list:list:paired

Complete Example: list:record to list:paired

This example demonstrates complex transformation combining multiple rule types:

Goal: Convert list:record collection where records have “mother” and “child” elements into list:paired with “forward” and “reverse”.

rules:
  - type: add_column_metadata
    value: identifier0  # Sample identifier
  - type: add_column_metadata
    value: identifier1  # Record type (mother/father/child)
  - type: add_column_regex
    target_column: 2
    expression: 'mother'
    replacement: 'forward'
    allow_unmatched: true  # Leaves others as ""
  - type: add_column_regex
    target_column: 2
    expression: 'child'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3  # Result of first regex
    target_column_1: 4  # Result of second regex
  - type: add_filter_empty
    target_column: 5  # Remove rows that didn't match (father)
    invert: false
  - type: remove_columns
    target_columns: [2, 3, 4]  # Clean up intermediate columns

mapping:
  - type: list_identifiers
    columns: [1, 2]  # Sample ID, then forward/reverse

Transformation steps:

Initial:
  data: [["el1"], ["el2"], ["el3"]]
  sources: [
    {"identifiers": ["samp1", "mother"]},
    {"identifiers": ["samp1", "father"]},
    {"identifiers": ["samp1", "child"]}
  ]

After add_column_metadata (identifier0, identifier1):
  [["el1", "samp1", "mother"],
   ["el2", "samp1", "father"],
   ["el3", "samp1", "child"]]

After first regex (mother → forward):
  [["el1", "samp1", "mother", "forward"],
   ["el2", "samp1", "father", ""],
   ["el3", "samp1", "child", ""]]

After second regex (child → reverse):
  [["el1", "samp1", "mother", "forward", ""],
   ["el2", "samp1", "father", "", ""],
   ["el3", "samp1", "child", "", "reverse"]]

After concatenate (cols 3+4):
  [["el1", "samp1", "mother", "forward", "", "forward"],
   ["el2", "samp1", "father", "", "", ""],
   ["el3", "samp1", "child", "", "reverse", "reverse"]]

After filter empty (col 5):
  [["el1", "samp1", "mother", "forward", "", "forward"],
   ["el3", "samp1", "child", "", "reverse", "reverse"]]

After remove_columns [2, 3, 4]:
  [["el1", "samp1", "forward"],
   ["el3", "samp1", "reverse"]]

Final mapping with list_identifiers [1, 2]:
  Result: list:paired [
    samp1 → {forward, reverse}
  ]

Rule Composition Patterns

Pattern 1: Extract and Flatten

Goal: Flatten list:paired → list with combined identifiers

rules:
  - type: add_column_metadata
    value: identifier0  # Outer ID
  - type: add_column_metadata
    value: identifier1  # Pair ID (forward/reverse)
  - type: add_column_concatenate
    target_column_0: 1
    target_column_1: 2  # Combine them

mapping:
  - type: list_identifiers
    columns: [3]  # Use concatenated column

Pattern 2: Group by Tag

Goal: Reorganize by tag value into nested structure

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_group_tag_value
    value: condition  # Extract "condition" tag
    default_value: "unassigned"

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by condition, then sample ID
  - type: group_tags
    columns: [1]     # Apply as group tags

Pattern 3: Filter and Sort

Goal: Select subset and alphabetize

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_filter_regex
    target_column: 0
    expression: '^control_'  # Only controls
    invert: false
  - type: sort
    numeric: false
    target_column: 0

mapping:
  - type: list_identifiers
    columns: [0]

Pattern 4: Parse Filename Structure

Goal: Extract sample info from “sample_123_R1.fastq.gz” format

rules:
  - type: add_column_metadata
    value: identifier0  # Original filename
  - type: add_column_regex
    target_column: 0
    expression: 'sample_(\w+)_R(\d)'
    group_count: 2  # Sample ID and read number
  - type: add_column_value
    value: "_R"
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 2  # "_R" + "1" = "_R1"
  - type: add_column_concatenate
    target_column_0: 1
    target_column_1: 4  # "123" + "_R1" = "123_R1"
  - type: remove_columns
    target_columns: [0, 2, 3, 4]  # Keep only final identifier

mapping:
  - type: list_identifiers
    columns: [0]

Pattern 5: Create Paired from Separate Lists

Goal: Combine separate forward/reverse lists into paired

Assumption: Files named like sample1_R1.fastq, sample1_R2.fastq

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_R([12])'
    group_count: 2  # Sample name and read number
  - type: add_column_regex
    target_column: 2
    expression: '1'
    replacement: 'forward'
    allow_unmatched: true
  - type: add_column_regex
    target_column: 2
    expression: '2'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 4
  - type: sort
    numeric: false
    target_column: 1  # Ensure pairs adjacent
  - type: remove_columns
    target_columns: [0, 2, 3, 4]

mapping:
  - type: list_identifiers
    columns: [0]     # Sample ID
  - type: paired_identifier
    columns: [1]     # forward/reverse

Best Practices

1. Plan Column Layout

Before writing rules, sketch the transformations:

Col 0: Original identifier
Col 1: Extracted sample ID (regex)
Col 2: Extracted replicate (regex)
Col 3: Separator "_"
Col 4: Concatenate 1+3+2
Col 5: Final identifier after cleanup

2. Test Incrementally

Add rules one at a time and verify output:

Start with metadata extraction
Add one transformation
Check result
Continue

3. Use allow_unmatched Carefully

Only use when genuinely optional:

# BAD - silently fails to extract
- type: add_column_regex
  expression: 'wrong_pattern'
  allow_unmatched: true

# GOOD - errors if pattern doesn't match
- type: add_column_regex
  expression: 'expected_pattern'
  allow_unmatched: false

4. Remove Intermediate Columns

Clean up before mapping:

rules:
  - type: add_column_metadata
    value: identifier0
  # ... many transformations ...
  - type: remove_columns
    target_columns: [0, 2, 3]  # Remove temp columns

mapping:
  - type: list_identifiers
    columns: [0]  # Only final column remains

5. Validate with Filters

Use filters to ensure data quality:

rules:
  - type: add_column_regex
    expression: 'pattern'
    allow_unmatched: false  # Errors if doesn't match
  - type: add_filter_empty
    target_column: 1
    invert: false  # Remove any that became empty

6. Document Complex Rules

Add comments explaining logic:

rules:
  # Extract sample ID from filename "sample_123_R1.fastq"
  - type: add_column_regex
    target_column: 0
    expression: 'sample_(\w+)_R\d'

  # Remove original filename column
  - type: remove_columns
    target_columns: [0]

Common Pitfalls

Pitfall 1: Column Indices Shift

Problem: After removing columns, indices change

# WRONG
rules:
  - type: remove_columns
    target_columns: [0]
  - type: add_column_regex
    target_column: 1  # This is now wrong! Column 1 became 0

Solution: Remove columns last, or recalculate indices

Pitfall 2: Forgetting Invert Logic

Problem: Confusion about filter invert

# Remove matching rows (keep non-matching)
- type: add_filter_regex
  expression: 'control_'
  invert: false  # FALSE means "remove matching"

# Keep matching rows
- type: add_filter_regex
  expression: 'sample_'
  invert: true  # TRUE means "remove non-matching" = keep matching

Clearer thinking: invert: false = “remove matches”, invert: true = “remove non-matches”

Pitfall 3: Regex Escaping

Problem: Special regex characters not escaped

# WRONG - . matches any character
expression: 'file.fastq'

# RIGHT
expression: 'file\.fastq'

# For literal parentheses
expression: '\(sample\)'

Pitfall 4: Case Sensitivity

Problem: Filters are case-sensitive

# Doesn't match "Sample1"
- type: add_filter_matches
  value: "sample1"
  target_column: 0

Solution: Use regex with case-insensitive flag or normalize case first

Pitfall 5: Empty Sources After Filtering

Problem: All rows filtered out

rules:
  - type: add_filter_regex
    expression: 'nonexistent'
    invert: false
# Result: Empty collection!

Solution: Test filters carefully, use allow_unmatched: true when appropriate

When to Use / When NOT to Use Apply Rules

When to Use Apply Rules

Complex identifier parsing (multiple regex extractions)
Tag-based restructuring (group by experimental condition)
Conditional filtering combined with restructuring
Structure transformations not covered by simple tools
Multiple transformations needed in one step

When NOT to Use Apply Rules

Operation	Use This Instead	Why
Simple filtering	`__FILTER_FROM_FILE__`	Simpler, clearer intent
Basic flattening	`__FLATTEN__`	One-step operation
Sort collection	`__SORTLIST__`	Dedicated tool
Extract element	`__EXTRACT_DATASET__`	Direct operation
Zip two lists	`__ZIP_COLLECTION__`	Simpler syntax
Unzip paired	`__UNZIP_COLLECTION__`	Straightforward
Relabel identifiers	`__RELABEL_FROM_FILE__`	If mapping from file

Comparison Table

Operation	Simple Tool	When to use Apply Rules instead
Filter	Filter Collection	Need to filter on derived metadata, combine with restructuring
Flatten	Flatten Collection	Need control over identifier format, filter simultaneously
Relabel	Relabel Identifiers	Need regex-based transformation, derive from existing metadata
Sort	Sort Collection	Need to sort by derived values, combine with other operations
Restructure	N/A	Full control over nesting structure from any metadata

Key Insight: Apply Rules is the tool of choice when:

Multiple transformations needed in one step
Restructuring based on metadata (tags, identifier patterns)
Complex identifier manipulation required
Standard tools don’t cover the use case

Use Case Examples

Use Case 1: Standard Paired-End RNA-seq

Files: sample1_R1.fastq.gz, sample1_R2.fastq.gz, sample2_R1.fastq.gz, sample2_R2.fastq.gz

Goal: Create list:paired collection

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_R([12])\.fastq\.gz'
    group_count: 2
  - type: add_column_regex
    target_column: 2
    expression: '1'
    replacement: 'forward'
    allow_unmatched: true
  - type: add_column_regex
    target_column: 2
    expression: '2'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 4
  - type: sort
    target_column: 1
    numeric: false
  - type: remove_columns
    target_columns: [0, 2, 3, 4]

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_identifier
    columns: [1]

Use Case 2: Remove Control Samples

Goal: Filter out samples starting with “control_“

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_filter_regex
    target_column: 0
    expression: '^control_'
    invert: true  # Remove matches = keep non-controls

mapping:
  - type: list_identifiers
    columns: [0]

Use Case 3: Group by Treatment Condition

Goal: Reorganize by “group:condition:*” tag into nested list

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_group_tag_value
    value: condition
    default_value: 'unassigned'

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by condition, then sample
  - type: group_tags
    columns: [1]

Use Case 4: Select Top N by Quality Score

Assumption: Quality score in sample name like “sample_123_q95”

Goal: Keep only samples with quality >= 90

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: 'sample_\w+_q(\d+)'
  - type: add_filter_compare
    target_column: 1
    value: 90
    compare_type: greater_than_equal
  - type: remove_columns
    target_columns: [1]

mapping:
  - type: list_identifiers
    columns: [0]

Use Case 5: Replicate Structure

Files: treatment_rep1, treatment_rep2, control_rep1, control_rep2

Goal: Create list:list [treatment → [rep1, rep2], control → [rep1, rep2]]

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_rep(\d+)'
    group_count: 2
  - type: sort
    target_column: 1
    numeric: false
  - type: remove_columns
    target_columns: [0]

mapping:
  - type: list_identifiers
    columns: [0, 1]  # Condition, then replicate

API Usage

inputs = {
    "input": {"src": "hdca", "id": collection_id},
    "rules": {
        "rules": [...],
        "mapping": [...]
    }
}
response = POST /api/tools {"tool_id": "__APPLY_RULES__", "history_id": "...", "inputs": inputs}

Rules DSL Architecture

Core Concepts

Rule Operations

1. Column Addition Rules

add_column_basename

add_column_regex

add_column_substr

add_column_rownum

add_column_value

add_column_concatenate

add_column_metadata

add_column_group_tag_value

add_column_from_sample_sheet_index

2. Filter Rules

add_filter_regex

add_filter_count

add_filter_empty

add_filter_matches

add_filter_compare

3. Structural Rules

remove_columns

sort

swap_columns

split_columns

Mapping Operations

Available Mapping Types

list_identifiers

paired_identifier

paired_or_unpaired_identifier

tags

group_tags

Collection Type Determination

Complete Example: list:record to list:paired

Rule Composition Patterns

Pattern 1: Extract and Flatten

Pattern 2: Group by Tag

Pattern 3: Filter and Sort

Pattern 4: Parse Filename Structure

Pattern 5: Create Paired from Separate Lists

Best Practices

1. Plan Column Layout

2. Test Incrementally

3. Use allow_unmatched Carefully

4. Remove Intermediate Columns

5. Validate with Filters

6. Document Complex Rules

Common Pitfalls

Pitfall 1: Column Indices Shift

Pitfall 2: Forgetting Invert Logic

Pitfall 3: Regex Escaping

Pitfall 4: Case Sensitivity

Pitfall 5: Empty Sources After Filtering

When to Use / When NOT to Use Apply Rules

When to Use Apply Rules

When NOT to Use Apply Rules

Comparison Table

Use Case Examples

Use Case 1: Standard Paired-End RNA-seq

Use Case 2: Remove Control Samples

Use Case 3: Group by Treatment Condition

Use Case 4: Select Top N by Quality Score

Use Case 5: Replicate Structure

API Usage

Incoming References (13)