Home Research

Galaxy Apply Rules DSL

Reference for Galaxy's Apply Rules DSL: rule operations, mapping operations, composition patterns, pitfalls.

Raw
Revised
2026-05-02
Rev
2
component

Reference for Galaxy’s Apply Rules DSL — the rule grammar consumed by __APPLY_RULES__ (see galaxy-collection-tools for the surrounding tool catalog and galaxy-collection-semantics for collection mapping/reduction semantics).

Key principle: rules transform collection metadata (identifiers, indices, tags) as tabular data; mapping operations turn the resulting columns back into collection structure.

Sources of truth in Galaxy:

  • lib/galaxy/util/rules_dsl.py — rule implementation
  • lib/galaxy/util/rules_dsl_spec.yml — test spec covering every rule type
  • lib/galaxy/managers/collections.py — collection building from rules
  • PR #5819 — original implementation

This note is the consumer-facing companion to those files. Verify against the spec YAML when in doubt.

Rules DSL Architecture

Core Concepts

Data Model:

data: [[cell values]]      # 2D array of strings (tabular data)
sources: [source objects]   # Metadata for each row (identifiers, indices, tags)

Initial State Example:

# Input: list:paired with elements [sample1/forward, sample1/reverse, sample2/forward, sample2/reverse]

data = [[], [], [], []]  # Empty rows, one per dataset
sources = [
    {"identifiers": ["sample1", "forward"], "indices": [0, 0], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample1", "reverse"], "indices": [0, 1], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample2", "forward"], "indices": [1, 0], "dataset": <hda>, "tags": []},
    {"identifiers": ["sample2", "reverse"], "indices": [1, 1], "dataset": <hda>, "tags": []},
]

Execution Flow:

  1. Collection metadata extracted to tabular format
  2. Rules applied sequentially to transform data
  3. Mapping operations convert transformed data to new collection

Example:

Input collection: list [i1, i2]

Initial state:
  data: [["value1"], ["value2"]]
  sources: [
    {"identifiers": ["i1"], "indices": [0]},
    {"identifiers": ["i2"], "indices": [1]}
  ]

After rules:
  data: [["value1", "i1"], ["value2", "i2"]]  # Added identifier column

After mapping:
  Output collection: list [i1, i2]

Rule Operations

Rules are applied sequentially in the order specified. Each rule transforms the data table.

1. Column Addition Rules

add_column_basename

Purpose: Extract basename from file paths

Parameters:

  • target_column (int): Column containing paths

Example:

rules:
  - type: add_column_basename
    target_column: 0

Transformation:

Input:  [["/path/to/moo.txt"], ["moo.txt"]]
Output: [["/path/to/moo.txt", "moo.txt"], ["moo.txt", "moo.txt"]]

Use cases:

  • Extract filenames from full paths
  • Create identifiers from uploaded file paths
  • Normalize identifiers across different upload methods

add_column_regex

Purpose: Capture regex groups or perform replacements

Parameters:

  • target_column (int): Column to process
  • expression (string): Regular expression pattern
  • replacement (string, optional): Replacement template with \1, \2 for groups
  • group_count (int, optional): Number of groups to capture as separate columns
  • allow_unmatched (bool, default: false): If false, errors on unmatched rows

Mode 1: Simple capture (default)

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o)+'
Input:  [["foo"], ["cow"]]
Output: [["foo", "oo"], ["cow", "o"]]

Mode 2: Replacement

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o+)'
    replacement: 'the os \1'
Input:  [["foo"], ["cow"]]
Output: [["foo", "the os oo"], ["cow", "the os o"]]

Mode 3: Multiple groups

rules:
  - type: add_column_regex
    target_column: 0
    expression: '.*(o)(o)'
    group_count: 2
Input:  [["foo"], ["boo"]]
Output: [["foo", "o", "o"], ["boo", "o", "o"]]

Mode 4: Allow unmatched

rules:
  - type: add_column_regex
    target_column: 0
    expression: '(o)+'
    allow_unmatched: true
Input:  [["foo"], ["cow"], ["cat"]]
Output: [["foo", "oo"], ["cow", "o"], ["cat", ""]]

Use cases:

  • Extract sample names from filenames (e.g., sample_(\w+)_R1.fastq)
  • Parse structured identifiers (e.g., TCGA-(\w+)-(\d+))
  • Clean up identifiers (remove prefixes/suffixes)
  • Extract metadata embedded in filenames

Common patterns:

# Extract sample ID from "sample_123_R1.fastq"
expression: 'sample_(\w+)_R\d'

# Extract prefix before underscore
expression: '([^_]+)_.*'

# Extract everything before last dot
expression: '(.+)\.[^.]+$'

add_column_substr

Purpose: Extract or remove fixed-length substrings

Parameters:

  • target_column (int): Column to process
  • substr_type (enum): Operation type
    • keep_prefix: Keep first N characters
    • keep_suffix: Keep last N characters
    • drop_prefix: Remove first N characters
    • drop_suffix: Remove last N characters
  • length (int): Number of characters

Examples:

# Keep first 2 characters
rules:
  - type: add_column_substr
    target_column: 0
    substr_type: keep_prefix
    length: 2
Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "fo"], ["cow", "co"], ["ba", "ba"], ["d", "d"]]
# Drop last 2 characters
rules:
  - type: add_column_substr
    target_column: 0
    substr_type: drop_suffix
    length: 2
Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "f"], ["cow", "c"], ["ba", ""], ["d", ""]]

Use cases:

  • Remove common prefixes/suffixes
  • Extract barcodes from fixed positions
  • Truncate long identifiers

add_column_rownum

Purpose: Add sequential row numbers

Parameters:

  • start (int): Starting number (0 or 1)

Example:

rules:
  - type: add_column_rownum
    start: 1
Input:  [["foo"], ["cow"], ["ba"], ["d"]]
Output: [["foo", "1"], ["cow", "2"], ["ba", "3"], ["d", "4"]]

Use cases:

  • Create numerical identifiers
  • Track original row order after sorting
  • Generate replicate numbers

add_column_value

Purpose: Add constant value to all rows

Parameters:

  • value (string): Constant value

Example:

rules:
  - type: add_column_value
    value: "control"
Input:  [["foo"], ["cow"]]
Output: [["foo", "control"], ["cow", "control"]]

Use cases:

  • Add condition labels (treatment/control)
  • Add constant metadata
  • Create separator columns for concatenation

add_column_concatenate

Purpose: Combine two columns into one

Parameters:

  • target_column_0 (int): First column
  • target_column_1 (int): Second column

Example:

rules:
  - type: add_column_concatenate
    target_column_0: 0
    target_column_1: 1
Input:  [["sample", "001"], ["sample", "002"]]
Output: [["sample", "001", "sample001"], ["sample", "002", "sample002"]]

Use cases:

  • Combine sample ID + replicate number
  • Build hierarchical identifiers
  • Create unique identifiers from multiple parts

Common pattern - add separator:

rules:
  - type: add_column_value
    value: "_"
  - type: add_column_concatenate
    target_column_0: 0
    target_column_1: 2  # The "_" column
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 1  # Result + second original column

add_column_metadata

Purpose: Extract metadata from source objects

Parameters:

  • value (enum): Metadata type
    • identifier0, identifier1, identifier2, …
    • index0, index1, index2, …
    • tags

Identifier extraction:

rules:
  - type: add_column_metadata
    value: identifier0  # Outermost identifier
Input:  [["moo"], ["meow"], ["bark"]]
Sources: [{"identifiers": ["cow"]}, {"identifiers": ["cat"]}, {"identifiers": ["dog"]}]
Output:  [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]

Multiple levels:

rules:
  - type: add_column_metadata
    value: identifier0  # Outer identifier
  - type: add_column_metadata
    value: identifier1  # Inner identifier
Sources: [
  {"identifiers": ["sample1", "forward"]},
  {"identifiers": ["sample1", "reverse"]}
]
Output:  [["data", "sample1", "forward"], ["data", "sample1", "reverse"]]

Index extraction:

rules:
  - type: add_column_metadata
    value: index0
  - type: add_column_metadata
    value: index1
Sources: [
  {"indices": [0, 0]},  # First sample, forward
  {"indices": [0, 1]},  # First sample, reverse
  {"indices": [1, 0]},  # Second sample, forward
  {"indices": [1, 1]}   # Second sample, reverse
]
Output:  [
  ["samp1for", "0", "0"],
  ["samp1rev", "0", "1"],
  ["samp2for", "1", "0"],
  ["samp2rev", "1", "1"]
]

Tags extraction:

rules:
  - type: add_column_metadata
    value: tags
Sources: [
  {"identifiers": ["cow"], "tags": ["farm"]},
  {"identifiers": ["dog"], "tags": ["house", "firestation"]}
]
Output:  [["moo", "farm"], ["bark", "firestation,house"]]  # Sorted, comma-joined

Use cases:

  • Access collection structure metadata
  • Build identifiers from nested collections
  • Use positional indices for numerical IDs
  • Extract tags for grouping/filtering

add_column_group_tag_value

Purpose: Extract specific group tag value

Parameters:

  • value (string): Group tag name (e.g., “condition”, “type”)
  • default_value (string): Value if tag not present

Example:

rules:
  - type: add_column_group_tag_value
    value: condition
    default_value: 'control'
Sources: [
  {"tags": ["group:condition:treated"]},
  {"tags": ["group:condition:control"]},
  {"tags": []}  # No condition tag
]
Output:  [["data", "treated"], ["data", "control"], ["data", "control"]]

Multiple tags - first alphabetically wins:

rules:
  - type: add_column_group_tag_value
    value: where
    default_value: 'barn'
Sources: [
  {"tags": ["group:where:house", "group:where:firestation"]}
]
Output:  [["data", "firestation"]]  # "firestation" < "house" alphabetically

Use cases:

  • Group samples by experimental condition
  • Extract sample type (single-end/paired-end)
  • Use tags for nested collection organization

add_column_from_sample_sheet_index

Purpose: Retrieve values from sample sheet columns

Parameters:

  • value (int): Sample sheet column index

Example:

rules:
  - type: add_column_from_sample_sheet_index
    value: 0
  - type: add_column_from_sample_sheet_index
    value: 1
Sources: [
  {"columns": [0, 1]},
  {"columns": [2, 3]}
]
Output:  [["moo", 0, 1], ["cow", 2, 3]]

Use cases:

  • Extract metadata from uploaded sample sheets
  • Access additional columns beyond identifiers
  • Incorporate external metadata

2. Filter Rules

Filters remove rows from the data table based on conditions.

add_filter_regex

Purpose: Keep/remove rows matching pattern

Parameters:

  • target_column (int): Column to test
  • expression (string): Regular expression
  • invert (bool, default: false): If true, keep non-matching rows

Keep matching:

rules:
  - type: add_filter_regex
    target_column: 0
    expression: '(a+)'
    invert: false
Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["a", "b", "c"]]

Remove matching:

rules:
  - type: add_filter_regex
    target_column: 2
    expression: '(c+)'
    invert: true
Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["e", "f", "g"]]

Use cases:

  • Filter by sample name pattern
  • Remove control samples
  • Select specific file types

add_filter_count

Purpose: Keep/remove first or last N rows

Parameters:

  • count (int): Number of rows
  • which (enum): first or last
  • invert (bool, default: false): If true, reverse filter

Remove first row:

rules:
  - type: add_filter_count
    count: 1
    which: first
    invert: false  # Remove first, keep rest
Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["e", "f", "g"], ["h", "i", "j"]]

Keep only last row:

rules:
  - type: add_filter_count
    count: 1
    which: last
    invert: true  # Remove all but last
Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["h", "i", "j"]]

Use cases:

  • Remove header rows
  • Skip first N samples
  • Select specific replicates

add_filter_empty

Purpose: Remove rows with empty cells

Parameters:

  • target_column (int): Column to check
  • invert (bool, default: false): If true, keep only empty

Remove empty:

rules:
  - type: add_filter_empty
    target_column: 0
    invert: false
Input:  [["", "b", "c"], ["a", "b", "c"]]
Output: [["a", "b", "c"]]

Use cases:

  • Remove rows with missing identifiers
  • Clean up sparse data
  • Filter failed extractions

add_filter_matches

Purpose: Exact value matching (case-sensitive)

Parameters:

  • value (string): Exact value to match
  • target_column (int): Column to check
  • invert (bool, default: false): If true, keep non-matching

Example:

rules:
  - type: add_filter_matches
    value: "a"
    target_column: 0
    invert: false
Input:  [["a", "b", "c"], ["e", "f", "g"], ["h", "i", "j"]]
Output: [["a", "b", "c"]]

Important: Exact match only, no partial matches:

rules:
  - type: add_filter_matches
    value: "a"
    target_column: 1
Input:  [["a ", "b", "c"]]  # Note space after "a"
Output: []  # No match - "a " != "a"

Use cases:

  • Filter by specific sample ID
  • Select exact condition matches
  • Boolean filtering (match “true”/“false”)

add_filter_compare

Purpose: Numeric comparisons

Parameters:

  • target_column (int): Column with numeric values
  • value (number): Comparison value
  • compare_type (enum):
    • less_than
    • less_than_equal
    • greater_than
    • greater_than_equal

Example:

rules:
  - type: add_filter_compare
    target_column: 0
    value: 13
    compare_type: less_than
Input:  [["1", "moo"], ["10", "cow"], ["13", "rat"], ["20", "dog"]]
Output: [["1", "moo"], ["10", "cow"]]

Use cases:

  • Filter by quality scores
  • Select samples by replicate number
  • Threshold-based filtering

3. Structural Rules

remove_columns

Purpose: Delete specified columns

Parameters:

  • target_columns (list[int]): Column indices to remove

Example:

rules:
  - type: remove_columns
    target_columns: [0, 1]
Input:  [["a", "b", "c"], ["e", "f", "g"]]
Output: [["c"], ["g"]]

Use cases:

  • Clean up intermediate columns
  • Remove temporary concatenation columns
  • Keep only final identifier columns

sort

Purpose: Sort rows by column value

Parameters:

  • target_column (int): Column to sort by
  • numeric (bool): If true, numeric sort; if false, alphabetic

Alphabetic sort:

rules:
  - type: sort
    numeric: false
    target_column: 0
Input:  [["moo", "cow"], ["meow", "cat"], ["bark", "dog"]]
Output: [["bark", "dog"], ["meow", "cat"], ["moo", "cow"]]

Note: Case-sensitive, uppercase sorts before lowercase

Input:  [["Dog"], ["cat"], ["cow"]]
Output: [["Dog"], ["cat"], ["cow"]]  # "Dog" < "cat" < "cow"

Use cases:

  • Alphabetize samples
  • Order by numerical IDs
  • Group similar identifiers together

swap_columns

Purpose: Exchange two column positions

Parameters:

  • target_column_0 (int): First column
  • target_column_1 (int): Second column

Example:

rules:
  - type: swap_columns
    target_column_0: 0
    target_column_1: 1
Input:  [["moo", "cow"], ["meow", "cat"]]
Output: [["cow", "moo"], ["cat", "meow"]]

Use cases:

  • Reorder identifier columns for mapping
  • Fix column order mistakes
  • Prepare for specific mapping requirements

split_columns

Purpose: Create Cartesian product of column groups (split rows)

Parameters:

  • target_columns_0 (list[int]): First column group
  • target_columns_1 (list[int]): Second column group

Example:

rules:
  - type: split_columns
    target_columns_0: [0]
    target_columns_1: [1]
Input:  [["moo", "cow", "A"], ["meow", "cat", "B"]]
Output: [
  ["moo", "A"],
  ["cow", "A"],
  ["meow", "B"],
  ["cat", "B"]
]

How it works:

  • For each row, creates N×M new rows where:
    • N = number of columns in group 0
    • M = number of columns in group 1
  • Each new row contains one value from group 0 + one value from group 1 + all other columns

Use cases:

  • Split paired-end data into forward/reverse
  • Expand multiple samples per row
  • Create all combinations for comparisons

Mapping Operations

Mapping operations define how transformed data columns become collection structure. These are the final step that converts tabular data back to collections.

Available Mapping Types

list_identifiers

Purpose: Create list structure with specified nesting levels

Parameters:

  • columns (list[int]): Column indices for identifiers

Single column = simple list:

mapping:
  - type: list_identifiers
    columns: [0]
Data: [["sample1"], ["sample2"]]
Result: list [sample1, sample2]

Two columns = nested list:list:

mapping:
  - type: list_identifiers
    columns: [0, 1]
Data: [["group1", "s1"], ["group1", "s2"], ["group2", "s3"]]
Result: list:list [
  group1 → [s1, s2],
  group2 → [s3]
]

Three columns = list:list:list:

mapping:
  - type: list_identifiers
    columns: [0, 1, 2]

Nesting logic:

  • Column 0 = outermost identifier
  • Column 1 = next level identifier
  • Column 2 = innermost identifier
  • Groups rows by matching outer identifiers

paired_identifier

Purpose: Add paired collection level

Parameters:

  • columns (list[int]): Single column with paired identifier

Valid identifier values:

  • forward, f, 1, R1 → becomes forward
  • reverse, r, 2, R2 → becomes reverse

Simple paired:

mapping:
  - type: paired_identifier
    columns: [0]
Data: [["forward"], ["reverse"]]
Result: paired {forward, reverse}

Combined with list:

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_identifier
    columns: [1]
Data: [
  ["sample1", "forward"],
  ["sample1", "reverse"],
  ["sample2", "forward"],
  ["sample2", "reverse"]
]
Result: list:paired [
  sample1 → {forward, reverse},
  sample2 → {forward, reverse}
]

paired_or_unpaired_identifier

Purpose: Add paired_or_unpaired collection level (allows unpaired single datasets)

Parameters:

  • columns (list[int]): Single column with paired/unpaired identifier

Valid identifier values:

  • All paired values above, plus:
  • unpaired, u → becomes unpaired

Example:

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_or_unpaired_identifier
    columns: [1]

Note: If a sample has only forward and no reverse, it becomes unpaired automatically.


tags

Purpose: Apply tags to collection elements

Parameters:

  • columns (list[int]): Columns containing tag values

Example:

mapping:
  - type: list_identifiers
    columns: [0]
  - type: tags
    columns: [1]
Data: [["sample1", "replicate1"], ["sample2", "replicate2"]]
Result: list with tags [
  sample1 (tags: ["replicate1"]),
  sample2 (tags: ["replicate2"])
]

group_tags

Purpose: Apply group tags (format: group:name:value)

Parameters:

  • columns (list[int]): Columns containing group tag values

Example:

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by column 1, element ID column 0
  - type: group_tags
    columns: [1]     # Apply as group tag
Data: [["s1", "treated"], ["s2", "control"]]
Result: list:list with group tags [
  treated → [s1 (tags: ["group:treated"])],
  control → [s2 (tags: ["group:control"])]
]

Collection Type Determination

The output collection type is determined solely by the mapping:

# From RuleSet.collection_type property:
list_columns = mapping_as_dict.get("list_identifiers", {"columns": []})["columns"]
collection_type = ":".join("list" for c in list_columns)
if "paired_identifier" in mapping_as_dict:
    collection_type += ":paired" if collection_type else "paired"
if "paired_or_unpaired_identifier" in mapping_as_dict:
    collection_type += ":paired_or_unpaired" if collection_type else "paired_or_unpaired"

Examples:

  • list_identifiers: [0]list
  • list_identifiers: [0, 1]list:list
  • list_identifiers: [0] + paired_identifier: [1]list:paired
  • list_identifiers: [0, 1] + paired_identifier: [2]list:list:paired

Complete Example: list:record to list:paired

This example demonstrates complex transformation combining multiple rule types:

Goal: Convert list:record collection where records have “mother” and “child” elements into list:paired with “forward” and “reverse”.

rules:
  - type: add_column_metadata
    value: identifier0  # Sample identifier
  - type: add_column_metadata
    value: identifier1  # Record type (mother/father/child)
  - type: add_column_regex
    target_column: 2
    expression: 'mother'
    replacement: 'forward'
    allow_unmatched: true  # Leaves others as ""
  - type: add_column_regex
    target_column: 2
    expression: 'child'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3  # Result of first regex
    target_column_1: 4  # Result of second regex
  - type: add_filter_empty
    target_column: 5  # Remove rows that didn't match (father)
    invert: false
  - type: remove_columns
    target_columns: [2, 3, 4]  # Clean up intermediate columns

mapping:
  - type: list_identifiers
    columns: [1, 2]  # Sample ID, then forward/reverse

Transformation steps:

Initial:
  data: [["el1"], ["el2"], ["el3"]]
  sources: [
    {"identifiers": ["samp1", "mother"]},
    {"identifiers": ["samp1", "father"]},
    {"identifiers": ["samp1", "child"]}
  ]

After add_column_metadata (identifier0, identifier1):
  [["el1", "samp1", "mother"],
   ["el2", "samp1", "father"],
   ["el3", "samp1", "child"]]

After first regex (mother → forward):
  [["el1", "samp1", "mother", "forward"],
   ["el2", "samp1", "father", ""],
   ["el3", "samp1", "child", ""]]

After second regex (child → reverse):
  [["el1", "samp1", "mother", "forward", ""],
   ["el2", "samp1", "father", "", ""],
   ["el3", "samp1", "child", "", "reverse"]]

After concatenate (cols 3+4):
  [["el1", "samp1", "mother", "forward", "", "forward"],
   ["el2", "samp1", "father", "", "", ""],
   ["el3", "samp1", "child", "", "reverse", "reverse"]]

After filter empty (col 5):
  [["el1", "samp1", "mother", "forward", "", "forward"],
   ["el3", "samp1", "child", "", "reverse", "reverse"]]

After remove_columns [2, 3, 4]:
  [["el1", "samp1", "forward"],
   ["el3", "samp1", "reverse"]]

Final mapping with list_identifiers [1, 2]:
  Result: list:paired [
    samp1 → {forward, reverse}
  ]

Rule Composition Patterns

Pattern 1: Extract and Flatten

Goal: Flatten list:pairedlist with combined identifiers

rules:
  - type: add_column_metadata
    value: identifier0  # Outer ID
  - type: add_column_metadata
    value: identifier1  # Pair ID (forward/reverse)
  - type: add_column_concatenate
    target_column_0: 1
    target_column_1: 2  # Combine them

mapping:
  - type: list_identifiers
    columns: [3]  # Use concatenated column

Pattern 2: Group by Tag

Goal: Reorganize by tag value into nested structure

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_group_tag_value
    value: condition  # Extract "condition" tag
    default_value: "unassigned"

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by condition, then sample ID
  - type: group_tags
    columns: [1]     # Apply as group tags

Pattern 3: Filter and Sort

Goal: Select subset and alphabetize

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_filter_regex
    target_column: 0
    expression: '^control_'  # Only controls
    invert: false
  - type: sort
    numeric: false
    target_column: 0

mapping:
  - type: list_identifiers
    columns: [0]

Pattern 4: Parse Filename Structure

Goal: Extract sample info from “sample_123_R1.fastq.gz” format

rules:
  - type: add_column_metadata
    value: identifier0  # Original filename
  - type: add_column_regex
    target_column: 0
    expression: 'sample_(\w+)_R(\d)'
    group_count: 2  # Sample ID and read number
  - type: add_column_value
    value: "_R"
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 2  # "_R" + "1" = "_R1"
  - type: add_column_concatenate
    target_column_0: 1
    target_column_1: 4  # "123" + "_R1" = "123_R1"
  - type: remove_columns
    target_columns: [0, 2, 3, 4]  # Keep only final identifier

mapping:
  - type: list_identifiers
    columns: [0]

Pattern 5: Create Paired from Separate Lists

Goal: Combine separate forward/reverse lists into paired

Assumption: Files named like sample1_R1.fastq, sample1_R2.fastq

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_R([12])'
    group_count: 2  # Sample name and read number
  - type: add_column_regex
    target_column: 2
    expression: '1'
    replacement: 'forward'
    allow_unmatched: true
  - type: add_column_regex
    target_column: 2
    expression: '2'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 4
  - type: sort
    numeric: false
    target_column: 1  # Ensure pairs adjacent
  - type: remove_columns
    target_columns: [0, 2, 3, 4]

mapping:
  - type: list_identifiers
    columns: [0]     # Sample ID
  - type: paired_identifier
    columns: [1]     # forward/reverse

Best Practices

1. Plan Column Layout

Before writing rules, sketch the transformations:

Col 0: Original identifier
Col 1: Extracted sample ID (regex)
Col 2: Extracted replicate (regex)
Col 3: Separator "_"
Col 4: Concatenate 1+3+2
Col 5: Final identifier after cleanup

2. Test Incrementally

Add rules one at a time and verify output:

  • Start with metadata extraction
  • Add one transformation
  • Check result
  • Continue

3. Use allow_unmatched Carefully

Only use when genuinely optional:

# BAD - silently fails to extract
- type: add_column_regex
  expression: 'wrong_pattern'
  allow_unmatched: true

# GOOD - errors if pattern doesn't match
- type: add_column_regex
  expression: 'expected_pattern'
  allow_unmatched: false

4. Remove Intermediate Columns

Clean up before mapping:

rules:
  - type: add_column_metadata
    value: identifier0
  # ... many transformations ...
  - type: remove_columns
    target_columns: [0, 2, 3]  # Remove temp columns

mapping:
  - type: list_identifiers
    columns: [0]  # Only final column remains

5. Validate with Filters

Use filters to ensure data quality:

rules:
  - type: add_column_regex
    expression: 'pattern'
    allow_unmatched: false  # Errors if doesn't match
  - type: add_filter_empty
    target_column: 1
    invert: false  # Remove any that became empty

6. Document Complex Rules

Add comments explaining logic:

rules:
  # Extract sample ID from filename "sample_123_R1.fastq"
  - type: add_column_regex
    target_column: 0
    expression: 'sample_(\w+)_R\d'

  # Remove original filename column
  - type: remove_columns
    target_columns: [0]

Common Pitfalls

Pitfall 1: Column Indices Shift

Problem: After removing columns, indices change

# WRONG
rules:
  - type: remove_columns
    target_columns: [0]
  - type: add_column_regex
    target_column: 1  # This is now wrong! Column 1 became 0

Solution: Remove columns last, or recalculate indices

Pitfall 2: Forgetting Invert Logic

Problem: Confusion about filter invert

# Remove matching rows (keep non-matching)
- type: add_filter_regex
  expression: 'control_'
  invert: false  # FALSE means "remove matching"

# Keep matching rows
- type: add_filter_regex
  expression: 'sample_'
  invert: true  # TRUE means "remove non-matching" = keep matching

Clearer thinking: invert: false = “remove matches”, invert: true = “remove non-matches”

Pitfall 3: Regex Escaping

Problem: Special regex characters not escaped

# WRONG - . matches any character
expression: 'file.fastq'

# RIGHT
expression: 'file\.fastq'

# For literal parentheses
expression: '\(sample\)'

Pitfall 4: Case Sensitivity

Problem: Filters are case-sensitive

# Doesn't match "Sample1"
- type: add_filter_matches
  value: "sample1"
  target_column: 0

Solution: Use regex with case-insensitive flag or normalize case first

Pitfall 5: Empty Sources After Filtering

Problem: All rows filtered out

rules:
  - type: add_filter_regex
    expression: 'nonexistent'
    invert: false
# Result: Empty collection!

Solution: Test filters carefully, use allow_unmatched: true when appropriate


When to Use / When NOT to Use Apply Rules

When to Use Apply Rules

  • Complex identifier parsing (multiple regex extractions)
  • Tag-based restructuring (group by experimental condition)
  • Conditional filtering combined with restructuring
  • Structure transformations not covered by simple tools
  • Multiple transformations needed in one step

When NOT to Use Apply Rules

OperationUse This InsteadWhy
Simple filtering__FILTER_FROM_FILE__Simpler, clearer intent
Basic flattening__FLATTEN__One-step operation
Sort collection__SORTLIST__Dedicated tool
Extract element__EXTRACT_DATASET__Direct operation
Zip two lists__ZIP_COLLECTION__Simpler syntax
Unzip paired__UNZIP_COLLECTION__Straightforward
Relabel identifiers__RELABEL_FROM_FILE__If mapping from file

Comparison Table

OperationSimple ToolWhen to use Apply Rules instead
FilterFilter CollectionNeed to filter on derived metadata, combine with restructuring
FlattenFlatten CollectionNeed control over identifier format, filter simultaneously
RelabelRelabel IdentifiersNeed regex-based transformation, derive from existing metadata
SortSort CollectionNeed to sort by derived values, combine with other operations
RestructureN/AFull control over nesting structure from any metadata

Key Insight: Apply Rules is the tool of choice when:

  • Multiple transformations needed in one step
  • Restructuring based on metadata (tags, identifier patterns)
  • Complex identifier manipulation required
  • Standard tools don’t cover the use case

Use Case Examples

Use Case 1: Standard Paired-End RNA-seq

Files: sample1_R1.fastq.gz, sample1_R2.fastq.gz, sample2_R1.fastq.gz, sample2_R2.fastq.gz

Goal: Create list:paired collection

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_R([12])\.fastq\.gz'
    group_count: 2
  - type: add_column_regex
    target_column: 2
    expression: '1'
    replacement: 'forward'
    allow_unmatched: true
  - type: add_column_regex
    target_column: 2
    expression: '2'
    replacement: 'reverse'
    allow_unmatched: true
  - type: add_column_concatenate
    target_column_0: 3
    target_column_1: 4
  - type: sort
    target_column: 1
    numeric: false
  - type: remove_columns
    target_columns: [0, 2, 3, 4]

mapping:
  - type: list_identifiers
    columns: [0]
  - type: paired_identifier
    columns: [1]

Use Case 2: Remove Control Samples

Goal: Filter out samples starting with “control_“

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_filter_regex
    target_column: 0
    expression: '^control_'
    invert: true  # Remove matches = keep non-controls

mapping:
  - type: list_identifiers
    columns: [0]

Use Case 3: Group by Treatment Condition

Goal: Reorganize by “group:condition:*” tag into nested list

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_group_tag_value
    value: condition
    default_value: 'unassigned'

mapping:
  - type: list_identifiers
    columns: [1, 0]  # Group by condition, then sample
  - type: group_tags
    columns: [1]

Use Case 4: Select Top N by Quality Score

Assumption: Quality score in sample name like “sample_123_q95”

Goal: Keep only samples with quality >= 90

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: 'sample_\w+_q(\d+)'
  - type: add_filter_compare
    target_column: 1
    value: 90
    compare_type: greater_than_equal
  - type: remove_columns
    target_columns: [1]

mapping:
  - type: list_identifiers
    columns: [0]

Use Case 5: Replicate Structure

Files: treatment_rep1, treatment_rep2, control_rep1, control_rep2

Goal: Create list:list [treatment → [rep1, rep2], control → [rep1, rep2]]

rules:
  - type: add_column_metadata
    value: identifier0
  - type: add_column_regex
    target_column: 0
    expression: '(.+)_rep(\d+)'
    group_count: 2
  - type: sort
    target_column: 1
    numeric: false
  - type: remove_columns
    target_columns: [0]

mapping:
  - type: list_identifiers
    columns: [0, 1]  # Condition, then replicate

API Usage

inputs = {
    "input": {"src": "hdca", "id": collection_id},
    "rules": {
        "rules": [...],
        "mapping": [...]
    }
}
response = POST /api/tools {"tool_id": "__APPLY_RULES__", "history_id": "...", "inputs": inputs}

Incoming References (13)