Home Research

Iwc Tabular Operations Survey

Corpus survey of tabular tools and operations across IWC workflows; map for the operation pattern hierarchy on row/column data manipulation.

Raw
Revised
2026-05-02
Rev
2
component

IWC tabular operations survey

Source corpus: /Users/jxc755/projects/repositories/workflow-fixtures/iwc-format2/, 120 .gxwf.yml files across 20 domain directories. Counts below are step-occurrence counts produced by grep -rh "^[[:space:]]*- tool_id:" --include="*.yml" | sort | uniq -c (i.e. one count per workflow step that uses the tool, including subworkflow steps and the trailing unique_tools block — so the magnitudes are roughly 2x the count of distinct invocations a user authored, but the ranking is faithful). All file:line citations are into iwc-format2/.

The corpus is heavily skewed toward sars-cov-2 reporting, microbiome amplicon, VGP assembly QC, and scRNA-seq metadata wrangling — those four pull the tabular tooling. Pure read/align/call workflows (read-preprocessing/, most of variant-calling/) barely touch tabular operations directly; their tabular work happens inside multiqc rollups, which are out of scope.

1. Tool inventory

Ranked by step occurrences. “DT” = devteam, “BG” = bgruening, “IUC” = iuc, “NML” = nml. Display names and short forms shown after first introduction.

1a. Galaxy “core” tabular tools (no toolshed owner; bundled with Galaxy)

Stepstool_idShort nameOperation
127Cut1Cut1 (Cut columns from a table)Column projection
33Filter1Filter1 (Filter data on any column using simple expressions)Row filter
47Grep1Grep1 (Select lines that match an expression)Row filter (regex)
25sort1sort1 (Sort)Sort
21Remove beginning1Remove beginningHeader strip
19Grouping1Grouping1 (Group data by a column)Group/aggregate
7Paste1Paste1 (Paste two files side by side)Column-bind
5addValue/1.0.1 (DT)Add a columnConstant column add
4cat1ConcatenateRow-bind

First citations:

  • Cut1sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (columnList: c4,c6,c7,c13,c14,c15,c16,c17,c18,c19,c21,c22,c23,c26,c24,c25,c20, delimiter: T).
  • Filter1sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545 (cond: c4=='PASS' or c4=='.', header_lines: "1").
  • Grep1comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687 (pattern: ^>, invert: "", keep_header: false).
  • sort1VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559.
  • Grouping1microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324 (groupcol: "6", operations: [{optype: length, opcol: "6"}]).
  • Remove beginning1microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186.
  • Paste1genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733 (two inputs, delimiter: T).

1b. bgruening text_processing suite (“tp_*”)

By far the largest single family. Same upstream tool collection (text_processing repo, owner bgruening), individual tools bucketed by operation. Multiple version pins coexist in the corpus (9.3+galaxy1, 9.5+galaxy0, 9.5+galaxy2, 9.5+galaxy3) — totals below sum across versions.

Steps (all versions)tool stemOperation
195tp_awk_toolFree-form awk
66tp_find_and_replaceRegex/string replace (whole-line)
39tp_replace_in_lineRegex replace in line
43tp_grep_toolRow filter (regex; vs core Grep1)
16tp_sed_toolFree-form sed
15tp_catRow-bind
12tp_text_file_with_recurring_linesHeader/template lines (constant prefix)
11tp_replace_in_columnPer-column substitution
~8tp_sort_header_toolSort while preserving header
~6tp_sorted_uniqDedupe (sort+uniq combined)
~5tp_easyjoin_toolJoin on key
~4tp_head_toolFirst-N row truncate
~3tp_uniq_toolDedupe
~2tp_multijoin_toolMulti-file outer join

Representative full IDs (first occurrence in corpus):

  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool/9.5+galaxy3comparative_genomics/hyphy/hyphy-core.gxwf.yml:169. 82 steps at 9.3+galaxy1, 60 at 9.5+galaxy3, 30 at 9.5+galaxy0, 27 at 9.5+galaxy3 (sars-cov-2 cluster), 13 at 9.5+galaxy2 — that’s the version-pin spread.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/9.5+galaxy3read-preprocessing/short-read-quality-control-and-trimming.gxwf.yml and elsewhere; 38 steps total at this version, plus 16 at 9.5+galaxy0 and 12 at 9.3+galaxy1. Example regex spec: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:390-407 (compound find_pattern reflowing seven ,-collapsed datamash columns into seven plain columns; is_regex: true, skip_first_line: true).
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-variant-calling/pe-artic-variation.gxwf.yml:973.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/9.3+galaxy1sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601 (5 invocations in this single file alone — see §3).
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_text_file_with_recurring_lines/9.5+galaxy3comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:663.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sed_tool/9.5+galaxy3sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:155 (5 sed invocations in this one file).
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_column/9.5+galaxy3microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:942. Older 1.1.3 pin survives at sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation.gxwf.yml:192.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy3sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sorted_uniq/9.5+galaxy3comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_uniq_tool/9.5+galaxy3VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476 (inside a subworkflow).
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_head_tool/9.5+galaxy0microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/9.5+galaxy3 — used 43x across four pins (1.1.1, 9.3+galaxy1, 9.5+galaxy2, 9.5+galaxy3 — last dominates); coexists with core Grep1 (47x). See §3.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_multijoin_tool/9.3+galaxy1microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796.
  • toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sort_header_tool/9.3+galaxy1sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923.
  • toolshed.g2.bx.psu.edu/repos/bgruening/split_file_on_column/tp_split_on_column/0.6microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:369 (split a tabular into a collection by a key column).

1c. devteam column tools

Stepstool_idOperation
93toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1Computed column (Python expressions over cN)
56toolshed.g2.bx.psu.edu/repos/devteam/add_value/addValue/1.0.1Constant column add (older idiom; survives heavily in VGP)

column_maker/Add_a_column1/2.1 first occurrence: sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:344. Representative state at sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:294-328 — two expressions: entries in one step, one inserting c7 as AFcaller at position I (insert), one replacing position 7 with round((c18 + c19) / c6, 6) named AF. The error_handling: { auto_col_types: true, fail_on_non_existent_columns: true, non_computable: { action: --fail-on-non-computable } } block is a stable boilerplate across the corpus — pattern page should call it out as the recommended-default tuple.

1d. iuc / nml tabular tools

Stepstool_idOperation
73toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/{1.1.0,1.8+galaxy0,1.9+galaxy0}Group/aggregate; collapse
44toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/5.1.0Concatenate a collection into a single tabular (with add_name/one_header)
32toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3Wide pivot: outer-join a collection of 2-col tables on identifier → one row-per-id, one col-per-element
16toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2Arbitrary SQL (SQLite) over one or more tabulars
11toolshed.g2.bx.psu.edu/repos/iuc/filter_tabular/filter_tabular/3.3.1SQL-or-line-filter pre-processor
~6toolshed.g2.bx.psu.edu/repos/iuc/table_compute/table_compute/1.2.4+galaxy{1,2}Pandas-style row/col reductions and matrix ops
~3toolshed.g2.bx.psu.edu/repos/iuc/biom_convert/biom_convert/...Format conversion (BIOM↔tabular, tangentially)

First citations:

  • datamash_ops/1.8+galaxy0sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333. Representative state (collapse a per-effect duplication): grouping: 1,2,3,4,5,6,7,8,9,10, operations: [{op_name: collapse, op_column: "11"}, ..., {op_name: collapse, op_column: "17"}] — 7 collapses across columns 11-17 in one step. Lines 333-373.
  • datamash_ops/1.9+galaxy0 (newer pin) — VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:746.
  • collapse_collections/collapse_dataset/5.1.0sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414 (filename: { add_name: true, place_name: same_multiple }, one_header: true). This is the Galaxy idiom for “flatten a collection back to a single tabular while injecting the element identifier as a leading column” — one_header strips per-file headers but keeps the first.
  • collection_column_join/0.0.3amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202 (identifier_column: "1", fill_char: "0", has_header: "0", old_col_in_header: true).
  • query_tabular/3.3.2amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33. SQL example at lines 57-64: SELECT c1, c2, c3, c3 * 100 / SUM(c3) OVER() AS relative_abundance FROM t1; — window-function relative abundance, in one tool. The tables_0|table input shape and input_opts: { linefilters: [{filter: {filter_type: comment, comment_char: "35"}}, ...] } for skipping #-prefixed lines is the recurring boilerplate.
  • filter_tabular/3.3.1genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:657. Used as a lightweight column-projection / regex-line-filter alternative to query_tabular when no SQL is needed.
  • table_compute/1.2.4+galaxy2epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:265 (mode: matrixapply, matrixapply_func: { vector_op: min }, dimension: "0" — column-wise min reduction). This is the only tool in the corpus that surfaces pandas-style reductions explicitly.

1e. Tabular-adjacent / built-ins worth flagging

  • wc_gnu (72 steps) — line-count, frequently downstream of a Filter/awk to feed param_value_from_file (e.g. epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301).
  • __APPLY_RULES__ (22), __FLATTEN__ (11), __RELABEL_FROM_FILE__ (39), __FILTER_FROM_FILE__ (20), __FILTER_EMPTY_DATASETS__ (64), __EXTRACT_DATASET__ (46), __MERGE_COLLECTION__ (12) — Galaxy collection ops; out of scope per task (“not row/column ops”) but they bracket the tabular sections heavily.
  • compose_text_param/0.1.1 (99), pick_value/0.2.0 (85), param_value_from_file (154), map_param_value/0.2.0 (92), collection_element_identifiers/0.0.2 (99) — the metadata-wrangling cluster around tabular steps; not tabular themselves.
  • multiqc/1.33+galaxy0 (31) and tooldistillator_summarize/1.0.4+galaxy0 (11) — produce tabular outputs but are reporting tools; out of scope.

1f. Notable absences

Searched --include="*.yml" for these; zero hits in the corpus:

  • csvtk/* — none. Rich CSV/TSV swiss-army wrapper exists in toolshed but no IWC workflow uses it.
  • datamash_transpose/*, datamash_reverse/* — none. Datamash’s transpose/reverse subcommands are not surfaced through any IWC workflow; transpose is done either via table_compute (rare) or implicit in the wide-pivot collection_column_join idiom (§5).
  • No dedicated CSV format converter (no tab_to_csv / csv_to_tab); when conversion is needed it goes through awk.
  • No bedtools-as-tabular intersect/sort outside of genuine BED contexts (excluded by scope anyway).

2. Operation inventory

Each operation, ranked by visible IWC frequency, with 2-3 file:line examples each.

2a. Filter rows (very common; ~80+ instances)

  • Core Filter1 with Python-expression over cN columns: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545 (cond: c4=='PASS' or c4=='.', header_lines: "1").
  • Filter1 driven by a runtime-generated rule string fed from another step: epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320-336 (cond: { __class__: ConnectedValue } from generate filter rule/out1).
  • Regex/grep — split between core Grep1 and tp_grep_tool (see §3 for the redundancy). Core: comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687 (pattern: ^>).
  • SQL filter with side-effect of column projection: query_tabular at amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33.
  • Filter via awk pattern: pervasive — see §2g and the code: greps in §5.

2b. Column projection / cut (very common; ~127 Cut1 + N query_tabular)

  • Cut1 with comma-separated cN list: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (columnList: c4,c6,c7,c13,...,c20, delimiter: T). Cut1 is also used to reorder columns by listing them out of order (note c20 last after c26,c24,c25 in that example).
  • query_tabular for column projection + computed columns in one shot: amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:57-64.
  • filter_tabular for projection without SQL: amplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:203.

2c. Computed column / per-row arithmetic (~93 column_maker + many awk)

  • Add_a_column1/2.1 with multiple expressions in one step (insert and replace): sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329 (AFcaller at insert pos 8, AF = round((c18 + c19) / c6, 6) at replace pos 7).
  • String-concatenation new column: same file lines 462-472 (c5 + '>' + c6 named change, c3 + ':' + c19 named change_with_pos).
  • Per-row arithmetic via awk: see §2g.

2d. Sort (~25 core sort1, ~8 tp_sort_header_tool)

  • Core sort1 first occurrence: VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559 (inside subworkflow).
  • Header-preserving sort: tp_sort_header_tool/9.3+galaxy1 at sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923 and :950 — used immediately before downstream joins.

2e. Group / aggregate (~73 datamash_ops, ~19 Grouping1)

  • Datamash with multi-column grouping: plus several op_names in one step: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-370 (10-col grouping, 7 collapse columns) and :562-596 (single-col group, 5 ops: countunique × 3, min, max, collapse).
  • Grouping1 (older Galaxy tool): microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324 (groupcol: "6", operations: [{optype: length, opcol: "6"}]) and :1087 (optype: cat, i.e. concatenate group members).

2f. Join (~5 tp_easyjoin_tool instances visible at top-level, ~2 tp_multijoin_tool)

  • Two-file join on key columns: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601-627 (column1: "20", column2: "1", empty_string_filler: "0", header: true, jointype: " " — note the single space string for default outer join, not --outer). Five easyjoin steps in this one workflow alone (lines 601, 722, 752, 799, 1059).
  • Multi-file join: microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796 (tp_multijoin_tool/9.3+galaxy1).

2g. awk (free-form) (~195 step instances)

Effectively the swiss-army knife. Several recurring sub-shapes:

  • Header injection (constant prefix): microbiome/mags-building/MAGs-generation.gxwf.yml:1090-1096
    BEGIN {OFS="\t"; print "genome\tcompleteness\tcontamination"}
    NR > 1 {
        if ($1 !~ /\.fasta$/)
            $1 = $1 ".fasta"
        print $1, $2, $3
    }
    Same one-liner shape at VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855 ('BEGIN{print "Metric\tAlternate"}; {print}; ') and :1631 (Metric\tPrimary).
  • BED triple synthesis from a 3-column input: amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222 and :268'BEGIN {OFS="\t"} {print $1, $2 - 1, $3, "forward", "1", "+"}' vs :245/:291 "reverse", "1", "-". Replicated verbatim inside the mgnify-amplicon-pipeline-v5-complete.gxwf.yml:2101,2124,2147,2170 subworkflow embeddings.
  • Long-format taxonomy splitter: amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135split($3, taxonomy, ";") then dispatch by ^sk__/^k__/^p__/^c__/^o__/^f__/^g__/^s__ prefix into the 8 taxonomic-rank columns. Re-implemented (similar but not identical) at amplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:241,281,336 and mgnify-amplicon-pipeline-v5-complete.gxwf.yml:5697,5737,5792,6130,6170.
  • FASTQ id sanitization (technically in scope as a row-text op): amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-complete/mgnify-amplicon-pipeline-v5-complete.gxwf.yml:407NR % 4 == 1 { gsub(/[ \/]/, "-", $0) } { print }.
  • Inline relabel (replace whole record with a counter): microbiome/binning-evaluation/MAGs-binning-evaluation.gxwf.yml:433'{gsub( $0 ,"sample_" (NR-1)); print}'.

2h. String/regex replace at the line/column level (~66 tp_find_and_replace, ~39 tp_replace_in_line, ~11 tp_replace_in_column)

  • tp_find_and_replace with multiple sequenced patterns in one step: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:389-407 (compound regex reflow + a (GroupBy|collapse)\(([^)]+)\)$2 strip). Demonstrates the convention skip_first_line: true for the first pass and false for the second.
  • tp_replace_in_column with column_replace: "16", delimiter: tab, pass_comments: "#", skip_lines: "1", unknowns_strategy: skip: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:281-289.

2i. Concatenate / row-bind (~15 tp_cat, ~4 cat1, ~44 collapse_dataset)

  • tp_cat for two-file concat: sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.
  • collapse_dataset for collection→tabular concat with header dedup: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414 (one_header: true, add_name: true, place_name: same_multiple). This is the dominant idiom — 44 steps; cf. only 15 tp_cat and 4 cat1.

2j. Dedupe (~6 tp_sorted_uniq, ~3 tp_uniq_tool)

  • tp_sorted_uniq first hit: comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773 and :1293 (in the workflow’s unique_tools block).
  • tp_uniq_tool: VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476.

2k. Header strip / take first N (~21 Remove beginning1, ~4 tp_head_tool)

  • Remove beginning1: microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186 and :200 (used twice in succession).
  • tp_head_tool: microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495,633.

2l. Pivot / transpose

Sparse — and the corpus shape is itself a finding. No datamash_transpose invocations anywhere. table_compute does column reductions but is not used as a transpose. The dominant wide-pivot idiom is collection_column_join: take a collection of two-column (id, value) tables and outer-join them on the id column to produce one row-per-id, one-column-per-element. Examples at amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202, microbiome/pathogen-identification/nanopore-pre-processing/Nanopore-Pre-Processing.gxwf.yml:823, microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656, microbiome/mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. The collection element identifier is the new column header (old_col_in_header: true).

Long-pivot (wide → long) is done with awk; e.g. amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135 reshapes a ;-delimited taxonomy column to 8 parallel columns — itself wide-from-long via awk dispatch, reverse direction.

2m. Format conversion (tabular flavors)

Almost none seen. biom_convert (24 step occurrences across some amplicon workflows) is the only dedicated converter, and it converts between BIOM and TSV — adjacent to scope. Otherwise, conversions between TSV/CSV/BED-like flavors are done with awk + Cut1 ad hoc; no pure converter pattern exists.

2n. Count / summarize (~72 wc_gnu)

wc_gnu is overwhelmingly used to feed param_value_from_file (line-count → integer parameter). E.g. epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301-313 (include_header: false, options: [lines]).

2o. Sample / random subset

Not visible at the tabular layer in any sampled file; sampling happens upstream at FASTQ level (seqtk_sample), not on tabular outputs.

2p. Side-by-side bind / paste

Paste1 (7 instances): genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733 and microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:611.

3. Cross-tab (tool × operation; redundancy hotspots)

OperationTools that cover it (corpus-observed)Recommendation lean
Filter rows (column expression)Filter1, query_tabular, filter_tabular, awkFilter1 for one-shot; query_tabular only when SQL semantics needed
Filter rows (regex)Grep1 (47), tp_grep_tool (43), awkReal redundancy — 47 vs 43 split with no semantic distinction visible
Cut/project columnsCut1 (127), query_tabular, filter_tabular, Paste1+Cut1 chainsCut1 dominates; use query_tabular for project+compute fused
Computed columnAdd_a_column1 (93), awk, query_tabularAdd_a_column1 if the expression is short; awk if the row needs a multi-line decision tree (see §2g taxonomy splitter)
Sortsort1, tp_sort_header_tooltp_sort_header_tool whenever the input has a header — sort1’s header handling is implicit
Group/aggregatedatamash_ops (73), Grouping1 (19), query_tabulardatamash_ops is the canonical choice; Grouping1 survives in older microbiome workflows
Join on keytp_easyjoin_tool, tp_multijoin_tool, query_tabular (SQL JOIN), collection_column_join (collection-shape)tp_easyjoin_tool for two-file 1:1; collection_column_join for the collection→wide-table case
Header injectionawk BEGIN-block, tp_text_file_with_recurring_lines+tp_catawk is far more frequent (see §5 idiom #1)
Header stripRemove beginning1, awk NR > 1, tp_head_toolRemove beginning1 is purpose-built and dominant
Concatenate (row-bind)tp_cat, cat1, collapse_dataset (collection→single)collapse_dataset for collections; tp_cat for plain two-file
Dedupetp_sorted_uniq, tp_uniq_tool, datamash_ops collapse + counttp_sorted_uniq for line-level; datamash_ops for key-aware dedupe
Replace texttp_find_and_replace (66), tp_replace_in_line (39), tp_replace_in_column (11), tp_sed_tool (16), awktp_replace_in_column when the substitution is per-column (preserves other columns); tp_find_and_replace for whole-line; the rest are all redundant flavors of “regex replace”
Wide pivotcollection_column_join, awkcollection_column_join for the collection-to-table case (only viable use); no tool for wide↔long generic
Transposetable_compute (rare)No good IWC-attested option; gap

Tool-anchored deep-dive candidates (one tool covers many operations, complex parameterization):

  • awk (tp_awk_tool, 195 steps across all versions) — by far the highest-leverage page. Header injection, BED synthesis, column reshape, regex filter, taxonomy splitting all condense to awk one-liners or short blocks. The code: |- field carries a Bash-like single-quoted string OR a multiline awk program; both shapes are present (compare mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222 quoted-string form vs mapseq-to-ampvis2.gxwf.yml:101 multiline |- form).
  • datamash_ops (73 steps) — the only group/aggregate path that scales beyond toy Grouping1. The grouping: field is a comma-separated string, operations: is a list of {op_name, op_column} pairs, both header_in/header_out are independent toggles, and need_sort: matters for correctness. Worth a dedicated page.
  • column_maker / Add_a_column1 (93 steps) — Python expressions over cN, plus the error_handling boilerplate, plus the add_column.mode: I/R/"" (insert/replace/append) mini-DSL. Worth a page; reviewers will reach for it constantly.
  • query_tabular (16 steps) — niche but powerful (SQL window functions, multi-table JOINs). Worth a page so users don’t reach for awk when SQL is genuinely cleaner.
  • collection_column_join (32 steps) — the wide-pivot idiom, and unobvious unless you already know about it. Worth a page.

4. Candidate operation-pattern boundaries

Proposed operation pages, each scoped tightly. Where a candidate is weak, I say so.

  1. tabular-filter-by-column-valueFilter1 with cond: cN == 'X' or cN > Y. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545, epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320, sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:276. Tools: Filter1. Why: highest-frequency single operation; the header_lines: parameter is an easy-to-miss correctness lever; the cond: mini-language is non-obvious (Python with cN columns). Keep.

  2. tabular-filter-by-regex (or tabular-grep) — Cover both Grep1 and tp_grep_tool and resolve the redundancy. Cite comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687. Keep, but the page must take a position on which to recommend (see open question Q1).

  3. tabular-cut-and-reorder-columnsCut1 with columnList: cN,cM,..., including the use of out-of-order lists for column reordering. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (reorder), :830, :878. Note delimiter: T constant. Keep.

  4. tabular-compute-new-columncolumn_maker/Add_a_column1 with the error_handling boilerplate and add_column.mode: I/R/"" DSL. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329, :462-472, consensus-from-variation.gxwf.yml:344. Keep — this is one of the most foot-gun-prone tools in IWC due to silent column-type coercion when auto_col_types: false.

  5. tabular-group-and-aggregate-with-datamashdatamash_ops with multi-column grouping:, sequenced operations:. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373 (collapse-by-many), :562-596 (count/min/max in one step), :632-657 (single-op countunique). Keep — high-density idiom page.

  6. tabular-join-on-keytp_easyjoin_tool with column1/column2, empty_string_filler, jointype (note the leading-space convention for outer). Cite variation-reporting.gxwf.yml:601, :722, :1059. Mention tp_multijoin_tool for >2 inputs and query_tabular for SQL JOIN. Keep.

  7. awk-in-galaxy (deep tool-anchored page) — tp_awk_tool quoted vs multiline code: shapes, BEGIN {OFS="\t"} rituals, NR > 1 header skip, gsub, split. Cite §2g and §5 examples. Keep — must be a deep page; see open question Q2 about depth.

  8. collection-to-wide-table-with-collection_column_join — the wide-pivot idiom. Cite mapseq-to-ampvis2.gxwf.yml:202, mags-building/MAGs-generation.gxwf.yml:1483,1544,1656, mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. Keep — non-obvious without a worked example.

  9. collection-to-single-tabular-with-collapse_dataset — sibling to the above for the row-bind direction. Cite variation-reporting.gxwf.yml:414, mapseq-to-ampvis2.gxwf.yml:178. The add_name/one_header/place_name triad is the foot-gun. Keep.

  10. tabular-sql-with-query_tabular — Cite mapseq-to-ampvis2.gxwf.yml:33 (window function), mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:137. The tables[].input_opts.linefilters and tbl_opts.column_names_from_first_line boilerplate is non-trivial. Keep — this page exists to narrow when query_tabular is the right reach (SQL windows, JOINs), not to evangelize it.

Drop / merge candidates (don’t deserve their own page):

  • tabular-sortsort1 and tp_sort_header_tool are simple enough to roll into a one-line note inside other pages. Possible exception: a one-paragraph stub disambiguating the two.
  • tabular-dedupe — same; tp_sorted_uniq and tp_uniq_tool are thin wrappers, low pitfall density.
  • tabular-row-bind — collapse into the collapse_dataset page; standalone tp_cat doesn’t merit a page.
  • tabular-format-convert — corpus is too sparse. Document as a gap, not a page (see §6 Q5).
  • tabular-pivot-wide-to-long — no concrete IWC pattern; corpus does this case-by-case in awk. Don’t write a page until there’s signal.
  • sed-in-galaxytp_sed_tool (16) is dwarfed by tp_find_and_replace (66) and awk (195). Cover sed as a one-section note inside the awk page rather than a sibling page.
  • tabular-header-stripRemove beginning1 is one-parameter (num_lines); too thin for a standalone.

5. Surprising or recurring idioms

  1. The BEGIN {OFS="\t"; print "header\there\there"} { print rows } ritual. Constant across families: microbiome/mags-building/MAGs-generation.gxwf.yml:1090, VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855, :1631 (Metric\tPrimary vs Metric\tAlternate — same workflow, two haplotypes, identical rite). Authors reach for awk to attach a header rather than concatenating a constant header file with tp_cattp_text_file_with_recurring_lines exists for the latter but is rarely used (12 step instances corpus-wide vs awk’s 195).

  2. Datamash → Find&Replace round-trip to emulate “collapse with delimiter X”. sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373 collapses 7 columns; :375-407 then runs tp_find_and_replace with a 7-group regex \t([^\t,]+),[^\t]+\t... to keep only the first comma-delimited member of each collapsed cell. This is “argmax in datamash” implemented as “collapse-then-regex-trim-to-first” because datamash doesn’t have a “first” op accessible through this UI. Pattern page should warn this is the pattern, and that getting the regex group count right is the primary failure mode.

  3. The same workflow uses Filter1 AND query_tabular for distinct cases. sars-cov-2-variation-reporting/variation-reporting.gxwf.yml uses Filter1 (line 545) for the simple c4=='PASS' case but reaches for datamash, easyjoin, find&replace chains for the fan-out/fan-in. There is a real implicit decision boundary: Filter1 for one-shot row filters; switch tools when the operation needs sort-aware or join-aware semantics.

  4. collection_column_join as the only attested wide-pivot. Six workflows use it, all with the same shape: a collection of (id, value) tabulars (one per sample/element) joined on column 1 with fill_char: "0". microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656 does this three times in one workflow for three different metric families. There’s no pivot_table-style tool in the corpus; this collection-shape join is the workflow author’s substitute, and only works because the upstream produces one (id, value) file per element.

  5. collapse_dataset with add_name: true, place_name: same_multiple, one_header: true is the canonical “merge a per-sample collection into a single annotated tabular”. sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414, microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:553 (with collection_column_join), microbiome/host-contamination-removal/.... Without the one_header: true you get duplicated header rows; without add_name: true you lose the per-row sample identity. Both bugs are silent.

  6. Version pin sprawl in the awk tool is the leading cause of “same idiom shows up four times”. tp_awk_tool exists in the corpus at 9.3+galaxy1 (82 steps), 9.5+galaxy0 (30), 9.5+galaxy2 (13), 9.5+galaxy3 (60+27 in two clusters). Same parameter shape across all of them. Reviewer-facing pattern: pick the highest pin currently in any live workflow and discourage downgrading; do not block PRs for older pins on cleanup-pass grounds.

6. Open questions

  • Q1. Grep1 (47) vs tp_grep_tool (43) — semantic difference real? Both take pattern, invert, keep_header. Suggest the regex page recommend one and demote the other; need your call which.
  • Q2. awk-in-galaxy page depth: one page covering all 195 invocations, or split into 4 sub-pages (awk-header-injection, awk-bed-synthesis, awk-taxonomy-split, awk-relabel)? Lean: one page with idiom sections; split only if frontmatter cross-linking gets noisy.
  • Q3. Should Add_a_column1 page warn against auto_col_types: false? Many corpus uses set it true, some false (variation-reporting.gxwf.yml:454); silent string-vs-numeric coercion is a real bug source. Need your call on prescriptiveness.
  • Q4. Is query_tabular deep-dive in scope for this hierarchy or its own thing? It overlaps Galaxy’s broader “compute over tabular” story (R, Python, csvtk-shaped). Lean: keep it in this hierarchy as the SQL operation.
  • Q5. Tabular format conversion is genuinely sparse (no tab_to_csv, no csv_to_tab ops; biom_convert is the only thing close). Write a one-paragraph “gap note” page or skip entirely?
  • Q6. Older tool IDs (Grouping1, cat1, addValue/1.0.1, Remove beginning1, Paste1, sort1) — do pages mention them as legacy or only the modern equivalent? Lean: mention with a “newer alternative” pointer; corpus still uses them so doc-blindness is wrong.
  • Q7. Out-of-scope in this survey but adjacent: multiqc and tooldistillator_summarize produce tabular outputs that downstream tabular tools then chew on. Worth a “tabular sources” cross-reference page or skip?
  • Q8. Should the wide-pivot page be collection-to-wide-table-with-collection_column_join (tool-anchored) or tabular-pivot-collection-to-wide (operation-anchored with the tool as recommendation)? Lean: operation-anchored title, tool-anchored content.

7. Decisions (2026-04-30)

Resolved via AskUserQuestion after this survey landed. Pinned here so the next subagent inherits them without re-litigation.

  • Naming axis (Q8 + general). Operation-anchored page names. Tool-anchored content is fine inside an operation-named page. Even the awk-split sub-pages get operation names (tabular-prepend-header, tabular-synthesize-bed-from-3col, tabular-split-taxonomy-string, tabular-relabel-by-row-counter); awk is the implementation, not the title.
  • awk page depth (Q2). Split into 4-5 operation-named sub-pages per the bullet above. No single awk-in-galaxy umbrella page; cross-link the awk-as-recipe sub-pages from a §Recipes line in any operation page that uses awk.
  • Grep1 vs tp_grep_tool (Q1). Recommend tp_grep_tool. Demote Grep1 to a “legacy alternative” footnote. Consistency with the rest of the tp_* family wins over slight corpus-frequency edge of Grep1.
  • Format-conversion gap (Q5). Skip. Corpus-first principle: no exemplar = no page. The §2m gap note in this survey stands as the only record.
  • auto_col_types (Q3). The tabular-compute-new-column page prescribes a strict structured rule:
    • Always set fail_on_non_existent_columns: true (51/51 corpus instances).
    • non_computable.action: --fail-on-non-computable is the dominant choice (49/51); the two --skip-non-computable exceptions (consensus-from-variation.gxwf.yml:364, :402) are intentional, for BED-coordinate arithmetic where some rows are legitimately non-numeric.
    • auto_col_types is per-expression-kind:
      Expression kindauto_col_types
      Arithmetic on raw cN ((c18+c19)/c6, round(...))true
      Pure string concat (c5 + '>' + c6)false
      Arithmetic with explicit casts (int(cN), float(cN))false
      Mixedsplit into two expressions: entries with different settings
    • Corpus distribution: 48 true / 3 false. Cite variation-reporting.gxwf.yml:307-329 (true, raw-cN arithmetic), :438-475 (false, string concat), and consensus-from-variation.gxwf.yml:343-378 (false, explicit-cast arithmetic) as the canonical triple.
    • Note on YAML shape: expressions: is nested under tool_state.ops.expressions (with header_lines_select: yes|no as sibling). error_handling is a top-level sibling of ops, not nested inside it. The pattern page must show this shape; flat expressions: does not roundtrip.
  • Legacy tool IDs (Q6). Pages name the modern tool primarily; include a short “Legacy alternative” footnote pointing to the old ID (Grouping1, cat1, addValue/1.0.1, Remove beginning1, Paste1, sort1). Reading old IWC workflows must remain possible.
  • query_tabular (Q4). Leaf in this tabular hierarchy as tabular-sql-query. Scope narrowly to “when SQL is the right reach” — window functions, multi-table JOINs, project+compute fused. Cross-link from filter / join / compute leaves; do not evangelize.
  • Tabular-source cross-ref (Q7). Deferred. If a Mold (e.g. summarize-galaxy-tool) later needs to point to multiqc/tooldistillator-as-tabular-source context, write the page then.

Incoming References (24)

  • Galaxy: tabular patternsrelated note— Use this MOC to choose corpus-grounded Galaxy tabular transformation patterns.
  • Collection: harmonize by sortlist from identifiersrelated note— Use SORTLIST with sort_type:file to reorder one collection by another collection's identifiers.
  • Manifest to mapped collection lifecyclerelated note— Use a manifest or table to build a collection, map a tool per row, then relabel or reshape outputs.
  • Tabular: compute a new columnrelated note— Use column_maker (Add_a_column1) with strict error_handling to insert/replace a computed column. Per-expression-kind auto_col_types rule.
  • Tabular: concatenate collection to tablerelated note— Use collapse_dataset to row-bind a collection of tabulars into one table, with optional element IDs and header dedupe.
  • Tabular: cut and reorder columnsrelated note— Use Cut1 with a comma-separated cN list to project — and reorder — columns. Listing out of order is the canonical reorder idiom.
  • Tabular: filter rows by column valuerelated note— Use Filter1 with a Python expression over cN columns to drop rows. Highest-frequency tabular row filter in IWC.
  • Tabular: filter rows by regexrelated note— Use tp_grep_tool for whole-line regex row filters on tabular input. Grep1 is the legacy alternative.
  • Tabular: group and aggregaterelated note— Use datamash_ops for grouped tabular aggregation: multi-column grouping, collapse, countunique, min/max, and reductions.
  • Tabular: join on keyrelated note— Use tp_easyjoin_tool for two-tabular key joins; use tp_multijoin_tool for many files and query_tabular for SQL joins.
  • Tabular: pivot collection to widerelated note— Use collection_column_join to outer-join a collection of 2-column id/value tables into one wide table.
  • Tabular: prepend headerrelated note— Use tp_awk_tool to prepend a constant header line, optionally skipping or reformatting an existing first row.
  • Tabular: relabel by row counterrelated note— Use tp_awk_tool to replace each row or label with deterministic sample_N values from awk NR.
  • Tabular: split taxonomy stringrelated note— Use tp_awk_tool to split semicolon-delimited taxonomy strings into explicit rank columns with missing-rank handling.
  • Tabular: SQL queryrelated note— Use query_tabular when SQL semantics justify it: windows, joins, anti-joins, or fused project+compute over tabulars.
  • Tabular: synthesize BED from 3-column inputrelated note— Use tp_awk_tool to convert chrom/start/end rows into 6-column BED, subtracting 1 from start and setting constants.
  • Iwc Map Over Lifecycle Surveyrelated note— Survey of IWC map-over lifecycle recipes, with a Nextflow-to-Galaxy crosswalk for collection construction, cleanup, reshape, reduce, and publish phases.
  • Iwc Parameter Derivation Surveyrelated note— Corpus survey of Galaxy workflow recipes that turn upstream data, metadata, or small files into runtime parameters.
  • Iwc Runtime Parameter Shims Surveyrelated note— Focused survey of tiny IWC runtime parameter shims for flags, enums, counts, booleans, and composed text.
  • Iwc Shortcuts Anti Patternsrelated note— What IWC test suites cut corners on (accepted) vs what's a code smell — existence-only probes, sim_size deltas, image dim checks, label coupling.
  • Iwc Test Data Conventionsrelated note— How IWC workflows organize and reference test data — Zenodo-first, SHA-1 integrity, collection shapes, CVMFS gotchas.
  • Iwc Transformations Surveyrelated note— Corpus survey of collection-shape transformations across IWC: built-in collection ops, toolshed transformers, and the multi-step recipes that bracket map-over.
  • Nextflow operators to Galaxy collection recipesrelated note— Classifies common Nextflow operators as Galaxy wiring, collection semantics, explicit steps, or review triggers.
  • Planemo Asserts Idiomsrelated note— Decision and idiom guide for picking planemo workflow-test assertions: which family per output type, how to size tolerances, when to validate.