IWC tabular operations survey

Source corpus: /Users/jxc755/projects/repositories/workflow-fixtures/iwc-format2/, 120 .gxwf.yml files across 20 domain directories. Counts below are step-occurrence counts produced by grep -rh "^[[:space:]]*- tool_id:" --include="*.yml" | sort | uniq -c (i.e. one count per workflow step that uses the tool, including subworkflow steps and the trailing unique_tools block — so the magnitudes are roughly 2x the count of distinct invocations a user authored, but the ranking is faithful). All file:line citations are into iwc-format2/.

The corpus is heavily skewed toward sars-cov-2 reporting, microbiome amplicon, VGP assembly QC, and scRNA-seq metadata wrangling — those four pull the tabular tooling. Pure read/align/call workflows (read-preprocessing/, most of variant-calling/) barely touch tabular operations directly; their tabular work happens inside multiqc rollups, which are out of scope.

1. Tool inventory

Ranked by step occurrences. “DT” = devteam, “BG” = bgruening, “IUC” = iuc, “NML” = nml. Display names and short forms shown after first introduction.

1a. Galaxy “core” tabular tools (no toolshed owner; bundled with Galaxy)

Steps	tool_id	Short name	Operation
127	`Cut1`	Cut1 (Cut columns from a table)	Column projection
33	`Filter1`	Filter1 (Filter data on any column using simple expressions)	Row filter
47	`Grep1`	Grep1 (Select lines that match an expression)	Row filter (regex)
25	`sort1`	sort1 (Sort)	Sort
21	`Remove beginning1`	Remove beginning	Header strip
19	`Grouping1`	Grouping1 (Group data by a column)	Group/aggregate
7	`Paste1`	Paste1 (Paste two files side by side)	Column-bind
5	`addValue/1.0.1` (DT)	Add a column	Constant column add
4	`cat1`	Concatenate	Row-bind

First citations:

Cut1 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (columnList: c4,c6,c7,c13,c14,c15,c16,c17,c18,c19,c21,c22,c23,c26,c24,c25,c20, delimiter: T).
Filter1 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545 (cond: c4=='PASS' or c4=='.', header_lines: "1").
Grep1 — comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687 (pattern: ^>, invert: "", keep_header: false).
sort1 — VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559.
Grouping1 — microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324 (groupcol: "6", operations: [{optype: length, opcol: "6"}]).
Remove beginning1 — microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186.
Paste1 — genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733 (two inputs, delimiter: T).

1b. bgruening text_processing suite (“tp_*”)

By far the largest single family. Same upstream tool collection (text_processing repo, owner bgruening), individual tools bucketed by operation. Multiple version pins coexist in the corpus (9.3+galaxy1, 9.5+galaxy0, 9.5+galaxy2, 9.5+galaxy3) — totals below sum across versions.

Steps (all versions)	tool stem	Operation
195	`tp_awk_tool`	Free-form awk
66	`tp_find_and_replace`	Regex/string replace (whole-line)
39	`tp_replace_in_line`	Regex replace in line
43	`tp_grep_tool`	Row filter (regex; vs core `Grep1`)
16	`tp_sed_tool`	Free-form sed
15	`tp_cat`	Row-bind
12	`tp_text_file_with_recurring_lines`	Header/template lines (constant prefix)
11	`tp_replace_in_column`	Per-column substitution
~8	`tp_sort_header_tool`	Sort while preserving header
~6	`tp_sorted_uniq`	Dedupe (sort+uniq combined)
~5	`tp_easyjoin_tool`	Join on key
~4	`tp_head_tool`	First-N row truncate
~3	`tp_uniq_tool`	Dedupe
~2	`tp_multijoin_tool`	Multi-file outer join

Representative full IDs (first occurrence in corpus):

toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool/9.5+galaxy3 — comparative_genomics/hyphy/hyphy-core.gxwf.yml:169. 82 steps at 9.3+galaxy1, 60 at 9.5+galaxy3, 30 at 9.5+galaxy0, 27 at 9.5+galaxy3 (sars-cov-2 cluster), 13 at 9.5+galaxy2 — that’s the version-pin spread.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/9.5+galaxy3 — read-preprocessing/short-read-quality-control-and-trimming.gxwf.yml and elsewhere; 38 steps total at this version, plus 16 at 9.5+galaxy0 and 12 at 9.3+galaxy1. Example regex spec: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:390-407 (compound find_pattern reflowing seven ,-collapsed datamash columns into seven plain columns; is_regex: true, skip_first_line: true).
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0 — sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-variant-calling/pe-artic-variation.gxwf.yml:973.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/9.3+galaxy1 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601 (5 invocations in this single file alone — see §3).
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_text_file_with_recurring_lines/9.5+galaxy3 — comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:663.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sed_tool/9.5+galaxy3 — sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:155 (5 sed invocations in this one file).
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_column/9.5+galaxy3 — microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:942. Older 1.1.3 pin survives at sars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation.gxwf.yml:192.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy3 — sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sorted_uniq/9.5+galaxy3 — comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_uniq_tool/9.5+galaxy3 — VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476 (inside a subworkflow).
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_head_tool/9.5+galaxy0 — microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/9.5+galaxy3 — used 43x across four pins (1.1.1, 9.3+galaxy1, 9.5+galaxy2, 9.5+galaxy3 — last dominates); coexists with core Grep1 (47x). See §3.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_multijoin_tool/9.3+galaxy1 — microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796.
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sort_header_tool/9.3+galaxy1 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923.
toolshed.g2.bx.psu.edu/repos/bgruening/split_file_on_column/tp_split_on_column/0.6 — microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:369 (split a tabular into a collection by a key column).

1c. devteam column tools

Steps	tool_id	Operation
93	`toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1`	Computed column (Python expressions over `cN`)
56	`toolshed.g2.bx.psu.edu/repos/devteam/add_value/addValue/1.0.1`	Constant column add (older idiom; survives heavily in VGP)

column_maker/Add_a_column1/2.1 first occurrence: sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:344. Representative state at sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:294-328 — two expressions: entries in one step, one inserting c7 as AFcaller at position I (insert), one replacing position 7 with round((c18 + c19) / c6, 6) named AF. The error_handling: { auto_col_types: true, fail_on_non_existent_columns: true, non_computable: { action: --fail-on-non-computable } } block is a stable boilerplate across the corpus — pattern page should call it out as the recommended-default tuple.

1d. iuc / nml tabular tools

Steps	tool_id	Operation
73	`toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/{1.1.0,1.8+galaxy0,1.9+galaxy0}`	Group/aggregate; collapse
44	`toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/5.1.0`	Concatenate a collection into a single tabular (with `add_name`/`one_header`)
32	`toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3`	Wide pivot: outer-join a collection of 2-col tables on identifier → one row-per-id, one col-per-element
16	`toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2`	Arbitrary SQL (SQLite) over one or more tabulars
11	`toolshed.g2.bx.psu.edu/repos/iuc/filter_tabular/filter_tabular/3.3.1`	SQL-or-line-filter pre-processor
~6	`toolshed.g2.bx.psu.edu/repos/iuc/table_compute/table_compute/1.2.4+galaxy{1,2}`	Pandas-style row/col reductions and matrix ops
~3	`toolshed.g2.bx.psu.edu/repos/iuc/biom_convert/biom_convert/...`	Format conversion (BIOM↔tabular, tangentially)

First citations:

datamash_ops/1.8+galaxy0 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333. Representative state (collapse a per-effect duplication): grouping: 1,2,3,4,5,6,7,8,9,10, operations: [{op_name: collapse, op_column: "11"}, ..., {op_name: collapse, op_column: "17"}] — 7 collapses across columns 11-17 in one step. Lines 333-373.
datamash_ops/1.9+galaxy0 (newer pin) — VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:746.
collapse_collections/collapse_dataset/5.1.0 — sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414 (filename: { add_name: true, place_name: same_multiple }, one_header: true). This is the Galaxy idiom for “flatten a collection back to a single tabular while injecting the element identifier as a leading column” — one_header strips per-file headers but keeps the first.
collection_column_join/0.0.3 — amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202 (identifier_column: "1", fill_char: "0", has_header: "0", old_col_in_header: true).
query_tabular/3.3.2 — amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33. SQL example at lines 57-64: SELECT c1, c2, c3, c3 * 100 / SUM(c3) OVER() AS relative_abundance FROM t1; — window-function relative abundance, in one tool. The tables_0|table input shape and input_opts: { linefilters: [{filter: {filter_type: comment, comment_char: "35"}}, ...] } for skipping #-prefixed lines is the recurring boilerplate.
filter_tabular/3.3.1 — genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:657. Used as a lightweight column-projection / regex-line-filter alternative to query_tabular when no SQL is needed.
table_compute/1.2.4+galaxy2 — epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:265 (mode: matrixapply, matrixapply_func: { vector_op: min }, dimension: "0" — column-wise min reduction). This is the only tool in the corpus that surfaces pandas-style reductions explicitly.

1e. Tabular-adjacent / built-ins worth flagging

wc_gnu (72 steps) — line-count, frequently downstream of a Filter/awk to feed param_value_from_file (e.g. epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301).
__APPLY_RULES__ (22), __FLATTEN__ (11), __RELABEL_FROM_FILE__ (39), __FILTER_FROM_FILE__ (20), __FILTER_EMPTY_DATASETS__ (64), __EXTRACT_DATASET__ (46), __MERGE_COLLECTION__ (12) — Galaxy collection ops; out of scope per task (“not row/column ops”) but they bracket the tabular sections heavily.
compose_text_param/0.1.1 (99), pick_value/0.2.0 (85), param_value_from_file (154), map_param_value/0.2.0 (92), collection_element_identifiers/0.0.2 (99) — the metadata-wrangling cluster around tabular steps; not tabular themselves.
multiqc/1.33+galaxy0 (31) and tooldistillator_summarize/1.0.4+galaxy0 (11) — produce tabular outputs but are reporting tools; out of scope.

1f. Notable absences

Searched --include="*.yml" for these; zero hits in the corpus:

csvtk/* — none. Rich CSV/TSV swiss-army wrapper exists in toolshed but no IWC workflow uses it.
datamash_transpose/*, datamash_reverse/* — none. Datamash’s transpose/reverse subcommands are not surfaced through any IWC workflow; transpose is done either via table_compute (rare) or implicit in the wide-pivot collection_column_join idiom (§5).
No dedicated CSV format converter (no tab_to_csv / csv_to_tab); when conversion is needed it goes through awk.
No bedtools-as-tabular intersect/sort outside of genuine BED contexts (excluded by scope anyway).

2. Operation inventory

Each operation, ranked by visible IWC frequency, with 2-3 file:line examples each.

2a. Filter rows (very common; ~80+ instances)

Core Filter1 with Python-expression over cN columns: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545 (cond: c4=='PASS' or c4=='.', header_lines: "1").
Filter1 driven by a runtime-generated rule string fed from another step: epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320-336 (cond: { __class__: ConnectedValue } from generate filter rule/out1).
Regex/grep — split between core Grep1 and tp_grep_tool (see §3 for the redundancy). Core: comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687 (pattern: ^>).
SQL filter with side-effect of column projection: query_tabular at amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33.
Filter via awk pattern: pervasive — see §2g and the code: greps in §5.

2b. Column projection / cut (very common; ~127 `Cut1` + N `query_tabular`)

Cut1 with comma-separated cN list: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (columnList: c4,c6,c7,c13,...,c20, delimiter: T). Cut1 is also used to reorder columns by listing them out of order (note c20 last after c26,c24,c25 in that example).
query_tabular for column projection + computed columns in one shot: amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:57-64.
filter_tabular for projection without SQL: amplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:203.

2c. Computed column / per-row arithmetic (~93 `column_maker` + many awk)

Add_a_column1/2.1 with multiple expressions in one step (insert and replace): sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329 (AFcaller at insert pos 8, AF = round((c18 + c19) / c6, 6) at replace pos 7).
String-concatenation new column: same file lines 462-472 (c5 + '>' + c6 named change, c3 + ':' + c19 named change_with_pos).
Per-row arithmetic via awk: see §2g.

2d. Sort (~25 core `sort1`, ~8 `tp_sort_header_tool`)

Core sort1 first occurrence: VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559 (inside subworkflow).
Header-preserving sort: tp_sort_header_tool/9.3+galaxy1 at sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923 and :950 — used immediately before downstream joins.

2e. Group / aggregate (~73 `datamash_ops`, ~19 `Grouping1`)

Datamash with multi-column grouping: plus several op_names in one step: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-370 (10-col grouping, 7 collapse columns) and :562-596 (single-col group, 5 ops: countunique × 3, min, max, collapse).
Grouping1 (older Galaxy tool): microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324 (groupcol: "6", operations: [{optype: length, opcol: "6"}]) and :1087 (optype: cat, i.e. concatenate group members).

2f. Join (~5 `tp_easyjoin_tool` instances visible at top-level, ~2 `tp_multijoin_tool`)

Two-file join on key columns: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601-627 (column1: "20", column2: "1", empty_string_filler: "0", header: true, jointype: " " — note the single space string for default outer join, not --outer). Five easyjoin steps in this one workflow alone (lines 601, 722, 752, 799, 1059).
Multi-file join: microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796 (tp_multijoin_tool/9.3+galaxy1).

2g. awk (free-form) (~195 step instances)

Effectively the swiss-army knife. Several recurring sub-shapes:

Header injection (constant prefix): microbiome/mags-building/MAGs-generation.gxwf.yml:1090-1096 —
```
BEGIN {OFS="\t"; print "genome\tcompleteness\tcontamination"}
NR > 1 {
    if ($1 !~ /\.fasta$/)
        $1 = $1 ".fasta"
    print $1, $2, $3
}
```
Same one-liner shape at VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855 ('BEGIN{print "Metric\tAlternate"}; {print}; ') and :1631 (Metric\tPrimary).
BED triple synthesis from a 3-column input: amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222 and :268 — 'BEGIN {OFS="\t"} {print $1, $2 - 1, $3, "forward", "1", "+"}' vs :245/:291 "reverse", "1", "-". Replicated verbatim inside the mgnify-amplicon-pipeline-v5-complete.gxwf.yml:2101,2124,2147,2170 subworkflow embeddings.
Long-format taxonomy splitter: amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135 — split($3, taxonomy, ";") then dispatch by ^sk__/^k__/^p__/^c__/^o__/^f__/^g__/^s__ prefix into the 8 taxonomic-rank columns. Re-implemented (similar but not identical) at amplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:241,281,336 and mgnify-amplicon-pipeline-v5-complete.gxwf.yml:5697,5737,5792,6130,6170.
FASTQ id sanitization (technically in scope as a row-text op): amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-complete/mgnify-amplicon-pipeline-v5-complete.gxwf.yml:407 — NR % 4 == 1 { gsub(/[ \/]/, "-", $0) } { print }.
Inline relabel (replace whole record with a counter): microbiome/binning-evaluation/MAGs-binning-evaluation.gxwf.yml:433 — '{gsub( $0 ,"sample_" (NR-1)); print}'.

2h. String/regex replace at the line/column level (~66 `tp_find_and_replace`, ~39 `tp_replace_in_line`, ~11 `tp_replace_in_column`)

tp_find_and_replace with multiple sequenced patterns in one step: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:389-407 (compound regex reflow + a (GroupBy|collapse)$([^)]+)$ → $2 strip). Demonstrates the convention skip_first_line: true for the first pass and false for the second.
tp_replace_in_column with column_replace: "16", delimiter: tab, pass_comments: "#", skip_lines: "1", unknowns_strategy: skip: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:281-289.

2i. Concatenate / row-bind (~15 `tp_cat`, ~4 `cat1`, ~44 `collapse_dataset`)

tp_cat for two-file concat: sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.
collapse_dataset for collection→tabular concat with header dedup: sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414 (one_header: true, add_name: true, place_name: same_multiple). This is the dominant idiom — 44 steps; cf. only 15 tp_cat and 4 cat1.

2j. Dedupe (~6 `tp_sorted_uniq`, ~3 `tp_uniq_tool`)

tp_sorted_uniq first hit: comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773 and :1293 (in the workflow’s unique_tools block).
tp_uniq_tool: VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476.

2k. Header strip / take first N (~21 `Remove beginning1`, ~4 `tp_head_tool`)

Remove beginning1: microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186 and :200 (used twice in succession).
tp_head_tool: microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495,633.

2l. Pivot / transpose

Sparse — and the corpus shape is itself a finding. No datamash_transpose invocations anywhere. table_compute does column reductions but is not used as a transpose. The dominant wide-pivot idiom is collection_column_join: take a collection of two-column (id, value) tables and outer-join them on the id column to produce one row-per-id, one-column-per-element. Examples at amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202, microbiome/pathogen-identification/nanopore-pre-processing/Nanopore-Pre-Processing.gxwf.yml:823, microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656, microbiome/mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. The collection element identifier is the new column header (old_col_in_header: true).

Long-pivot (wide → long) is done with awk; e.g. amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135 reshapes a ;-delimited taxonomy column to 8 parallel columns — itself wide-from-long via awk dispatch, reverse direction.

2m. Format conversion (tabular flavors)

Almost none seen. biom_convert (24 step occurrences across some amplicon workflows) is the only dedicated converter, and it converts between BIOM and TSV — adjacent to scope. Otherwise, conversions between TSV/CSV/BED-like flavors are done with awk + Cut1 ad hoc; no pure converter pattern exists.

2n. Count / summarize (~72 `wc_gnu`)

wc_gnu is overwhelmingly used to feed param_value_from_file (line-count → integer parameter). E.g. epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301-313 (include_header: false, options: [lines]).

2o. Sample / random subset

Not visible at the tabular layer in any sampled file; sampling happens upstream at FASTQ level (seqtk_sample), not on tabular outputs.

2p. Side-by-side bind / paste

Paste1 (7 instances): genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733 and microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:611.

3. Cross-tab (tool × operation; redundancy hotspots)

Operation	Tools that cover it (corpus-observed)	Recommendation lean
Filter rows (column expression)	`Filter1`, `query_tabular`, `filter_tabular`, awk	`Filter1` for one-shot; `query_tabular` only when SQL semantics needed
Filter rows (regex)	`Grep1` (47), `tp_grep_tool` (43), awk	Real redundancy — 47 vs 43 split with no semantic distinction visible
Cut/project columns	`Cut1` (127), `query_tabular`, `filter_tabular`, `Paste1`+`Cut1` chains	`Cut1` dominates; use `query_tabular` for project+compute fused
Computed column	`Add_a_column1` (93), awk, `query_tabular`	`Add_a_column1` if the expression is short; awk if the row needs a multi-line decision tree (see §2g taxonomy splitter)
Sort	`sort1`, `tp_sort_header_tool`	`tp_sort_header_tool` whenever the input has a header — `sort1`’s header handling is implicit
Group/aggregate	`datamash_ops` (73), `Grouping1` (19), `query_tabular`	`datamash_ops` is the canonical choice; `Grouping1` survives in older microbiome workflows
Join on key	`tp_easyjoin_tool`, `tp_multijoin_tool`, `query_tabular` (SQL `JOIN`), `collection_column_join` (collection-shape)	`tp_easyjoin_tool` for two-file 1:1; `collection_column_join` for the collection→wide-table case
Header injection	awk `BEGIN`-block, `tp_text_file_with_recurring_lines`+`tp_cat`	awk is far more frequent (see §5 idiom #1)
Header strip	`Remove beginning1`, awk `NR > 1`, `tp_head_tool`	`Remove beginning1` is purpose-built and dominant
Concatenate (row-bind)	`tp_cat`, `cat1`, `collapse_dataset` (collection→single)	`collapse_dataset` for collections; `tp_cat` for plain two-file
Dedupe	`tp_sorted_uniq`, `tp_uniq_tool`, `datamash_ops` collapse + count	`tp_sorted_uniq` for line-level; `datamash_ops` for key-aware dedupe
Replace text	`tp_find_and_replace` (66), `tp_replace_in_line` (39), `tp_replace_in_column` (11), `tp_sed_tool` (16), awk	`tp_replace_in_column` when the substitution is per-column (preserves other columns); `tp_find_and_replace` for whole-line; the rest are all redundant flavors of “regex replace”
Wide pivot	`collection_column_join`, awk	`collection_column_join` for the collection-to-table case (only viable use); no tool for wide↔long generic
Transpose	`table_compute` (rare)	No good IWC-attested option; gap

Tool-anchored deep-dive candidates (one tool covers many operations, complex parameterization):

awk (tp_awk_tool, 195 steps across all versions) — by far the highest-leverage page. Header injection, BED synthesis, column reshape, regex filter, taxonomy splitting all condense to awk one-liners or short blocks. The code: |- field carries a Bash-like single-quoted string OR a multiline awk program; both shapes are present (compare mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222 quoted-string form vs mapseq-to-ampvis2.gxwf.yml:101 multiline |- form).
datamash_ops (73 steps) — the only group/aggregate path that scales beyond toy Grouping1. The grouping: field is a comma-separated string, operations: is a list of {op_name, op_column} pairs, both header_in/header_out are independent toggles, and need_sort: matters for correctness. Worth a dedicated page.
column_maker / Add_a_column1 (93 steps) — Python expressions over cN, plus the error_handling boilerplate, plus the add_column.mode: I/R/"" (insert/replace/append) mini-DSL. Worth a page; reviewers will reach for it constantly.
query_tabular (16 steps) — niche but powerful (SQL window functions, multi-table JOINs). Worth a page so users don’t reach for awk when SQL is genuinely cleaner.
collection_column_join (32 steps) — the wide-pivot idiom, and unobvious unless you already know about it. Worth a page.

4. Candidate operation-pattern boundaries

Proposed operation pages, each scoped tightly. Where a candidate is weak, I say so.

tabular-filter-by-column-value — Filter1 with cond: cN == 'X' or cN > Y. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545, epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320, sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:276. Tools: Filter1. Why: highest-frequency single operation; the header_lines: parameter is an easy-to-miss correctness lever; the cond: mini-language is non-obvious (Python with cN columns). Keep.
tabular-filter-by-regex (or tabular-grep) — Cover both Grep1 and tp_grep_tool and resolve the redundancy. Cite comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687. Keep, but the page must take a position on which to recommend (see open question Q1).
tabular-cut-and-reorder-columns — Cut1 with columnList: cN,cM,..., including the use of out-of-order lists for column reordering. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782 (reorder), :830, :878. Note delimiter: T constant. Keep.
tabular-compute-new-column — column_maker/Add_a_column1 with the error_handling boilerplate and add_column.mode: I/R/"" DSL. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329, :462-472, consensus-from-variation.gxwf.yml:344. Keep — this is one of the most foot-gun-prone tools in IWC due to silent column-type coercion when auto_col_types: false.
tabular-group-and-aggregate-with-datamash — datamash_ops with multi-column grouping:, sequenced operations:. Cite sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373 (collapse-by-many), :562-596 (count/min/max in one step), :632-657 (single-op countunique). Keep — high-density idiom page.
tabular-join-on-key — tp_easyjoin_tool with column1/column2, empty_string_filler, jointype (note the leading-space convention for outer). Cite variation-reporting.gxwf.yml:601, :722, :1059. Mention tp_multijoin_tool for >2 inputs and query_tabular for SQL JOIN. Keep.
awk-in-galaxy (deep tool-anchored page) — tp_awk_tool quoted vs multiline code: shapes, BEGIN {OFS="\t"} rituals, NR > 1 header skip, gsub, split. Cite §2g and §5 examples. Keep — must be a deep page; see open question Q2 about depth.
collection-to-wide-table-with-collection_column_join — the wide-pivot idiom. Cite mapseq-to-ampvis2.gxwf.yml:202, mags-building/MAGs-generation.gxwf.yml:1483,1544,1656, mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. Keep — non-obvious without a worked example.
collection-to-single-tabular-with-collapse_dataset — sibling to the above for the row-bind direction. Cite variation-reporting.gxwf.yml:414, mapseq-to-ampvis2.gxwf.yml:178. The add_name/one_header/place_name triad is the foot-gun. Keep.
tabular-sql-with-query_tabular — Cite mapseq-to-ampvis2.gxwf.yml:33 (window function), mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:137. The tables[].input_opts.linefilters and tbl_opts.column_names_from_first_line boilerplate is non-trivial. Keep — this page exists to narrow when query_tabular is the right reach (SQL windows, JOINs), not to evangelize it.

Drop / merge candidates (don’t deserve their own page):

tabular-sort — sort1 and tp_sort_header_tool are simple enough to roll into a one-line note inside other pages. Possible exception: a one-paragraph stub disambiguating the two.
tabular-dedupe — same; tp_sorted_uniq and tp_uniq_tool are thin wrappers, low pitfall density.
tabular-row-bind — collapse into the collapse_dataset page; standalone tp_cat doesn’t merit a page.
tabular-format-convert — corpus is too sparse. Document as a gap, not a page (see §6 Q5).
tabular-pivot-wide-to-long — no concrete IWC pattern; corpus does this case-by-case in awk. Don’t write a page until there’s signal.
sed-in-galaxy — tp_sed_tool (16) is dwarfed by tp_find_and_replace (66) and awk (195). Cover sed as a one-section note inside the awk page rather than a sibling page.
tabular-header-strip — Remove beginning1 is one-parameter (num_lines); too thin for a standalone.

5. Surprising or recurring idioms

The BEGIN {OFS="\t"; print "header\there\there"} { print rows } ritual. Constant across families: microbiome/mags-building/MAGs-generation.gxwf.yml:1090, VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855, :1631 (Metric\tPrimary vs Metric\tAlternate — same workflow, two haplotypes, identical rite). Authors reach for awk to attach a header rather than concatenating a constant header file with tp_cat — tp_text_file_with_recurring_lines exists for the latter but is rarely used (12 step instances corpus-wide vs awk’s 195).
Datamash → Find&Replace round-trip to emulate “collapse with delimiter X”. sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373 collapses 7 columns; :375-407 then runs tp_find_and_replace with a 7-group regex \t([^\t,]+),[^\t]+\t... to keep only the first comma-delimited member of each collapsed cell. This is “argmax in datamash” implemented as “collapse-then-regex-trim-to-first” because datamash doesn’t have a “first” op accessible through this UI. Pattern page should warn this is the pattern, and that getting the regex group count right is the primary failure mode.
The same workflow uses Filter1 AND query_tabular for distinct cases. sars-cov-2-variation-reporting/variation-reporting.gxwf.yml uses Filter1 (line 545) for the simple c4=='PASS' case but reaches for datamash, easyjoin, find&replace chains for the fan-out/fan-in. There is a real implicit decision boundary: Filter1 for one-shot row filters; switch tools when the operation needs sort-aware or join-aware semantics.
collection_column_join as the only attested wide-pivot. Six workflows use it, all with the same shape: a collection of (id, value) tabulars (one per sample/element) joined on column 1 with fill_char: "0". microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656 does this three times in one workflow for three different metric families. There’s no pivot_table-style tool in the corpus; this collection-shape join is the workflow author’s substitute, and only works because the upstream produces one (id, value) file per element.
collapse_dataset with add_name: true, place_name: same_multiple, one_header: true is the canonical “merge a per-sample collection into a single annotated tabular”. sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414, microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:553 (with collection_column_join), microbiome/host-contamination-removal/.... Without the one_header: true you get duplicated header rows; without add_name: true you lose the per-row sample identity. Both bugs are silent.
Version pin sprawl in the awk tool is the leading cause of “same idiom shows up four times”. tp_awk_tool exists in the corpus at 9.3+galaxy1 (82 steps), 9.5+galaxy0 (30), 9.5+galaxy2 (13), 9.5+galaxy3 (60+27 in two clusters). Same parameter shape across all of them. Reviewer-facing pattern: pick the highest pin currently in any live workflow and discourage downgrading; do not block PRs for older pins on cleanup-pass grounds.

6. Open questions

Q1. Grep1 (47) vs tp_grep_tool (43) — semantic difference real? Both take pattern, invert, keep_header. Suggest the regex page recommend one and demote the other; need your call which.
Q2. awk-in-galaxy page depth: one page covering all 195 invocations, or split into 4 sub-pages (awk-header-injection, awk-bed-synthesis, awk-taxonomy-split, awk-relabel)? Lean: one page with idiom sections; split only if frontmatter cross-linking gets noisy.
Q3. Should Add_a_column1 page warn against auto_col_types: false? Many corpus uses set it true, some false (variation-reporting.gxwf.yml:454); silent string-vs-numeric coercion is a real bug source. Need your call on prescriptiveness.
Q4. Is query_tabular deep-dive in scope for this hierarchy or its own thing? It overlaps Galaxy’s broader “compute over tabular” story (R, Python, csvtk-shaped). Lean: keep it in this hierarchy as the SQL operation.
Q5. Tabular format conversion is genuinely sparse (no tab_to_csv, no csv_to_tab ops; biom_convert is the only thing close). Write a one-paragraph “gap note” page or skip entirely?
Q6. Older tool IDs (Grouping1, cat1, addValue/1.0.1, Remove beginning1, Paste1, sort1) — do pages mention them as legacy or only the modern equivalent? Lean: mention with a “newer alternative” pointer; corpus still uses them so doc-blindness is wrong.
Q7. Out-of-scope in this survey but adjacent: multiqc and tooldistillator_summarize produce tabular outputs that downstream tabular tools then chew on. Worth a “tabular sources” cross-reference page or skip?
Q8. Should the wide-pivot page be collection-to-wide-table-with-collection_column_join (tool-anchored) or tabular-pivot-collection-to-wide (operation-anchored with the tool as recommendation)? Lean: operation-anchored title, tool-anchored content.

7. Decisions (2026-04-30)

Resolved via AskUserQuestion after this survey landed. Pinned here so the next subagent inherits them without re-litigation.

Naming axis (Q8 + general). Operation-anchored page names. Tool-anchored content is fine inside an operation-named page. Even the awk-split sub-pages get operation names (tabular-prepend-header, tabular-synthesize-bed-from-3col, tabular-split-taxonomy-string, tabular-relabel-by-row-counter); awk is the implementation, not the title.
awk page depth (Q2). Split into 4-5 operation-named sub-pages per the bullet above. No single awk-in-galaxy umbrella page; cross-link the awk-as-recipe sub-pages from a §Recipes line in any operation page that uses awk.
Grep1 vs tp_grep_tool (Q1). Recommend tp_grep_tool. Demote Grep1 to a “legacy alternative” footnote. Consistency with the rest of the tp_* family wins over slight corpus-frequency edge of Grep1.
Format-conversion gap (Q5). Skip. Corpus-first principle: no exemplar = no page. The §2m gap note in this survey stands as the only record.

auto_col_types (Q3). The tabular-compute-new-column page prescribes a strict structured rule:

Always set fail_on_non_existent_columns: true (51/51 corpus instances).
non_computable.action: --fail-on-non-computable is the dominant choice (49/51); the two --skip-non-computable exceptions (consensus-from-variation.gxwf.yml:364, :402) are intentional, for BED-coordinate arithmetic where some rows are legitimately non-numeric.

auto_col_types is per-expression-kind:

Expression kind	`auto_col_types`
Arithmetic on raw `cN` (`(c18+c19)/c6`, `round(...)`)	`true`
Pure string concat (`c5 + '>' + c6`)	`false`
Arithmetic with explicit casts (`int(cN)`, `float(cN)`)	`false`
Mixed	split into two `expressions:` entries with different settings

Corpus distribution: 48 true / 3 false. Cite variation-reporting.gxwf.yml:307-329 (true, raw-cN arithmetic), :438-475 (false, string concat), and consensus-from-variation.gxwf.yml:343-378 (false, explicit-cast arithmetic) as the canonical triple.
Note on YAML shape: expressions: is nested under tool_state.ops.expressions (with header_lines_select: yes|no as sibling). error_handling is a top-level sibling of ops, not nested inside it. The pattern page must show this shape; flat expressions: does not roundtrip.

Legacy tool IDs (Q6). Pages name the modern tool primarily; include a short “Legacy alternative” footnote pointing to the old ID (Grouping1, cat1, addValue/1.0.1, Remove beginning1, Paste1, sort1). Reading old IWC workflows must remain possible.
query_tabular (Q4). Leaf in this tabular hierarchy as tabular-sql-query. Scope narrowly to “when SQL is the right reach” — window functions, multi-table JOINs, project+compute fused. Cross-link from filter / join / compute leaves; do not evangelize.
Tabular-source cross-ref (Q7). Deferred. If a Mold (e.g. summarize-galaxy-tool) later needs to point to multiqc/tooldistillator-as-tabular-source context, write the page then.

Iwc Tabular Operations Survey

IWC tabular operations survey

1. Tool inventory

1a. Galaxy “core” tabular tools (no toolshed owner; bundled with Galaxy)

1b. bgruening text_processing suite (“tp_*”)

1c. devteam column tools

1d. iuc / nml tabular tools

1e. Tabular-adjacent / built-ins worth flagging

1f. Notable absences

2. Operation inventory

2a. Filter rows (very common; ~80+ instances)

2b. Column projection / cut (very common; ~127 `Cut1` + N `query_tabular`)

2c. Computed column / per-row arithmetic (~93 `column_maker` + many awk)

2d. Sort (~25 core `sort1`, ~8 `tp_sort_header_tool`)

2e. Group / aggregate (~73 `datamash_ops`, ~19 `Grouping1`)

2f. Join (~5 `tp_easyjoin_tool` instances visible at top-level, ~2 `tp_multijoin_tool`)

2g. awk (free-form) (~195 step instances)

2h. String/regex replace at the line/column level (~66 `tp_find_and_replace`, ~39 `tp_replace_in_line`, ~11 `tp_replace_in_column`)

2i. Concatenate / row-bind (~15 `tp_cat`, ~4 `cat1`, ~44 `collapse_dataset`)

2j. Dedupe (~6 `tp_sorted_uniq`, ~3 `tp_uniq_tool`)

2k. Header strip / take first N (~21 `Remove beginning1`, ~4 `tp_head_tool`)

2l. Pivot / transpose

2m. Format conversion (tabular flavors)

2n. Count / summarize (~72 `wc_gnu`)

2o. Sample / random subset

2p. Side-by-side bind / paste

3. Cross-tab (tool × operation; redundancy hotspots)

4. Candidate operation-pattern boundaries

5. Surprising or recurring idioms

6. Open questions

7. Decisions (2026-04-30)

Incoming References (24)

IWC tabular operations survey

1. Tool inventory

1a. Galaxy “core” tabular tools (no toolshed owner; bundled with Galaxy)

1b. bgruening text_processing suite (“tp_*”)

1c. devteam column tools

1d. iuc / nml tabular tools

1e. Tabular-adjacent / built-ins worth flagging

1f. Notable absences

2. Operation inventory

2a. Filter rows (very common; ~80+ instances)

2b. Column projection / cut (very common; ~127 Cut1 + N query_tabular)

2c. Computed column / per-row arithmetic (~93 column_maker + many awk)

2d. Sort (~25 core sort1, ~8 tp_sort_header_tool)

2e. Group / aggregate (~73 datamash_ops, ~19 Grouping1)

2f. Join (~5 tp_easyjoin_tool instances visible at top-level, ~2 tp_multijoin_tool)

2g. awk (free-form) (~195 step instances)

2h. String/regex replace at the line/column level (~66 tp_find_and_replace, ~39 tp_replace_in_line, ~11 tp_replace_in_column)

2i. Concatenate / row-bind (~15 tp_cat, ~4 cat1, ~44 collapse_dataset)

2j. Dedupe (~6 tp_sorted_uniq, ~3 tp_uniq_tool)

2k. Header strip / take first N (~21 Remove beginning1, ~4 tp_head_tool)

2l. Pivot / transpose

2m. Format conversion (tabular flavors)

2n. Count / summarize (~72 wc_gnu)

2o. Sample / random subset

2p. Side-by-side bind / paste

3. Cross-tab (tool × operation; redundancy hotspots)

4. Candidate operation-pattern boundaries

5. Surprising or recurring idioms

6. Open questions

7. Decisions (2026-04-30)

Incoming References (24)

2b. Column projection / cut (very common; ~127 `Cut1` + N `query_tabular`)

2c. Computed column / per-row arithmetic (~93 `column_maker` + many awk)

2d. Sort (~25 core `sort1`, ~8 `tp_sort_header_tool`)

2e. Group / aggregate (~73 `datamash_ops`, ~19 `Grouping1`)

2f. Join (~5 `tp_easyjoin_tool` instances visible at top-level, ~2 `tp_multijoin_tool`)

2h. String/regex replace at the line/column level (~66 `tp_find_and_replace`, ~39 `tp_replace_in_line`, ~11 `tp_replace_in_column`)

2i. Concatenate / row-bind (~15 `tp_cat`, ~4 `cat1`, ~44 `collapse_dataset`)

2j. Dedupe (~6 `tp_sorted_uniq`, ~3 `tp_uniq_tool`)

2k. Header strip / take first N (~21 `Remove beginning1`, ~4 `tp_head_tool`)

2n. Count / summarize (~72 `wc_gnu`)