IWC tabular operations survey
Source corpus: /Users/jxc755/projects/repositories/workflow-fixtures/iwc-format2/, 120 .gxwf.yml files across 20 domain directories. Counts below are step-occurrence counts produced by grep -rh "^[[:space:]]*- tool_id:" --include="*.yml" | sort | uniq -c (i.e. one count per workflow step that uses the tool, including subworkflow steps and the trailing unique_tools block — so the magnitudes are roughly 2x the count of distinct invocations a user authored, but the ranking is faithful). All file:line citations are into iwc-format2/.
The corpus is heavily skewed toward sars-cov-2 reporting, microbiome amplicon, VGP assembly QC, and scRNA-seq metadata wrangling — those four pull the tabular tooling. Pure read/align/call workflows (read-preprocessing/, most of variant-calling/) barely touch tabular operations directly; their tabular work happens inside multiqc rollups, which are out of scope.
1. Tool inventory
Ranked by step occurrences. “DT” = devteam, “BG” = bgruening, “IUC” = iuc, “NML” = nml. Display names and short forms shown after first introduction.
1a. Galaxy “core” tabular tools (no toolshed owner; bundled with Galaxy)
| Steps | tool_id | Short name | Operation |
|---|---|---|---|
| 127 | Cut1 | Cut1 (Cut columns from a table) | Column projection |
| 33 | Filter1 | Filter1 (Filter data on any column using simple expressions) | Row filter |
| 47 | Grep1 | Grep1 (Select lines that match an expression) | Row filter (regex) |
| 25 | sort1 | sort1 (Sort) | Sort |
| 21 | Remove beginning1 | Remove beginning | Header strip |
| 19 | Grouping1 | Grouping1 (Group data by a column) | Group/aggregate |
| 7 | Paste1 | Paste1 (Paste two files side by side) | Column-bind |
| 5 | addValue/1.0.1 (DT) | Add a column | Constant column add |
| 4 | cat1 | Concatenate | Row-bind |
First citations:
Cut1—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782(columnList: c4,c6,c7,c13,c14,c15,c16,c17,c18,c19,c21,c22,c23,c26,c24,c25,c20,delimiter: T).Filter1—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545(cond: c4=='PASS' or c4=='.',header_lines: "1").Grep1—comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687(pattern: ^>,invert: "",keep_header: false).sort1—VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559.Grouping1—microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324(groupcol: "6",operations: [{optype: length, opcol: "6"}]).Remove beginning1—microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186.Paste1—genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733(two inputs,delimiter: T).
1b. bgruening text_processing suite (“tp_*”)
By far the largest single family. Same upstream tool collection (text_processing repo, owner bgruening), individual tools bucketed by operation. Multiple version pins coexist in the corpus (9.3+galaxy1, 9.5+galaxy0, 9.5+galaxy2, 9.5+galaxy3) — totals below sum across versions.
| Steps (all versions) | tool stem | Operation |
|---|---|---|
| 195 | tp_awk_tool | Free-form awk |
| 66 | tp_find_and_replace | Regex/string replace (whole-line) |
| 39 | tp_replace_in_line | Regex replace in line |
| 43 | tp_grep_tool | Row filter (regex; vs core Grep1) |
| 16 | tp_sed_tool | Free-form sed |
| 15 | tp_cat | Row-bind |
| 12 | tp_text_file_with_recurring_lines | Header/template lines (constant prefix) |
| 11 | tp_replace_in_column | Per-column substitution |
| ~8 | tp_sort_header_tool | Sort while preserving header |
| ~6 | tp_sorted_uniq | Dedupe (sort+uniq combined) |
| ~5 | tp_easyjoin_tool | Join on key |
| ~4 | tp_head_tool | First-N row truncate |
| ~3 | tp_uniq_tool | Dedupe |
| ~2 | tp_multijoin_tool | Multi-file outer join |
Representative full IDs (first occurrence in corpus):
toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool/9.5+galaxy3—comparative_genomics/hyphy/hyphy-core.gxwf.yml:169. 82 steps at9.3+galaxy1, 60 at9.5+galaxy3, 30 at9.5+galaxy0, 27 at9.5+galaxy3(sars-cov-2 cluster), 13 at9.5+galaxy2— that’s the version-pin spread.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/9.5+galaxy3—read-preprocessing/short-read-quality-control-and-trimming.gxwf.ymland elsewhere; 38 steps total at this version, plus 16 at9.5+galaxy0and 12 at9.3+galaxy1. Example regex spec:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:390-407(compoundfind_patternreflowing seven,-collapsed datamash columns into seven plain columns;is_regex: true,skip_first_line: true).toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0—sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-variant-calling/pe-artic-variation.gxwf.yml:973.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/9.3+galaxy1—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601(5 invocations in this single file alone — see §3).toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_text_file_with_recurring_lines/9.5+galaxy3—comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:663.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sed_tool/9.5+galaxy3—sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:155(5 sed invocations in this one file).toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_column/9.5+galaxy3—microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:942. Older1.1.3pin survives atsars-cov-2-variant-calling/sars-cov-2-ont-artic-variant-calling/ont-artic-variation.gxwf.yml:192.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy3—sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sorted_uniq/9.5+galaxy3—comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_uniq_tool/9.5+galaxy3—VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476(inside a subworkflow).toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_head_tool/9.5+galaxy0—microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/9.5+galaxy3— used 43x across four pins (1.1.1,9.3+galaxy1,9.5+galaxy2,9.5+galaxy3— last dominates); coexists with coreGrep1(47x). See §3.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_multijoin_tool/9.3+galaxy1—microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796.toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_sort_header_tool/9.3+galaxy1—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923.toolshed.g2.bx.psu.edu/repos/bgruening/split_file_on_column/tp_split_on_column/0.6—microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:369(split a tabular into a collection by a key column).
1c. devteam column tools
| Steps | tool_id | Operation |
|---|---|---|
| 93 | toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1 | Computed column (Python expressions over cN) |
| 56 | toolshed.g2.bx.psu.edu/repos/devteam/add_value/addValue/1.0.1 | Constant column add (older idiom; survives heavily in VGP) |
column_maker/Add_a_column1/2.1 first occurrence: sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:344. Representative state at sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:294-328 — two expressions: entries in one step, one inserting c7 as AFcaller at position I (insert), one replacing position 7 with round((c18 + c19) / c6, 6) named AF. The error_handling: { auto_col_types: true, fail_on_non_existent_columns: true, non_computable: { action: --fail-on-non-computable } } block is a stable boilerplate across the corpus — pattern page should call it out as the recommended-default tuple.
1d. iuc / nml tabular tools
| Steps | tool_id | Operation |
|---|---|---|
| 73 | toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/{1.1.0,1.8+galaxy0,1.9+galaxy0} | Group/aggregate; collapse |
| 44 | toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/5.1.0 | Concatenate a collection into a single tabular (with add_name/one_header) |
| 32 | toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3 | Wide pivot: outer-join a collection of 2-col tables on identifier → one row-per-id, one col-per-element |
| 16 | toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2 | Arbitrary SQL (SQLite) over one or more tabulars |
| 11 | toolshed.g2.bx.psu.edu/repos/iuc/filter_tabular/filter_tabular/3.3.1 | SQL-or-line-filter pre-processor |
| ~6 | toolshed.g2.bx.psu.edu/repos/iuc/table_compute/table_compute/1.2.4+galaxy{1,2} | Pandas-style row/col reductions and matrix ops |
| ~3 | toolshed.g2.bx.psu.edu/repos/iuc/biom_convert/biom_convert/... | Format conversion (BIOM↔tabular, tangentially) |
First citations:
datamash_ops/1.8+galaxy0—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333. Representative state (collapse a per-effect duplication):grouping: 1,2,3,4,5,6,7,8,9,10,operations: [{op_name: collapse, op_column: "11"}, ..., {op_name: collapse, op_column: "17"}]— 7 collapses across columns 11-17 in one step. Lines 333-373.datamash_ops/1.9+galaxy0(newer pin) —VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:746.collapse_collections/collapse_dataset/5.1.0—sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414(filename: { add_name: true, place_name: same_multiple },one_header: true). This is the Galaxy idiom for “flatten a collection back to a single tabular while injecting the element identifier as a leading column” — one_header strips per-file headers but keeps the first.collection_column_join/0.0.3—amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202(identifier_column: "1",fill_char: "0",has_header: "0",old_col_in_header: true).query_tabular/3.3.2—amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33. SQL example at lines 57-64:SELECT c1, c2, c3, c3 * 100 / SUM(c3) OVER() AS relative_abundance FROM t1;— window-function relative abundance, in one tool. Thetables_0|tableinput shape andinput_opts: { linefilters: [{filter: {filter_type: comment, comment_char: "35"}}, ...] }for skipping#-prefixed lines is the recurring boilerplate.filter_tabular/3.3.1—genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:657. Used as a lightweight column-projection / regex-line-filter alternative toquery_tabularwhen no SQL is needed.table_compute/1.2.4+galaxy2—epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:265(mode: matrixapply,matrixapply_func: { vector_op: min },dimension: "0"— column-wise min reduction). This is the only tool in the corpus that surfaces pandas-style reductions explicitly.
1e. Tabular-adjacent / built-ins worth flagging
wc_gnu(72 steps) — line-count, frequently downstream of a Filter/awk to feedparam_value_from_file(e.g.epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301).__APPLY_RULES__(22),__FLATTEN__(11),__RELABEL_FROM_FILE__(39),__FILTER_FROM_FILE__(20),__FILTER_EMPTY_DATASETS__(64),__EXTRACT_DATASET__(46),__MERGE_COLLECTION__(12) — Galaxy collection ops; out of scope per task (“not row/column ops”) but they bracket the tabular sections heavily.compose_text_param/0.1.1(99),pick_value/0.2.0(85),param_value_from_file(154),map_param_value/0.2.0(92),collection_element_identifiers/0.0.2(99) — the metadata-wrangling cluster around tabular steps; not tabular themselves.multiqc/1.33+galaxy0(31) andtooldistillator_summarize/1.0.4+galaxy0(11) — produce tabular outputs but are reporting tools; out of scope.
1f. Notable absences
Searched --include="*.yml" for these; zero hits in the corpus:
csvtk/*— none. Rich CSV/TSV swiss-army wrapper exists in toolshed but no IWC workflow uses it.datamash_transpose/*,datamash_reverse/*— none. Datamash’s transpose/reverse subcommands are not surfaced through any IWC workflow; transpose is done either viatable_compute(rare) or implicit in the wide-pivotcollection_column_joinidiom (§5).- No dedicated CSV format converter (no
tab_to_csv/csv_to_tab); when conversion is needed it goes through awk. - No bedtools-as-tabular intersect/sort outside of genuine BED contexts (excluded by scope anyway).
2. Operation inventory
Each operation, ranked by visible IWC frequency, with 2-3 file:line examples each.
2a. Filter rows (very common; ~80+ instances)
- Core
Filter1with Python-expression overcNcolumns:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545(cond: c4=='PASS' or c4=='.',header_lines: "1"). Filter1driven by a runtime-generated rule string fed from another step:epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320-336(cond: { __class__: ConnectedValue }fromgenerate filter rule/out1).- Regex/grep — split between core
Grep1andtp_grep_tool(see §3 for the redundancy). Core:comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687(pattern: ^>). - SQL filter with side-effect of column projection:
query_tabularatamplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:33. - Filter via awk pattern: pervasive — see §2g and the
code:greps in §5.
2b. Column projection / cut (very common; ~127 Cut1 + N query_tabular)
Cut1with comma-separatedcNlist:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782(columnList: c4,c6,c7,c13,...,c20,delimiter: T).Cut1is also used to reorder columns by listing them out of order (notec20last afterc26,c24,c25in that example).query_tabularfor column projection + computed columns in one shot:amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:57-64.filter_tabularfor projection without SQL:amplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:203.
2c. Computed column / per-row arithmetic (~93 column_maker + many awk)
Add_a_column1/2.1with multiple expressions in one step (insert and replace):sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329(AFcallerat insert pos 8,AF = round((c18 + c19) / c6, 6)at replace pos 7).- String-concatenation new column: same file lines 462-472 (
c5 + '>' + c6namedchange,c3 + ':' + c19namedchange_with_pos). - Per-row arithmetic via awk: see §2g.
2d. Sort (~25 core sort1, ~8 tp_sort_header_tool)
- Core
sort1first occurrence:VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:559(inside subworkflow). - Header-preserving sort:
tp_sort_header_tool/9.3+galaxy1atsars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:923and:950— used immediately before downstream joins.
2e. Group / aggregate (~73 datamash_ops, ~19 Grouping1)
- Datamash with multi-column
grouping:plus severalop_names in one step:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-370(10-col grouping, 7 collapse columns) and:562-596(single-col group, 5 ops:countunique × 3,min,max,collapse). Grouping1(older Galaxy tool):microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:324(groupcol: "6",operations: [{optype: length, opcol: "6"}]) and:1087(optype: cat, i.e. concatenate group members).
2f. Join (~5 tp_easyjoin_tool instances visible at top-level, ~2 tp_multijoin_tool)
- Two-file join on key columns:
sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:601-627(column1: "20",column2: "1",empty_string_filler: "0",header: true,jointype: " "— note the single space string for default outer join, not--outer). Five easyjoin steps in this one workflow alone (lines 601, 722, 752, 799, 1059). - Multi-file join:
microbiome/pathogen-identification/pathogen-detection-pathogfair-samples-aggregation-and-visualisation/Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:796(tp_multijoin_tool/9.3+galaxy1).
2g. awk (free-form) (~195 step instances)
Effectively the swiss-army knife. Several recurring sub-shapes:
- Header injection (constant prefix):
microbiome/mags-building/MAGs-generation.gxwf.yml:1090-1096—
Same one-liner shape atBEGIN {OFS="\t"; print "genome\tcompleteness\tcontamination"} NR > 1 { if ($1 !~ /\.fasta$/) $1 = $1 ".fasta" print $1, $2, $3 }VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855('BEGIN{print "Metric\tAlternate"}; {print}; ') and:1631(Metric\tPrimary). - BED triple synthesis from a 3-column input:
amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-rrna-prediction/mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222and:268—'BEGIN {OFS="\t"} {print $1, $2 - 1, $3, "forward", "1", "+"}'vs:245/:291"reverse", "1", "-". Replicated verbatim inside themgnify-amplicon-pipeline-v5-complete.gxwf.yml:2101,2124,2147,2170subworkflow embeddings. - Long-format taxonomy splitter:
amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135—split($3, taxonomy, ";")then dispatch by^sk__/^k__/^p__/^c__/^o__/^f__/^g__/^s__prefix into the 8 taxonomic-rank columns. Re-implemented (similar but not identical) atamplicon/amplicon-mgnify/mgnify-amplicon-taxonomic-summary-tables/mgnify-amplicon-summary-tables.gxwf.yml:241,281,336andmgnify-amplicon-pipeline-v5-complete.gxwf.yml:5697,5737,5792,6130,6170. - FASTQ id sanitization (technically in scope as a row-text op):
amplicon/amplicon-mgnify/mgnify-amplicon-pipeline-v5-complete/mgnify-amplicon-pipeline-v5-complete.gxwf.yml:407—NR % 4 == 1 { gsub(/[ \/]/, "-", $0) } { print }. - Inline relabel (replace whole record with a counter):
microbiome/binning-evaluation/MAGs-binning-evaluation.gxwf.yml:433—'{gsub( $0 ,"sample_" (NR-1)); print}'.
2h. String/regex replace at the line/column level (~66 tp_find_and_replace, ~39 tp_replace_in_line, ~11 tp_replace_in_column)
tp_find_and_replacewith multiple sequenced patterns in one step:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:389-407(compound regex reflow + a(GroupBy|collapse)\(([^)]+)\)→$2strip). Demonstrates the conventionskip_first_line: truefor the first pass andfalsefor the second.tp_replace_in_columnwithcolumn_replace: "16",delimiter: tab,pass_comments: "#",skip_lines: "1",unknowns_strategy: skip:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:281-289.
2i. Concatenate / row-bind (~15 tp_cat, ~4 cat1, ~44 collapse_dataset)
tp_catfor two-file concat:sars-cov-2-variant-calling/sars-cov-2-pe-illumina-artic-ivar-analysis/pe-wgs-ivar-analysis.gxwf.yml:627.collapse_datasetfor collection→tabular concat with header dedup:sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414(one_header: true,add_name: true,place_name: same_multiple). This is the dominant idiom — 44 steps; cf. only 15tp_catand 4cat1.
2j. Dedupe (~6 tp_sorted_uniq, ~3 tp_uniq_tool)
tp_sorted_uniqfirst hit:comparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:773and:1293(in the workflow’sunique_toolsblock).tp_uniq_tool:VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.gxwf.yml:2476.
2k. Header strip / take first N (~21 Remove beginning1, ~4 tp_head_tool)
Remove beginning1:microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:186and:200(used twice in succession).tp_head_tool:microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:495,633.
2l. Pivot / transpose
Sparse — and the corpus shape is itself a finding. No datamash_transpose invocations anywhere. table_compute does column reductions but is not used as a transpose. The dominant wide-pivot idiom is collection_column_join: take a collection of two-column (id, value) tables and outer-join them on the id column to produce one row-per-id, one-column-per-element. Examples at amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:202, microbiome/pathogen-identification/nanopore-pre-processing/Nanopore-Pre-Processing.gxwf.yml:823, microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656, microbiome/mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. The collection element identifier is the new column header (old_col_in_header: true).
Long-pivot (wide → long) is done with awk; e.g. amplicon/amplicon-mgnify/mapseq-to-ampvis2/mapseq-to-ampvis2.gxwf.yml:101-135 reshapes a ;-delimited taxonomy column to 8 parallel columns — itself wide-from-long via awk dispatch, reverse direction.
2m. Format conversion (tabular flavors)
Almost none seen. biom_convert (24 step occurrences across some amplicon workflows) is the only dedicated converter, and it converts between BIOM and TSV — adjacent to scope. Otherwise, conversions between TSV/CSV/BED-like flavors are done with awk + Cut1 ad hoc; no pure converter pattern exists.
2n. Count / summarize (~72 wc_gnu)
wc_gnu is overwhelmingly used to feed param_value_from_file (line-count → integer parameter). E.g. epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:301-313 (include_header: false, options: [lines]).
2o. Sample / random subset
Not visible at the tabular layer in any sampled file; sampling happens upstream at FASTQ level (seqtk_sample), not on tabular outputs.
2p. Side-by-side bind / paste
Paste1 (7 instances): genome_annotation/functional-annotation/functional-annotation-of-sequences/Functional_annotation_of_sequences.gxwf.yml:733 and microbiome/pathogen-identification/allele-based-pathogen-identification/Allele-based-Pathogen-Identification.gxwf.yml:611.
3. Cross-tab (tool × operation; redundancy hotspots)
| Operation | Tools that cover it (corpus-observed) | Recommendation lean |
|---|---|---|
| Filter rows (column expression) | Filter1, query_tabular, filter_tabular, awk | Filter1 for one-shot; query_tabular only when SQL semantics needed |
| Filter rows (regex) | Grep1 (47), tp_grep_tool (43), awk | Real redundancy — 47 vs 43 split with no semantic distinction visible |
| Cut/project columns | Cut1 (127), query_tabular, filter_tabular, Paste1+Cut1 chains | Cut1 dominates; use query_tabular for project+compute fused |
| Computed column | Add_a_column1 (93), awk, query_tabular | Add_a_column1 if the expression is short; awk if the row needs a multi-line decision tree (see §2g taxonomy splitter) |
| Sort | sort1, tp_sort_header_tool | tp_sort_header_tool whenever the input has a header — sort1’s header handling is implicit |
| Group/aggregate | datamash_ops (73), Grouping1 (19), query_tabular | datamash_ops is the canonical choice; Grouping1 survives in older microbiome workflows |
| Join on key | tp_easyjoin_tool, tp_multijoin_tool, query_tabular (SQL JOIN), collection_column_join (collection-shape) | tp_easyjoin_tool for two-file 1:1; collection_column_join for the collection→wide-table case |
| Header injection | awk BEGIN-block, tp_text_file_with_recurring_lines+tp_cat | awk is far more frequent (see §5 idiom #1) |
| Header strip | Remove beginning1, awk NR > 1, tp_head_tool | Remove beginning1 is purpose-built and dominant |
| Concatenate (row-bind) | tp_cat, cat1, collapse_dataset (collection→single) | collapse_dataset for collections; tp_cat for plain two-file |
| Dedupe | tp_sorted_uniq, tp_uniq_tool, datamash_ops collapse + count | tp_sorted_uniq for line-level; datamash_ops for key-aware dedupe |
| Replace text | tp_find_and_replace (66), tp_replace_in_line (39), tp_replace_in_column (11), tp_sed_tool (16), awk | tp_replace_in_column when the substitution is per-column (preserves other columns); tp_find_and_replace for whole-line; the rest are all redundant flavors of “regex replace” |
| Wide pivot | collection_column_join, awk | collection_column_join for the collection-to-table case (only viable use); no tool for wide↔long generic |
| Transpose | table_compute (rare) | No good IWC-attested option; gap |
Tool-anchored deep-dive candidates (one tool covers many operations, complex parameterization):
- awk (
tp_awk_tool, 195 steps across all versions) — by far the highest-leverage page. Header injection, BED synthesis, column reshape, regex filter, taxonomy splitting all condense to awk one-liners or short blocks. Thecode: |-field carries a Bash-like single-quoted string OR a multiline awk program; both shapes are present (comparemgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:222quoted-string form vsmapseq-to-ampvis2.gxwf.yml:101multiline|-form). - datamash_ops (73 steps) — the only group/aggregate path that scales beyond toy
Grouping1. Thegrouping:field is a comma-separated string,operations:is a list of{op_name, op_column}pairs, bothheader_in/header_outare independent toggles, andneed_sort:matters for correctness. Worth a dedicated page. - column_maker / Add_a_column1 (93 steps) — Python expressions over
cN, plus theerror_handlingboilerplate, plus theadd_column.mode: I/R/""(insert/replace/append) mini-DSL. Worth a page; reviewers will reach for it constantly. - query_tabular (16 steps) — niche but powerful (SQL window functions, multi-table JOINs). Worth a page so users don’t reach for awk when SQL is genuinely cleaner.
- collection_column_join (32 steps) — the wide-pivot idiom, and unobvious unless you already know about it. Worth a page.
4. Candidate operation-pattern boundaries
Proposed operation pages, each scoped tightly. Where a candidate is weak, I say so.
-
tabular-filter-by-column-value—Filter1withcond: cN == 'X' or cN > Y. Citesars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:545,epigenetics/consensus-peaks/consensus-peaks-atac-cutandrun.gxwf.yml:320,sars-cov-2-variant-calling/sars-cov-2-consensus-from-variation/consensus-from-variation.gxwf.yml:276. Tools:Filter1. Why: highest-frequency single operation; theheader_lines:parameter is an easy-to-miss correctness lever; thecond:mini-language is non-obvious (Python withcNcolumns). Keep. -
tabular-filter-by-regex(ortabular-grep) — Cover bothGrep1andtp_grep_tooland resolve the redundancy. Citecomparative_genomics/hyphy/capheine-core-and-compare.gxwf.yml:687. Keep, but the page must take a position on which to recommend (see open question Q1). -
tabular-cut-and-reorder-columns—Cut1withcolumnList: cN,cM,..., including the use of out-of-order lists for column reordering. Citesars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:782(reorder),:830,:878. Notedelimiter: Tconstant. Keep. -
tabular-compute-new-column—column_maker/Add_a_column1with theerror_handlingboilerplate andadd_column.mode: I/R/""DSL. Citesars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:316-329,:462-472,consensus-from-variation.gxwf.yml:344. Keep — this is one of the most foot-gun-prone tools in IWC due to silent column-type coercion whenauto_col_types: false. -
tabular-group-and-aggregate-with-datamash—datamash_opswith multi-columngrouping:, sequencedoperations:. Citesars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373(collapse-by-many),:562-596(count/min/max in one step),:632-657(single-op countunique). Keep — high-density idiom page. -
tabular-join-on-key—tp_easyjoin_toolwithcolumn1/column2,empty_string_filler,jointype(note the leading-space convention for outer). Citevariation-reporting.gxwf.yml:601,:722,:1059. Mentiontp_multijoin_toolfor >2 inputs andquery_tabularfor SQL JOIN. Keep. -
awk-in-galaxy(deep tool-anchored page) —tp_awk_toolquoted vs multilinecode:shapes,BEGIN {OFS="\t"}rituals,NR > 1header skip,gsub,split. Cite §2g and §5 examples. Keep — must be a deep page; see open question Q2 about depth. -
collection-to-wide-table-with-collection_column_join— the wide-pivot idiom. Citemapseq-to-ampvis2.gxwf.yml:202,mags-building/MAGs-generation.gxwf.yml:1483,1544,1656,mag-genome-annotation-parallel/MAG-Genome-Annotation-Parallel.gxwf.yml:811. Keep — non-obvious without a worked example. -
collection-to-single-tabular-with-collapse_dataset— sibling to the above for the row-bind direction. Citevariation-reporting.gxwf.yml:414,mapseq-to-ampvis2.gxwf.yml:178. Theadd_name/one_header/place_nametriad is the foot-gun. Keep. -
tabular-sql-with-query_tabular— Citemapseq-to-ampvis2.gxwf.yml:33(window function),mgnify-amplicon-pipeline-v5-rrna-prediction.gxwf.yml:137. Thetables[].input_opts.linefiltersandtbl_opts.column_names_from_first_lineboilerplate is non-trivial. Keep — this page exists to narrow when query_tabular is the right reach (SQL windows, JOINs), not to evangelize it.
Drop / merge candidates (don’t deserve their own page):
tabular-sort—sort1andtp_sort_header_toolare simple enough to roll into a one-line note inside other pages. Possible exception: a one-paragraph stub disambiguating the two.tabular-dedupe— same;tp_sorted_uniqandtp_uniq_toolare thin wrappers, low pitfall density.tabular-row-bind— collapse into thecollapse_datasetpage; standalonetp_catdoesn’t merit a page.tabular-format-convert— corpus is too sparse. Document as a gap, not a page (see §6 Q5).tabular-pivot-wide-to-long— no concrete IWC pattern; corpus does this case-by-case in awk. Don’t write a page until there’s signal.sed-in-galaxy—tp_sed_tool(16) is dwarfed bytp_find_and_replace(66) and awk (195). Cover sed as a one-section note inside the awk page rather than a sibling page.tabular-header-strip—Remove beginning1is one-parameter (num_lines); too thin for a standalone.
5. Surprising or recurring idioms
-
The
BEGIN {OFS="\t"; print "header\there\there"} { print rows }ritual. Constant across families:microbiome/mags-building/MAGs-generation.gxwf.yml:1090,VGP-assembly-v2/Purge-duplicates-one-haplotype-VGP6b/Purging-duplicates-one-haplotype-VGP6b.gxwf.yml:855,:1631(Metric\tPrimaryvsMetric\tAlternate— same workflow, two haplotypes, identical rite). Authors reach for awk to attach a header rather than concatenating a constant header file withtp_cat—tp_text_file_with_recurring_linesexists for the latter but is rarely used (12 step instances corpus-wide vs awk’s 195). -
Datamash → Find&Replace round-trip to emulate “collapse with delimiter X”.
sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:333-373collapses 7 columns;:375-407then runstp_find_and_replacewith a 7-group regex\t([^\t,]+),[^\t]+\t...to keep only the first comma-delimited member of each collapsed cell. This is “argmax in datamash” implemented as “collapse-then-regex-trim-to-first” because datamash doesn’t have a “first” op accessible through this UI. Pattern page should warn this is the pattern, and that getting the regex group count right is the primary failure mode. -
The same workflow uses
Filter1ANDquery_tabularfor distinct cases.sars-cov-2-variation-reporting/variation-reporting.gxwf.ymlusesFilter1(line 545) for the simplec4=='PASS'case but reaches for datamash, easyjoin, find&replace chains for the fan-out/fan-in. There is a real implicit decision boundary:Filter1for one-shot row filters; switch tools when the operation needs sort-aware or join-aware semantics. -
collection_column_joinas the only attested wide-pivot. Six workflows use it, all with the same shape: a collection of(id, value)tabulars (one per sample/element) joined on column 1 withfill_char: "0".microbiome/mags-building/MAGs-generation.gxwf.yml:1483,1544,1656does this three times in one workflow for three different metric families. There’s nopivot_table-style tool in the corpus; this collection-shape join is the workflow author’s substitute, and only works because the upstream produces one (id, value) file per element. -
collapse_datasetwithadd_name: true, place_name: same_multiple, one_header: trueis the canonical “merge a per-sample collection into a single annotated tabular”.sars-cov-2-variant-calling/sars-cov-2-variation-reporting/variation-reporting.gxwf.yml:414,microbiome/pathogen-identification/.../Pathogen-Detection-PathoGFAIR-Samples-Aggregation-and-Visualisation.gxwf.yml:553(withcollection_column_join),microbiome/host-contamination-removal/.... Without theone_header: trueyou get duplicated header rows; withoutadd_name: trueyou lose the per-row sample identity. Both bugs are silent. -
Version pin sprawl in the awk tool is the leading cause of “same idiom shows up four times”.
tp_awk_toolexists in the corpus at9.3+galaxy1(82 steps),9.5+galaxy0(30),9.5+galaxy2(13),9.5+galaxy3(60+27 in two clusters). Same parameter shape across all of them. Reviewer-facing pattern: pick the highest pin currently in any live workflow and discourage downgrading; do not block PRs for older pins on cleanup-pass grounds.
6. Open questions
- Q1.
Grep1(47) vstp_grep_tool(43) — semantic difference real? Both takepattern,invert,keep_header. Suggest the regex page recommend one and demote the other; need your call which. - Q2.
awk-in-galaxypage depth: one page covering all 195 invocations, or split into 4 sub-pages (awk-header-injection,awk-bed-synthesis,awk-taxonomy-split,awk-relabel)? Lean: one page with idiom sections; split only if frontmatter cross-linking gets noisy. - Q3. Should
Add_a_column1page warn againstauto_col_types: false? Many corpus uses set it true, some false (variation-reporting.gxwf.yml:454); silent string-vs-numeric coercion is a real bug source. Need your call on prescriptiveness. - Q4. Is
query_tabulardeep-dive in scope for this hierarchy or its own thing? It overlaps Galaxy’s broader “compute over tabular” story (R, Python, csvtk-shaped). Lean: keep it in this hierarchy as the SQL operation. - Q5. Tabular format conversion is genuinely sparse (no
tab_to_csv, nocsv_to_tabops;biom_convertis the only thing close). Write a one-paragraph “gap note” page or skip entirely? - Q6. Older tool IDs (
Grouping1,cat1,addValue/1.0.1,Remove beginning1,Paste1,sort1) — do pages mention them as legacy or only the modern equivalent? Lean: mention with a “newer alternative” pointer; corpus still uses them so doc-blindness is wrong. - Q7. Out-of-scope in this survey but adjacent:
multiqcandtooldistillator_summarizeproduce tabular outputs that downstream tabular tools then chew on. Worth a “tabular sources” cross-reference page or skip? - Q8. Should the wide-pivot page be
collection-to-wide-table-with-collection_column_join(tool-anchored) ortabular-pivot-collection-to-wide(operation-anchored with the tool as recommendation)? Lean: operation-anchored title, tool-anchored content.
7. Decisions (2026-04-30)
Resolved via AskUserQuestion after this survey landed. Pinned here so the next subagent inherits them without re-litigation.
- Naming axis (Q8 + general). Operation-anchored page names. Tool-anchored content is fine inside an operation-named page. Even the awk-split sub-pages get operation names (
tabular-prepend-header,tabular-synthesize-bed-from-3col,tabular-split-taxonomy-string,tabular-relabel-by-row-counter); awk is the implementation, not the title. - awk page depth (Q2). Split into 4-5 operation-named sub-pages per the bullet above. No single
awk-in-galaxyumbrella page; cross-link the awk-as-recipe sub-pages from a §Recipes line in any operation page that uses awk. - Grep1 vs tp_grep_tool (Q1). Recommend
tp_grep_tool. DemoteGrep1to a “legacy alternative” footnote. Consistency with the rest of thetp_*family wins over slight corpus-frequency edge ofGrep1. - Format-conversion gap (Q5). Skip. Corpus-first principle: no exemplar = no page. The §2m gap note in this survey stands as the only record.
auto_col_types(Q3). Thetabular-compute-new-columnpage prescribes a strict structured rule:- Always set
fail_on_non_existent_columns: true(51/51 corpus instances). non_computable.action: --fail-on-non-computableis the dominant choice (49/51); the two--skip-non-computableexceptions (consensus-from-variation.gxwf.yml:364,:402) are intentional, for BED-coordinate arithmetic where some rows are legitimately non-numeric.auto_col_typesis per-expression-kind:Expression kind auto_col_typesArithmetic on raw cN((c18+c19)/c6,round(...))truePure string concat ( c5 + '>' + c6)falseArithmetic with explicit casts ( int(cN),float(cN))falseMixed split into two expressions:entries with different settings- Corpus distribution: 48
true/ 3false. Citevariation-reporting.gxwf.yml:307-329(true, raw-cNarithmetic),:438-475(false, string concat), andconsensus-from-variation.gxwf.yml:343-378(false, explicit-cast arithmetic) as the canonical triple. - Note on YAML shape:
expressions:is nested undertool_state.ops.expressions(withheader_lines_select: yes|noas sibling).error_handlingis a top-level sibling ofops, not nested inside it. The pattern page must show this shape; flatexpressions:does not roundtrip.
- Always set
- Legacy tool IDs (Q6). Pages name the modern tool primarily; include a short “Legacy alternative” footnote pointing to the old ID (
Grouping1,cat1,addValue/1.0.1,Remove beginning1,Paste1,sort1). Reading old IWC workflows must remain possible. query_tabular(Q4). Leaf in this tabular hierarchy astabular-sql-query. Scope narrowly to “when SQL is the right reach” — window functions, multi-table JOINs, project+compute fused. Cross-link from filter / join / compute leaves; do not evangelize.- Tabular-source cross-ref (Q7). Deferred. If a Mold (e.g.
summarize-galaxy-tool) later needs to point to multiqc/tooldistillator-as-tabular-source context, write the page then.