Tabular: join on key

Tool

toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool is the primary corpus path for ordinary two-file key joins. Use tp_multijoin_tool when many files share the same key/value shape. Use tabular-sql-query when the join condition, projection, anti-join, named columns, or indexes are the point.

When to reach for it

Use this pattern when two or more tabular datasets share an identifier column and need row-wise alignment.

Do not use this for collection-to-wide pivots; that is tabular-pivot-collection-to-wide. Do not use it for side-by-side paste with no key, row-bind/concat, or grouping. For aggregation around a join, see tabular-group-and-aggregate-with-datamash.

Parameters

For tp_easyjoin_tool:

column1: 1-indexed key column in infile1, quoted as a string.
column2: 1-indexed key column in infile2, quoted as a string.
empty_string_filler: fill value for missing-side cells. Corpus values include "0" and ".".
header: boolean; whether inputs have headers.
ignore_case: boolean key matching option.
jointype: join mode. Corpus default outer-style shape uses a literal single-space string: " ".
infile1, infile2: connected tabular inputs.

For tp_multijoin_tool:

first_file: first connected tabular input.
files: remaining connected tabular inputs.
key_column: shared key column.
value_columns: value columns to carry through from each file.
filler: fill value for missing cells.
input_header, output_header: independent header toggles.
ignore_dups: duplicate-key behavior.

Idiomatic shapes

Two-file key join, fill missing with zero:

tool_id: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_easyjoin_tool/9.3+galaxy1
tool_state:
  column1: "20"
  column2: "1"
  empty_string_filler: "0"
  header: true
  ignore_case: false
  infile1: { __class__: ConnectedValue }
  infile2: { __class__: ConnectedValue }
  jointype: " "

Anchored by the SARS-CoV-2 variation reporting IWC exemplar.

Many two-column files joined by a shared key:

tool_id: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_multijoin_tool/9.3+galaxy1
tool_state:
  first_file: { __class__: ConnectedValue }
  files: { __class__: ConnectedValue }
  key_column: "1"
  value_columns: "2"
  filler: "0"
  input_header: false
  output_header: true
  ignore_dups: false

Anchored by the PathoGFAIR sample aggregation IWC exemplar.

SQL join when SQL semantics matter:

tool_id: toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2
tool_state:
  query_result:
    header: yes
    header_prefix: ""
  sqlquery: |-
    SELECT pep.mpep, prot.prot
    FROM pep
    INNER JOIN prot on pep.mpep=prot.pep
  tables:
    - table: { __class__: ConnectedValue }
      tbl_opts:
        table_name: pep
        col_names: mpep
    - table: { __class__: ConnectedValue }
      tbl_opts:
        table_name: prot
        col_names: pep,prot

Anchored by the clinical metaproteomics verification IWC exemplar.

Pitfalls

jointype: " " is intentional. The surveyed easyjoin shape uses a literal single-space string. Do not clean it up to an empty string or --outer without wrapper verification.
Key columns are 1-indexed strings. Use "1", "20", etc. Do not use c1 here.
Header flag must match input. Wrong header settings shift matching and can duplicate or garble headers.
Fill value is semantic. "0" means absent is zero; "." means absent is unknown/missing. Pick based on downstream interpretation.
tp_easyjoin_tool is not SQL. Compound predicates, anti-joins, named-column projection, or indexed joins belong in tabular-sql-query.
tp_multijoin_tool assumes uniform shape. It fits many same-shaped key/value files, not arbitrary heterogeneous tables.
Do not substitute collection joins. collection_column_join is for the collection-to-wide-table idiom, not ordinary two-file joins.

Legacy alternative

Older core join tools may appear in inherited workflows, but the tabular survey’s recommended path is tp_easyjoin_tool for ordinary two-file joins and tp_multijoin_tool for many same-shaped files. Preserve legacy steps when reading old workflows; do not introduce them in new authoring.

Tabular: join on key

Pattern health

Tabular: join on key

Tool

When to reach for it

Parameters

Idiomatic shapes

Pitfalls

Legacy alternative

See also

IWC Exemplars

Incoming References (7)