SI_Recipe_S1_mobile_resistome

UC1 Setup Recipe — MRSA Mobile AMR Context Across Isolates

Issue: https://github.com/jmchilton/galaxy-brain/issues/12 Science: Comparative ARG↔IS proximity analysis across 4 S. aureus isolates. Headline: aac(6')-aph(2'') is IS6-adjacent on plasmids of KUN1163/KUH140046 but IS256-adjacent on KUH180129 chromosome. Extracted workflow ID on reference server: 33b43b4e7093c91f (/tmp/uc1_workflow.ga) — 14 steps, 1 input, 13 tools, zero dangling, 9 workflow outputs.

Server-agnostic recipe to recreate the UC1 notebook on another Galaxy. Tool versions and parameters were confirmed against the reference server (Galaxy 26.2.dev0). Where a value is instance-specific (encoded dataset ids), the recipe gives the step to regenerate it.


1. Tools to Install

Tool Shed repos (install via ephemeris shed-tools)

# Install ephemeris isolated (do NOT pip-install into Galaxy's .venv):
uv tool install ephemeris

# Install UC1 tool set:
shed-tools install -g http://<SERVER>:8080 -a <ADMIN_KEY> \
  -t mrsa-mobile-amr-tools.yml \
  --skip-install-resolver-dependencies

mrsa-mobile-amr-tools.yml installs: staramr (nml), bakta (iuc), isescan (iuc), integron_finder (iuc), plasmidfinder (iuc), tbl2gff3 (iuc), text_processing (bgruening), bedtools (iuc), jbrowse (iuc). Also install (needed for the on-graph figures and collection operations): iuc/ggplot2_heatmap2, nml/collapse_collections.

Confirmed tool IDs and versions (from running server)

ToolFull tool IDVersion
staramrtoolshed.g2.bx.psu.edu/repos/nml/staramr/staramr_search0.12.3+galaxy0
isescantoolshed.g2.bx.psu.edu/repos/iuc/isescan/isescan1.7.3+galaxy0
integron_findertoolshed.g2.bx.psu.edu/repos/iuc/integron_finder/integron_finder2.0.5+galaxy1
bedtools closesttoolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_closestbed2.31.1+galaxy1
bedtools sortbedtoolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_sortbed2.31.1+galaxy0
tp_awk_tooltoolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool9.5+galaxy3
ggplot2_heatmap2toolshed.g2.bx.psu.edu/repos/iuc/ggplot2_heatmap2/ggplot2_heatmap23.3.0+galaxy0
collapse_collectionstoolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset5.1.0

Builtin (no install): __DATA_FETCH__ (collection upload from URL).

Docker / container setup

Enable Docker for containerized jobs (docker_enabled job env + enable_mulled_containers: true, conda_auto_install: false). Pre-pull BioContainer images before running, especially on Apple Silicon (amd64 emulation is correct but slow).

Bakta (OPTIONAL — structural confirmation only, NOT in the extractable core)

Deliberately excluded from the extractable clean pipeline. If desired as non-extractable enrichment:


2. Reference Data / DB Setup

None for the extractable core. The analysis is assembly-based (FASTA→BED); no genome index or .loc registration is needed.


3. Input Data

One list collection of 4 combined chromosome+plasmid FASTA files, fetched directly from ENA via a single POST /api/tools/fetch (set history_id):

{
  "targets": [{
    "destination": {"type": "hdca"},
    "collection_type": "list",
    "name": "MRSA isolate assemblies",
    "elements": [
      {"name": "KUH140013", "ext": "fasta", "src": "url",
       "url": "https://www.ebi.ac.uk/ena/browser/api/fasta/AP020311.1,AP020312.1"},
      {"name": "KUH140046", "ext": "fasta", "src": "url",
       "url": "https://www.ebi.ac.uk/ena/browser/api/fasta/AP020313.1,AP020314.1"},
      {"name": "KUH180129", "ext": "fasta", "src": "url",
       "url": "https://www.ebi.ac.uk/ena/browser/api/fasta/AP020322.1,AP020323.1"},
      {"name": "KUN1163", "ext": "fasta", "src": "url",
       "url": "https://www.ebi.ac.uk/ena/browser/api/fasta/AP020324.1,AP020325.1"}
    ]
  }]
}

Element names (isolate IDs) must be exactly KUH140013, KUH140046, KUH180129, KUN1163 — they become the matrix column labels.

ElementMLSTChrPlasmidGenome bp
KUH140013ST8AP020311.1AP020312.12,837,394
KUH140046ST8AP020313.1AP020314.12,859,431
KUH180129ST6313AP020322.1AP020323.12,897,282
KUN1163ST764AP020324.1AP020325.12,944,787

Source: Hikichi et al. 2019, BioProject PRJDB8599.


4. Pipeline Steps

All steps map over the input collection. Use {"batch":true,"values":[{"src":"hdca","id":<collection_id>}]} for the collection input parameter when calling POST /api/tools directly (the MCP run_tool does not surface map-over; use direct API).

Step 1 — staramr …/staramr_search/0.12.3+galaxy0

{"pointfinder_db": {"use_pointfinder": "disabled"},
 "advanced": {"genome_size_lower_bound": "4000000", "genome_size_upper_bound": "6000000",
  "minimum_N50_value": "10000", "minimum_contig_length": "300", "mlst_scheme": "auto",
  "pid_threshold": "98.0", "plasmidfinder_type": "include_all", "plength_plasmidfinder": "60.0",
  "plength_resfinder": "60.0", "report_all_blast": false}}

The genome-size “quality-failed” warning is expected for S. aureus (~2.8–2.9 Mb vs the 4–6 Mb window); ResFinder/PlasmidFinder calls are unaffected. Outputs used: resfinder.tsv, mlst.tsv, summary.tsv, plasmidfinder.tsv collections.

Step 2 — ISEScan …/isescan/1.7.3+galaxy0

{"remove_short_is": true, "log_activate": false}

CRITICAL GOTCHA: remove_short_is MUST be true. The default (false) keeps partial/unclassified elements (family=new) → spurious distance-0 IS overlaps → corrupted context matrix (KUN1163 shows 25 vs the validated 17 IS). Headline survives either way, but only true reproduces byte-identical results. Output used: results in gff format collection.

Step 3 — Integron Finder …/integron_finder/2.0.5+galaxy1

{"type_replicon": "--circ", "local_max": false, "promoter_attI": false, "gbk": false, "pdf": false,
 "settings": {"attc_settings": {"calin_threshold": "2", "dist_thresh": "4000", "max_attc_size": "200", "min_attc_size": "40"},
  "protein_settings": {"func_annot": false, "no_proteins": false}}}

Expected: zero integrons in all four (expected for S. aureus). Outputs: Integron annotations, Summary collections.

Step 4 — ARG→BED (tp_awk on resfinder.tsv collection)

NR>1 { s=$9; e=$10; if (s>e) { t=s; s=e; e=t } print $8"\t"(s-1)"\t"e"\t"$2 }

Output: contig, start(0-based), end, gene ($8=contig, $9/$10=start/end, $2=gene; swap for reverse-strand hits).

Step 5 — IS→BED (tp_awk on ISEScan gff collection)

/family=/ { fam="IS"; if (match($9, /family=[^;]+/)) fam=substr($9, RSTART+7, RLENGTH-7); print $1"\t"($4-1)"\t"$5"\t"fam }

Output: contig, start(0-based), end, IS_family.

Steps 6 & 7 — SortBED …/bedtools_sortbed/2.31.1+galaxy0

Default params (lexicographic). Map over the ARG BED collection (step 6) and the IS BED collection (step 7).

Step 8 — bedtools closest -d …/bedtools_closestbed/2.31.1+galaxy1

Input A = ARG sorted BED (mapped); Input B = IS sorted BED (mapped; matched element-wise).

{"ties": "all", "strand": "", "addition": true, "addition2": {"addition2_select": ""}, "k": "1", "io": false, "mdb": "each"}

(addition: true = -d; ties: all; mdb: each.) Output (9 cols): ARG_contig, ARG_start, ARG_end, gene, IS_contig, IS_start, IS_end, IS_family, distance.

Step 9 — Collapse Collection …/collapse_dataset/5.1.0

{"one_header": false, "filename": {"add_name": true, "place_name": "same_multiple"}}

Prepends the element identifier as column 1. Output (10 cols): isolate, contig, ARG_start, ARG_end, gene, IS_contig, IS_start, IS_end, IS_family, distance.

Step 10 — Matrix 1 (ARG location, gene × isolate) tp_awk

BEGIN{FS=OFS="\t"}
{ iso[NR]=$1; con[NR]=$2; gene[NR]=$5; n=NR; if($4+0>mx[$2]) mx[$2]=$4+0; isos[$1]=1 }
END{
  for(k=1;k<=n;k++){
    code=(mx[con[k]]<200000)?2:1; g=gene[k]; ix=iso[k];
    if(!((g SUBSEP ix) in L) || code==2) L[g,ix]=code; genes[g]=1;
  }
  ni=0; for(s in isos){ni++; A[ni]=s}
  for(a=1;a<=ni;a++)for(b=a+1;b<=ni;b++)if(A[b]<A[a]){t=A[a];A[a]=A[b];A[b]=t}
  ng=0; for(g in genes){ng++; GN[ng]=g}
  for(a=1;a<=ng;a++)for(b=a+1;b<=ng;b++)if(GN[b]<GN[a]){t=GN[a];GN[a]=GN[b];GN[b]=t}
  printf "Gene"; for(a=1;a<=ni;a++) printf "%s%s",OFS,A[a]; printf "\n";
  for(r=1;r<=ng;r++){ printf "%s",GN[r]; for(a=1;a<=ni;a++) printf "%s%d",OFS,(L[GN[r],A[a]]+0); printf "\n"; }
}

Self-contained plasmid classifier: max(ARG_end) per contig < 200000 bp = plasmid (code 2), else chromosome (code 1); plasmid wins on mixed copies. Output: gene × isolate, 0=absent / 1=chromosome / 2=plasmid.

Step 11 — Matrix 2 (ARG mobile context, context × isolate) tp_awk

BEGIN{FS=OFS="\t"}
{ iso[NR]=$1; con[NR]=$2; di[NR]=$10; n=NR; if($4+0>mx[$2]) mx[$2]=$4+0; isos[$1]=1 }
END{
  for(k=1;k<=n;k++){
    code=(mx[con[k]]<200000)?2:1;
    if(code==2) ctx="Plasmid-borne";
    else if(di[k]+0<1000) ctx="Chromosomal IS-adjacent (<1kb)";
    else ctx="Chromosomal distal";
    M[ctx,iso[k]]++;
  }
  ni=0; for(s in isos){ni++; A[ni]=s}
  for(a=1;a<=ni;a++)for(b=a+1;b<=ni;b++)if(A[b]<A[a]){t=A[a];A[a]=A[b];A[b]=t}
  split("Plasmid-borne|Chromosomal IS-adjacent (<1kb)|Chromosomal distal",CTX,"|");
  printf "Context"; for(a=1;a<=ni;a++) printf "%s%s",OFS,A[a]; printf "\n";
  for(r=1;r<=3;r++){ printf "%s",CTX[r]; for(a=1;a<=ni;a++) printf "%s%d",OFS,(M[CTX[r],A[a]]+0); printf "\n"; }
}

Steps 12 & 13 — Heatmaps …/ggplot2_heatmap2/3.3.0+galaxy0

Fig 1 (Matrix 1) title “ARG genomic location (0=absent, 1=chromosome, 2=plasmid)”; Fig 2 (Matrix 2) title “ARG counts by mobile context”. Both:

{"transform": "none", "zscore_cond": {"scale": "none", "zscore": "none"},
 "cluster_cond": {"cluster": "no"}, "labels": "both",
 "colorchoice": {"name": "BrBG", "type": "palettes"}, "image_file_format": "png"}

5. Notebook directives (on-graph outputs the page displays)

One galaxy-fenced block per directive:

  1. history_dataset_collection_display(history_dataset_collection_id=<staramr mlst collection>)
  2. history_dataset_collection_display(history_dataset_collection_id=<staramr summary collection>)
  3. history_dataset_as_image(history_dataset_id=<Fig1 location heatmap PNG>)
  4. history_dataset_collection_display(history_dataset_collection_id=<staramr plasmidfinder collection>)
  5. history_dataset_collection_display(history_dataset_collection_id=<staramr resfinder collection>)
  6. history_dataset_collection_display(history_dataset_collection_id=<Integron Finder Summary collection>)
  7. history_dataset_collection_display(history_dataset_collection_id=<ISEScan tabular results collection>)
  8. history_dataset_collection_display(history_dataset_collection_id=<bedtools closest collection>)
  9. history_dataset_as_image(history_dataset_id=<Fig2 context heatmap PNG>)

(Titles with spaces in history_dataset_as_table must be quoted.)


6. Verification

CheckExpected
bedtools closest collection (with remove_short_is=true)byte-identical across all 4 isolates
Matrix 2 (context)byte-identical
Matrix 1 cellsidentical (row order may be alphabetical)
KUN1163 IS count17 complete elements
aac(6')-aph(2'') KUN116380 bp, IS6, plasmid
aac(6')-aph(2'') KUH140046467 bp, IS6, plasmid
aac(6')-aph(2'') KUH18012984 bp, IS256, chromosome
Integronszero in all four
Page extraction14 seeded, 8 ICJ map-over, 9 exposed, 0 warnings, 0 dangling
Extracted workflow14 steps (1 input + 13 tools)

Headline (load-bearing science): same gene aac(6')-aph(2''), different mobilizing IS family by location (IS6 plasmids vs IS256 chromosome).


7. Extraction

Page-extraction feature (PR #22860 / extract_next):

GET  /api/pages/<page_id>/workflow_extraction_summary
POST /api/workflows/extract {"from_page_id": "<page_id>", "history_id": "<history_id>"}

Expected: 14 steps, every input_connection resolves (zero dangling), 9 workflow outputs, report rewritten with 0 leftover instance ids. Cleanest extraction of the three — genuine collection map-over, no anti-patterns, no code change required.