UC1_DEBRIEF

UC1 debrief — MRSA mobile-AMR context across isolates (issue #12)

Built interactively via the notebooks MCP against the local Galaxy (worktree history_pages), 2026-06-12. This is the debrief for use case 1; setup story is in SETUP_DEBRIEF.md, operational facts in index.md.

Artifacts produced

ThingID / location
HistoryMRSA mobile AMR context across isolates60559d1a5f859493
Notebook pageb887d74393f85b6d (history-attached; latest revision eca0af6fb47bf90c)
Isolate collectionMRSA isolate assemblies (f597429621d6eb2b), 4 elements
bedtools closest output collectionfb85969571388350
Figure 1 — ARG location heatmap (PNG)60e680a037f41974
Figure 2 — ARG counts by mobile context (PNG)55504e7a2466a2e3
Bakta GFF3 (solo runs, populated)KUN1163 2f4933c7d721d5e8, KUH180129 b1cd55a75be0e1ad, KUH140046 f2f5db583bb871d6
Plotting tool installediuc/ggplot2_heatmap2 (the only plotting tool available)
Bakta light DB v5.1database/bakta_db/db-light (1.48 GB tar → 3.4 GB) + AMRFinder DB database/bakta_db/amrfinderplus-db

What was done

4 complete S. aureus isolates (Hikichi 2019, BioProject PRJDB8599), each imported as a combined chromosome + plasmid FASTA straight from ENA’s comma-separated FASTA API (.../fasta/AP020324.1,AP020325.1):

Pipeline, all mapped over the assemblies collection (so it extracts as one reusable workflow):

assemblies → staramr ──── resfinder.tsv ─ awk→BED ─ SortBED ┐
          → ISEScan ───── results.gff ── awk→BED ─ SortBED ┴─ bedtools closest -d → ARG↔IS distances
          → Integron Finder ── (zero integrons)

staramr (ResFinder+PlasmidFinder+MLST, PointFinder disabled), ISEScan, Integron Finder, then Text reformatting (awk) → BED, bedtools SortBED, bedtools closest -d. All jobs ran in Docker/BioContainers.

Headline finding (survives review)

The same gene aac(6')-aph(2'') sits in a different mobile context depending on location — next to an IS6 element on the plasmids of KUN1163 (80 bp) and KUH140046 (467 bp), but next to an IS256 element on the chromosome of KUH180129 (84 bp). Same resistance gene, different mobilizing-element family. Supporting comparative points: KUH180129 is mecA-negative (genotypically MSSA) and ST-divergent; integrons absent in all four (expected for S. aureus); plasmids are IS6-enriched; KUH180129’s chromosome has an 18-copy IS256 expansion.

Design decisions worth keeping

Snags / environment findings

  1. MCP upload bug (fixed in worktree). upload_file_from_url in lib/galaxy/agents/operations.py pre-decoded history_id to an int and passed it to FetchDataPayload.history_id (a DecodedDatabaseIdField that wants the encoded string) → every upload failed. One-line fix (pass encoded id through). Present in dev and release_26.1 — originating PR #21942, commit 36a5e66089. Full writeup + diff: ../FETCH_MCP_ISSUE.md.
  2. MCP run_tool can’t express collection map-over. Core Galaxy rejects a bare {src:hdca} on a single input; map-over needs the {"batch":true,"values":[{"src":"hdca",...}]} encoding, which run_tool doesn’t surface. Every mapped step here was driven via direct POST /api/tools instead. Noted as a related gap in FETCH_MCP_ISSUE.md.
  3. history_dataset_as_table directive args with spaces must be quoted (title="..."); one directive per ```galaxy fence.
  4. Bakta light DB collides with its wrapper. The bakta light DB v5.1 bundles an amrfinderplus-db/ dir, but the Galaxy wrapper does mkdir amrfinderplus-db && ln -s db-light/* database_pathln: database_path/amrfinderplus-db: File exists, all jobs error. Fix: move the bundled amrfinderplus-db out of db-light, register it as the separate AMRFinder DB, leave db-light without it. The bundled AMRFinder is already indexed (2023-11-15.1), so no separate download/indexing was needed — registered both DBs manually in the .loc files (bakta_version col must = 1.7; amrfinder db_version col must = 3.12), repointed shed_tool_data_table_conf.xml off the .loc.sample files, and restarted (data-table .loc path changes need a restart — there’s no working reload endpoint).
  5. No general plotting tool is in the installed set — only deeptools/pyGenomeTracks (genomic-signal). Installed iuc/ggplot2_heatmap2 for the figures; it does heatmaps only, so the “stacked bar” was rendered as a context-count heatmap. Matrices were created as datasets via the /api/tools/fetch pasted source.
  6. Bakta AMRFinder version skew (silent failure). Bakta ran fine through CDS prediction then crashed at “conduct expert systems” with amrfinder error code 1. Root cause: the amrfinder binary in the bakta 1.9.4 container is 3.12.8, which requires DB ≥ 2023-12-15.2, but the bakta light DB bundles amrfinder DB 2023-11-15.1 (too old). The wrapper has no --skip-amr option, and the bakta ... | tee pipe swallows bakta’s non-zero exit → the job reports ok with 0-byte outputs. Fix: amrfinder_update --force_update --database <dir> inside the bakta container (downloads + indexes 2024-07-22.1), repoint the amrfinder .loc, restart, re-run. Diagnosis required running bakta/amrfinder manually in the container (quay.io/biocontainers/bakta:1.9.4--pyhdfd78af_0).
  7. Bakta OOM under 4-wide emulation. Mapped over 4 isolates, only 1 produced output (again silent-ok-empty); the other 3 ran out of memory under concurrent amd64-emulated diamond/blast. Fix: run the failed isolates one at a time (each then succeeded, ~3.5 MB GFF3). Lesson: on this machine, run Bakta serially, not mapped 4-wide.

Review pass (subagent) — corrections applied

A review subagent checked the notebook against the raw data and issue #12. It caught two factual errors (now fixed) and several overclaims (now caveated):

Closing the paper-worthy gap (in progress)

After the review, the user chose to close the gap: figures + Bakta.

Still-open gaps vs issue #12

Extractability — core pipeline extractable; figures + Bakta are not

Job-graph scan: the MRSA isolate assemblies collection exists and staramr / ISEScan / Integron Finder / awk→BED / SortBED / bedtools closest all run mapped over it (“on collection 25”). So the mobile-context pipeline extracts cleanly as a collection-parameterized workflow — this is the model the other two should follow.

Two things would NOT survive extraction and should be redone if a fully-extractable workflow is wanted:

Page-extraction audit — 2026-06-13 (the cleanest of the three)

Ran the real page-based extraction (extract_next/#22860, now merged) against page b887d74393f85b6d. Best result of the three use cases. Summary: 24 rows, 18 seeded, 9 exposed outputs, 12 map-over (ICJ) steps (size-4: isescan, integron_finder, staramr, awk×2, sortbed×2, closest all mapped over the 4 isolates), 0 warnings, 0 seed_warnings, 0 invalid. Extracted workflow ebfb8f50c6abde6d: 18 steps, zero dangling, 9 workflow outputs, 0 report warnings — the notebook→report rewrite carried over clean. The map-over core is exactly the collection-parameterized shape we want.

Two real weaknesses confirmed at extraction:

Net: UC1 is the strongest notebook-extraction showcase — a genuine map-over pipeline that extracts fully connected with its outputs starred. Its only gap is the pasted-matrix figures (a known, already-solved pattern). No new bug; the A/A2 seeding fix (7e1d6d730f) wasn’t even exercised here (no loose-element consumption).

Next-time recipe — make the figures extractable (data transform VERIFIED 2026-06-13)

Both pasted matrices are exactly reconstructable from one tool output already in the graph: the bedtools closest collection (hid 135, ARG↔IS per isolate). Verified by recomputing both locally from the real closest-collection elements + a contig→location classifier and diffing cell-for-cell against the pasted ARG location matrix (hid 140) and ARG mobile-context counts (hid 142) — exact match on all 9 genes × 4 isolates and all 3 contexts × 4 isolates. The closest output is sufficient (it contains every ARG incl. blaZ/fosD, not just IS-adjacent ones), so no need to go back to resfinder/detailed_summary.

The closest row schema (per isolate element): contig, arg_start, arg_end, gene, IS_contig, IS_start, IS_end, IS_family, distance.

Contig → location (plasmid vs chromosome). Two tool-only options:

Build — VERIFIED LIVE 2026-06-13 (simpler than first planned; only 2 new tools needed):

closest collection (hid 135)
  └─ Collapse Collection (filename|add_name=true, place_name=same_multiple)   → flat table, isolate prepended as col1
       (cols: 1 isolate, 2 contig, 3 arg_start, 4 arg_end, 5 gene, 6-9 IS…, 10 distance)
  ├─ tp_awk  → Matrix 2 (context × isolate)   [ONE job does it all]
  └─ tp_awk  → Matrix 1 (gene × isolate)       [ONE job]
  → ggplot2_heatmap2 (image_file_format=png, cluster_cond|cluster=no) on each → PNG  (on-graph, extractable)

The key simplification: a single tp_awk END-block does classification + count + pivot in one pass — no datamash/easyjoin/tp_multijoin and no separate contig-length tool. It stores each row, tracks max(arg_end) per contig (→ code = max<200000 ? plasmid(2) : chromosome(1), the self-contained classifier), then in END accumulates matrix[label][isolate] and prints the wide matrix with sorted isolate columns. Matrix 2 label = context (code==2→Plasmid-borne; code==1 & dist<1000→Chromosomal IS-adjacent (<1kb); else Chromosomal distal); Matrix 1 label = gene, value = code (plasmid wins on mixed copies). The exact awk programs are in the session; both are short.

Plus: drop the duplicate combined staramr on dataset 1-4 — keep only the mapped staramr (icj4).

Verification status — DONE, end-to-end in Galaxy: ran the chain live (Docker up). Galaxy-built Matrix 2 is byte-identical to the pasted hid 142; Galaxy-built Matrix 1 cells are identical to hid 140 (alphabetical row order vs the curated order — cosmetic). Both ggplot2_heatmap2 PNGs render correctly (context × isolate / gene × isolate). So the figures are now genuine on-graph tool outputs and the heatmaps walk back through awk → Collapse → closest → … → assemblies — i.e. the pasted-matrix anti-pattern is eliminated and the figure steps extract. (Installed: nml/collapse_collections, devteam/fasta_compute_length [the latter not needed in the end — the awk max-coord classifier is self-contained].)

One caveat to record: the END-block awk pivot hardcodes the isolate column set (collected + sorted at runtime, so it adapts to whatever isolates are present — but the figure shape is still one column per isolate in the flattened table). This is clean for the map-over study. The only structural limit, as elsewhere, is that there is no crosstab tool in datamash_ops; the awk END-block is the substitute and works for any isolate count present in the data.

CLEAN REBUILD — 2026-06-13 (fresh history, fully extractable, extraction VERIFIED)

Rebuilt from scratch in a clean history per the next-time recipe, to make the whole notebook extract as one workflow. Result: 14-step workflow, zero dangling, 9 outputs, report rewrite flawless — the cleanest of the three.

Artifacts:

How it was built (every step a clean map-over / on-graph tool):

  1. Clean upload — all 4 combined chr+plasmid FASTAs fetched directly into ONE list collection in a single /api/tools/fetch call (ENA …/api/fasta/<chr>.1,<plasmid>.1), element ids = isolate names. Not 4 separate uploads.
  2. staramr mapped over the collection (PointFinder disabled) → implicit collections. Only the mapped run — the duplicate combined staramr on dataset 1-4 is gone.
  3. ISEScan mappedremove_short_is=true (see gotcha below), Integron Finder mapped (--circ, zero integrons).
  4. ARG→BED awk (NR>1 {min/max → contig,start,end,gene}) on resfinder; IS→BED awk (/family=/ → contig,start,end,family) on ISEScan gff; SortBED each; bedtools closest -d (k=1, ties=all, mdb=each) mapped over both — all map-over.
  5. Figures on-graph: Collapse Collection (add_name=true, same_multiple) → flat table → one tp_awk END-block per matrix (self-contained max(arg_end)<200000 plasmid classifier + pivot) → ggplot2_heatmap2 (png, cluster=no). No pasted matrices.

Gotcha caught + fixed mid-build (important for reproducibility): the first ISEScan run used the tool default remove_short_is=false, which keeps partial/unclassified (family=new) elements and injected spurious distance-0 IS overlaps (KUN1163: 25 IS vs the validated 17), corrupting the chromosomal-IS-adjacent counts. The headline still held (aac(6')-aph(2'')→IS6 plasmids / IS256 chromosome, 80/467/84 bp) but the context matrix was noisy. The original used remove_short_is=true (complete elements only). Re-ran with true, purged the bad IS branch → closest then matched the original byte-for-byte. Record this as a required ISEScan param for this analysis.

Extraction (page-based, #22860) — VERIFIED clean. Summary: 18 rows, 14 seeded (the clean pipeline), 9 exposed, 8 map-over ICJ steps, 0 warnings / 0 seed_warnings / 0 invalid. The 4 unseeded rows are orphan jobs from the purged bad-ISEScan branch (outputs purged) — excluded from extraction. Extracted .ga: 14 steps (1 input + 13 tools), every input_connection resolves (zero dangling), 9 workflow outputs, report rewritten with 0 leftover instance ids. Topology exactly: input collection → {staramr, ISEScan, Integron} → BED→Sort×2 → closest → Collapse → 2 matrix awks → 2 heatmaps.

Bakta deliberately omitted from the clean core — it’s the non-mapped/OOM-prone messy part. The notebook reframes the headline at the IS-family level (fully supported by the extractable closest data) and lists Bakta as optional structural confirmation outside the extractable core. This keeps the graph pristine; the comparative payoff (IS6↔IS256 flip across 3 isolates) survives without it.

Suggested next moves

Build the JBrowse locus view, optionally add Bakta structural confirmation (as a separate enrichment, not in the extractable core), or move on to UC2 (TAL1, issue #13) / UC3. UC1 is now a fully-extractable issue-#12 showcase.