UC1 Next Steps — bring Bakta annotation + JBrowse locus view into the workflow
Issue: https://github.com/jmchilton/galaxy-brain/issues/12 For: the agent that built the clean UC1 history. Handoff to push the MRSA mobile-AMR analysis from “great paper figures” to “complete workflow.”
Why this handoff
You optimized UC1 for the Galaxy Notebooks paper: a clean collection map-over
that extracts to a flawless 14-step, sample-agnostic workflow with two on-graph
heatmaps. That goal was met (history 48916fac0de9a85d, page eafb646da3b7aac5,
extracted workflow 33b43b4e7093c91f). For that goal, Bakta was deliberately
left out as “non-extractable enrichment” and JBrowse was parked as a TODO.
But the issue’s plan (#12 steps 4 and 8) names both, and at the workflow level they are load-bearing, not decoration:
- Bakta = the gene-context layer. Right now the ARGs are located (plasmid vs chromosome) and given a nearest-IS distance, but nothing annotates the genomic neighborhood — what genes flank the mobile-AMR locus. That’s the structural confirmation the issue asks for.
- JBrowse = the auditable locus view. The headline (
aac(6')-aph(2'')is IS6-adjacent on the KUN1163/KUH140046 plasmids but IS256-adjacent on the KUH180129 chromosome) is currently only legible as matrix cells. A JBrowse locus is the issue’s intended way to show it.
Both tools are already installed (mrsa-mobile-amr-tools.yml: iuc/bakta,
iuc/jbrowse). The full Bakta DB-setup gotchas are documented in
UC1_RECIPE.md §1 (db-light layout, the amrfinder-db move, .loc registration,
serial-on-Apple-Silicon). Don’t re-derive them — follow that section.
Step A — Bakta whole-genome annotation (mapped over the collection)
- Tool:
toolshed.g2.bx.psu.edu/repos/iuc/bakta/bakta(Bakta light DB v5.1 per the recipe). - Input: the existing
MRSA isolate assemblieslistcollection (the same input staramr/ISEScan/Integron Finder map over). - Run mapped over all 4 isolates if runtime allows; on Apple Silicon run serially (one isolate at a time) per the recipe’s OOM note, or narrow to the two headline isolates (KUN1163, KUH180129).
- Output to expose: the Bakta GFF3 annotation collection (and optionally the GBK/summary). This is the new annotation deliverable.
- Decide the wiring:
- Minimum: expose Bakta GFF3 as a workflow output (enrichment; doesn’t feed downstream). Lowest risk, still closes the issue’s “annotate” task.
- Better: feed the Bakta GFF3 as a track into Step B (JBrowse) so the locus
view shows annotated CDS features around the ARG/IS, and/or
bedtools closestthe ARG BED against Bakta CDS features to attach a nearest-gene column to the context table.
- Extraction check: Bakta is a normal mapped tool — confirm it seeds and extracts with zero dangling like the rest of the spine. If it genuinely can’t extract cleanly (it should), keep it as a parallel enrichment branch rather than dropping it.
Step B — JBrowse locus view for the headline mobile-AMR region
- Tool:
toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse. - Issue step 8: “one representative mobile-AMR locus and one contrasting isolate.”
Use the headline contrast — KUN1163 (plasmid, IS6-adjacent
aac(6')-aph(2'')) vs KUH180129 (chromosome, IS256-adjacent same gene). - Reference: the per-isolate assembly FASTA (pull the relevant element from the
MRSA isolate assembliescollection). - Tracks — reuse intermediate outputs already in the workflow, so JBrowse is a
connected step, not a fresh upload:
- ARG sorted BED (the staramr→BED→SortBED output) → ARG features.
- IS sorted BED (the ISEScan→BED→SortBED output) → IS features.
- the
bedtools closestoutput → the ARG↔IS pairing. - optionally the Bakta GFF3 from Step A → gene context.
- Output to expose: the JBrowse HTML per isolate (or one combined), centered on
the
aac(6')-aph(2'')locus.
Verification (what “done” looks like)
| Check | Expected |
|---|---|
| Workflow step count | grows from 14 → ~16+ (Bakta + JBrowse, plus any closest/annotate glue) |
| Bakta output | GFF3 annotation collection exposed as a workflow output |
| JBrowse output | HTML locus view(s) exposed; aac(6')-aph(2'') visible beside its IS in both KUN1163 (plasmid/IS6) and KUH180129 (chr/IS256) |
| Re-extraction | new outputs exposed, zero dangling input_connections, report rewritten clean (hold to the UC1 §7 standard) |
| Science unchanged | the existing 14-step headline still reproduces byte-identically (don’t perturb the staramr/ISEScan/closest spine; remove_short_is=true still required) |
Scope / risks
- Bakta DB + Apple-Silicon serial execution is the main cost — budget for it; it’s a one-time DB setup per the recipe.
- Decide map-over-all-4 vs. two-isolate subset for Bakta/JBrowse before building; the headline only needs KUN1163 + KUH180129.
- Don’t regress the clean extractable core to add these — prefer wiring them as additional connected steps/branches that also extract, per the recipe’s “reference real outputs, don’t re-upload” rule.