UC2_DEBRIEF

UC2 debrief — TAL1 peaks to candidate regulated genes (issue #13)

Built interactively via the notebooks MCP against the local Galaxy (worktree history_pages), 2026-06-13. Setup story in SETUP_DEBRIEF.md, operational facts in index.md. Sibling: UC1_DEBRIEF.md.

Artifacts produced

ThingID / location
HistoryTAL1 peaks to candidate regulated genesf30a35c999095ed7
Notebook pagef0f309c56aff0025 (history-attached)
G1E TAL1 narrowPeak (261)e516d7c43b2ce824
Megakaryocyte TAL1 narrowPeak, pooled R1+R2 (150)7f09d52a860db821
common / G1E-only / mega-only peaks5bb18c11f5b70a41 / 38d70b27d179c236 / 603e7db97773b4d1
promoter windows (TSS −1000/+500)b735ed9e5e005602
distance-to-TSS figure (PNG)d33e32db742aed56
bowtie2 + mm10 index installeddevteam/bowtie2; database/bowtie2_index/mm10 (prebuilt, registered in bowtie2_indices.loc)

Data-source decision

The tutorial’s TAL1 narrowPeak files are tutorial outputs, not hosted inputs (Zenodo 197100 has only FASTQ + the RefSeq/ROI BEDs). GSE51338 hosts real TAL1 peaks but in mm9 broadPeak (build mismatch with the mm10 RefSeq). User chose mm10 + regenerate peaks: align the tutorial FASTQ to mm10 and call peaks ourselves. (Rejected: mm9+ENCODE peaks [build switch], liftover, synthetic.)

What was done

G1E (erythroid) and megakaryocyte TAL1 ChIP, 36 bp single-end, R1 per condition for G1E, pooled R1+R2 for megakaryocyte:

FASTQ ─► Bowtie2 (mm10 prebuilt index) ─► BAM ─► MACS2 callpeak (TAL1 vs input, gsize mouse) ─► narrowPeak
RefSeq mm10 BED12 ─► awk TSS±window (promoters) + awk TSS points
bedtools intersect (G1E vs mega) ─► common / G1E-only / mega-only
   ├─ ∩ promoters (-u) ─► promoter-bound candidate genes
   └─ closest -d vs TSS ─► nearest gene + distance

No read trimming (36 bp, good quality — documented simplification).

Headline result (textbook-correct)

The erythroid↔megakaryocyte lineage switch falls out of the differential TAL1 binding:

Peak-set sizes: G1E 261, mega 150 (pooled), common 39, G1E-only 222, mega-only 110.

Snags / findings

  1. Megakaryocyte single-replicate gave only 17 peaks. Cause: deep TAL1 (4.4M reads) vs shallow input (750k) → MACS2 scales the larger library down to the smaller, killing sensitivity. Adding Tal1 R2 alone would NOT fix it (worsens the imbalance); the fix is deeper control. Pooled R1+R2 for both treatment and control → 150 peaks. (The 17 were all real, at Gata1/Gata2/Runx1/Cbfa2t3.)
  2. mm10 bowtie2 index setup (Bakta-style infrastructure): downloaded the genome-idx prebuilt mm10 index (3.4 GB zip → 6 .bt2), registered in tool-data/shed/bowtie2_indices.loc (cols value, dbkey, name, path; path = index basename), restarted. The shed install of bowtie2 auto-registered the bowtie2_indexes data table. Prebuilt index avoids a slow emulated bowtie2-build.
  3. The MCP upload bug (UC1) and run_tool-no-map-over gap (UC1) still apply; UC2 used run_tool for flat/conditional params (worked) and didn’t need map-over.

Review pass (subagent, live-verified) — corrections applied

A review subagent verified every quantitative claim against the live Galaxy data (peak counts, gene lists, promoter-window logic, dedup), checked MACS2 params, and fact-checked the biology. Verdict: clears the issue #13 MVP bar; all six peak counts, all three promoter-bound gene lists, the strand-aware window logic, and the distance bins matched the data exactly. Fixes applied to the notebook (rev 5114a2a207b7caff):

Completeness gaps the review flagged (beyond the open list below): the issue’s bar chart of candidate counts by class and a consolidated peak-set summary table (peaks / promoter-overlapping / nearest-assigned per set) aren’t present — minor, would strengthen paper-worthiness.

Still-open gaps vs issue #13

Extractability — original NOT extractable; extractable rebuild done (a912e9e5d84530d4)

A full extractable rebuild was done in a fresh history TAL1 peaks to candidate genes (extractable) (a912e9e5d84530d4), notebook 72ad249754f05d26. It reproduces the original science exactly (G1E 261 / mega 150 peaks; common 39 / G1E-only 222 / mega-only 110; same promoter-bound genes incl. Gata1 G1E-only, Fli1 mega-only) but as a clean tool/collection DAG:

This history is the one to run extraction against. The original f30a35c999095ed7 remains as the first-pass reference. Original redo instructions retained below for reference.

Extraction test — run 2026-06-13 (feature WORKS; map-then-reduce-by-element seam breaks)

Booted local Galaxy and ran extraction against a912e9e5d84530d4 both ways — the worktree first had only PR #22706 (history-based extraction); jmchilton then merged extract_next/#22860 mid-session, so the page-based flow was re-tested properly.

Run B (the real feature) — page-based extract_next/#22860

GET /api/pages/72ad249754f05d26/workflow_extraction_summaryPOST /api/workflows/extract {…, from_page_id}. Workflow f597429621d6eb2b, .ga at /tmp/uc2_page_workflow.ga.

What works (the headline — the feature is sound):

Two real defects this history exposes — both rooted in map-over then reduce-by-named-element:

(A) Seeding bug — the map’s input collection is dropped when its output is consumed as loose elements. → FIXED. In lib/galaxy/managers/workflow_extraction_summary.py::_backward_job_closure, map-input recovery (reading implicit_input_collections and enqueuing the input collection) fired only when the queue item was the output HDCA. But MACS2 consumed individual BAM elements (src=hda, e.g. G1E_Tal1 BAM hid 10 3ee1d7c9a966c95c). Walking back from a loose element, the code seeded the Bowtie2 ICJ but never enqueued the reads collection. Net effect: the reads collection (hid 1) was not seeded (hdca_ids: 0), so Bowtie2 extracted dangling, and a spurious hidden fastq element (G1E_Tal1 hid 3) was surfaced as a loose input.

(A2) Sibling elements of a shared fetch job silently dropped → resolved as a side effect of A. The seen_jobs guard meant all 6 fastq elements sharing one __DATA_FETCH__ job → only 1 of 6 surfaced as a boundary input. Now moot: with A fixed, the per-element fastqs are skipped (their input name matches the recovered mapped_input_names) and the collection is seeded instead, so no loose element surfaces at all. (The underlying seen_jobs-drops-siblings behavior remains latent for any non-map “one upload, N siblings” shape, but no longer affects this pipeline.)

(B) Fundamental topology limit — “pick named element out of a collection → single-dataset input” has no workflow-connection representation. MACS2 reads specific BAM elements (treatment vs control, pooled replicates) out of the alignment collection. Even with A fixed (collection seeded), the two MACS2 steps come out disconnected (input_treatment_file: null), because a collection→single-dataset element edge can’t be a workflow connection. Confirmed independently in Run A (classic full-history extract, workflow f2db41e1fa331b3e, /tmp/uc2_workflow.ga): there the collection was seeded and Bowtie2 wired, but MACS2 still dangled — isolating B from A. Not a bug; a real limit.

Net verdict & fix

The extraction feature itself is solid — summary, subgraph seeding, output exposure, and the notebook→report rewrite all work cleanly at this scale. What a912e9e5d84530d4 exposed in a map-over-then-reduce-by-element pipeline was one real seeding bug (A/A2 — now fixed, Bowtie2 wires) and one inherent topology limit (B — the MACS2 element-addressing seam, history-side fix only). So the earlier “the whole UC2 now extracts” claim was over-stated — the figure tail extracts and (post-fix) the alignment map extracts, but the map→MACS2 reduce-by-element seam still needs the Extract Dataset bridge below.

Redo recipe for B — structure the reads so MACS2 reduces/maps instead of element-addressing

The decisive tool fact: MACS2 callpeak’s treatment and control pooling inputs are multiple="true" data params (not <repeat>) — iuc/macs2 .../macs2_callpeak.xml, treatment conditional “Are you pooling?” → input_treatment_file multiple="true", same for control|c_multiple. multiple="true" data inputs participate in collection reduction + map-over (LIST_REDUCTION / NESTED_LIST_REDUCTION). So the pooling we needed isn’t an obstacle — it’s what a list:list reduces into.

Build the reads as two list:list inputs (outer = condition, inner = replicate):

treatment_reads : list:list  { G1E:{r1:G1E_Tal1},  mega:{r1:Mega_Tal1_R1, r2:Mega_Tal1_R2} }
control_reads   : list:list  { G1E:{r1:G1E_input}, mega:{r1:Mega_input_R1, r2:Mega_input_R2} }

Bowtie2  map-over each            → treatment_bams / control_bams (list:list, structure preserved)  [NESTED_LIST_MAPPING]
MACS2    map treatment_bams over treatment(multiple) + control_bams over control(multiple),
         linked on the outer (condition) key                                           [NESTED_LIST_REDUCTION ×2, linked]
   → peaks : list { G1E:…narrowPeak, mega:…narrowPeak }

Inner list reduces (pools replicates — handles the G1E-1-rep vs mega-2-reps asymmetry for free, since multiple="true" takes 1+); outer list maps (one MACS2 job per condition); treatment/control link on the shared outer key → matched inputs. One Bowtie2 step per input, one MACS2 step, zero element-addressing — extracts clean, and a new user just supplies their own two list:list (condition→replicates). Strictly better than the Extract Dataset bridge, which hard-codes element="G1E_Tal1" and only works for these sample names.

One assumption to verify before banking on it: that Galaxy maps two list:list over MACS2’s two multiple="true" inputs simultaneously, linked on the outer key. Compositionally supported by the semantics rules but not pinned by a specific test — worth a one-off live prototype (build the two collections, map-align, run MACS2 once).

Why not sample sheets (the semantically perfect model — the sample-sheet backend doc literally motivates it with ChIP-seq condition/replicate/control_sample): (1) it’s an authored input, not a derivable shape — column_definitions live on the workflow input parameter and are user-filled at run time; extraction emits plain data_collection_input steps and has no path to reconstruct the schema. (2) No tools consume the metadata yet — bowtie2/MACS2 see a sample sheet as a list; routing treatment/control from the columns needs a “split/route by sample-sheet column” tool that the doc’s own Limitations list as future work. So sample_sheet = best hand-authored input, not an extraction target. The nested list:list encodes the same grouping structurally and needs no metadata-aware tools — which is why it extracts. (paired_or_unpaired doesn’t fit at all: treatment/control ≠ forward/reverse, and it caps at 2 elements.)

The residual — and why UC2 is a stress-test, not a showcase, for single-notebook extraction

The list:list restructure dissolves the MACS2 addressing, but not the downstream G1E-vs-mega comparison (common / G1E-only / mega-only). That’s an inherent 2-way set comparison of two named conditions — picking two specific elements out of the peaks list, the same addressing problem in a new spot. The ideal engineering answer is to split into two workflows (a map-over peak-caller + a 2-data-input pairwise comparator), but we’re demoing notebook→one-workflow extraction, so a split defeats the point. Within the one-notebook constraint, the comparison tail still needs 2 condition-pinned Extract Dataset steps (element="G1E_Tal1" etc. — workflow-compatible, so it runs, just not reusably).

Honest verdict: a differential two-condition analysis is not the cleanest showcase for “notebook → one reusable workflow.” Its irreducible 2-way comparison means the best single-notebook outcome is a runnable-but-condition-pinned workflow. It is, however, a genuinely good robustness test — it found and drove a real seeding-bug fix (A). For a pristine happy-path demo, prefer a naturally map-over analysis with no cross-sample comparison (per-sample/per-isolate — closer to UC1’s shape); save UC2 as the “extraction survives a realistic messy pipeline” story. (Fixing A/A2 was a code change to #22860’s seeding walk — committed 7e1d6d730f. The list:list rebuild and the prototype are not yet done; documented here as the next-run recipe.)

Original redo instructions (now executed above)

Job-graph scan: 0 collections; the 6 FASTQ were uploaded individually → 6 separate bowtie2 jobs (no map-over), the volcano-style figure used a Python-computed pasted matrix, and the candidate-gene lists were produced in bash (awk/sort -u), not as Galaxy datasets. So PR #22860 extraction would yield a sprawling fixed-sample DAG (6 named FASTQ inputs, parallel hardcoded alignments) plus dead-ends where external computation was injected — not a reusable workflow.

To redo so it extracts cleanly:

  1. Upload reads into a collection, not individually. Build a list (or list:paired) collection of the FASTQ samples (e.g. one collection of TAL1 ChIP samples + one of matched inputs, with element identifiers = sample names). Use the dataset-collection builder, not N separate upload_file_from_url calls. Via API: POST /api/histories/{id}/contents with type=dataset_collection.
  2. Map Bowtie2 over the collection ({"library|input_1": {"batch": true, "values": [{"src":"hdca","id":...}]}}) → one BAM collection, one workflow step instead of six branches.
  3. MACS2 with matched treatment/control. Either call MACS2 per condition (TAL1 BAM vs input BAM) as explicit steps, or pair via collections; keep the pooled-replicate handling as a single MACS2 step (it already pools — that part is fine). Avoid the one-off R1-only run.
  4. Do every downstream transform as a Galaxy tool (these already are, keep them): bedtools intersect for common/G1E-only/mega-only, tp_awk for promoter/TSS, bedtools closest for nearest-gene.
  5. Produce the candidate-gene list as a dataset, not in bash. After the promoter ∩ peaks intersect, extract + dedupe gene symbols with text_processing (Cut column 4 → SortUnique), so the gene list is a workflow output, not a terminal awk in the shell.
  6. Build the figure’s count matrix with Galaxy tools (datamash/awk on the closest output), not Python → pasted, so the figure step is reproducible.

The interpretive bedtools half is already tool-based and would survive extraction; the alignment front-end and the figure/gene-list tails are what break it.

CLEAN REBUILD — 2026-06-13 (list:list MACS2 validated; full extractable workflow with one pinned seam)

Rebuilt from scratch in a clean history with the list:list restructure from the redo recipe above. The prototype assumption is now confirmed live, and the whole analysis extracts into a complete runnable workflow — better than the earlier “stress-test, not showcase” verdict: the MACS2 seam fully dissolves, and even the 2-way comparison extracts (condition-pinned).

Artifacts:

The decisive prototype — MACS2 maps two list:list (the assumption the recipe flagged “verify before banking”). Confirmed the encoding:

Science reproduced exactly from the clean tool graph: G1E 261 / mega 150 peaks; common 39 / G1E-only 222 / mega-only 110; promoter-bound Group gene lists = Cbfa2t3/Pf4 (common), Gata1+11 (G1E-only), Fli1/Tal1/Dock8/Lgi1 (mega-only); figure bin counts common 4/11/22/2, G1E-only 18/66/97/41, mega-only 5/30/45/30 (note: figure closest uses ties=first, one row per peak — ties=all tie-inflates and was caught/fixed mid-build).

Extraction — fully runnable workflow, one pinned seam, plus a real seeding-walk gap. Page summary: 31 seeded, 4 ICJ (2 Bowtie2 + 1 MACS2 — the two MACS2 summary rows share one ICJ, so dedupe implicit_collection_jobs_ids before POSTing /api/workflows/extract). The extracted .ga (with the fix below): 34 steps, every tool step has inputs, 0 dangling, 3 workflow outputs, report 0 leftover ids. MACS2 wires to both Bowtie2 BAM collections — the list:list step survives extraction as a single mapped step.

Net: UC2 is no longer just a “robustness test” — with list:list it’s a genuinely extractable differential-ChIP workflow whose only non-reusable point is the inherent 2-condition comparison (condition-pinned, but runnable). The clean uploads (2 list:list from Zenodo), all-tool graph (no bash, no pasted matrix), and exact science reproduction hold.

Two-workflow split — VERIFIED reusable (2026-06-14)

To make the differential analysis fully sample-agnostic (vs the single condition-pinned workflow), it extracts cleanly as two workflows via extract_by_ids:

Both verified: all tool steps connected, all edges resolve. Chaining WF1→WF2 still requires picking the two condition elements out of WF1’s peaks list (the irreducible 2-way comparison), but each workflow is individually fully reusable. This is the clean answer to “differential A-vs-B doesn’t fit one reusable notebook→workflow”: split the map-over caller from the pairwise comparator.

Suggested next moves

JBrowse locus view, the RNA-seq expression stretch, or move on. UC2 now extracts either as one complete condition-pinned workflow (post _original_hda fix) or as two fully-reusable workflows (caller + comparator).