UC2 debrief 2 — clean extractable rebuild, live rendering, contributions (2026-06-14)
Second debrief for use case 2 (TAL1 peaks → candidate genes, issue #13). The first debrief (UC2_DEBRIEF.md) covers the original build, the extraction-test analysis, and appends the clean rebuild + the two-workflow split; this document is the standalone session-2 writeup focused on extractability, the code fix it forced, live rendering, and paper relevance.
One-line takeaway
UC2 is the robustness case: a differential (two-condition) analysis that (a) validated the list:list map-over-reduce restructure for MACS2, (b) drove a real Galaxy extraction-correctness fix (_original_hda dropping collection-operation steps), and (c) showed the irreducible limit of “one notebook → one reusable workflow” for 2-way comparisons — cleanly resolved by a two-workflow split.
Clean rebuild & extraction (verified)
- History
TAL1 peaks to candidate genes (clean, extractable)96d9e11f37f34b29; notebook page42a2c611109e5ed3. - Clean uploads: 6 FASTQ fetched from Zenodo directly into two
list:listcollections (condition→replicate) + the mm10 RefSeq BED. Bowtie2 map-over → BAMlist:list. - Validated the
list:listMACS2 restructure live (the recipe flagged “verify before banking”): passing twolist:listto MACS2’s twomultiple="true"inputs withmap_over_type:"list"does outer-map (condition) + inner-reduce (pool replicates), linked on the outer key → one MACS2 step, two condition narrowPeaks. Naivebatch:trueinstead flattens to all leaves (replicates not pooled) — the encoding matters. - Science reproduced exactly: G1E 261 / mega 150 peaks; common 39 / G1E-only 222 / mega-only 110; promoter-bound
Groupgene lists Gata1 (G1E-only), Fli1+Tal1 (mega-only), Cbfa2t3 (common); figure bin counts identical (figure closest usesties=first). - Extraction outcomes (both verified):
- Single workflow
03501d7626bd192f(post-fix): 34 steps, zero dangling, with the Extract-Dataset bridge wired through — the comparison seam is condition-pinned (identifier=G1E/mega) but runs. - Two-workflow split (fully reusable): WF1 peak caller
3f5830403180d620(5 steps, twolist:listreads → Bowtie2 ×2 → MACS2 → condition peaks list; sample-agnostic) + WF2 comparatore85a3be143d5905b(29 steps, 4 inputs → 3-way intersect → Group gene lists + datamash figure; no condition-name pinning). Both connected, edges resolve.
- Single workflow
The extraction-correctness fix this UC forced (key contribution)
- Root cause (DB-confirmed): a
__EXTRACT_DATASET__output HDA carries bothcopied_from(its source collection element) and its own creating job.galaxy.workflow.extract._original_hdawalkedcopied_fromunconditionally, sosummarizeand the page-extraction closure normalized the extracted dataset back to its source and dropped the Extract Dataset step, leaving the downstream comparison input-less (dangling). A UI “Extract” produced a broken workflow. - Fix (
lib/galaxy/workflow/extract.py):_original_hda/_original_hdcastop walkingcopied_fromwhen the content has its owncreating_job_associations— a passive copy has none and still normalizes; aDatabaseOperationTooloutput (Extract Dataset, Unzip, …) is kept as a real step. Generalizes to every collection-operation tool. Unit-tested + a Selenium E2E test (test_notebook_keeps_extract_dataset_step) asserting the step survives with its edges.
Live notebook rendering review (Playwright, dev client)
- The distance-to-TSS heatmap (
ggplot2_heatmap2PNG) renders cleanly; both candidate-gene-listhistory_dataset_displaycards render with the markers (Gata1, Fli1) and Download/Import affordances. No warnings, no broken displays.
Paper relevance
- The honest counterpoint to UC1 in the extraction section: a differential two-condition analysis is the one shape where “notebook → one reusable workflow” hits a real topology limit (an irreducible 2-way element comparison). The paper can state this limit precisely and show the clean resolution: split into a map-over caller + a pairwise comparator.
- Evidence that the extraction machinery is robust and improvable: a real use case surfaced a correctness bug (
_original_hda) that was root-caused and fixed with tests — supports the implementation/test-coverage narrative. - Demonstrates structural encoding of experimental design (
list:listcondition→replicate) driving map-over+reduce without metadata-aware tools — useful for the “reference artifacts / structure for reuse” discussion. (Sample sheets remain the hand-authored ideal but are not an extraction target; the nestedlist:listis.)
Open refinements / next
Surface __EXTRACT_DATASET__ in the extraction summary automatically (the fix makes it seed; the summary-walk could also recognize collection-op steps generically); JBrowse locus view; RNA-seq expression stretch.