UC3_ATAC_issue

UC3 / Differential ATAC-seq accessibility — interview input

Pulled-down GitHub issue used as the effective result of the INTERVIEW → GALAXY interview (the live interview mechanics are harness-owned and precede pipeline phase 1). Source: https://github.com/jmchilton/galaxy-brain/issues/14 Paired aspirational target: UC3_ATAC_extracted.ga (extracted from a Galaxy history; not human-validated).


Purpose

Develop a Galaxy Notebooks paper/demo vignette for bulk differential ATAC-seq accessibility. This should use existing Galaxy Training Network ATAC analysis as scaffolding, but the issue belongs to galaxy-brain because the deliverable is a paper/demo notebook, not a direct training-material change.

Objective

Show the conceptual jump from “call peaks in one ATAC-seq sample” to “count reads/fragments over a shared peak universe and test accessibility differences across biological replicates.”

Why this is a useful demo deviation

The existing ATAC-seq tutorial is single-sample GM12878-focused. It covers QC, trimming, Bowtie2 mapping, filtering, duplicate removal, insert-size QC, MACS2 peak calling, TSS/CTCF heatmaps, and pyGenomeTracks visualization. A differential-accessibility notebook can reuse the same tool ecosystem while adding a more paper-worthy analysis story: replicate-aware peak counts, DESeq2-style testing, volcano plots, and interpretation of condition-specific regulatory regions.

Existing analysis anchors

Public data candidates

Preferred MVP direction: use small prepared files from a public ENCODE-derived two-condition replicate dataset rather than full raw FASTQ/BAM processing.

Candidate datasets to evaluate:

Fallback: publish a tiny Zenodo bundle containing union_peaks.bed, a featureCounts-style count matrix, sample_metadata.tsv, optional bigWigs/narrowPeaks, and selected peak annotations. This still teaches real differential accessibility while keeping runtime workshop-feasible.

Notebook workflow plan

  1. Import prepared small datasets: count matrix or per-sample count files, sample_metadata.tsv, union_peaks.bed, and optional bigWigs/narrowPeaks.
  2. Explain the peak universe: merged/union reproducible peaks across conditions.
  3. Optional BAM-subset branch: create the union peak universe with concatenate/sort/merge interval tools, then convert intervals to SAF-like annotation.
  4. Optional BAM-subset branch: count fragments over union peaks with featureCounts.
  5. Run replicate QC using DESeq2 PCA/sample-distance outputs, or multiBamSummary plus plotCorrelation if BAMs are included.
  6. Run DESeq2 with factor condition.
  7. Filter significant peaks, e.g. padj < 0.05 and abs(log2FoldChange) >= 1.
  8. Create a volcano plot using the Galaxy Volcano Plot tool.
  9. Sort top condition-gained and condition-lost peaks.
  10. Visualize selected differential regions with deepTools heatmaps and optional pyGenomeTracks.

Expected paper/demo artifacts

Scope and risks

Tasks