SI Recipe S3 — Differential ATAC-seq Accessibility
Science: Genome-wide differential ATAC-seq accessibility, Erythroblast vs B-cell (Corces 2016, hg19). 45,620 significant peaks; GATA1/HBB Erythroblast-gained; PAX5/EBF1/MS4A1/CD19 B-cell-gained.
Extracted workflow: 13 steps — counts collection + sample sheet → DESeq2 → NA-filter → { volcano ∥ significance-filter → rank → top-gained / top-lost }, with each PDF figure (PCA, volcano) converted to PNG by an in-graph tool (graphicsmagick → extract-page). 6 workflow outputs; re-runs to the identical numbers. Importable .ga: SI_Workflow_S5_differential_atac.ga (reference server workflow ba751ee0539fff04).
Server-agnostic. Tool versions confirmed on Galaxy 26.2.dev0. Build everything through
/api/tools(or the MCPrun_tool); the notebook and the extraction are plain API calls — no UI needed. No Galaxy code change required — DESeq2/volcano emit PDF, so each figure is rasterized to PNG by an in-graph tool (§1) and referenced with the stockhistory_dataset_as_imagedirective.
1. Figure conversion — PDF → image via in-graph tool (no code change)
The DESeq2 diagnostics and the volcano are PDF, which the notebook cannot embed as an image. Rather than convert at render time in the server, convert each figure with a tool, so the image becomes a real on-graph dataset referenced by the stock history_dataset_as_image directive. Provenance-clean, extraction-automatic, no server dependency.
Per PDF figure:
graphicsmagick_image_convert(bgruening) on the PDF,output_format=png. For PDF input it splits every page into alistcollectionsplitted_pdf, elementstemp_000,temp_001, … (one per page). Runs in the graphicsmagick biocontainer (ghostscript delegate reads the PDF).__EXTRACT_DATASET__by identifier to pull the wanted page (temp_000= page 1) out of the collection as a standalone PNG HDA.- Rename that HDA to a meaningful figure name (
Sample PCA plot,Volcano plot) — it becomes the workflow output label and theoutput="…"reference in the extracted report.
Volcano PDF = 1 page → temp_000. DESeq2 diagnostics = 5 pages → page 1 (temp_000) is the PCA. (Cost: graphicsmagick converts all pages, so the 5-page DESeq2 PDF yields 5 PNGs even though only the PCA is used.)
2. Tools
| Tool | Full tool ID | Version |
|---|---|---|
| DESeq2 | toolshed.g2.bx.psu.edu/repos/iuc/deseq2/deseq2 | 2.11.40.8+galaxy2 |
| Volcano Plot | toolshed.g2.bx.psu.edu/repos/iuc/volcanoplot/volcanoplot | 4.0.3+galaxy0 |
| Text reformatting (awk) | …/bgruening/text_processing/tp_awk_tool | 9.5+galaxy3 |
| Sort | …/bgruening/text_processing/tp_sort_header_tool | 9.5+galaxy3 |
| Select first (head) | …/bgruening/text_processing/tp_head_tool | 9.5+galaxy3 |
| Select last (tail) | …/bgruening/text_processing/tp_tail_tool | 9.5+galaxy3 |
| Convert image format | …/bgruening/graphicsmagick_image_convert/graphicsmagick_image_convert | 1.3.46+galaxy0 |
| Extract dataset | __EXTRACT_DATASET__ (built-in) | 1.0.2 |
3. Input data
Source: Corces 2016 hematopoiesis ATAC count matrix — GEO GSE74912, GSE74912_ATACseq_All_Counts.txt.gz (590,650 peaks × 132 samples, hg19).
Sample selection (8 samples, donors as replicates):
| Sample | Condition | Donor |
|---|---|---|
| Bcell_1 | Bcell | Donor1022 |
| Bcell_2 | Bcell | Donor4983 |
| Bcell_3 | Bcell | Donor5483 |
| Bcell_4 | Bcell | Donor5483 |
| Eryth_1 | Erythroblast | Donor2596 |
| Eryth_2 | Erythroblast | Donor5483 |
| Eryth_3 | Erythroblast | Donor6926 |
| Eryth_4 | Erythroblast | Donor6926 |
(Bcell_3+4 and Eryth_3+4 each share a donor — minor pseudoreplication; conclusions robust; a ~ donor + condition design is the cleaner ideal.)
Prep (data-import boundary, outside Galaxy): split the matrix into 8 per-sample two-column files (peak_id <TAB> count, no header, 590,650 rows). Peak IDs are chr_start_end (e.g. chr11_60222940_60223440).
Upload: the 8 count files as a list collection ATAC counts (element identifiers exactly Bcell_1..4, Eryth_1..4 — must match the sample sheet); the sample sheet as one tabular dataset:
sample condition donor
Bcell_1 Bcell Donor1022
…
Eryth_4 Erythroblast Donor6926
4. Build sequence (all /api/tools, flat pipe-keys)
4.0 The one gotcha — DESeq2’s conditional input
DESeq2’s data selector is a conditional (select_data|how). Pass it with flat pipe-delimited keys, not a nested {select_data:{...}} dict. A nested dict mis-routes to the datasets_per_level default and silently feeds the sample sheet as the only counts file (job errors / garbage). Flat form:
select_data|how = sample_sheet_contrasts
select_data|countsFile = {src: hdca, id: <ATAC counts collection>}
select_data|sample_sheet = {src: hda, id: <sample sheet>}
select_data|design_formula_mode|mode = automatic
select_data|design_formula_mode|factor = 2
select_data|design_formula_mode|reference_level = Erythroblast
select_data|design_formula_mode|target_level = Bcell
header = true
tximport|tximport_selector = count
advanced_options|fit_type = 1
advanced_options|auto_mean_filter_off = false
advanced_options|outlier_filter_off = false
advanced_options|outlier_replace_off = false
advanced_options|use_beta_priors = false
output_options|alpha_ma = 0.1
output_options|output_selector = [pdf, normCounts, normVST]
Direction: reference_level=Erythroblast ⇒ LFC > 0 = B-cell-gained, LFC < 0 = Erythroblast-gained. Confirm with a marker — MS4A1/CD20 chr11_60222940_… should be LFC ≈ +7.3.
Step 1 — DESeq2
Outputs used: result table (tabular; cols 1=peak, 2=baseMean, 3=log2FC, 4=lfcSE, 5=stat, 6=pvalue, 7=padj) and the 5-page diagnostics PDF (page 1 = PCA). Genome-wide run ≈ 6–7 min.
Step 2 — NA-filter (tp_awk) on the DESeq2 result
infile = {src: hda, id: <DESeq2 result>}
code = $6!="NA" && $7!="NA"{print}
→ testable peaks (≈ 247,038). Prerequisite for volcano (it errors on NA-padj rows).
Step 3 — Volcano Plot on the NA-filtered table
Column params live inside the with_header conditional (top-level ones are silently ignored):
input_file = {src: hda, id: <NA-filtered>}
with_header|header = no
with_header|label_col = 1
with_header|lfc_col = 3
with_header|pval_col = 6
with_header|fdr_col = 7
signif_thresh = 0.05 lfc_thresh = 1.0 labels|label_select = none
plot_options|legend_labs = "Down,Not Sig,Up"
→ volcano PDF.
Step 4 — Significance filter (tp_awk) on the NA-filtered table
Branch in parallel with the volcano — do not replace the volcano’s input (it needs the full non-significant cloud).
infile = {src: hda, id: <NA-filtered>}
code = $7+0 < 0.05 && ($3+0 >= 1 || $3+0 <= -1){print}
→ Significant differential peaks (45,620).
Step 5 — Rank (Sort) the significant peaks by log2FC, descending
infile = {src: hda, id: <significant peaks>}
header = 0
sortkeys_0|column = 3 sortkeys_0|order = r sortkeys_0|style = g
(g = general-numeric, handles scientific notation.) Top = largest +LFC (B-cell), bottom = most −LFC (Eryth).
Step 6 — Top B-cell-gained = Select first 50 (tp_head_tool)
infile = {src: hda, id: <ranked>} complement = "" count = 50
Step 7 — Top Erythroblast-gained = Select last 50 (tp_tail_tool)
infile = {src: hda, id: <ranked>} complement = "" num_lines = 50
Rename the new outputs to meaningful names (Significant differential peaks…, Top 50 B-cell-gained peaks, Top 50 Erythroblast-gained peaks) — these become the workflow output labels.
Steps 8–11 — Figures: PDF → PNG (see §1)
Per figure (DESeq2 plots PDF, volcano PDF):
# convert
graphicsmagick_image_convert:
input = {src: hda, id: <figure PDF>}
output_format = png
palette|palette_select = no
resize = 100
→ list collection splitted_pdf (temp_000 … per page)
# extract page 1
__EXTRACT_DATASET__:
input = {src: hdca, id: <splitted_pdf collection>}
which|which_dataset = by_identifier
which|identifier = temp_000
→ standalone PNG HDA
Rename the two extracted PNGs → Sample PCA plot (from DESeq2 plots page 1) and Volcano plot (from the volcano PDF).
5. Notebook page
Each directive in its own ```galaxy block, every one referencing a real on-graph output:
history_dataset_as_image(history_dataset_id=<Sample PCA plot PNG>)— PCA (extracted page 1 of DESeq2 plots)history_dataset_as_image(history_dataset_id=<Volcano plot PNG>)history_dataset_as_table(history_dataset_id=<DESeq2 result>, title="…", compact=true)history_dataset_as_table(history_dataset_id=<significant peaks>, title="…", compact=true)history_dataset_as_table(history_dataset_id=<top B-cell>, title="…", compact=true)history_dataset_as_table(history_dataset_id=<top Eryth>, title="…", compact=true)
6. Extraction (page → workflow)
GET /api/pages/{page}/workflow_extraction_summary— expect 0 warnings, every tool step seeded, the 6 referenced outputs exposed.POST /api/workflows/extract:
{
"workflow_name": "Differential ATAC-seq accessibility (Eryth vs B-cell)",
"job_ids": ["<DESeq2>","<NA-filter>","<volcano>","<sig-filter>","<sort>","<head>","<tail>","<convert PCA>","<extract PCA>","<convert volcano>","<extract volcano>"],
"hda_ids": ["<sample sheet>"],
"hdca_ids": ["<ATAC counts collection>"],
"from_page_id": "<page>",
"report_title": "Differential ATAC-seq Accessibility — Erythroblast vs B-cell"
}
Easiest source of job_ids: the jobs[] from the summary where step_type=="tool" and checked. Pass only the collection as the counts input — NOT its 8 element HDAs (those become orphan input_dataset steps). The extractor maps DESeq2’s expanded element-inputs back to the collection.
Result: 13-step workflow, 0 dangling, 6 outputs, report markdown carried with each directive rewritten from history_dataset_id=… to workflow-relative output="<label>" (e.g. history_dataset_as_image(output="Sample PCA plot")). Each figure branch — DESeq2 → graphicsmagick_image_convert → Extract dataset and volcanoplot → graphicsmagick_image_convert → Extract dataset — is reproduced when the workflow runs.
7. Verification
| Check | Expected |
|---|---|
| Peak universe | 590,650 |
| Testable (non-NA) | 247,038 |
| Significant (padj<0.05, |LFC|≥1) | 45,620 |
| B-cell-gained (LFC>0) / Eryth-gained (LFC<0) | 34,873 / 10,747 |
| MS4A1/CD20 (chr11 ~60.22 Mb) | LFC ≈ +7.3 |
| GATA1 (chrX ~48.64 Mb) / HBB (chr11 ~5.25 Mb) | LFC ≈ −5.9 / −6.6 |
| Extraction summary | 0 warnings, all tool steps seeded, 6 exposed |
| Extracted workflow | 13 steps, 0 dangling, 6 outputs (incl. Sample PCA plot, Volcano plot), report clean |
| Re-run | identical 590,650 / 45,620 / 34,873 / 10,747 |
8. Notes
- DESeq2 result column indices can shift with
output_selector/version — confirm 3=log2FC, 6=pvalue, 7=padj from the header before wiring the awk/volcano column args. - Figures are tool outputs (§1), so no
pymupdf/ core change is needed.graphicsmagick_image_convertsplits a PDF into a one-page-per-element collection;temp_000is page 1. Rename the extracted PNG before extraction so the workflow output label is meaningful (else it inheritstemp_000). - The 8 count files are prepared input data (data-import boundary), not an analysis step — legitimate workflow inputs.