summarize-cwl

Read a CWL Workflow entrypoint, resolve referenced Workflow, CommandLineTool, ExpressionTool, and Operation documents, and emit summary-cwl.json. This Mold is source-specific and target-agnostic: it records what the CWL says, validates and normalizes references, and leaves Galaxy interface/data-flow choices to downstream molds.

CWL is already a structured workflow language. Do not imitate summarize-nextflow‘s heavy inference machinery unless a real CWL fixture proves the need.

Inputs

The Mold expects:

A local CWL entrypoint path or an HTTP(S) URL.
Optional pin/version metadata supplied by the harness or user.
Optional output directory/path for a normalized CWL document.
Optional test/job file hints. If no test files are supplied or discoverable, emit tests: [].

Outputs

A single JSON document conforming to summary-cwl. Sketch shape:

{
  "summary_version": "1",
  "source": {
    "ecosystem": "cwl",
    "workflow": "rnaseq-qc",
    "url": "https://example.org/workflows/rnaseq-qc.cwl",
    "version": "abc123",
    "license": null,
    "slug": "rnaseq-qc",
    "cwl_version": "v1.2",
    "entrypoint": "rnaseq-qc.cwl#main"
  },
  "documents": {
    "entrypoint": "rnaseq-qc.cwl",
    "normalized_path": "normalized/rnaseq-qc.cwl.json",
    "validation": {
      "command": "cwltool --validate rnaseq-qc.cwl",
      "status": "valid",
      "diagnostics": []
    }
  },
  "workflow_inputs": [
    {
      "id": "reads",
      "label": "reads",
      "type": "File[]",
      "optional": false,
      "default": null,
      "doc": "Input FASTQ files.",
      "format": "edam:format_1930",
      "secondary_files": []
    }
  ],
  "workflow_outputs": [
    {
      "id": "report",
      "label": "report",
      "type": "File",
      "output_source": "multiqc/report",
      "doc": null,
      "format": "edam:format_2330",
      "secondary_files": []
    }
  ],
  "steps": [
    {
      "id": "fastqc",
      "run": "#fastqc_tool",
      "run_class": "CommandLineTool",
      "label": "FastQC",
      "doc": null,
      "in": [{ "id": "reads", "source": ["reads"], "value_from": null }],
      "out": ["html", "zip"],
      "scatter": ["reads"],
      "scatter_method": "dotproduct",
      "when": null,
      "requirements": [],
      "hints": []
    }
  ],
  "tools": [
    {
      "id": "fastqc_tool",
      "label": "FastQC",
      "base_command": ["fastqc"],
      "arguments": [],
      "inputs": [],
      "outputs": [],
      "requirements": [
        {
          "class": "DockerRequirement",
          "docker_pull": "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0",
          "docker_image_id": null,
          "packages": [],
          "raw": {}
        }
      ],
      "hints": []
    }
  ],
  "graph": {
    "nodes": [{ "id": "fastqc", "kind": "step", "label": "FastQC" }],
    "edges": [{ "from": "reads", "to": "fastqc/reads", "via": ["scatter"] }]
  },
  "tests": [],
  "warnings": []
}

Procedure

Validate the entrypoint with cwltool --validate or equivalent library validation. If invalid, emit source provenance, validation diagnostics, warnings[], and do not invent graph structure.
Normalize the workflow with cwl-normalizer from cwl-utils when possible. Use the normalized JSON document as the preferred extraction surface because referenced documents have been gathered, older CWL versions have been upgraded to v1.2 when needed, and the output is regular JSON.
Extract Workflow inputs/outputs, step wiring, scatter, scatterMethod, when, requirements, and hints directly from the normalized CWL object model.
Extract every referenced CommandLineTool command surface: baseCommand, arguments, input/output bindings, output globs, DockerRequirement, and SoftwareRequirement.
Build a simple graph from workflow inputs to step inputs, step outputs to step inputs, and step outputs to workflow outputs. Add via markers for scatter, linkMerge, pickValue, valueFrom, and secondaryFiles.
Record test/job files only when supplied or discoverable by convention. Do not infer expected outputs from command names.
Validate the assembled object with foundry validate-summary-cwl summary-cwl.json before returning it.

Caveats Baked Into The Procedure

Expressions are preserved, not executed. valueFrom, when, expression-based globs, and JavaScript-heavy tools should surface warnings when they affect data shape.
Directory is a review trigger. Preserve Directory types; downstream Galaxy molds decide whether to use directory-capable wrappers, explicit files, or collections.
Nested workflows stay visible. A nested Workflow in run: is a step target, not a reason to flatten blindly. Summarize its boundary and warn if downstream Galaxy translation needs expansion.
Dependency solving is downstream. Capture DockerRequirement and SoftwareRequirement, but do not resolve them into Tool Shed tools or new wrappers here.
Remote document resolution is bounded. Resolve referenced CWL documents and tool files; do not recursively download arbitrary input data.

Reference Dispatch

summary-cwl — always validate output against this schema before emitting.
component-cwl-workflow-anatomy — use for normalization, graph extraction, scatter/conditionals, requirements, and known non-goals.

Non-Goals

Translation to Galaxy. Collection choice, datatype choice, data-flow reshaping, IWC comparison, and gxformat2 authoring belong downstream.
Tool discovery or wrapper authoring. Existing Galaxy wrapper search and new wrapper authoring are handled by the per-step Galaxy loop.
Runtime execution. This Mold summarizes and validates CWL structure; run-workflow-test owns execution.

summarize-cwl

Mold health

Reference Loading

Cast artifacts

Artifact handoffs

Produces

summarize-cwl

Inputs

Outputs

Procedure

Caveats Baked Into The Procedure

Reference Dispatch

Non-Goals

Incoming References (5)