# cwl-utils Dependency White Paper

**Package**: `cwl-utils` (v0.41)
**License**: Apache 2.0
**Python**: 3.10 - 3.14
**Repository**: https://github.com/common-workflow-language/cwl-utils
**Docs**: https://cwl-utils.readthedocs.io/

## Overview

`cwl-utils` is the official Python utility library for loading, parsing, manipulating, and transforming [Common Workflow Language](https://www.commonwl.org/) (CWL) documents. It provides autogenerated Python dataclasses for CWL v1.0, v1.1, and v1.2, a version-dispatching parser, static type checking of workflow connections, document packing/splitting, expression handling, and several CLI tools for common CWL operations.

It is **not** a CWL executor. It operates purely at the document/schema level - reading, validating, and transforming CWL definitions.

## CWL Version Support

| Version | Parser Module | Status |
|---------|---------------|--------|
| v1.0 | `cwl_utils.parser.cwl_v1_0` | Stable |
| v1.1 | `cwl_utils.parser.cwl_v1_1` | Stable |
| v1.2 | `cwl_utils.parser.cwl_v1_2` | Stable (latest, default) |

Each version has an autogenerated parser (~25-30k lines of Python dataclasses generated from the CWL schema-salad definitions), version-specific utilities, and version-specific expression refactoring support.

## Core Python API

### Document Loading

All loaders live in `cwl_utils.parser` and auto-detect `cwlVersion` to dispatch to the correct version-specific parser.

```python
from cwl_utils.parser import load_document_by_uri, load_document, save

# Load from file path or URL
process = load_document_by_uri("workflow.cwl")

# Load from YAML string or object
process = load_document(yaml_string_or_dict)

# Serialize back to JSON/YAML-compatible dict
saved = save(process)
```

**Key loading functions:**

| Function | Input | Description |
|----------|-------|-------------|
| `load_document_by_uri(path)` | file path, `file://` URI, or `http(s)://` URL | Primary entry point |
| `load_document(doc)` | YAML string or parsed dict | Load from in-memory data |
| `load_document_by_string(string, uri)` | Raw YAML string | Parse string then load |
| `load_document_by_yaml(yaml, uri)` | Pre-parsed YAML object | Load from ruamel.yaml output |

All loaders accept optional `loadingOptions` (for custom fetch behavior) and `load_all=True` to load all entries from `$graph` documents instead of just `#main`.

**Returned objects** are typed Python dataclass instances (e.g. `cwl_v1_2.Workflow`, `cwl_v1_2.CommandLineTool`) with full attribute access to all CWL fields.

### Version-Agnostic Type Aliases

`cwl_utils.parser` exports union type aliases spanning all three CWL versions, enabling version-agnostic code:

```python
from cwl_utils.parser import (
    Process,              # Workflow | CommandLineTool | ExpressionTool | Operation
    Workflow,             # v1.0 | v1.1 | v1.2 Workflow
    CommandLineTool,      # v1.0 | v1.1 | v1.2 CommandLineTool
    ExpressionTool,       # v1.0 | v1.1 | v1.2 ExpressionTool
    WorkflowStep,         # v1.0 | v1.1 | v1.2 WorkflowStep
    WorkflowInputParameter,
    WorkflowOutputParameter,
    CommandInputParameter,
    CommandOutputParameter,
    DockerRequirement,
    SoftwareRequirement,
    InputArraySchema,
    InputEnumSchema,
    InputRecordSchema,
    File,
    Directory,
    SecondaryFileSchema,  # v1.1+ only
)
```

Runtime type-checking tuples are also available (e.g. `WorkflowTypes`, `CommandLineToolTypes`, `DockerRequirementTypes`) for use with `isinstance()`.

### Serialization

```python
from cwl_utils.parser import save

# Convert Python objects back to dicts suitable for YAML/JSON output
result = save(process, top=True, base_url="", relative_uris=True)
```

When saving a list of processes, `save()` automatically wraps them in a `$graph` document with the latest `cwlVersion`.

### Utility Functions

```python
from cwl_utils.parser import cwl_version, is_process, version_split

cwl_version(yaml_dict)    # -> "v1.2" | None
is_process(obj)           # -> bool
version_split("v1.2")     # -> [1, 2]
```

## Parser Utilities (`cwl_utils.parser.utils`)

Higher-level operations on parsed CWL objects.

### Static Type Checking

```python
from cwl_utils.parser.utils import static_checker

static_checker(workflow)
# Raises ValidationException with detailed source-line errors
# if any workflow step source/sink types are incompatible.
```

Validates all step input sources against their declared types, checks `linkMerge` compatibility (`merge_nested`, `merge_flattened`), and verifies `pickValue` semantics (`first_non_null`, `only_non_null`, `all_non_null`).

### Type Inference

```python
from cwl_utils.parser.utils import (
    type_for_source,
    type_for_step_input,
    type_for_step_output,
    param_for_source_id,
)
```

These functions resolve the actual CWL type flowing through workflow connections, accounting for scatter, linkMerge, and pickValue modifiers.

### Step Loading & Conversion

```python
from cwl_utils.parser.utils import load_step, convert_stdstreams_to_files

# Resolve a step's `run` field (handles both inline and URI references)
step_process = load_step(workflow_step)

# Normalize stdin/stdout/stderr shortcuts into File objects
convert_stdstreams_to_files(command_line_tool)
```

### Input File Loading

```python
from cwl_utils.parser.utils import load_inputfile_by_uri, load_inputfile

# Load CWL input/job files (the YAML files that provide runtime values)
inputs = load_inputfile_by_uri("v1.2", "inputs.yml")
```

## Document Packing (`cwl_utils.pack`)

Consolidates multi-file CWL workflows (with `$include`, `$import`, `run:` references) into a single self-contained document.

```python
from cwl_utils.pack import pack_process

packed = pack_process(cwl_dict, base_url, cwl_version)
```

Handles:
- Inlining of `run:` references to external tool definitions
- `$include` / `$import` resolution
- `SchemaDefRequirement` user-defined type inlining
- Local and remote (HTTP) file fetching
- GitHub symbolic link detection

## Expression Handling (`cwl_utils.expression`)

Parses and evaluates CWL expressions (`$(...)` parameter references and `${...}` JavaScript blocks).

```python
from cwl_utils.expression import scanner

# Find JS expression boundaries in a string
result = scanner("prefix_$(inputs.name)_suffix")
# Returns (start, end) tuple of the expression
```

Supports:
- Parameter reference syntax: `$(inputs.file.path)`
- JavaScript expression syntax: `${return inputs.x + 1}`
- Nested quoting (single/double quotes, backslash escapes)
- Configurable JavaScript engine sandboxing (`cwl_utils.sandboxjs`)

## File Format Validation (`cwl_utils.file_formats`)

Validates CWL file format annotations against ontologies (typically EDAM).

```python
from cwl_utils.file_formats import check_format, formatSubclassOf

# Validate that a file's format matches allowed input formats
check_format(file_dict, allowed_formats, ontology_graph)

# Check ontology subclass relationships
formatSubclassOf(fmt_uri, class_uri, ontology_graph, visited=set())
```

Uses `rdflib` to traverse `rdfs:subClassOf` and `owl:equivalentClass` relationships.

## Container Image Handling (`cwl_utils.image_puller`)

Abstract `ImagePuller` base class with concrete implementations:

| Class | Engine | Output |
|-------|--------|--------|
| `DockerImagePuller` | Docker / Podman | `.tar` tarball |
| `SingularityImagePuller` | Singularity 2.6+ / 3.x+ | `.img` or `.sif` |

Both support force-pull and configurable save directories.

## CWL Value Types (`cwl_utils.types`)

TypedDict definitions for CWL runtime values:

| Type | Description |
|------|-------------|
| `CWLFileType` | File object (location, basename, checksum, size, secondaryFiles, format, contents) |
| `CWLDirectoryType` | Directory object (location, basename, listing) |
| `CWLOutputType` | Union of all CWL output value types (primitives, File, Directory, arrays, records) |
| `CWLObjectType` | `MutableMapping[str, CWLOutputType]` |
| `CWLParameterContext` | `{inputs, self, runtime}` context for expression evaluation |
| `CWLRuntimeParameterContext` | Runtime context (outdir, tmpdir, cores, ram, exitCode, etc.) |

Type guard functions: `is_file()`, `is_directory()`, `is_file_or_directory()`.

Built-in CWL type names: `null`, `boolean`, `int`, `long`, `float`, `double`, `string`, `File`, `Directory`, `stdin`, `stdout`, `stderr`, `Any`.

## Schema Definition Handling (`cwl_utils.schemadef`)

Resolves `SchemaDefRequirement` user-defined types in CWL documents. Builds a type dictionary from inline and `$import`-ed schema definitions for use during packing and validation.

## General Utilities (`cwl_utils.utils`)

| Function | Description |
|----------|-------------|
| `load_linked_file()` | Fetch and parse imported CWL files (local or remote) |
| `normalize_to_map()` / `normalize_to_list()` | Convert between dict and list representations of CWL fields |
| `resolved_path()` | Resolve relative paths against base URIs |
| `bytes2str_in_dicts()` | Recursively decode byte strings in nested structures |
| `yaml_dumps()` | Serialize to YAML string |
| `sanitise_schema_field()` | Normalize CWL type shorthand (e.g. `File?` -> optional File) |
| `to_pascal_case()` | String case conversion |
| `is_uri()` / `is_local_uri()` / `get_value_from_uri()` | URI detection and parsing |

## CLI Tools

Six command-line tools are installed as console scripts:

### `cwl-cite-extract`
Extract software citations/requirements from CWL documents. Traverses workflows recursively to find all `SoftwareRequirement` entries with package names and versions.

### `cwl-docker-extract`
Pull and cache all container images referenced in CWL documents via `DockerRequirement`. Supports Docker, Podman, Singularity, and udocker backends.

### `cwl-expression-refactor`
Refactor inline CWL expressions (`$(...)` / `${...}`) into standalone `ExpressionTool` or `CommandLineTool` steps, producing expression-free workflows.

### `cwl-graph-split`
Unpack `$graph`-style CWL documents (multiple processes in one file) into separate files, one per process.

### `cwl-normalizer`
Normalize CWL documents: upgrade to v1.2 (via `cwl-upgrader`), pack into a single document, and optionally refactor expressions. Outputs JSON or YAML.

### `cwl-inputs-schema-gen`
Generate a JSON Schema from CWL workflow/tool input definitions. Useful for building input validation forms or generating documentation.

## Error Hierarchy

```
BaseException
  ArrayMissingItems
  MissingKeyField
  MissingTypeName
  RecordMissingFields
Exception
  JavascriptException
  SubstitutionError
  WorkflowException
    GraphTargetMissingException
schema_salad.exceptions.ValidationException  (used extensively)
```

## Dependencies

### Required

| Package | Version Constraint | Purpose |
|---------|--------------------|---------|
| `schema-salad` | `>= 8.8, < 9` | Schema validation framework, YAML loading, source-line tracking |
| `ruamel.yaml` | `>= 0.17.6, < 0.20` | YAML parsing with round-trip fidelity |
| `rdflib` | any | RDF graph handling for file format ontology validation |
| `requests` | any | HTTP client for fetching remote CWL documents |
| `cwl-upgrader` | `>= 1.2.3` | CWL version upgrading (v1.0/v1.1 -> v1.2) |
| `packaging` | any | Version string comparison |
| `typing_extensions` | `>= 4.10.0` | Backported typing features (TypeIs, etc.) |

### Optional

| Extra | Packages | Purpose |
|-------|----------|---------|
| `pretty` | `cwlformat` | Pretty-printing CWL output |
| `testing` | `pytest`, `pytest-cov`, `pytest-xdist`, `jsonschema`, `udocker`, `cwltool` | Test suite |

## Integration Patterns

### As a parsing library
Load CWL documents into typed Python objects for analysis, transformation, or code generation.

### As a validation layer
Use `static_checker()` to type-check workflow connections before execution. Use `check_format()` to validate file format ontology compatibility.

### As a transformation pipeline
Chain: load -> modify Python objects -> save -> write. Or use CLI tools: `cwl-normalizer` (upgrade + pack), `cwl-expression-refactor` (simplify expressions), `cwl-graph-split` (decompose).

### As a metadata extractor
Extract software requirements (`cwl-cite-extract`), container images (`cwl-docker-extract`), or input schemas (`cwl-inputs-schema-gen`) from CWL documents.