CWL Legacy Runtime Deep Dive
How CWL tools execute inside Galaxy’s job infrastructure — from API request to command execution to output collection. Written to inform the migration toward a runtimeify-style approach using the tool request API and validated tool state.
Branch: cwl_on_tool_request_api_2
Table of Contents
- Executive Summary
- YAML Tool Runtime (The Target Pattern)
- CWL Tool Execution: End-to-End
- Phase 1: API Entry and State Handling
- Phase 2: Job Creation and Persistence
- Phase 3: Job Preparation and Evaluation
- Phase 4: exec_before_job — The CWL Core
- Phase 5: JobProxy — cwltool Bridge
- Phase 6: Command Assembly and Job Script
- Phase 7: Output Collection
- The Representation Layer (Legacy Hack)
- Comparison: CWL vs YAML Tool Runtime
- What a CWL Runtimeify Would Look Like
- Unresolved Questions
Executive Summary
CWL tool execution uses a fundamentally different architecture from YAML tools:
-
YAML tools use
UserToolEvaluatorwithparam_dict_style="json". The evaluator callsruntimeify()to convertJobInternalToolStateinto aJobRuntimeToolStatewith CWL-style File objects. Command building usesdo_eval()(JavaScript/CWL expressions) against these inputs. Everything happens inside the evaluator — no special pre-processing hook needed. -
CWL tools use the standard
ToolEvaluatorwithparam_dict_style="regular". The critical work happens inCwlCommandBindingTool.exec_before_job(), which extractsvalidated_tool_state.input_state, creates aJobProxywrapping cwltool, and pre-computes the entire command line, stdin/stdout/stderr, and environment variables. These are stashed inparam_dictas__cwl_commandand__cwl_command_state. The evaluator just uses those pre-computed values verbatim.
The key architectural difference: YAML tools build commands at evaluation time using expressions. CWL tools delegate command building to cwltool via a proxy object and store the result before evaluation even starts.
YAML Tool Runtime (The Target Pattern)
Understanding this is critical because the goal is to make CWL execution follow a similar pattern.
Evaluator Selection
MinimalJobWrapper._get_tool_evaluator() (jobs/__init__.py:1402-1415):
if self.tool.base_command or self.tool.shell_command:
klass = UserToolEvaluator # YAML tools
else:
klass = ToolEvaluator # Galaxy tools, CWL tools
CWL tools have neither base_command nor shell_command, so they get ToolEvaluator.
UserToolEvaluator.build_param_dict() (evaluation.py:1130-1170)
Two paths based on whether validated_tool_state exists:
New path (runtimeify):
hda_references, adapt_datasets, adapt_collections = setup_for_runtimeify(
self.app, compute_environment, input_datasets, input_dataset_collections
)
job_runtime_state = runtimeify(validated_tool_state, self.tool, adapt_datasets, adapt_collections)
cwl_style_inputs = job_runtime_state.input_state
Returns: {"inputs": cwl_style_inputs, "outdir": job_working_directory}
UserToolEvaluator._build_command_line() (evaluation.py:1172-1198)
Evaluates base_command + arguments or shell_command using do_eval() against the CWL-style inputs. No Cheetah templates. No pre-computation.
State Transformation Chain
RequestToolState (API) → decode() → RequestInternalToolState
→ dereference() → RequestInternalDereferencedToolState
→ expand() → JobInternalToolState (persisted to job.tool_state)
→ runtimeify() → JobRuntimeToolState (at evaluation time, File objects with paths)
Each transition is typed, validated, and unit-testable.
CWL Tool Execution: End-to-End
POST /api/jobs (tool_request_raw)
│
▼
Tool.handle_input_async()
│ has_galaxy_inputs=False → bypass expand_meta_parameters, _populate_async
│ creates JobInternalToolState from raw input
▼
execute_async() → _execute()
│ job.tool_state = execution_slice.validated_param_combination.input_state
▼
[Job persisted to DB, queued via Celery]
│ Tool serialized via ToolProxy.to_persistent_representation()
▼
JobWrapper.prepare()
│ _get_tool_evaluator() → ToolEvaluator (not UserToolEvaluator)
▼
ToolEvaluator.set_compute_environment()
│ Reconstructs JobInternalToolState from job.tool_state
│ Calls build_param_dict() → returns early for CWL (no output wrapping)
│ Calls execute_tool_hooks() → exec_before_job(validated_tool_state=...)
▼
CwlCommandBindingTool.exec_before_job()
│ input_json = validated_tool_state.input_state ← direct access
│ Creates JobProxy(input_json, output_dict, job_dir)
│ Extracts: command_line, stdin, stdout, stderr, env
│ Stages files, saves .cwl_job.json
│ Sets param_dict["__cwl_command"] and ["__cwl_command_state"]
▼
ToolEvaluator.build()
│ _build_command_line() → uses param_dict["__cwl_command"] verbatim
│ _build_config_files() → returns empty (CWL tools have no config files)
│ _build_environment_variables() → reads from __cwl_command_state["env"]
▼
build_command() (command_factory.py)
│ Wraps command with container, dependency resolution
│ CWL-specific: writes cwl_params.json, generates relocate_dynamic_outputs.py
▼
Job script executes on compute node
│ 1. Run __cwl_command
│ 2. python relocate_dynamic_outputs.py
▼
handle_outputs()
│ Loads JobProxy from .cwl_job.json
│ Calls job_proxy.collect_outputs()
│ Moves files to Galaxy dataset paths
│ Writes galaxy.json metadata
Phase 1: API Entry and State Handling
Tool.expand_incoming_async() (tools/__init__.py:2157-2223)
The has_galaxy_inputs flag controls whether Galaxy’s parameter machinery runs:
if self.has_galaxy_inputs:
# Galaxy tools: full parameter expansion, validation, population
expanded_incomings, job_tool_states, collection_info = expand_meta_parameters_async(...)
else:
# CWL tools: pass state through as-is
expanded_incomings = [deepcopy(tool_request_internal_state.input_state)]
job_tool_states = [deepcopy(tool_request_internal_state.input_state)]
collection_info = None
After expansion, validation still happens:
if self.has_galaxy_inputs:
params, errors = self._populate_async(request_context, expanded_incoming)
else:
params = expanded_incoming
errors = {}
But a JobInternalToolState is always created and validated against the tool’s parameter model:
internal_tool_state = JobInternalToolState(job_tool_state)
internal_tool_state.validate(self, f"{self.id} (job internal model)")
has_galaxy_inputs Flag (tools/__init__.py:1725,1734)
Set during parse_inputs():
self.has_galaxy_inputs = False # line 1725
if pages.inputs_defined:
self.has_galaxy_inputs = True # line 1734
For CWL tools, CwlPageSource.inputs_style is "cwl", which means inputs_defined behavior depends on the page source implementation. With the new commits, CWL tools have has_galaxy_inputs = False.
Phase 2: Job Creation and Persistence
execute.py (lib/galaxy/tools/execute.py)
execute_async() → _execute() → execute_single_job():
# Line 254-256:
if execution_slice.validated_param_combination:
tool_state = execution_slice.validated_param_combination.input_state
job.tool_state = tool_state
This persists the JobInternalToolState.input_state dict as JSON on the Job model. For CWL tools using the new path, this is the raw CWL-compatible input dict (dataset references as {src: "hda", id: <int>}).
Celery Serialization
Tool request API uses Celery tasks. CWL tools must round-trip through serialization:
ToolProxy.to_persistent_representation()serializes the full CWL tool descriptionQueueJobsschema carriestool_idandtool_uuid- On the worker,
create_tool_from_representation()reconstructs the tool - This was fixed in commit
d4d68d2a9b
Phase 3: Job Preparation and Evaluation
JobWrapper.prepare() (jobs/__init__.py:1247-1314)
Called by the job runner when the job is ready to execute:
tool_evaluator = self._get_tool_evaluator(job) # line 1270
tool_evaluator.set_compute_environment(compute_environment, ...) # line 1272
(self.command_line, self.version_command_line,
self.extra_filenames, self.environment_variables,
self.interactivetools) = tool_evaluator.build() # line 1274
Evaluator Selection (jobs/__init__.py:1402-1415)
if self.tool.base_command or self.tool.shell_command:
klass = UserToolEvaluator # YAML tools have these
else:
klass = ToolEvaluator # CWL tools don't
CWL tools always get ToolEvaluator (not UserToolEvaluator).
ToolEvaluator.set_compute_environment() (evaluation.py:166-243)
Reconstructs validated state from the persisted job:
# Lines 217-220:
internal_tool_state = None
if job.tool_state:
internal_tool_state = JobInternalToolState(job.tool_state)
internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")
Then calls hooks with the validated state:
self.execute_tool_hooks(inp_data=inp_data, out_data=out_data,
incoming=incoming, validated_tool_state=internal_tool_state)
Which calls:
self.tool.exec_before_job(self.app, inp_data, out_data, self.param_dict,
validated_tool_state=validated_tool_state)
ToolEvaluator.build_param_dict() — CWL Branch (evaluation.py:263-285)
CWL tools get a plain dict (not TreeDict) and return early:
if self.tool.tool_type == "cwl":
param_dict: Union[dict[str, Any], TreeDict] = self.param_dict
else:
param_dict = TreeDict(self.param_dict)
# ... populate wrappers, input dataset wrappers ...
if self.tool.tool_type == "cwl":
# don't need the outputs or the sanitization
param_dict["__local_working_directory__"] = self.local_working_directory
return param_dict
Skips: output dataset wrapping, output collection wrapping, non-job params, sanitization.
Phase 4: exec_before_job — The CWL Core
CwlCommandBindingTool.exec_before_job() (tools/__init__.py:3757-3829)
This is where CWL-specific execution setup happens. Full annotated flow:
def exec_before_job(self, app, inp_data, out_data, param_dict=None,
validated_tool_state=None):
super().exec_before_job(...)
local_working_directory = param_dict["__local_working_directory__"]
# 1. GET INPUT STATE — direct from validated_tool_state (new path)
input_json = validated_tool_state.input_state
# 2. BUILD OUTPUT DICT — maps output names to dataset paths
output_dict = {}
for name, dataset in out_data.items():
output_dict[name] = {
"id": str(getattr(dataset.dataset, dataset.dataset.store_by)),
"path": dataset.get_file_name(),
}
# 3. FILTER INPUT JSON — remove unset optional files and empty strings
input_json = {k: v for k, v in input_json.items()
if not (isinstance(v, dict) and v.get("class") == "File"
and v.get("location") == "None")}
input_json = {k: v for k, v in input_json.items() if v != ""}
# 4. CREATE JOB PROXY — wraps cwltool
cwl_job_proxy = self._cwl_tool_proxy.job_proxy(
input_json, output_dict, local_working_directory)
# 5. EXTRACT EXECUTION DETAILS FROM CWLTOOL
cwl_command_line = cwl_job_proxy.command_line # list of args
cwl_stdin = cwl_job_proxy.stdin
cwl_stdout = cwl_job_proxy.stdout
cwl_stderr = cwl_job_proxy.stderr
env = cwl_job_proxy.environment
# 6. ASSEMBLE COMMAND STRING
command_line = " ".join(
shlex.quote(arg) if needs_shell_quoting_hack(arg) else arg
for arg in cwl_command_line
)
if cwl_stdin: command_line += f' < "{cwl_stdin}"'
if cwl_stdout: command_line += f' > "{cwl_stdout}"'
if cwl_stderr: command_line += f' 2> "{cwl_stderr}"'
# 7. STAGE FILES — symlinks for input files + InitialWorkDirRequirement
tool_working_directory = os.path.join(local_working_directory, "working")
safe_makedirs(tool_working_directory)
cwl_job_proxy.stage_files()
cwl_job_proxy.rewrite_inputs_for_staging()
# 8. PERSIST JOB PROXY — for output collection later
cwl_job_proxy.save_job() # writes .cwl_job.json
# 9. STASH IN PARAM_DICT — for evaluator to pick up
param_dict["__cwl_command"] = command_line
param_dict["__cwl_command_state"] = {
"args": cwl_command_line,
"stdin": cwl_stdin,
"stdout": cwl_stdout,
"stderr": cwl_stderr,
"env": env,
}
Critical observation: The input_json at step 1 is now validated_tool_state.input_state (the new path). In the legacy path, this would have been self.param_dict_to_cwl_inputs(param_dict, local_working_directory) which reverse-engineers CWL inputs from Galaxy’s wrapped parameter dict via to_cwl_job() or galactic_flavored_to_cwl_job().
$GALAXY_SLOTS Handling
needs_shell_quoting_hack() exempts $GALAXY_SLOTS from quoting. But there’s a deeper hack: cwltool needs a concrete number for ResourceRequirement.coresMin at job-construction time. JobProxy._select_resources() substitutes a sentinel value (1.480231396), and command_line property replaces it back with $GALAXY_SLOTS:
# parser.py:442-449
@property
def command_line(self):
command_line = self.cwl_job().command_line
return [fragment.replace(str(SENTINEL_GALAXY_SLOTS_VALUE), "$GALAXY_SLOTS")
for fragment in command_line]
Phase 5: JobProxy — cwltool Bridge
Constructor (parser.py:329-344)
class JobProxy:
def __init__(self, tool_proxy, input_dict, output_dict, job_directory):
self._tool_proxy = tool_proxy
self._input_dict = input_dict # CWL job inputs
self._output_dict = output_dict # {name: {id, path}}
self._job_directory = job_directory
self._final_output = None
self._ok = True
self._cwl_job = None
self._normalize_job()
_normalize_job() (parser.py:376-391)
Prepares input dict for cwltool:
- Converts
"path"keys to"location"in File/Directory objects - Calls
process.fill_in_defaults()to inject CWL defaults - Uses cwltool’s
visit_class()for recursive path rewriting
_ensure_cwl_job_initialized() (parser.py:354-374)
Lazily creates the cwltool Job object:
job_args = dict(
basedir=self._job_directory,
select_resources=self._select_resources,
outdir=os.path.join(self._job_directory, "working"),
tmpdir=os.path.join(self._job_directory, "cwltmp"),
stagedir=os.path.join(self._job_directory, "cwlstagedir"),
use_container=False, # Galaxy handles containers
beta_relaxed_fmt_check=True,
)
runtimeContext = RuntimeContext(job_args)
# Defensive copy to prevent mutations
cwl_tool_instance = copy.copy(self._tool_proxy._tool)
cwl_tool_instance.inputs_record_schema = copy.deepcopy(
cwl_tool_instance.inputs_record_schema)
self._cwl_job = next(cwl_tool_instance.job(
self._input_dict, self._output_callback, runtimeContext))
self._is_command_line_job = hasattr(self._cwl_job, "command_line")
Key: use_container=False — Galaxy’s own containerization (Docker/Singularity) wraps the command later in build_command(). cwltool must not try to run containers.
Directory Layout
{job_directory}/
├── .cwl_job.json # Serialized JobProxy (tool + inputs + outputs)
├── cwl_params.json # {job_metadata, job_id_tag} for output collection
├── cwlstagedir/ # cwltool staging area (symlinks)
├── cwltmp/ # cwltool temp directory
├── working/ # Tool output directory (outdir)
├── outputs/
│ └── dataset_{id}_files/ # Galaxy extra files per output
│ └── __secondary_files__/ # Secondary files
├── relocate_dynamic_outputs.py # Generated output collection script
└── tool_script.sh # Galaxy job script
stage_files() (parser.py:541-564)
Uses cwltool’s PathMapper to create symlinks:
if hasattr(cwl_job, "pathmapper"):
process.stage_files(cwl_job.pathmapper, stageFunc, ignore_writable=True)
if hasattr(cwl_job, "generatefiles"):
# InitialWorkDirRequirement
generate_mapper = pathmapper.PathMapper(
cwl_job.generatefiles["listing"], outdir, outdir, separateDirs=False)
process.stage_files(generate_mapper, stageFunc, ignore_writable=inplace_update)
relink_initialworkdir(generate_mapper, outdir, outdir, inplace_update=inplace_update)
save_job() (parser.py:508-516)
Writes .cwl_job.json:
job_objects = {
"tool_representation": self._tool_proxy.to_persistent_representation(),
"job_inputs": self._input_dict,
"output_dict": self._output_dict,
}
json.dump(job_objects, open(job_file, "w"))
This is how the post-execution output collection script can reconstruct the full CWL context.
CommandLineTool vs ExpressionTool
| Property | CommandLineTool | ExpressionTool |
|---|---|---|
is_command_line_job | True | False |
command_line | cwl_job.command_line (list of args) | ["true"] (no-op) |
stdin/stdout/stderr | From cwl_job | None |
environment | From cwl_job (EnvVarRequirement) | {} |
stage_files() | Uses pathmapper + generatefiles | No pathmapper |
collect_outputs() | cwl_job.collect_outputs(workdir, rcode) | cwl_job.run() → JS execution → _output_callback |
Phase 6: Command Assembly and Job Script
ToolEvaluator.build() (evaluation.py)
After exec_before_job has set __cwl_command in param_dict:
_build_command_line() (line 806-809):
if self.tool.tool_type in CWL_TOOL_TYPES and "__cwl_command" in param_dict:
command_line = param_dict["__cwl_command"] # Pre-computed, no Cheetah
_build_config_files() (line 849-851):
if self.tool.tool_type in CWL_TOOL_TYPES:
return config_filenames # Empty — CWL tools have no config files
_build_environment_variables() (line 873-907):
# Extract CWL env vars from __cwl_command_state
for key, value in param_dict.get("__cwl_command_state", {}).get("env", {}).items():
environment_variable = dict(name=key, template=value)
environment_variables_raw.append(environment_variable)
# Later: CWL tools skip Cheetah templating for env vars
if self.tool.tool_type not in CWL_TOOL_TYPES:
template_type = "cheetah"
build_command() (command_factory.py:39-293)
Assembles the final job script. CWL-specific block at lines 141-158:
if job_wrapper.is_cwl_job:
# 1. Write cwl_params.json for output collection
cwl_metadata_params = {
"job_metadata": join("working", job_wrapper.tool.provided_metadata_file),
"job_id_tag": job_wrapper.get_id_tag(),
}
with open(cwl_metadata_params_path, "w") as f:
json.dump(cwl_metadata_params, f)
# 2. Generate relocate script
relocate_contents = (
"from galaxy_ext.cwl.handle_outputs import relocate_dynamic_outputs; "
"relocate_dynamic_outputs()"
)
write_script(relocate_script_file, relocate_contents, ...)
# 3. Append to job script
commands_builder.append_command(SETUP_GALAXY_FOR_METADATA)
commands_builder.append_command(f"python '{relocate_script_file}'")
Also at line 289-293, CWL jobs skip the duplicate SETUP_GALAXY_FOR_METADATA before the metadata command since it’s already added above.
Resulting Job Script Structure
# 1. Dependency setup (conda, etc.)
# 2. Container setup if needed
# 3. The CWL command itself (__cwl_command)
<cwl_tool_command> < stdin > stdout 2> stderr
# 4. Exit code capture
# 5. Galaxy environment setup
SETUP_GALAXY_FOR_METADATA
# 6. Output relocation
python 'relocate_dynamic_outputs.py'
# 7. Standard metadata commands
Phase 7: Output Collection
Entry Point: relocate_dynamic_outputs.py
Generated by command_factory.py, calls:
from galaxy_ext.cwl.handle_outputs import relocate_dynamic_outputs
relocate_dynamic_outputs()
handle_outputs.py → runtime_actions.py
galaxy_ext/cwl/handle_outputs.py is a thin wrapper that adjusts sys.path and calls galaxy.tool_util.cwl.runtime_actions.handle_outputs().
handle_outputs() (runtime_actions.py:69-229)
Step 1: Load context
job_proxy = load_job_proxy(job_directory, strict_cwl_validation=False)
cwl_metadata_params = json.load(open(cwl_metadata_params_path))
exit_code_file = default_exit_code_file(".", cwl_metadata_params["job_id_tag"])
tool_exit_code = read_exit_code_from(exit_code_file, job_id_tag)
load_job_proxy() (parser.py:798-808) reconstructs the full CWL context:
job_objects = json.load(open(os.path.join(job_directory, ".cwl_job.json")))
cwl_tool = tool_proxy_from_persistent_representation(job_objects["tool_representation"])
return cwl_tool.job_proxy(job_objects["job_inputs"], job_objects["output_dict"], job_directory)
Step 2: Collect CWL outputs
outputs = job_proxy.collect_outputs(tool_working_directory, tool_exit_code)
For CommandLineTools: delegates to cwltool’s collect_outputs() which evaluates output glob patterns.
For ExpressionTools: calls cwl_job.run() to execute the JavaScript expression.
Step 3: Process each output
| CWL Output Type | Processing |
|---|---|
File (dict with location) | move_output() — copies file to Galaxy dataset path, handles secondary files |
Directory (dict with location) | move_directory() — copies tree to extra_files_path |
| Record (dict without location) | Splits by |__part__| prefix, processes each field |
| List (array) | Creates indexed elements with filenames |
| Scalar/JSON | handle_known_output_json() — writes to expression.json |
| None/missing | Fills with null JSON for declared-but-absent outputs |
Step 4: Write galaxy.json
job_metadata = os.path.join(job_directory, cwl_metadata_params["job_metadata"])
with open(job_metadata, "w") as f:
json.dump(provided_metadata, f)
This galaxy.json contains per-output metadata: created_from_basename, ext, format, and for collections, elements.
Secondary Files
Stored in dataset_{id}_files/__secondary_files__/ with an index file:
{"order": ["file.idx", "file.bai"]}
The move_output() function handles secondary file naming. CWL uses a ^ prefix convention (each ^ removes one extension from the primary file name), but the code also supports STORE_SECONDARY_FILES_WITH_BASENAME mode.
The Representation Layer (Legacy Hack)
This section documents what the migration aims to eliminate. The legacy path converts Galaxy param_dict back to CWL inputs.
to_cwl_job() (representation.py:386-488)
Called by CwlTool.param_dict_to_cwl_inputs(). Walks tool.inputs (Galaxy’s parsed parameter tree):
- Repeat inputs → CWL arrays (strips
_repeatsuffix) - Conditional inputs → reads
_cwl__type_discriminator, extracts_cwl__value_ - Data inputs →
dataset_wrapper_to_file_json()creates CWL File objects - Collections →
collection_wrapper_to_array()orcollection_wrapper_to_record() - Primitives → type-coerced values
galactic_flavored_to_cwl_job() (representation.py:286-383)
Simpler variant for GalacticCwlTool. Uses map_to paths for nested structures. No repeat/conditional handling.
dataset_wrapper_to_file_json() (representation.py:155-195)
Converts a Galaxy DatasetWrapper to CWL File object:
raw_file_object = {
"class": "File",
"location": path,
"size": int(dataset_wrapper.get_size()),
"format": str(dataset_wrapper.cwl_formats[0]),
"basename": basename,
"nameroot": nameroot,
"nameext": nameext,
"secondaryFiles": [...]
}
Handles secondary files by symlinking into an _inputs directory.
Why This Is a Problem
The round-trip (CWL schema → Galaxy widgets → user input → Galaxy param_dict → CWL job JSON) requires:
- Every CWL type mapped to a Galaxy widget type (
TYPE_REPRESENTATIONS) - Union types become Galaxy conditionals with
_cwl__type_/_cwl__value_keys FieldTypeToolParameterinbasic.pyas a catch-all for CWL’s flexible typing- DatasetWrappers must be reverse-engineered back to CWL File objects
- All of this touches
basic.py, which is core Galaxy infrastructure
With validated tool state, the input JSON goes directly to exec_before_job without this reverse-engineering.
Comparison: CWL vs YAML Tool Runtime
| Aspect | CWL Tool (current) | YAML Tool (runtimeify) |
|---|---|---|
| Evaluator class | ToolEvaluator | UserToolEvaluator |
| param_dict_style | "regular" | "json" |
| Input state source | validated_tool_state.input_state (new) or param_dict_to_cwl_inputs() (legacy) | runtimeify(validated_tool_state) |
| Dataset → File conversion | Done in exec_before_job (input_json already has references) OR dataset_wrapper_to_file_json() (legacy) | Done by setup_for_runtimeify() adapters |
| Command building | Pre-computed by cwltool via JobProxy, stored in __cwl_command | Built at eval time via do_eval() with CWL expressions |
| Where command lives | param_dict["__cwl_command"] | Returned from _build_command_line() |
| Output collection | Post-execution relocate_dynamic_outputs.py script via cwltool | Standard Galaxy metadata |
| Job proxy needed | Yes — wraps cwltool.job.Job | No |
| Container handling | Galaxy wraps (cwltool use_container=False) | Galaxy wraps |
| File staging | cwltool PathMapper (symlinks) | Galaxy’s standard input staging |
| Config files | None | YamlTemplateConfigFile |
| Environment vars | From cwltool (EnvVarRequirement) | From tool definition |
Key Structural Differences
-
Command pre-computation: CWL delegates to cwltool at
exec_before_jobtime. YAML tools evaluate at_build_command_linetime. This is unavoidable — cwltool is the authoritative CWL command builder. -
Two-phase output: CWL uses a post-execution script to collect outputs because cwltool needs to run its own output glob evaluation. YAML tools use Galaxy’s standard metadata.
-
File staging: CWL uses cwltool’s PathMapper. YAML tools use Galaxy’s standard input path rewriting via
compute_environment.input_path_rewrite(). -
No
runtimeify()equivalent: CWL currently getsvalidated_tool_state.input_statedirectly. It does NOT go throughruntimeify()to convert dataset references to File objects with paths. The input_state either already has the right format (new path) or gets reverse-engineered from param_dict (legacy path).
What a CWL Runtimeify Would Look Like
The goal: make CWL execution use typed state transitions similar to YAML tools.
Current New Path (partially done)
validated_tool_state.input_state (has dataset refs as {src: "hda", id: N})
↓
exec_before_job() filters + passes directly to JobProxy
↓
JobProxy._normalize_job() fills CWL defaults
↓
cwltool processes and generates command
What’s Missing for Full Runtimeify
-
Dataset reference → File object conversion: Currently
exec_before_jobreceivesinput_statewith raw references. Someone needs to convert{src: "hda", id: N}to{"class": "File", "location": "/path/to/file", ...}. In the YAML path,runtimeify()+setup_for_runtimeify()does this. For CWL, this conversion could happen:- Option A: Inside
exec_before_job(current approach — it has access toinp_data) - Option B: Via a CWL-specific
runtimeify()beforeexec_before_job - Option C: Use
UserToolEvaluatorfor CWL tools too (would needbase_commandorshell_commandset)
- Option A: Inside
-
Secondary files: YAML’s
runtimeify()doesn’t handle secondary files. CWL needs them.dataset_wrapper_to_file_json()currently handles this in the legacy path. -
Directory inputs: CWL directories are tar archives in Galaxy. Need extraction logic that the YAML path doesn’t have.
-
Collection mapping: CWL arrays/records map to Galaxy collections. The YAML
runtimeify()hasadapt_collectionbut raisesNotImplementedErrorfor some cases.
The Input State Question
In the current new path, what does validated_tool_state.input_state look like for CWL tools? It appears to be the raw API input — dataset references but not yet File objects with paths. The conversion to CWL File objects (with location, size, checksum, secondary files) would need to happen somewhere before JobProxy gets the input dict.
The YAML tool path does this in setup_for_runtimeify() → adapt_dataset() which creates DataInternalJson objects (CWL File-like). A CWL equivalent would need to be richer — adding secondary files, CWL format URIs, checksums, etc.
Unresolved Questions
-
Where should dataset→File conversion happen for CWL? In
exec_before_job(hasinp_datadict), in a CWL-specificruntimeify, or somewhere else? The current code inexec_before_jobjust usesvalidated_tool_state.input_statedirectly — does this already contain resolved paths or just references? -
Can CWL tools use
UserToolEvaluator? They’d needbase_commandorshell_command. Could we set a syntheticshell_commandthat’s the cwltool-generated command? Probably not — the command isn’t known untilJobProxyruns. -
How close can file staging get to Galaxy’s standard path? CWL uses cwltool’s PathMapper for symlinks. YAML uses
compute_environment.input_path_rewrite(). Could we skip PathMapper and use Galaxy’s rewriting? Probably not for InitialWorkDirRequirement files. -
Can output collection move inside Galaxy? Currently it’s a post-execution script. Could
collect_outputs()run inside Galaxy’s job finishing instead of as a script appended to the job? This would avoid needing to serialize the full tool representation to.cwl_job.json. -
ExpressionTool execution: These run JS, not shell commands. The current path returns
["true"]as the command and runs the expression duringcollect_outputs. How does this interact with the tool request API? Is there a simpler path? -
What validated_tool_state.input_state actually contains for CWL right now? Need to trace a concrete test case to see the actual JSON structure at each phase. The filtering in
exec_before_job(removinglocation == "None"files and empty strings) suggests the state may not be fully clean yet. -
Secondary files in the new path: The legacy
dataset_wrapper_to_file_json()reconstructs secondary files from__secondary_files__directories. In the new path usingvalidated_tool_state.input_state, who provides secondary file information?