Add support for picking `from_work_dir` directory #20916

mvdbeek · 2025-09-18T13:40:02Z

The pulsar test will require galaxyproject/pulsar#410

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

jmchilton · 2025-09-18T13:53:01Z

lib/galaxy/job_execution/setup.py

+            false_extra_files_path = os.path.join(
+                os.path.dirname(da_false_path or real_path), da.dataset.dataset.extra_files_path_name
+            )
+            if da.dataset.datatype.is_directory:


This is very subtle so I'm going to give you the approve but I'm going to express my unease and suggestion and then let you decide if it makes any sense. I don't love that we're dispatching on the output type here. The semantics of from_work_dir seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now - I don't think we did this in the past. It makes something like getting this behavior with an output of type auto impossible? Wouldn't we potentially want something like this behavior for a wide variety of composite datatypes? It makes me think it would be better to get this behavior from a different XML attribute that isn't having its behavior altered on a per datatype basis? We could be dispatching on something about the tool output description to get this behavior instead of the datatype. It would improve the design by respecting the encapsulation of the tool framework and the datatype framework better IMO.

Hmm, that's a complicated issue. Right now we don't precreate the extra files dir, if a tool wants to write to the extra files dir it has to mkdir $output.extra_files in the command block, and it can do that independently of whether based on the datatype one would expect this, meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?

Precreating the directory isn't at all necessary, I could remove the check here and the -d check in the command factory and this would just work. One issues is that the user can purge the output of a running job, and that would remove the target directory. Without the -d check we write into a purged location (a bit of a worst case scenario because our cleanup scripts don't verify that purged datasets are really purged from disk).
So this seems like a conservative way of only creating the extra files dir when we reasonably expect that one should be created. I could of course unconditionally create the extra files dir, but that also seems wasteful?

seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now

that's true, but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?

I guess that aligns with

We could be dispatching on something about the tool output description to get this behavior instead of the datatype

? Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?

(and yes, the check is an optimization so we don't need to do this for all datatypes and still benefit from not accidentally writing to a purged location)

meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?

I mean I don't know - it is consistent with CWL behavior and it does uphold that sort of encapsulation between say the object store and the datatype framework or between the tool framework and datatype framework. A dataset is a primary file and an extra files directory - the semantics come in the display based on the datatype - not enforcement by these other generic components that are independent of datatype.

I could of course unconditionally create the extra files dir, but that also seems wasteful?

I mean the "correct" thing to do would be that but the whole process is already so inefficient and disk bound - we're suffering from repeatedly doing the "correct" thing. So I agree we shouldn't just do it always I think...

but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?

Yeah - I think that makes good sense to me. Implement it as a new flag then and just implicitly turn on the flag with "class: Directory"? Like "{class: Directory, fromWorkDir: moo}" implicitly becomes "{class:File, datatype: Directory, precreateAndRecursivelyGatherFromWorkDir: true, fromWorkDir: moo}" where I guess sub-types of Directory datatype could still be specified and used and we'd probably want another name for the new flag.

Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?

I mean it would be my preference for sure but you've got a lot on your plate and I know this is a very esoteric want on my part.

Turns out we'll actually need precreate_directory="false" (i.e. the defaults) to make existing tests pass 😅. I'll implement that ... can we skip RecursivelyGatherFromWorkDir and just assume that's what we want to do if the output is a directory ?

jmchilton · 2025-09-24T16:28:16Z

lib/galaxy/jobs/command_factory.py

    # in that case we don't want to copy the output to a purged path.
    # Static, non work_dir_output files are handled in job_finish code.
-    return f'\nif [ -f "{source_file}" -a -f "{destination}" ] ; then cp "{source_file}" "{destination}" ; fi'
+    return f'\nif [ {test_flag} "{source_file}" -a {test_flag} "{destination}" ] ; then{delete_destination_dir} cp{recursive_flag} "{source_file}" "{destination}" ; fi'


Does this ensure the directories are created for resubmitted jobs? My bash-fu is lacking but I'm guessing yes - I just want to check you've thought through partial directories from previous jobs.

I have definitely not thought about that, but I can add a test and see where we are. I presume we should delete and create those directories on resubmission ?

jmchilton · 2025-09-24T16:29:17Z

Previous approve still applies and I think it is looking better. Please feel free to rebase and merge - thanks for the improvement.

mvdbeek · 2025-09-24T17:47:58Z

We'll need a new pulsar release too, I can get that done tomorrow.

Convenient for XML, required for the yaml version of tools.

mvdbeek requested a review from jmchilton September 18, 2025 13:40

github-actions bot added area/testing area/jobs area/datatypes area/testing/integration labels Sep 18, 2025

github-actions bot added this to the 25.1 milestone Sep 18, 2025

jmchilton reviewed Sep 18, 2025

View reviewed changes

jmchilton approved these changes Sep 18, 2025

View reviewed changes

mvdbeek force-pushed the from_work_dir_directory branch 4 times, most recently from 7bcc8e9 to 76380b8 Compare September 19, 2025 09:38

ahmedhamidawan added the kind/enhancement label Sep 23, 2025

ahmedhamidawan assigned jmchilton Sep 23, 2025

jmchilton reviewed Sep 24, 2025

View reviewed changes

mvdbeek force-pushed the from_work_dir_directory branch from 76380b8 to ce6f8e1 Compare September 30, 2025 12:26

mvdbeek added 5 commits October 1, 2025 09:01

Add support for picking from_work_dir directory

9cdff7b

Convenient for XML, required for the yaml version of tools.

Test from_work_dir="dir" feature

c24e1e2

Add precreate_directory flag

9a69b5a

Bump up pulsar dependency

b0dd306

Update schema

8b8abd1

mvdbeek force-pushed the from_work_dir_directory branch from 8bd0050 to 8b8abd1 Compare October 1, 2025 07:02

mvdbeek merged commit 3e7f220 into galaxyproject:dev Oct 1, 2025
51 of 59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for picking `from_work_dir` directory #20916

Add support for picking `from_work_dir` directory #20916

Uh oh!

mvdbeek commented Sep 18, 2025

Uh oh!

jmchilton Sep 18, 2025

Uh oh!

mvdbeek Sep 18, 2025

Uh oh!

mvdbeek Sep 18, 2025

Uh oh!

jmchilton Sep 18, 2025

Uh oh!

jmchilton Sep 18, 2025

Uh oh!

jmchilton Sep 18, 2025

Uh oh!

mvdbeek Sep 18, 2025

Uh oh!

jmchilton Sep 24, 2025

Uh oh!

mvdbeek Sep 24, 2025

Uh oh!

jmchilton commented Sep 24, 2025

Uh oh!

mvdbeek commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for picking from_work_dir directory #20916

Add support for picking from_work_dir directory #20916

Uh oh!

Conversation

mvdbeek commented Sep 18, 2025

How to test the changes?

License

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmchilton commented Sep 24, 2025

Uh oh!

mvdbeek commented Sep 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for picking `from_work_dir` directory #20916

Add support for picking `from_work_dir` directory #20916