Skip to content

Conversation

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Sep 18, 2025

The pulsar test will require galaxyproject/pulsar#410

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

false_extra_files_path = os.path.join(
os.path.dirname(da_false_path or real_path), da.dataset.dataset.extra_files_path_name
)
if da.dataset.datatype.is_directory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very subtle so I'm going to give you the approve but I'm going to express my unease and suggestion and then let you decide if it makes any sense. I don't love that we're dispatching on the output type here. The semantics of from_work_dir seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now - I don't think we did this in the past. It makes something like getting this behavior with an output of type auto impossible? Wouldn't we potentially want something like this behavior for a wide variety of composite datatypes? It makes me think it would be better to get this behavior from a different XML attribute that isn't having its behavior altered on a per datatype basis? We could be dispatching on something about the tool output description to get this behavior instead of the datatype. It would improve the design by respecting the encapsulation of the tool framework and the datatype framework better IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a complicated issue. Right now we don't precreate the extra files dir, if a tool wants to write to the extra files dir it has to mkdir $output.extra_files in the command block, and it can do that independently of whether based on the datatype one would expect this, meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?

Precreating the directory isn't at all necessary, I could remove the check here and the -d check in the command factory and this would just work. One issues is that the user can purge the output of a running job, and that would remove the target directory. Without the -d check we write into a purged location (a bit of a worst case scenario because our cleanup scripts don't verify that purged datasets are really purged from disk).
So this seems like a conservative way of only creating the extra files dir when we reasonably expect that one should be created. I could of course unconditionally create the extra files dir, but that also seems wasteful?

seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now

that's true, but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?

I guess that aligns with

We could be dispatching on something about the tool output description to get this behavior instead of the datatype

? Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and yes, the check is an optimization so we don't need to do this for all datatypes and still benefit from not accidentally writing to a purged location)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?

I mean I don't know - it is consistent with CWL behavior and it does uphold that sort of encapsulation between say the object store and the datatype framework or between the tool framework and datatype framework. A dataset is a primary file and an extra files directory - the semantics come in the display based on the datatype - not enforcement by these other generic components that are independent of datatype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could of course unconditionally create the extra files dir, but that also seems wasteful?

I mean the "correct" thing to do would be that but the whole process is already so inefficient and disk bound - we're suffering from repeatedly doing the "correct" thing. So I agree we shouldn't just do it always I think...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?

Yeah - I think that makes good sense to me. Implement it as a new flag then and just implicitly turn on the flag with "class: Directory"? Like "{class: Directory, fromWorkDir: moo}" implicitly becomes "{class:File, datatype: Directory, precreateAndRecursivelyGatherFromWorkDir: true, fromWorkDir: moo}" where I guess sub-types of Directory datatype could still be specified and used and we'd probably want another name for the new flag.

Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?

I mean it would be my preference for sure but you've got a lot on your plate and I know this is a very esoteric want on my part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we'll actually need precreate_directory="false" (i.e. the defaults) to make existing tests pass 😅. I'll implement that ... can we skip RecursivelyGatherFromWorkDir and just assume that's what we want to do if the output is a directory ?

@mvdbeek mvdbeek force-pushed the from_work_dir_directory branch 4 times, most recently from 7bcc8e9 to 76380b8 Compare September 19, 2025 09:38
# in that case we don't want to copy the output to a purged path.
# Static, non work_dir_output files are handled in job_finish code.
return f'\nif [ -f "{source_file}" -a -f "{destination}" ] ; then cp "{source_file}" "{destination}" ; fi'
return f'\nif [ {test_flag} "{source_file}" -a {test_flag} "{destination}" ] ; then{delete_destination_dir} cp{recursive_flag} "{source_file}" "{destination}" ; fi'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ensure the directories are created for resubmitted jobs? My bash-fu is lacking but I'm guessing yes - I just want to check you've thought through partial directories from previous jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have definitely not thought about that, but I can add a test and see where we are. I presume we should delete and create those directories on resubmission ?

@jmchilton
Copy link
Member

Previous approve still applies and I think it is looking better. Please feel free to rebase and merge - thanks for the improvement.

@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 24, 2025

We'll need a new pulsar release too, I can get that done tomorrow.

@mvdbeek mvdbeek force-pushed the from_work_dir_directory branch from 76380b8 to ce6f8e1 Compare September 30, 2025 12:26
@mvdbeek mvdbeek force-pushed the from_work_dir_directory branch from 8bd0050 to 8b8abd1 Compare October 1, 2025 07:02
@mvdbeek mvdbeek merged commit 3e7f220 into galaxyproject:dev Oct 1, 2025
51 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants