-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for picking from_work_dir directory
#20916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/galaxy/job_execution/setup.py
Outdated
| false_extra_files_path = os.path.join( | ||
| os.path.dirname(da_false_path or real_path), da.dataset.dataset.extra_files_path_name | ||
| ) | ||
| if da.dataset.datatype.is_directory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very subtle so I'm going to give you the approve but I'm going to express my unease and suggestion and then let you decide if it makes any sense. I don't love that we're dispatching on the output type here. The semantics of from_work_dir seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now - I don't think we did this in the past. It makes something like getting this behavior with an output of type auto impossible? Wouldn't we potentially want something like this behavior for a wide variety of composite datatypes? It makes me think it would be better to get this behavior from a different XML attribute that isn't having its behavior altered on a per datatype basis? We could be dispatching on something about the tool output description to get this behavior instead of the datatype. It would improve the design by respecting the encapsulation of the tool framework and the datatype framework better IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a complicated issue. Right now we don't precreate the extra files dir, if a tool wants to write to the extra files dir it has to mkdir $output.extra_files in the command block, and it can do that independently of whether based on the datatype one would expect this, meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?
Precreating the directory isn't at all necessary, I could remove the check here and the -d check in the command factory and this would just work. One issues is that the user can purge the output of a running job, and that would remove the target directory. Without the -d check we write into a purged location (a bit of a worst case scenario because our cleanup scripts don't verify that purged datasets are really purged from disk).
So this seems like a conservative way of only creating the extra files dir when we reasonably expect that one should be created. I could of course unconditionally create the extra files dir, but that also seems wasteful?
seem (and correct me if I am wrong maybe this is just a little optimization) like they depend on the output datatype now
that's true, but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?
I guess that aligns with
We could be dispatching on something about the tool output description to get this behavior instead of the datatype
? Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and yes, the check is an optimization so we don't need to do this for all datatypes and still benefit from not accidentally writing to a purged location)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning you can store a whole bunch of extra stuff in let's say a txt output. Which seems vaguely problematic ?
I mean I don't know - it is consistent with CWL behavior and it does uphold that sort of encapsulation between say the object store and the datatype framework or between the tool framework and datatype framework. A dataset is a primary file and an extra files directory - the semantics come in the display based on the datatype - not enforcement by these other generic components that are independent of datatype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could of course unconditionally create the extra files dir, but that also seems wasteful?
I mean the "correct" thing to do would be that but the whole process is already so inefficient and disk bound - we're suffering from repeatedly doing the "correct" thing. So I agree we shouldn't just do it always I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I'm hoping that we'd eventually use class: Directory instead of datatype: directory, which seems like a reasonable way to parameterize this behavior ?
Yeah - I think that makes good sense to me. Implement it as a new flag then and just implicitly turn on the flag with "class: Directory"? Like "{class: Directory, fromWorkDir: moo}" implicitly becomes "{class:File, datatype: Directory, precreateAndRecursivelyGatherFromWorkDir: true, fromWorkDir: moo}" where I guess sub-types of Directory datatype could still be specified and used and we'd probably want another name for the new flag.
Should I see if supporting class: Directory (or a more xml-ish equivalent) improves this ?
I mean it would be my preference for sure but you've got a lot on your plate and I know this is a very esoteric want on my part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we'll actually need precreate_directory="false" (i.e. the defaults) to make existing tests pass 😅. I'll implement that ... can we skip RecursivelyGatherFromWorkDir and just assume that's what we want to do if the output is a directory ?
7bcc8e9 to
76380b8
Compare
| # in that case we don't want to copy the output to a purged path. | ||
| # Static, non work_dir_output files are handled in job_finish code. | ||
| return f'\nif [ -f "{source_file}" -a -f "{destination}" ] ; then cp "{source_file}" "{destination}" ; fi' | ||
| return f'\nif [ {test_flag} "{source_file}" -a {test_flag} "{destination}" ] ; then{delete_destination_dir} cp{recursive_flag} "{source_file}" "{destination}" ; fi' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this ensure the directories are created for resubmitted jobs? My bash-fu is lacking but I'm guessing yes - I just want to check you've thought through partial directories from previous jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have definitely not thought about that, but I can add a test and see where we are. I presume we should delete and create those directories on resubmission ?
|
Previous approve still applies and I think it is looking better. Please feel free to rebase and merge - thanks for the improvement. |
|
We'll need a new pulsar release too, I can get that done tomorrow. |
76380b8 to
ce6f8e1
Compare
Convenient for XML, required for the yaml version of tools.
8bd0050 to
8b8abd1
Compare
The pulsar test will require galaxyproject/pulsar#410
How to test the changes?
(Select all options that apply)
License