-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolving output references without querying the full output document #408
Comments
Thanks for bringing this up and proposing a solution. Here are some first thoughts/questions that came to mind.
|
For your first point, that is also a potential option. However, if the first query fails, you will need to fall back on the method I proposed, leading to a total of 3 queries in the worst case scenario. However, since most queries will be directly resolvable maybe this is a reasonable thing to do. Either way, I think we will still need to implement the method I proposed, so perhaps this could be added on later.
No sorry, that was poor wording. I didn't mean in the metadata field, this should definitely be stored separately.
My initial thought was to only store the top level, e.g., |
Maybe I am missing something, but my idea was to query both the "metadata" and Anyway, this could indeed also be added at a later stage. |
Ah I missed that. Yes, that should work! |
Currently, when resolving output references, the full output of a previous job is queried from the database, even if only a small amount of the document is needed. For example, when executing the following atomate2 code using FireWorks/the local manager, the full VASP task document of job1 is first returned before the structure is accessed.
This is clearly very wasteful, puts strain on the database, and is inefficient, especially when the task document contains large items such as band structures.
Problem
Part of the difficulty is that there is no way to know in advance if the full output is actually needed to access the specific field requested. Take this example:
There are two complicating factors here:
nb_bands
as this is not something that is stored in the band structure object directly but is instead an attribute of theBandStructure
class.Accordingly, simply restricting the query to return
output.vasp_objects.bandstructure.nb_bands
will fail on two accounts.Another complicating factor is that the output database can sometimes contain references to other job outputs. E.g., this can happen for dynamic workflows. Jobflow automatically resolves these references under-the-hood, but again it requires first getting the full output, finding any references in the output, resolving those references, and finally returning the specific item from the output that was requested.
Proposed solution
The best way I can think of solving this is:
output.vasp_objects.bandstructure
.{"__class__", "__module__"}
signature.output.vasp_objects.bandstructure.nb_bands
would overlap withoutput.vasp_objects.bandstructure
.output.vasp_objects.bandstructure
. Now resolve any blobs from the data store/output references.As I see it, this approach has two disadvantages:
However, I can't see a cleaner way of solving this bug, and I imagine this would result in a speedup even with the extra database request.
The text was updated successfully, but these errors were encountered: