-
Notifications
You must be signed in to change notification settings - Fork 35
Release query result after materialization & transformation #1027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks. This is certainly better than the status quo. This now has conflicts. We also want to coordinate with another planned change -- removal of the I wonder if column counting and object lifetime could be made local to the |
Thanks! Resolved the conflicts and moved the column counting and related logic under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Whenever a free function uses a pointer or reference (like rownames_wrapper->
) in almost every statement, it's an indication that this should better be a member function. Do you want to tackle this refactoring too? The essence of this PR is the res.reset();
that could be added in a separate small PR, to separate refactorings from features.
@@ -404,6 +433,8 @@ size_t DoubleToSize(double d) { | |||
auto relation_wrapper = make_shared_ptr<AltrepRelationWrapper>(rel, allow_materialization, DoubleToSize(n_rows), | |||
DoubleToSize(n_cells)); | |||
|
|||
relation_wrapper->ncols = drel->Columns().size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to update twice (here and below)?
@@ -306,16 +324,27 @@ const void *RelToAltrep::RownamesDataptrOrNull(SEXP x) { | |||
|
|||
void *RelToAltrep::DoRownamesDataptrGet(SEXP x) { | |||
auto rownames_wrapper = AltrepRownamesWrapper::Get(x); | |||
|
|||
// the query has been materialized, return the rowcount | |||
// (and void recomputing the query if it's been reset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// (and void recomputing the query if it's been reset) | |
// (and avoid recomputing the query if it's been reset) |
R_xlen_t rowcount; | ||
bool rowcount_retrieved; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to initialize with -1
to avoid rowcount_retrieved
? I honestly don't know.
size_t ncols; | ||
size_t cols_transformed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the same token, use a single n_cols_to_retrieve
that counts down to zero?
I took a closer look at the API. I think there's a way to split a
The resulting elements of All these operations look like they are zero-copy (except for metadata), need to double-check. Would you like to further explore this avenue? This might enable streaming results. |
FWIW, did you manage to create a |
Thanks for the pointers! I'll get back to them. I can work on the refactoring also. Perhaps it's more clear to do that in another PR though. I'll look further into the outline you proposed, but had quick look at 'Fetch()`. To me it seems that it makes a copy of the chunk leaving the original result in place because of DISALLOW_ZERO_COPY option? Or am I misunderstanding something? I've also looked a bit into the idea of streaming the results. It seems that releasing the memory associated with a chunk is not trivial when iterating over a result set, but need to study this further. I don't have IDE support. I've just rebuilt the package from the command line with plain old |
Thanks. I have found IDE support very helpful for browsing the code, not even for compiling or debugging (but that's very useful too). VS Code with the clangd extension should get you started. |
Perhaps |
Oh, now I realize what you mean. I thought that by "streaming" you meant discarding query results after R-transformation on the fly :) I also haven't looked beyond |
I tried implementing the approach you outlined. Had no luck splitting However, it's possible with The memory consumption pattern is a bit surprising though. It seems that this approach is slower and does not help with peak memory consumption for reasons I don't fully understand? Here's a plot of the memory consumption of different branches. I gathered the data by calling |
Releases the query result after it is no longer needed to reduce memory footprint. See tidyverse/duckplyr#434
TODO: Needs tests that show decreased memory footprint (how to?) & validate logic