Skip to content

Release query result after materialization & transformation #1027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

toppyy
Copy link
Contributor

@toppyy toppyy commented Jan 25, 2025

Releases the query result after it is no longer needed to reduce memory footprint. See tidyverse/duckplyr#434

TODO: Needs tests that show decreased memory footprint (how to?) & validate logic

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 12, 2025

Thanks. This is certainly better than the status quo.

This now has conflicts. We also want to coordinate with another planned change -- removal of the allow_materialization argument, and replacing it with n_cells == 0 . Only in this class, not in the API. Could be one PR, or one after another.

I wonder if column counting and object lifetime could be made local to the AltrepRelationWrapper class so that the vectors only notify that a column now has been computed, and the bookkeeping takes part in that class only.

@toppyy
Copy link
Contributor Author

toppyy commented Feb 15, 2025

Thanks! Resolved the conflicts and moved the column counting and related logic under AltrepRelationWrapper (good idea).

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Whenever a free function uses a pointer or reference (like rownames_wrapper->) in almost every statement, it's an indication that this should better be a member function. Do you want to tackle this refactoring too? The essence of this PR is the res.reset(); that could be added in a separate small PR, to separate refactorings from features.

@@ -404,6 +433,8 @@ size_t DoubleToSize(double d) {
auto relation_wrapper = make_shared_ptr<AltrepRelationWrapper>(rel, allow_materialization, DoubleToSize(n_rows),
DoubleToSize(n_cells));

relation_wrapper->ncols = drel->Columns().size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to update twice (here and below)?

@@ -306,16 +324,27 @@ const void *RelToAltrep::RownamesDataptrOrNull(SEXP x) {

void *RelToAltrep::DoRownamesDataptrGet(SEXP x) {
auto rownames_wrapper = AltrepRownamesWrapper::Get(x);

// the query has been materialized, return the rowcount
// (and void recomputing the query if it's been reset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// (and void recomputing the query if it's been reset)
// (and avoid recomputing the query if it's been reset)

Comment on lines +30 to +31
R_xlen_t rowcount;
bool rowcount_retrieved;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to initialize with -1 to avoid rowcount_retrieved ? I honestly don't know.

Comment on lines +33 to +34
size_t ncols;
size_t cols_transformed;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the same token, use a single n_cols_to_retrieve that counts down to zero?

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 21, 2025

I took a closer look at the API. I think there's a way to split a MaterializedQueryResult (or any QueryResult for that matter) into a vector of columns:

  • Initialize an array of lists to DataChunk pointers called out
  • Call Fetch() until it returns nullptr, this consumes the data
  • For each resulting DataChunk, call Split() from ncol - 1 to 0 to return a vector of DataChunk objects
  • Append those to out

The resulting elements of out can be freed independently.

All these operations look like they are zero-copy (except for metadata), need to double-check.

Would you like to further explore this avenue? This might enable streaming results.

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 22, 2025

FWIW, did you manage to create a compile_commands.json for IDE support? I use ./configure followed by pkgload:::generate_db() .

@toppyy
Copy link
Contributor Author

toppyy commented Feb 23, 2025

Thanks for the pointers! I'll get back to them. I can work on the refactoring also. Perhaps it's more clear to do that in another PR though.

I'll look further into the outline you proposed, but had quick look at 'Fetch()`. To me it seems that it makes a copy of the chunk leaving the original result in place because of DISALLOW_ZERO_COPY option? Or am I misunderstanding something?

https://github.com/duckdb/duckdb/blob/cd0d0da9b1f475632a21e11a7c00cc726bedaacc/src/main/materialized_query_result.cpp#L97C1-L101C3

I've also looked a bit into the idea of streaming the results. It seems that releasing the memory associated with a chunk is not trivial when iterating over a result set, but need to study this further.

I don't have IDE support. I've just rebuilt the package from the command line with plain old R CMD INSTALL . (with ccache ofc).

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 23, 2025

Thanks. I have found IDE support very helpful for browsing the code, not even for compiling or debugging (but that's very useful too). VS Code with the clangd extension should get you started.

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 23, 2025

Perhaps DISALLOW_ZERO_COPY is there for a MaterializedQueryResult, but not for streaming queries? I haven't studied this part too closely.

@toppyy
Copy link
Contributor Author

toppyy commented Feb 23, 2025

Oh, now I realize what you mean. I thought that by "streaming" you meant discarding query results after R-transformation on the fly :) I also haven't looked beyond MaterializedQueryResult as for some reason I took it as a given that we're bound to that. StreamQueryResult might just be what is needed, thanks.

@toppyy
Copy link
Contributor Author

toppyy commented Mar 5, 2025

I tried implementing the approach you outlined. Had no luck splitting MaterializedQueryResult apart. I tried making a zero-copy version of MaterializedQueryResult::Fetch(), but as this comment suggests it's of no use since the chunk is not usable after the result is destroyed. And destroying the chunk does not seem to free the allocation if it's held by the result.

However, it's possible with StreamQueryResult. Here's a branch where I step through the result using Fetch() and split the resulting chunk column-wise. This allows destroying the chunks when transforming a column.

The memory consumption pattern is a bit surprising though. It seems that this approach is slower and does not help with peak memory consumption for reasons I don't fully understand?

Here's a plot of the memory consumption of different branches. I gathered the data by calling ps every 0.25 seconds while running this script.

memory-comp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants