Release query result after materialization & transformation #1027

toppyy · 2025-01-25T17:36:35Z

Releases the query result after it is no longer needed to reduce memory footprint. See tidyverse/duckplyr#434

TODO: Needs tests that show decreased memory footprint (how to?) & validate logic

krlmlr · 2025-02-12T15:55:25Z

Thanks. This is certainly better than the status quo.

This now has conflicts. We also want to coordinate with another planned change -- removal of the allow_materialization argument, and replacing it with n_cells == 0 . Only in this class, not in the API. Could be one PR, or one after another.

I wonder if column counting and object lifetime could be made local to the AltrepRelationWrapper class so that the vectors only notify that a column now has been computed, and the bookkeeping takes part in that class only.

toppyy · 2025-02-15T10:11:00Z

Thanks! Resolved the conflicts and moved the column counting and related logic under AltrepRelationWrapper (good idea).

krlmlr

Thanks!

Whenever a free function uses a pointer or reference (like rownames_wrapper->) in almost every statement, it's an indication that this should better be a member function. Do you want to tackle this refactoring too? The essence of this PR is the res.reset(); that could be added in a separate small PR, to separate refactorings from features.

krlmlr · 2025-02-19T08:27:01Z

src/reltoaltrep.cpp

@@ -404,6 +433,8 @@ size_t DoubleToSize(double d) {
 	auto relation_wrapper = make_shared_ptr<AltrepRelationWrapper>(rel, allow_materialization, DoubleToSize(n_rows),
 	                                                              DoubleToSize(n_cells));

+	relation_wrapper->ncols = drel->Columns().size();


Do we really need to update twice (here and below)?

krlmlr · 2025-02-19T08:27:08Z

src/reltoaltrep.cpp

@@ -306,16 +324,27 @@ const void *RelToAltrep::RownamesDataptrOrNull(SEXP x) {

 void *RelToAltrep::DoRownamesDataptrGet(SEXP x) {
 	auto rownames_wrapper = AltrepRownamesWrapper::Get(x);
+
+	// the query has been materialized, return the rowcount
+	// (and void recomputing the query if it's been reset)


Suggested change

// (and void recomputing the query if it's been reset)

// (and avoid recomputing the query if it's been reset)

krlmlr · 2025-02-19T08:36:17Z

src/include/reltoaltrep.hpp

+	R_xlen_t rowcount;
+	bool rowcount_retrieved;


Is it worth to initialize with -1 to avoid rowcount_retrieved ? I honestly don't know.

krlmlr · 2025-02-19T08:37:04Z

src/include/reltoaltrep.hpp

+	size_t ncols;
+	size_t cols_transformed;


By the same token, use a single n_cols_to_retrieve that counts down to zero?

krlmlr · 2025-02-21T22:16:41Z

I took a closer look at the API. I think there's a way to split a MaterializedQueryResult (or any QueryResult for that matter) into a vector of columns:

Initialize an array of lists to DataChunk pointers called out
Call Fetch() until it returns nullptr, this consumes the data
For each resulting DataChunk, call Split() from ncol - 1 to 0 to return a vector of DataChunk objects
Append those to out

The resulting elements of out can be freed independently.

All these operations look like they are zero-copy (except for metadata), need to double-check.

Would you like to further explore this avenue? This might enable streaming results.

krlmlr · 2025-02-22T00:25:18Z

FWIW, did you manage to create a compile_commands.json for IDE support? I use ./configure followed by pkgload:::generate_db() .

toppyy · 2025-02-23T19:59:22Z

Thanks for the pointers! I'll get back to them. I can work on the refactoring also. Perhaps it's more clear to do that in another PR though.

I'll look further into the outline you proposed, but had quick look at 'Fetch()`. To me it seems that it makes a copy of the chunk leaving the original result in place because of DISALLOW_ZERO_COPY option? Or am I misunderstanding something?

https://github.com/duckdb/duckdb/blob/cd0d0da9b1f475632a21e11a7c00cc726bedaacc/src/main/materialized_query_result.cpp#L97C1-L101C3

I've also looked a bit into the idea of streaming the results. It seems that releasing the memory associated with a chunk is not trivial when iterating over a result set, but need to study this further.

I don't have IDE support. I've just rebuilt the package from the command line with plain old R CMD INSTALL . (with ccache ofc).

krlmlr · 2025-02-23T20:02:23Z

Thanks. I have found IDE support very helpful for browsing the code, not even for compiling or debugging (but that's very useful too). VS Code with the clangd extension should get you started.

krlmlr · 2025-02-23T20:03:24Z

Perhaps DISALLOW_ZERO_COPY is there for a MaterializedQueryResult, but not for streaming queries? I haven't studied this part too closely.

toppyy · 2025-02-23T20:20:09Z

Oh, now I realize what you mean. I thought that by "streaming" you meant discarding query results after R-transformation on the fly :) I also haven't looked beyond MaterializedQueryResult as for some reason I took it as a given that we're bound to that. StreamQueryResult might just be what is needed, thanks.

toppyy · 2025-03-05T18:14:35Z

I tried implementing the approach you outlined. Had no luck splitting MaterializedQueryResult apart. I tried making a zero-copy version of MaterializedQueryResult::Fetch(), but as this comment suggests it's of no use since the chunk is not usable after the result is destroyed. And destroying the chunk does not seem to free the allocation if it's held by the result.

However, it's possible with StreamQueryResult. Here's a branch where I step through the result using Fetch() and split the resulting chunk column-wise. This allows destroying the chunks when transforming a column.

The memory consumption pattern is a bit surprising though. It seems that this approach is slower and does not help with peak memory consumption for reasons I don't fully understand?

Here's a plot of the memory consumption of different branches. I gathered the data by calling ps every 0.25 seconds while running this script.

krlmlr · 2025-04-28T17:29:37Z

Thank you for your patience. I reviewed your branch, this is what I had in mind, thanks for looking into it.

I see that just using MaterializedQueryResult is faster.

Are we overthinking this? Can we call chunk.Destroy() after duckdb_r_transform(), and fetch (and allocate) columnwise instead of chunk-wise, to get much better results?

toppyy · 2025-05-20T17:50:47Z

Thanks for the review!

This is my understanding of the problem:

For R we allocate memory for the entire column at once
DuckDB stores data in chunks during execution. A chunk is a collection of Vectors representing a subset of a relation

These two approaches are somewhat incompatible. As you suggest, it'd be great if we're able to 1) allocate memory for the R-representation of a column and 2) fetch a column from the DuckDB result, transform it into R and discard it. This would save us from having the data in memory for R and DuckDB simultaneously.

But: idk if we can do this with MaterializedQueryResult. At least, I've yet figure out how. But maybe I am indeed missing something, can you elaborate a bit on your idea? Thanks.

Calling chunk.Destroy() does not help with what we have currently because Fetch() from materialized query results returns copies.

toppyy added 2 commits January 25, 2025 19:26

release query result after materialization & transformation

a900623

merge main; resolve conflicts

7279de3

toppyy mentioned this pull request Jan 25, 2025

Memory consumption with collect() tidyverse/duckplyr#434

Open

toppyy added 3 commits January 25, 2025 21:51

members declared in init order

091179b

all members declared in init order..

a8e3f8f

Reset query results after transformed to R

c8eaf0a

toppyy added 4 commits February 15, 2025 11:31

move logic under AltrepRelationWrapper; rm debug printf

1cbea2d

merge reset-query-rewrite

37b1b9f

Merge branch 'main' into reset-query-result

cb86095

improve naming: column is transformed, not materialized

c8256a7

krlmlr reviewed Feb 19, 2025

View reviewed changes

krlmlr force-pushed the main branch from 3e5d5fa to f344576 Compare May 18, 2025 20:46

	// (and void recomputing the query if it's been reset)
	// (and avoid recomputing the query if it's been reset)

Release query result after materialization & transformation #1027

Are you sure you want to change the base?

Release query result after materialization & transformation #1027

Conversation

toppyy commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krlmlr commented Feb 12, 2025

Uh oh!

toppyy commented Feb 15, 2025

Uh oh!

krlmlr left a comment

Choose a reason for hiding this comment

Uh oh!

krlmlr Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

krlmlr Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

krlmlr Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

krlmlr Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

krlmlr commented Feb 21, 2025

Uh oh!

krlmlr commented Feb 22, 2025

Uh oh!

toppyy commented Feb 23, 2025

Uh oh!

krlmlr commented Feb 23, 2025

Uh oh!

krlmlr commented Feb 23, 2025

Uh oh!

toppyy commented Feb 23, 2025

Uh oh!

toppyy commented Mar 5, 2025

Uh oh!

krlmlr commented Apr 28, 2025

Uh oh!

toppyy commented May 20, 2025

Uh oh!

Uh oh!

toppyy commented Jan 25, 2025 •

edited

Loading