Slow Fetch Performance with Incremental Collect #5615
Replies: 2 comments 2 replies
-
Performance tuning is a big topic, it requires you to provide more details about your queries to give some specific advice. Also, there are some general ideas for that. Spark is known as a parallel data-processing programming framework, the "incremental collect" actually demotes the parallel processing model to serial for the last stage. Let's start with a simple scan query Assuming the table data is stored in Parquet format(splittable), in the planning phase, Spark will try to make each split to fit
If we boost
So, what's the matter here? The cost of task scheduling and open files is unneglectable. Assuming each task takes 1s to schedule and 1s to open files, in full collect mode, bunches of tasks are scheduled and run in parallel, such overhead won't increase the whole execution time significantly. But in incremental collect mode, all procedures are executed in serial, then the total cost of task scheduling and open files would dramatically slow down your query. What about the complex query? Usually, we choose to introduce an additional shuffle at the last stage to address the issue.
Another interesting topic is the result serialization.
for
Obviously, the Spark Driver takes much CPU and memory to do data serialization and format transformation. for
which offloads the bunches of data serialization works from the Spark Driver to Executor, and ArrowBatch is a more efficient format than Thrift. However, it requires that the client upgrade to support recognizing and deserializing ArrowBatch data. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response. We are encountering this during the The issues seems to be that the I'll see about upgrading to 1.7.3 and trying Arrow serialization. We're stuck with using ODBC due to pulling from Power BI, though. Do you know if any of the ODBC Spark drivers will support Arrow serialization from Kyuubi? Maybe the newer Databricks drivers? |
Beta Was this translation helpful? Give feedback.
-
I'm using Kyuubi and a Spark Engine to host data for BI tools (as well as ad-hoc queries). Recently I've noticed that performance of fetch operations is really slow - 5+ hours to fetch 40-50 million records. I've tried changing a number of settings on both Spark and Kyuubi (everything from serialization settings to threading) but so far nothing has had a meaningful impact on performance. I don't seem to be hitting limits on either CPU or memory.
Is there any way to increase the performance of fetch operations?
Beta Was this translation helpful? Give feedback.
All reactions