Slow Fetch Performance with Incremental Collect #5615

nousot-cloud-guy · 2023-11-02T20:39:11Z

nousot-cloud-guy
Nov 2, 2023

I'm using Kyuubi and a Spark Engine to host data for BI tools (as well as ad-hoc queries). Recently I've noticed that performance of fetch operations is really slow - 5+ hours to fetch 40-50 million records. I've tried changing a number of settings on both Spark and Kyuubi (everything from serialization settings to threading) but so far nothing has had a meaningful impact on performance. I don't seem to be hitting limits on either CPU or memory.

Is there any way to increase the performance of fetch operations?

pan3793 · 2023-11-03T03:30:17Z

pan3793
Nov 3, 2023
Collaborator

Performance tuning is a big topic, it requires you to provide more details about your queries to give some specific advice.

Also, there are some general ideas for that. Spark is known as a parallel data-processing programming framework, the "incremental collect" actually demotes the parallel processing model to serial for the last stage.

Let's start with a simple scan query SELECT * FROM large_table.

Assuming the table data is stored in Parquet format(splittable), in the planning phase, Spark will try to make each split to fit spark.sql.files.maxPartitionBytes which is 128m in default, so the number of tasks will be total_size / 128m, and each task will scan 128m data.

Scan Table (10000 tasks, each task scans 128m)
    |
 Collect

If we boost spark.sql.files.maxPartitionBytes to 512m, then the task number goes down and each task scans more data

Scan Table (2500 tasks, each task scans 512m)
    |
 Collect

So, what's the matter here? The cost of task scheduling and open files is unneglectable.

Assuming each task takes 1s to schedule and 1s to open files, in full collect mode, bunches of tasks are scheduled and run in parallel, such overhead won't increase the whole execution time significantly. But in incremental collect mode, all procedures are executed in serial, then the total cost of task scheduling and open files would dramatically slow down your query.

What about the complex query? Usually, we choose to introduce an additional shuffle at the last stage to address the issue.

Previous Stages (produce 10000 partitions, each task producing 32m)
    |
 Collect

Previous Stages (produce 10000 partitions, each task writes shuffle 32m)
    |
Shuffle Exchange (625 partitions, each writes 512m)
    |
 Collect

Another interesting topic is the result serialization.

kyuubi.operation.result.format was introduced in Kyuubi 1.7.0 (we recommend the user update both the client and server to 1.7.3 to try this feature), it supports thrift(default) and arrow. This feature is orthogonal to the incremental collect, thus you can combine the usage of two features.

for kyuubi.operation.result.format=thrift

Executor (InternalRow) =>
  Driver (InternalRow) => Driver (Row) => Driver (Thrift) =>
    Server (Thrift) =>
      Client (Thrift) => Client (Java Object)

Obviously, the Spark Driver takes much CPU and memory to do data serialization and format transformation.

for kyuubi.operation.result.format=arrow

Executor (InternalRow) => Executor (ArrowBatch) =>
  Driver (ArrowBatch) =>
    Server (ArrowBatch) =>
      Client (ArrowBatch) => Client (Java Object)

which offloads the bunches of data serialization works from the Spark Driver to Executor, and ArrowBatch is a more efficient format than Thrift. However, it requires that the client upgrade to support recognizing and deserializing ArrowBatch data.

0 replies

nousot-cloud-guy · 2023-11-03T16:08:07Z

nousot-cloud-guy
Nov 3, 2023
Author

Thanks for the response. We are encountering this during the select * from large_table scenario, although the table in question only has 40-50 parquet files at around 95MB each. So I'm not sure larger files would result in a significant performance boost.

The issues seems to be that the hasNext at FetchIeterator.scala:97 job are only running every 5-7 minutes, even though the job duration is only a couple seconds. This leads me to believe that it's the actual fetch/serialization that is taking so long in between job runs, which is what I'm hoping to optimize. Or should it really take 5-7 minutes to serialize 1,000,000 records? That's a little hard for me to believe. I've checked the resource utilization on the driver pod (we're hosting in Kubernetes) during this time and it's utilizing over 2.5 cores during those periods out of an available 4, so I don't think it's a resource issue.

I'll see about upgrading to 1.7.3 and trying Arrow serialization. We're stuck with using ODBC due to pulling from Power BI, though. Do you know if any of the ODBC Spark drivers will support Arrow serialization from Kyuubi? Maybe the newer Databricks drivers?

2 replies

pan3793 Nov 4, 2023
Collaborator

the driver pod (we're hosting in Kubernetes) during this time and it's utilizing over 2.5 cores during those periods out of an available 4

It's expected, in your case, Driver only processes one task's result, so only a few cores will be utilized.

Maybe you can try async-profiler to figure out what Driver doing during the data fetching period.

Do you know if any of the ODBC Spark drivers will support Arrow serialization from Kyuubi?

Sorry, there is no such arrow standard protocol. Kyuubi created it and we only implemented it in Kyuubi JDBC driver now. You could test it using the $KYUUBI_HOME/bin/beeline. Do you know if there are any open-sourced Hive/Spark/Impala ODBC drivers?

pan3793 Nov 4, 2023
Collaborator

job are only running every 5-7 minutes, even though the job duration is only a couple seconds.

Yea, that's what I meant in the previous reply

the "incremental collect" actually demotes the parallel processing model to serial

In "incremental collect", the execution of the result stage is triggered partition by partition, only after the current partition's result is consumed, the task for the next partition will be triggered to compute and fetch the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Fetch Performance with Incremental Collect #5615

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Slow Fetch Performance with Incremental Collect #5615

nousot-cloud-guy Nov 2, 2023

Replies: 2 comments · 2 replies

pan3793 Nov 3, 2023 Collaborator

nousot-cloud-guy Nov 3, 2023 Author

pan3793 Nov 4, 2023 Collaborator

pan3793 Nov 4, 2023 Collaborator

nousot-cloud-guy
Nov 2, 2023

Replies: 2 comments 2 replies

pan3793
Nov 3, 2023
Collaborator

nousot-cloud-guy
Nov 3, 2023
Author

pan3793 Nov 4, 2023
Collaborator

pan3793 Nov 4, 2023
Collaborator