Replies: 1 comment
-
How far are you from saturating your available network bandwidth? You should try disabling "use threads" in the scanner and parquet reader since you're managing the scheduling now. These options means more thread creation and scheduler overhead when you already have Other things to experiment tweaking is increasing the block size (in number of files) of the parallel-for. Spawning too many tasks might create too much competition for I/O and CPU which might turn out to make things worse. A profiler can show where the bottlenecks are. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello. I am encountering some difficulties to make the S3 filesystem to work in parallel. The problem that I have is that I need to read several hundreds of Parquet files, from each file I need a segment given by a index column on the first stage I read some metadata columns and on a second stage I read a payload from a data column (a LargeList float array).
For the reading of the metadata I am using a dataset that scans only one file. For the the payload I am using a parquet/arrow reader with
readRowGroup
API, finding the necessary row groups using the metadata/stats of the given parquet fileI have tried some approaches:
Use an outer parallelization loop over files using oneAPI TBB, for each file I create all the arrow machinery (filesystems, readers, datasets...). With this approach I got the expected parallelism/performance when using the local filesystem, but when using the S3Filesystem I got very low throughput and cpu utilization even tough some of post processing of the data is cpu bound.
Sharing the S3Filesystem instance. I built outside the parallelization loop a S3FileSystem instance with its associated io_context. That I shared among the tasks of the ONEAPI tbb parallel for loop. This approach is exemplified here:
With this approach I still got very low cpu utilization/throughput
Any pointers or caveats woudl be useful
thanks =)
Beta Was this translation helpful? Give feedback.
All reactions