-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Hi,
Over the years, we became quite happy with mongo-arrow and also contributed some bug fixes. However, there is still one topic left that we would be happy to get solved, and this is speed as mongo arrow claims to be a FAST:
We use MongoDB to store large amounts of data and as a consequence, also often query large amounts of data. The MongoDB server handles the load pretty well, but when it comes to fetching the result with mongo-arrow in Python there is one big bottleneck, and this seems to be BSON decoding, so this line here.
When i comment it out and just print the len() of the batch speed is as I expect it. Obviously, the BSON has to be decoded, and this takes CPU. Unfortunately, it seems that only one CPU core is used to decode BSON which is quite frustrating as modern libraries such as polars are running their calculations multi-core. Sure we made your way around this limitation, but the implementations are not clean.
Is there any chance to make the BSON decoding itself multi-core or at least the batch processing which looks like the perfect candidate for multicore decoding?
Ps.: I've also spend quite some time googling for a solution, but other then frustrated users that say MongoDB is slow (which is not true in my opinion) i have not found any solution. To give you some numbers. I can easily fetch a large dataset in a few seconds on a super fast connection to the server while it takes a minute to decode. I fully understand the technology difference between an object store and a column based relational database, and i accept some lower performance on bulk reading to gain somewhere else, but it really does not need to be this slow for any sound reason.
Thanks, Sebastian