Skip to content

INTPYTHON-807: reading large amounts of data is rather slow (due to single threaded decoding BSON) #357

@sibbiii

Description

@sibbiii

Hi,

Over the years, we became quite happy with mongo-arrow and also contributed some bug fixes. However, there is still one topic left that we would be happy to get solved, and this is speed as mongo arrow claims to be a FAST:

We use MongoDB to store large amounts of data and as a consequence, also often query large amounts of data. The MongoDB server handles the load pretty well, but when it comes to fetching the result with mongo-arrow in Python there is one big bottleneck, and this seems to be BSON decoding, so this line here.

Image

When i comment it out and just print the len() of the batch speed is as I expect it. Obviously, the BSON has to be decoded, and this takes CPU. Unfortunately, it seems that only one CPU core is used to decode BSON which is quite frustrating as modern libraries such as polars are running their calculations multi-core. Sure we made your way around this limitation, but the implementations are not clean.

Is there any chance to make the BSON decoding itself multi-core or at least the batch processing which looks like the perfect candidate for multicore decoding?

Ps.: I've also spend quite some time googling for a solution, but other then frustrated users that say MongoDB is slow (which is not true in my opinion) i have not found any solution. To give you some numbers. I can easily fetch a large dataset in a few seconds on a super fast connection to the server while it takes a minute to decode. I fully understand the technology difference between an object store and a column based relational database, and i accept some lower performance on bulk reading to gain somewhere else, but it really does not need to be this slow for any sound reason.

Thanks, Sebastian

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions