Skip to content

Degrading performance on very large / billions / of entries #164

@axos88

Description

@axos88

We're trying to build an auto complete for big set (low billions) of sha256 hashes. So our dataset's entropy is quite high, one could say it's basically evenly distributed.

Building an index for 100m (unsorted) rows with the supplied binary takes about 12 minutes and peaks at about 3-4gb of ram.
Building the index for 1 billion has now been running for 4 hours, and usually peaks around 25gb of ram usage.

This doesn't seem to be in contrast to the linear complexity of the algorithm, and I wonder if it can be improved in some way? The process seems to be CPU bound at times, and IO bound at other times.

The process has already read 500gb of data from disk,while the raw dataset (1 billion rows of 64 hex characters + NL) is 65gb. I guess this is caused by the chunking, ordering and merging of the intermediary fsts, but it still seems excessive. I'm running with 25k batches, 14 threads and 8 max files.

All in all merging of very large fsts doesn't seem to be very efficient to me in contrast to the claim in the excellent blog ariticle you wrote about the implementation, caveats and tradeoffs of the data structure.

I wonder if it would help to save the intermediary fsts in mmaped files, or if you have any other suggestions on how to improve performance in this regard?

Our goal is to keep an index for autocompletion of an ever-growing dataset that is currently around 3b entries. Since the fst is immutable we are planning to build multiple fsts, much like javas generational garbage collection, adding entries to the lowest generation that can be re-calculated on the fly in an acceptable time, merging it into the next generation when it reaches a size threshold, thus achieving mutability. Querying will be done in parallel to all generations. This requires efficient merges / unions, but I'm not entirely sure how efficient they really are.

Appreciate your input on this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions