-
Notifications
You must be signed in to change notification settings - Fork 0
Add benchmarks for indexing and searching #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
benches/indexing_benches.rs
Outdated
| let _ = writer.add_document(doc).unwrap(); | ||
| }); | ||
|
|
||
| let _ = writer.commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a tough thing to benchmark. The I/O really happens in the commit, where any remaining buffered docs are written to disk. I wouldn't try benchmarking I/O, I would stick to, say, searching through docs that are buffered in memory.
Imagine your service going this way:
- Fill the index with juicy molecules
- Cut down size of index based on atom count, etc
- Load the top 1000 matching docs in to cheminee's memory
- Rank those 1000 matching docs using search routine $XYZ
- Return the top N hits
I think you want to create the output vec from step 2 in static memory, somehow, and then you just want to benchmark the serach routine of step 3 and assert the stable sort going in to step 4.
Fake 1/2, benchmark 3, assert 4 is what you expect. If you try to benchmark the actual I/O to get through 1/2 you will find it's highly variable and does not give a reliable picture.
benches/search_benches.rs
Outdated
| use std::collections::{HashMap, HashSet}; | ||
| use std::ops::Deref; | ||
| use tantivy::schema::Field; | ||
| use test::Bencher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This include block is getting unwieldly, I'll do a future PR to reformat our project. Just a note to myself here...
benches/search_benches.rs
Outdated
| let searcher = reader.searcher(); | ||
| let results = basic_search(&searcher, &query, 100).unwrap(); | ||
| let _final_results = aggregate_query_hits(searcher, results, &query).unwrap(); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think benchmarking I/O is going to provide an inaccurate picture
…ng, and searching
|
How about we just bench the core functionality? This has already been illuminating for determining the slowest bits of functionality. Standardization of molecules looks to be the slowest step by far, which I guess makes sense. |
Description
Resolves #108 by adding benchmarks for indexing and searching.
We now have added benchmarks for most of the core functionality used for indexing and searching:
These benchmarks make it clear, for example, that our molecular standardization step is comparatively more computationally intensive than most of the other functionality.