Skip to content

Conversation

@JJ-Pineda
Copy link
Contributor

@JJ-Pineda JJ-Pineda commented Sep 23, 2024

Description
Resolves #108 by adding benchmarks for indexing and searching.

We now have added benchmarks for most of the core functionality used for indexing and searching:

  • Going from a SMILES to a Tantivy doc
  • Compounds processing (e.g. standardization)
  • Structure matching (e.g. exact matching and substruct matching)

These benchmarks make it clear, for example, that our molecular standardization step is comparatively more computationally intensive than most of the other functionality.

@JJ-Pineda JJ-Pineda marked this pull request as ready for review September 23, 2024 17:22
@JJ-Pineda JJ-Pineda requested a review from xrl September 23, 2024 17:22
let _ = writer.add_document(doc).unwrap();
});

let _ = writer.commit();
Copy link
Contributor

@xrl xrl Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a tough thing to benchmark. The I/O really happens in the commit, where any remaining buffered docs are written to disk. I wouldn't try benchmarking I/O, I would stick to, say, searching through docs that are buffered in memory.

Imagine your service going this way:

  1. Fill the index with juicy molecules
  2. Cut down size of index based on atom count, etc
  3. Load the top 1000 matching docs in to cheminee's memory
  4. Rank those 1000 matching docs using search routine $XYZ
  5. Return the top N hits

I think you want to create the output vec from step 2 in static memory, somehow, and then you just want to benchmark the serach routine of step 3 and assert the stable sort going in to step 4.

Fake 1/2, benchmark 3, assert 4 is what you expect. If you try to benchmark the actual I/O to get through 1/2 you will find it's highly variable and does not give a reliable picture.

use std::collections::{HashMap, HashSet};
use std::ops::Deref;
use tantivy::schema::Field;
use test::Bencher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include block is getting unwieldly, I'll do a future PR to reformat our project. Just a note to myself here...

let searcher = reader.searcher();
let results = basic_search(&searcher, &query, 100).unwrap();
let _final_results = aggregate_query_hits(searcher, results, &query).unwrap();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think benchmarking I/O is going to provide an inaccurate picture

@JJ-Pineda
Copy link
Contributor Author

How about we just bench the core functionality? This has already been illuminating for determining the slowest bits of functionality. Standardization of molecules looks to be the slowest step by far, which I guess makes sense.

@JJ-Pineda JJ-Pineda requested a review from xrl September 25, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

We need more benchmark measurements

3 participants