Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BI-2579] Optimize Germplasm Import and Post Endpoint #51

Open
wants to merge 15 commits into
base: germ-search-opts
Choose a base branch
from

Conversation

jloux-brapi
Copy link
Collaborator

These changes should be merged on top of #49, as they are built on a branch based off those changes.

The thruput of the BrAPI Germplasm POST was very slow at the time I started working on this import.

An import of 30k germs on a clean DB or a program with no germs would take somewhere in the ballpark of 30-60 minutes.

Once the number of germs per program reached around 60-90k, the import would take hours or never finish at all.

After the host of changes made in the body of this MR, the 30k file can be imported in around 3-6 minutes on a DB with 550k germs on it with about 250k germs per program already loaded in.

It should be noted the import is much faster, but issues with the cache remain.

The associated bi-api changes edited the import process so that the cache is only refreshed at the end of the import. But as the number of records grows, the cache takes longer and longer to return, and this can impact the experience of the import.

The performance is still worlds better than it was before, but sometimes the cache getting can hold things up, and because it happens at the end of the import, users may find that when they try to view the germplasm records is takes a long time to load because the cache is still fetching.

I'll try to summarize the optimizations made:

  • Optimized the Germplasm save so that it creates entities in batch, and also does lookups for these entities in batch, rather than one by one. This cut down about 2000 queries on an import size of 1000 records at a time.
  • Removed saveAndFlushAlls everywhere. After some research, I found that flushalls can greatly hurt performance time because if entities being saved reference entities that were flushed (as a lot of this code seems to do) hibernate has to refresh the entities, or sometimes it doesn't even know what to do and gives an error.
  • Batched Pedigree creation from Germplasm records. There were kind of a bunch of optimizations tied together in this. It's hard to estimate how many queries were killed from this batching. anywhere between 20-30k. Without the changes from Germplasm Search Optimizations #49, performance suffered even more because the pedigree batching code kicked off full germplasm search requests, which tried paginating and fetching in memory, not only slowing the performance but bringing memory allocation to a screeching halt.
    • Batch the creation of pedigree records from germplasm when no pedigree string came in
    • Batch the creation of pedigree records from germplasm when pedigree string came in.
    • Batch the creation of pedigree edges
    • Batch the creation of pedigree edges from parents
    • Batch the creation of pedigree edges from progeny (children)
  • I found that after these optimizations were made, an extraordinary amount of deletions were occurring on the pedigree_node_external_references table, and brought the import to a screeching halt at scale. This ended up being due to the UpdateUtility unnecessarily clearing out existing external references that actual in fact already matched the ones that existed. A check was put in place to prevent this from ever occurring, however I only applied this check to the PedigreeNode code. Another solution to this problem could have been to just prevent setting pedigree external references entirely, as they likely do not need the same ones as germplasm. But this issue seemed so severe and worrisome for other entities I thought it would be helpful to keep the exref checking code for reuse.
  • After this discovery, I hit another sticking point: Thousands of queries being created to convert the pedigree database data to DTOs which was being completely dropped on the floor in the case of the Germplasm POST. I created some new method signatures to prevent the dto conversion from ever happening for this use case. NOTE: This does not affect the data returned, as the data returned only returns germplasm data, and I kept those conversions untouched.
  • Added batching spring.data properties to the application.properties.template. These must be replicated to the application.properties of the relevant application/deployment, as they are imperative for speedups relating to batch inserts

This may or may not improve things, needs testing
Replace pagination calls with normal queries
This method will help to alleviate use cases during batch updates where external
references were getting updated when they didn't need to.

This created a large amount of deletions and insertions slowing batch updates down
Made a new method putting relevant logic to the POST call.

Not sure if PUT could use this one, would need to test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant