forked from plantbreeding/brapi-Java-TestServer
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BI-2579] Optimize Germplasm Import and Post Endpoint #51
Open
jloux-brapi
wants to merge
15
commits into
germ-search-opts
Choose a base branch
from
germ-importer-opts
base: germ-search-opts
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This may or may not improve things, needs testing
Replace pagination calls with normal queries
…s unused in response
This method will help to alleviate use cases during batch updates where external references were getting updated when they didn't need to. This created a large amount of deletions and insertions slowing batch updates down
Made a new method putting relevant logic to the POST call. Not sure if PUT could use this one, would need to test.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These changes should be merged on top of #49, as they are built on a branch based off those changes.
The thruput of the BrAPI Germplasm POST was very slow at the time I started working on this import.
An import of 30k germs on a clean DB or a program with no germs would take somewhere in the ballpark of 30-60 minutes.
Once the number of germs per program reached around 60-90k, the import would take hours or never finish at all.
After the host of changes made in the body of this MR, the 30k file can be imported in around 3-6 minutes on a DB with 550k germs on it with about 250k germs per program already loaded in.
It should be noted the import is much faster, but issues with the cache remain.
The associated bi-api changes edited the import process so that the cache is only refreshed at the end of the import. But as the number of records grows, the cache takes longer and longer to return, and this can impact the experience of the import.
The performance is still worlds better than it was before, but sometimes the cache getting can hold things up, and because it happens at the end of the import, users may find that when they try to view the germplasm records is takes a long time to load because the cache is still fetching.
I'll try to summarize the optimizations made:
saveAndFlushAlls
everywhere. After some research, I found that flushalls can greatly hurt performance time because if entities being saved reference entities that were flushed (as a lot of this code seems to do) hibernate has to refresh the entities, or sometimes it doesn't even know what to do and gives an error.pedigree_node_external_references
table, and brought the import to a screeching halt at scale. This ended up being due to theUpdateUtility
unnecessarily clearing out existing external references that actual in fact already matched the ones that existed. A check was put in place to prevent this from ever occurring, however I only applied this check to the PedigreeNode code. Another solution to this problem could have been to just prevent setting pedigree external references entirely, as they likely do not need the same ones as germplasm. But this issue seemed so severe and worrisome for other entities I thought it would be helpful to keep the exref checking code for reuse.application.properties.template
. These must be replicated to theapplication.properties
of the relevant application/deployment, as they are imperative for speedups relating to batch inserts