forked from plantbreeding/brapi-Java-TestServer
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Germplasm Search Optimizations #49
Open
jloux-brapi
wants to merge
10
commits into
develop
Choose a base branch
from
germ-search-opts
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If not specified, this sort will be used to keep the endpoints idempotent.
…tion Added utility methods to SearchQueryBuilder and BrAPIRepositoryImpl to allow for proper paginating for hibernate fetch queries that don't suffocate memory. Also added methods to run queries without pagination entirely using the SearchQueryBuilder to prevent the use of pagination when it's not required, an issue that specifically had to be addressed for the BI cache, but one that introduced code that is reusable for other use cases. Modified the GermplasmApiController's searchGermplasmPost endpoint to accomodate two code paths: - One where no page and pageSize are supplied. In this scenario the code will grab all germplasm without the use of pagination. Good for large data grabs, but gets dangerous with excessively large amounts of data. This is entirely to meet BI's current use case, which we have strongly advised they move off of. - When page and/or pageSize are supplied, paginate as requested, default page size of 1000 if not requested.
Additionally make these configurable vars consistent and usable across BrAPIController and PagingUtility, which both utilize them.
@@ -156,7 +156,7 @@ public ResponseEntity<SeedLotTransactionListResponse> seedlotsTransactionsGet( | |||
validateAcceptHeader(request); | |||
Metadata metadata = generateMetaDataTemplate(page, pageSize); | |||
List<SeedLotTransaction> data = seedLotService.findSeedLotTransactions(transactionDbId, seedLotDbId, | |||
germplasmDbId, germplasmName, crossDbId, crossName, commonCropName, programDbId, externalReferenceId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
externalReferenceId
is the newer, correct spelling, introduced in BrAPI v2.1.
externalReferenceID
is the deprecated one and could be deleted if there is no need for backwards compatibility with v2.0
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A couple of optimizations, bug fixes and workarounds have been provided to improve the performance and usability of the germplasm search endpoint.
firstResult/maxResults specified with collection fetch; applying in memory
which was a big clue to why this search endpoint was performing so poorly. Tim essentially tried to implement as vlad described here, but there was a critical error, we passed through all of the query logic to the search query builder, which today runs paginated queries no matter what. This means that while we did the grunt work of fetching the IDs separately and giving them to another query, we ended up with the same applying in memory error as before. This code has been changed so that not only does theSearchQueryBuilder
support non-paginated queries, there is also more support for the kind of double query that the paginated fetches require. (See GermplasmService.findGermplasmEntities()). The performance improvement is orders of magnitude. I went from about 4.5 seconds 100 records to about 500ms, 15 seconds on 1000 to 1 second on a dataset of 550k germs on a program.page
andpageSize
or neither attributed be present in the request to kick off some new logic on the germplasm search POST endpoint which will utilize the new non-paginated query thatSearchQueryBuilder
supports to return all data at once, non-paginated. This has a breaking point, however, as at about 250k germ records per program java completely exhausts its heap trying to load all the data in entity objects and converting them to Json. It should be noted this also isn't particularly fast, as this is a large amount of data to transmit. 125k records takes about 30 seconds on average to get back. But this should work as a stop gap in the meantime.Could not prepare SQL statement
error, we needed a way to completely refuse lookups that could produce more than 65k sql params, as this is the limit. These occurred mostly in my testing when I tried to paginate germplasms large than 65k records, because in order to fetch these records we need to pass IDs of found records in inital query to later join-fetch queries. I suppose there are other ways around this, like we could break up the queries into more queries, but the right solution feels like to incentivize the requester to actually request data from the search endpoint in a meaningful and more performant way. That is, we have configured a way to control the maximum allowable page size for page requests on the server. For now, it is 65k, and this applies to all entities, not just Germplasm. Specifically for BI, this will be a problem for the cache, which we have addressed for the germplasm entity but not for other larger entities they might have, like observations and observation units. I may revert this commit and put it somewhere else separately if it is a problem loading a cache for large datasets, or I might add to this body of code.Note: This PR and its associated commits should also be merged to the prod server when it is verified on BI's end.