-
Notifications
You must be signed in to change notification settings - Fork 0
[BI-2579] Optimize Germplasm Import and Post Endpoint #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This may or may not improve things, needs testing
Replace pagination calls with normal queries
…s unused in response
This method will help to alleviate use cases during batch updates where external references were getting updated when they didn't need to. This created a large amount of deletions and insertions slowing batch updates down
Made a new method putting relevant logic to the POST call. Not sure if PUT could use this one, would need to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to test, I wanted to give you my initial feedback.
Thank you for helping us with this.
src/main/java/org/brapi/test/BrAPITestServer/service/germ/GermplasmService.java
Show resolved
Hide resolved
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) { | ||
List<PedigreeNodeEntity> nodes = new ArrayList<>(); | ||
|
||
// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds. Have to see if this is a significant performance hit. | ||
List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds); | ||
|
||
Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId())); | ||
|
||
nodesGroupedByGerm.forEach((germId, nodesByGerm) -> { | ||
if (nodesByGerm.size() > 1) { | ||
log.error("multiple pedigree nodes found for a single germplasm"); | ||
} | ||
|
||
Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst(); | ||
|
||
node.ifPresent(nodes::add); | ||
}); | ||
|
||
return nodes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unique constraint on the pedigree_node
table for the germplasm_id
column and simplify this method?
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) { | |
List<PedigreeNodeEntity> nodes = new ArrayList<>(); | |
// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds. Have to see if this is a significant performance hit. | |
List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds); | |
Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId())); | |
nodesGroupedByGerm.forEach((germId, nodesByGerm) -> { | |
if (nodesByGerm.size() > 1) { | |
log.error("multiple pedigree nodes found for a single germplasm"); | |
} | |
Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst(); | |
node.ifPresent(nodes::add); | |
}); | |
return nodes; | |
} | |
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) { | |
return pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably, but this is a question for @BrapiCoordinatorSelby . For what it's worth, the current lookup isn't all that inefficient.
src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java
Outdated
Show resolved
Hide resolved
|
||
// Find out which germIds were not found in the DB. Use a set for improved performance on the contains check. | ||
// TODO: Check if the germEntity is already populated by getPedigreeNodes, and if this block results in more DB transactions. | ||
Set<String> germIdsOfFoundNodes = dbNodes.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's longer, but germDbIdsOfFoundNodes
might be more clear. Germplasm have so many identifiers, it's helpful to be specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree, entityNameId has meant the name of the entities unique id at every place I have worked at.
src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
public List<PedigreeNode> convertFromGermplasmToPedigreeBatchUsingNames(List<Germplasm> germplasms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there will be issues using germplasm name for lookups, it isn't guaranteed to be unique. This code is only to support our import use case, right? In that case, names coming from DeltaBreed should be unique, but if it's used outside of that, it could be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I have thought about that. That's kind of why the code works like this for now as a stopgap:
public void updateGermplasmPedigreeForPost(List<Germplasm> data) throws BrAPIServerException {
// T = With pedigree, F = Without pedigree
Map<Boolean, List<Germplasm>> germsWithAndWithoutPedigree
= data.stream().collect(Collectors.partitioningBy(g -> g.getPedigree() != null));
if (!germsWithAndWithoutPedigree.get(true).isEmpty()) {
savePedigreeNodes(convertFromGermplasmToPedigreeBatchUsingNames(germsWithAndWithoutPedigree.get(true)), false);
}
if (!germsWithAndWithoutPedigree.get(false).isEmpty()) {
List<PedigreeNode> createPedigreeNodes = new ArrayList<>();
for (Germplasm germplasm : germsWithAndWithoutPedigree.get(false)) {
createPedigreeNodes.add(convertFromGermplasmToPedigree(germplasm));
}
savePedigreeNodes(createPedigreeNodes, false);
}
}
If it comes in with a pedigree, we create by that in batch. If it doesn't we do it the other way, and it's not really too slow if we initialize the other way.
We honestly could have this kept this unchanged (and it is via updateGermplasmPedigree
for the PUT method) and it wouldn't have made much of a performance difference, but really what I created was what I found was happening under the hood: The code would lookup by pedigree name if it existed in the end. I just tried to make it much more obvious and to do it in batch, bc the by pedigree lookup was slower.
src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java
Outdated
Show resolved
Hide resolved
List<String> parentEdgesToDelete = new ArrayList<>(); | ||
|
||
SearchQueryBuilder<PedigreeEdgeEntity> search = new SearchQueryBuilder<>(PedigreeEdgeEntity.class); | ||
search.appendList(germIdsWithParentNodes, "connectedNode.germplasm.id"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this actually working with the correct spelling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was yea, I also noticed that it somehow worked without 🤷. It's possible hibernate might be able to detect or guestimate, idk. Regardless, I will change to the correct spelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this more, it's very possible this method was never called in my testing. I'll breakpoint the next time I do a full test.
src/main/java/org/brapi/test/BrAPITestServer/service/germ/GermplasmService.java
Outdated
Show resolved
Hide resolved
List<GermplasmEntity> motherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeMother.values())); | ||
List<GermplasmEntity> fatherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeFather.values())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling or logging for cases where the mother or father germplasm names are not found in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I am doing this in batch, it's not exactly trivial to find each name one by one that didn't match or without adding extra overhead. That said, it's certainly easy to give warnings if none of the names in the map exist in the DB, which would in general indicate a pretty larger scale issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, working.
These changes should be merged on top of #49, as they are built on a branch based off those changes.
The thruput of the BrAPI Germplasm POST was very slow at the time I started working on this import.
An import of 30k germs on a clean DB or a program with no germs would take somewhere in the ballpark of 30-60 minutes.
Once the number of germs per program reached around 60-90k, the import would take hours or never finish at all.
After the host of changes made in the body of this MR, the 30k file can be imported in around 3-6 minutes on a DB with 550k germs on it with about 250k germs per program already loaded in.
It should be noted the import is much faster, but issues with the cache remain.
The associated bi-api changes edited the import process so that the cache is only refreshed at the end of the import. But as the number of records grows, the cache takes longer and longer to return, and this can impact the experience of the import.
The performance is still worlds better than it was before, but sometimes the cache getting can hold things up, and because it happens at the end of the import, users may find that when they try to view the germplasm records is takes a long time to load because the cache is still fetching.
I'll try to summarize the optimizations made:
saveAndFlushAlls
everywhere. After some research, I found that flushalls can greatly hurt performance time because if entities being saved reference entities that were flushed (as a lot of this code seems to do) hibernate has to refresh the entities, or sometimes it doesn't even know what to do and gives an error.pedigree_node_external_references
table, and brought the import to a screeching halt at scale. This ended up being due to theUpdateUtility
unnecessarily clearing out existing external references that actual in fact already matched the ones that existed. A check was put in place to prevent this from ever occurring, however I only applied this check to the PedigreeNode code. Another solution to this problem could have been to just prevent setting pedigree external references entirely, as they likely do not need the same ones as germplasm. But this issue seemed so severe and worrisome for other entities I thought it would be helpful to keep the exref checking code for reuse.application.properties.template
. These must be replicated to theapplication.properties
of the relevant application/deployment, as they are imperative for speedups relating to batch inserts