Skip to content

[BI-2579] Optimize Germplasm Import and Post Endpoint #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 24, 2025

Conversation

jloux-brapi
Copy link
Collaborator

These changes should be merged on top of #49, as they are built on a branch based off those changes.

The thruput of the BrAPI Germplasm POST was very slow at the time I started working on this import.

An import of 30k germs on a clean DB or a program with no germs would take somewhere in the ballpark of 30-60 minutes.

Once the number of germs per program reached around 60-90k, the import would take hours or never finish at all.

After the host of changes made in the body of this MR, the 30k file can be imported in around 3-6 minutes on a DB with 550k germs on it with about 250k germs per program already loaded in.

It should be noted the import is much faster, but issues with the cache remain.

The associated bi-api changes edited the import process so that the cache is only refreshed at the end of the import. But as the number of records grows, the cache takes longer and longer to return, and this can impact the experience of the import.

The performance is still worlds better than it was before, but sometimes the cache getting can hold things up, and because it happens at the end of the import, users may find that when they try to view the germplasm records is takes a long time to load because the cache is still fetching.

I'll try to summarize the optimizations made:

  • Optimized the Germplasm save so that it creates entities in batch, and also does lookups for these entities in batch, rather than one by one. This cut down about 2000 queries on an import size of 1000 records at a time.
  • Removed saveAndFlushAlls everywhere. After some research, I found that flushalls can greatly hurt performance time because if entities being saved reference entities that were flushed (as a lot of this code seems to do) hibernate has to refresh the entities, or sometimes it doesn't even know what to do and gives an error.
  • Batched Pedigree creation from Germplasm records. There were kind of a bunch of optimizations tied together in this. It's hard to estimate how many queries were killed from this batching. anywhere between 20-30k. Without the changes from Germplasm Search Optimizations #49, performance suffered even more because the pedigree batching code kicked off full germplasm search requests, which tried paginating and fetching in memory, not only slowing the performance but bringing memory allocation to a screeching halt.
    • Batch the creation of pedigree records from germplasm when no pedigree string came in
    • Batch the creation of pedigree records from germplasm when pedigree string came in.
    • Batch the creation of pedigree edges
    • Batch the creation of pedigree edges from parents
    • Batch the creation of pedigree edges from progeny (children)
  • I found that after these optimizations were made, an extraordinary amount of deletions were occurring on the pedigree_node_external_references table, and brought the import to a screeching halt at scale. This ended up being due to the UpdateUtility unnecessarily clearing out existing external references that actual in fact already matched the ones that existed. A check was put in place to prevent this from ever occurring, however I only applied this check to the PedigreeNode code. Another solution to this problem could have been to just prevent setting pedigree external references entirely, as they likely do not need the same ones as germplasm. But this issue seemed so severe and worrisome for other entities I thought it would be helpful to keep the exref checking code for reuse.
  • After this discovery, I hit another sticking point: Thousands of queries being created to convert the pedigree database data to DTOs which was being completely dropped on the floor in the case of the Germplasm POST. I created some new method signatures to prevent the dto conversion from ever happening for this use case. NOTE: This does not affect the data returned, as the data returned only returns germplasm data, and I kept those conversions untouched.
  • Added batching spring.data properties to the application.properties.template. These must be replicated to the application.properties of the relevant application/deployment, as they are imperative for speedups relating to batch inserts

This may or may not improve things, needs testing
Replace pagination calls with normal queries
This method will help to alleviate use cases during batch updates where external
references were getting updated when they didn't need to.

This created a large amount of deletions and insertions slowing batch updates down
Made a new method putting relevant logic to the POST call.

Not sure if PUT could use this one, would need to test.
Copy link

@mlm483 mlm483 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to test, I wanted to give you my initial feedback.

Thank you for helping us with this.

Comment on lines +175 to +194
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {
List<PedigreeNodeEntity> nodes = new ArrayList<>();

// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds. Have to see if this is a significant performance hit.
List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);

Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId()));

nodesGroupedByGerm.forEach((germId, nodesByGerm) -> {
if (nodesByGerm.size() > 1) {
log.error("multiple pedigree nodes found for a single germplasm");
}

Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst();

node.ifPresent(nodes::add);
});

return nodes;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unique constraint on the pedigree_node table for the germplasm_id column and simplify this method?

Suggested change
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {
List<PedigreeNodeEntity> nodes = new ArrayList<>();
// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds. Have to see if this is a significant performance hit.
List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);
Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId()));
nodesGroupedByGerm.forEach((germId, nodesByGerm) -> {
if (nodesByGerm.size() > 1) {
log.error("multiple pedigree nodes found for a single germplasm");
}
Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst();
node.ifPresent(nodes::add);
});
return nodes;
}
public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {
return pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably, but this is a question for @BrapiCoordinatorSelby . For what it's worth, the current lookup isn't all that inefficient.


// Find out which germIds were not found in the DB. Use a set for improved performance on the contains check.
// TODO: Check if the germEntity is already populated by getPedigreeNodes, and if this block results in more DB transactions.
Set<String> germIdsOfFoundNodes = dbNodes.stream()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's longer, but germDbIdsOfFoundNodes might be more clear. Germplasm have so many identifiers, it's helpful to be specific.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree, entityNameId has meant the name of the entities unique id at every place I have worked at.

}
}

public List<PedigreeNode> convertFromGermplasmToPedigreeBatchUsingNames(List<Germplasm> germplasms)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there will be issues using germplasm name for lookups, it isn't guaranteed to be unique. This code is only to support our import use case, right? In that case, names coming from DeltaBreed should be unique, but if it's used outside of that, it could be a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I have thought about that. That's kind of why the code works like this for now as a stopgap:

	public void updateGermplasmPedigreeForPost(List<Germplasm> data) throws BrAPIServerException {
		// T = With pedigree, F = Without pedigree
		Map<Boolean, List<Germplasm>> germsWithAndWithoutPedigree
				= data.stream().collect(Collectors.partitioningBy(g -> g.getPedigree() != null));

		if (!germsWithAndWithoutPedigree.get(true).isEmpty()) {
			savePedigreeNodes(convertFromGermplasmToPedigreeBatchUsingNames(germsWithAndWithoutPedigree.get(true)), false);
		}

		if (!germsWithAndWithoutPedigree.get(false).isEmpty()) {
			List<PedigreeNode> createPedigreeNodes = new ArrayList<>();

			for (Germplasm germplasm : germsWithAndWithoutPedigree.get(false)) {
				createPedigreeNodes.add(convertFromGermplasmToPedigree(germplasm));
			}

			savePedigreeNodes(createPedigreeNodes, false);
		}
	}

If it comes in with a pedigree, we create by that in batch. If it doesn't we do it the other way, and it's not really too slow if we initialize the other way.

We honestly could have this kept this unchanged (and it is via updateGermplasmPedigree for the PUT method) and it wouldn't have made much of a performance difference, but really what I created was what I found was happening under the hood: The code would lookup by pedigree name if it existed in the end. I just tried to make it much more obvious and to do it in batch, bc the by pedigree lookup was slower.

List<String> parentEdgesToDelete = new ArrayList<>();

SearchQueryBuilder<PedigreeEdgeEntity> search = new SearchQueryBuilder<>(PedigreeEdgeEntity.class);
search.appendList(germIdsWithParentNodes, "connectedNode.germplasm.id");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this actually working with the correct spelling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was yea, I also noticed that it somehow worked without 🤷. It's possible hibernate might be able to detect or guestimate, idk. Regardless, I will change to the correct spelling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this more, it's very possible this method was never called in my testing. I'll breakpoint the next time I do a full test.

Comment on lines 830 to 831
List<GermplasmEntity> motherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeMother.values()));
List<GermplasmEntity> fatherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeFather.values()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling or logging for cases where the mother or father germplasm names are not found in the database.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I am doing this in batch, it's not exactly trivial to find each name one by one that didn't match or without adding extra overhead. That said, it's certainly easy to give warnings if none of the names in the map exist in the DB, which would in general indicate a pretty larger scale issue.

@jloux-brapi jloux-brapi requested review from mlm483 and dmeidlin April 4, 2025 14:48
@jloux-brapi jloux-brapi changed the base branch from germ-search-opts to develop April 18, 2025 12:58
Copy link

@mlm483 mlm483 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, working.

@nickpalladino nickpalladino merged commit 399da44 into develop Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants