[BI-2579] Optimize Germplasm Import and Post Endpoint #51

jloux-brapi · 2025-03-21T19:54:57Z

These changes should be merged on top of #49, as they are built on a branch based off those changes.

The thruput of the BrAPI Germplasm POST was very slow at the time I started working on this import.

An import of 30k germs on a clean DB or a program with no germs would take somewhere in the ballpark of 30-60 minutes.

Once the number of germs per program reached around 60-90k, the import would take hours or never finish at all.

After the host of changes made in the body of this MR, the 30k file can be imported in around 3-6 minutes on a DB with 550k germs on it with about 250k germs per program already loaded in.

It should be noted the import is much faster, but issues with the cache remain.

The associated bi-api changes edited the import process so that the cache is only refreshed at the end of the import. But as the number of records grows, the cache takes longer and longer to return, and this can impact the experience of the import.

The performance is still worlds better than it was before, but sometimes the cache getting can hold things up, and because it happens at the end of the import, users may find that when they try to view the germplasm records is takes a long time to load because the cache is still fetching.

I'll try to summarize the optimizations made:

Optimized the Germplasm save so that it creates entities in batch, and also does lookups for these entities in batch, rather than one by one. This cut down about 2000 queries on an import size of 1000 records at a time.
Removed saveAndFlushAlls everywhere. After some research, I found that flushalls can greatly hurt performance time because if entities being saved reference entities that were flushed (as a lot of this code seems to do) hibernate has to refresh the entities, or sometimes it doesn't even know what to do and gives an error.
Batched Pedigree creation from Germplasm records. There were kind of a bunch of optimizations tied together in this. It's hard to estimate how many queries were killed from this batching. anywhere between 20-30k. Without the changes from Germplasm Search Optimizations #49, performance suffered even more because the pedigree batching code kicked off full germplasm search requests, which tried paginating and fetching in memory, not only slowing the performance but bringing memory allocation to a screeching halt.
- Batch the creation of pedigree records from germplasm when no pedigree string came in
- Batch the creation of pedigree records from germplasm when pedigree string came in.
- Batch the creation of pedigree edges
- Batch the creation of pedigree edges from parents
- Batch the creation of pedigree edges from progeny (children)
I found that after these optimizations were made, an extraordinary amount of deletions were occurring on the pedigree_node_external_references table, and brought the import to a screeching halt at scale. This ended up being due to the UpdateUtility unnecessarily clearing out existing external references that actual in fact already matched the ones that existed. A check was put in place to prevent this from ever occurring, however I only applied this check to the PedigreeNode code. Another solution to this problem could have been to just prevent setting pedigree external references entirely, as they likely do not need the same ones as germplasm. But this issue seemed so severe and worrisome for other entities I thought it would be helpful to keep the exref checking code for reuse.
After this discovery, I hit another sticking point: Thousands of queries being created to convert the pedigree database data to DTOs which was being completely dropped on the floor in the case of the Germplasm POST. I created some new method signatures to prevent the dto conversion from ever happening for this use case. NOTE: This does not affect the data returned, as the data returned only returns germplasm data, and I kept those conversions untouched.
Added batching spring.data properties to the application.properties.template. These must be replicated to the application.properties of the relevant application/deployment, as they are imperative for speedups relating to batch inserts

…hing

This may or may not improve things, needs testing

Replace pagination calls with normal queries

…s unused in response

This method will help to alleviate use cases during batch updates where external references were getting updated when they didn't need to. This created a large amount of deletions and insertions slowing batch updates down

Made a new method putting relevant logic to the POST call. Not sure if PUT could use this one, would need to test.

mlm483

I still need to test, I wanted to give you my initial feedback.

Thank you for helping us with this.

src/main/resources/application.properties.template

src/main/java/org/brapi/test/BrAPITestServer/service/UpdateUtility.java

src/main/java/org/brapi/test/BrAPITestServer/service/germ/GermplasmService.java

mlm483 · 2025-03-31T13:13:33Z

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

+	public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {
+		List<PedigreeNodeEntity> nodes = new ArrayList<>();
+
+		// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds.  Have to see if this is a significant performance hit.
+		List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);
+
+		Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId()));
+
+		nodesGroupedByGerm.forEach((germId, nodesByGerm) -> {
+			if (nodesByGerm.size() > 1) {
+				log.error("multiple pedigree nodes found for a single germplasm");
+			}
+
+			Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst();
+
+			node.ifPresent(nodes::add);
+		});
+
+		return nodes;
+	}


Can we add a unique constraint on the pedigree_node table for the germplasm_id column and simplify this method?

Suggested change

public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {

List<PedigreeNodeEntity> nodes = new ArrayList<>();

// TODO: Might have to make a custom query for this that fetches the germ eagerly, bc need to compare the germIds. Have to see if this is a significant performance hit.

List<PedigreeNodeEntity> dbNodeList = pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);

Map<String, List<PedigreeNodeEntity>> nodesGroupedByGerm = dbNodeList.stream().collect(Collectors.groupingBy(pn -> pn.getGermplasm().getId()));

nodesGroupedByGerm.forEach((germId, nodesByGerm) -> {

if (nodesByGerm.size() > 1) {

log.error("multiple pedigree nodes found for a single germplasm");

}

Optional<PedigreeNodeEntity> node = nodesByGerm.stream().findFirst();

node.ifPresent(nodes::add);

});

return nodes;

}

public List<PedigreeNodeEntity> getPedigreeNodes(List<String> germplasmDbIds) {

return pedigreeRepository.findByGermplasm_IdIn(germplasmDbIds);

}

I think probably, but this is a question for @BrapiCoordinatorSelby . For what it's worth, the current lookup isn't all that inefficient.

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

mlm483 · 2025-03-31T13:21:05Z

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

+
+		// Find out which germIds were not found in the DB. Use a set for improved performance on the contains check.
+		// TODO: Check if the germEntity is already populated by getPedigreeNodes, and if this block results in more DB transactions.
+		Set<String> germIdsOfFoundNodes = dbNodes.stream()


I know it's longer, but germDbIdsOfFoundNodes might be more clear. Germplasm have so many identifiers, it's helpful to be specific.

I'm not sure I agree, entityNameId has meant the name of the entities unique id at every place I have worked at.

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

mlm483 · 2025-04-02T18:46:31Z

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

+		}
+	}
+
+	public List<PedigreeNode> convertFromGermplasmToPedigreeBatchUsingNames(List<Germplasm> germplasms)


I wonder if there will be issues using germplasm name for lookups, it isn't guaranteed to be unique. This code is only to support our import use case, right? In that case, names coming from DeltaBreed should be unique, but if it's used outside of that, it could be a problem.

Yea, I have thought about that. That's kind of why the code works like this for now as a stopgap:

public void updateGermplasmPedigreeForPost(List<Germplasm> data) throws BrAPIServerException { // T = With pedigree, F = Without pedigree Map<Boolean, List<Germplasm>> germsWithAndWithoutPedigree = data.stream().collect(Collectors.partitioningBy(g -> g.getPedigree() != null)); if (!germsWithAndWithoutPedigree.get(true).isEmpty()) { savePedigreeNodes(convertFromGermplasmToPedigreeBatchUsingNames(germsWithAndWithoutPedigree.get(true)), false); } if (!germsWithAndWithoutPedigree.get(false).isEmpty()) { List<PedigreeNode> createPedigreeNodes = new ArrayList<>(); for (Germplasm germplasm : germsWithAndWithoutPedigree.get(false)) { createPedigreeNodes.add(convertFromGermplasmToPedigree(germplasm)); } savePedigreeNodes(createPedigreeNodes, false); } }

If it comes in with a pedigree, we create by that in batch. If it doesn't we do it the other way, and it's not really too slow if we initialize the other way.

We honestly could have this kept this unchanged (and it is via updateGermplasmPedigree for the PUT method) and it wouldn't have made much of a performance difference, but really what I created was what I found was happening under the hood: The code would lookup by pedigree name if it existed in the end. I just tried to make it much more obvious and to do it in batch, bc the by pedigree lookup was slower.

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

mlm483 · 2025-04-02T18:53:31Z

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

+		List<String> parentEdgesToDelete = new ArrayList<>();
+
+		SearchQueryBuilder<PedigreeEdgeEntity> search = new SearchQueryBuilder<>(PedigreeEdgeEntity.class);
+		search.appendList(germIdsWithParentNodes, "connectedNode.germplasm.id");


Was this actually working with the correct spelling?

It was yea, I also noticed that it somehow worked without 🤷. It's possible hibernate might be able to detect or guestimate, idk. Regardless, I will change to the correct spelling.

Thinking on this more, it's very possible this method was never called in my testing. I'll breakpoint the next time I do a full test.

src/main/java/org/brapi/test/BrAPITestServer/service/germ/GermplasmService.java

dmeidlin · 2025-04-03T15:36:56Z

src/main/java/org/brapi/test/BrAPITestServer/service/germ/PedigreeService.java

+		List<GermplasmEntity> motherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeMother.values()));
+		List<GermplasmEntity> fatherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeFather.values()));


Consider adding error handling or logging for cases where the mother or father germplasm names are not found in the database.

Since I am doing this in batch, it's not exactly trivial to find each name one by one that didn't match or without adding extra overhead. That said, it's certainly easy to give warnings if none of the names in the map exist in the DB, which would in general indicate a pretty larger scale issue.

…arn logs

mlm483

Tested, working.

jloux-brapi added 11 commits March 19, 2025 15:33

Remove flushalls, batch pedigree lookups from germplasm

8203961

Add createEntitiesInBatch method for PedigreeService to optimize batc…

accc26f

…hing

Batch pedigree edge functionality

52dd04c

This may or may not improve things, needs testing

Clean up updateEntitiesWithEdgesInBatch

afeded8

Replace pagination calls with normal queries

Prevent DTO converting on pedigree for Germplasm POST Path, as data i…

53f5919

…s unused in response

Add exref checking method to UpdateUtility

f020388

This method will help to alleviate use cases during batch updates where external references were getting updated when they didn't need to. This created a large amount of deletions and insertions slowing batch updates down

Simplify updateGermplasmPedigree method

3088807

Made a new method putting relevant logic to the POST call. Not sure if PUT could use this one, would need to test.

Batch crop and breeding method lookups on germ creation

76292ab

Apply Map optimizations on PedigreeService

c430ebf

Fix bugs related to default pagination

48bd507

Add batching props to application.properties

6af2e8f

jloux-brapi requested review from dmeidlin, BrapiCoordinatorSelby, nickpalladino and mlm483 March 21, 2025 19:54

jloux-brapi added 5 commits March 21, 2025 15:55

Remove erroneous property

9b54e8d

Add additional properties for debugging, with comments

d12ce0c

Merge branch 'germ-search-opts' into germ-importer-opts

83b2937

Add general findByIdIn to BrAPIRepository to be used by importer

b0a5d4b

Merge branch 'germ-search-opts' into germ-importer-opts

ab35a5d

mlm483 requested changes Apr 2, 2025

View reviewed changes

dmeidlin reviewed Apr 3, 2025

View reviewed changes

src/main/java/org/brapi/test/BrAPITestServer/service/germ/GermplasmService.java Outdated Show resolved Hide resolved

dmeidlin reviewed Apr 3, 2025

View reviewed changes

jloux-brapi added 2 commits April 3, 2025 17:17

Simplify resultingNodes logic, fix connectedNode typo

4836f33

Handle empty name search use case in GermService, add mother/father w…

fb1a01f

…arn logs

jloux-brapi requested review from mlm483 and dmeidlin April 4, 2025 14:48

Fix mispell for updateChildEdges method

449263f

jloux-brapi changed the base branch from germ-search-opts to develop April 18, 2025 12:58

nickpalladino approved these changes Apr 22, 2025

View reviewed changes

Rename gId to dbId in PedigreeService

e1051b1

mlm483 approved these changes Apr 24, 2025

View reviewed changes

nickpalladino merged commit 399da44 into develop Apr 24, 2025

		List<GermplasmEntity> motherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeMother.values()));
		List<GermplasmEntity> fatherGerms = germplasmService.findByNames(new ArrayList<>(germsByPedigreeFather.values()));

[BI-2579] Optimize Germplasm Import and Post Endpoint #51

[BI-2579] Optimize Germplasm Import and Post Endpoint #51

Uh oh!

Conversation

jloux-brapi commented Mar 21, 2025

Uh oh!

mlm483 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlm483 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!