ALS-7737: Better support for splitting genomic data by chromosome #137

ramari16 · 2025-02-27T19:15:19Z

This PR adds better support for building a genomic dataset split by chromosome. Previously, in BDC, we ran the NewVCFLoader for each chromosome, which was not as efficient and would be even less so as we move to running jobs in docker containers.

This new functionality allows us to pass a single VCF index file to SplitChromosomeVcfLoader and it will process all VCFs in it and store the genomic dataset in appropriate directories split by chromosome. A similar approach was done for SplitChromosomeVariantMetadataLoader.

Integration tests were also updated to use the implementation of GenomicProcessor using a split-chromosome dataset. This uncovered several small bugs which are fixed in this PR

…data in a similar fashion

ramari16 · 2025-02-27T20:17:34Z

data/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/data/genotype/BucketIndexBySample.java

@@ -159,7 +159,11 @@ public Set<String> filterVariantSetForPatientSet(Set<String> variantSet, Collect
 			String bucketKey = variantSpec.split(",")[0] + ":" + (Integer.parseInt(variantSpec.split(",")[1])/1000);

 			//testBit uses inverted indexes include +2 offset for bookends
-			return _bucketMask.testBit(maxIndex - Collections.binarySearch(bucketList, bucketKey)  + 2);


This bug has existed since forever. testBit(-1) was sometimes returning not false because of overflows

ramari16 · 2025-02-27T20:21:27Z

.../main/java/edu/harvard/hms/dbmi/avillach/hpds/data/genotype/util/ReSplitMergeInfoStores.java

 import java.io.IOException;

-import static edu.harvard.hms.dbmi.avillach.hpds.etl.genotype.NewVCFLoader.convertInfoStoresToByteIndexed;

 public class ReSplitMergeInfoStores {


I don't think this is used, will verify and delete

ramari16 · 2025-02-27T20:22:23Z

etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/NewVCFLoader.java

@@ -22,57 +22,77 @@

 public class NewVCFLoader {


All changes in this file are to support extending it and removing all the static nonsense

ramari16 · 2025-02-27T20:47:09Z

etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/VCFIndexBuilder.java

+
+    private static final Joiner COMMA_JOINER = Joiner.on(",");
+
+    public VCFIndexBuilder(File vcfPatientMappingFile, File patientUUIDToIdMappingFile, File vcfIndexOutputDirectory, Set<String> validPatientType) {


This is a BCH specific utility class. I will add some documentation to explain exactly the inputs to create a vcf file but I doubt it will be useful for anyone else

ramari16 added 5 commits February 14, 2025 09:48

Create VCFIndexBulider, add logging to NewVcfLoader

337f7e9

ALS-7737: Refactor NewVCFLoader to better support running in parallel

e585887

ALS-7737: Add VcfLoader that splits by chromosome. todo: variant meta…

ea8846a

…data in a similar fashion

ALS-7737: Add VariantMetadataLoader that loads per chromosome

e6338be

ALS-7737: Fix bugs discovered by splitting genomic data by chromosome

04e47bf

ramari16 added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Feb 27, 2025

ramari16 commented Feb 27, 2025

View reviewed changes

Remove duplicate code

0d6ecc5

ramari16 commented Feb 27, 2025

View reviewed changes

ramari16 added 5 commits February 27, 2025 15:58

Refactor genomic config

cb17239

ALS-7737: Cleanup hard to follow loops

ac16836

Update build references for new vcf loader

7cd7900

Fix typo

da65d98

Update genomic config for bch

73f4c63

Luke-Sikina force-pushed the ALS-7737 branch from bf5971d to cfd1102 Compare March 4, 2025 15:57

toggle filesharing

b91771f

Luke-Sikina force-pushed the ALS-7737 branch from cfd1102 to b91771f Compare March 4, 2025 16:21

ramari16 added 2 commits March 4, 2025 13:28

Fix spring config

6a0cbfd

Fix tests

21ef8ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALS-7737: Better support for splitting genomic data by chromosome #137

ALS-7737: Better support for splitting genomic data by chromosome #137

ramari16 commented Feb 27, 2025 •

edited

Loading

ramari16 Feb 27, 2025

ramari16 Feb 27, 2025

ramari16 Feb 27, 2025

ramari16 Feb 27, 2025


		private static final Joiner COMMA_JOINER = Joiner.on(",");

		public VCFIndexBuilder(File vcfPatientMappingFile, File patientUUIDToIdMappingFile, File vcfIndexOutputDirectory, Set<String> validPatientType) {

ALS-7737: Better support for splitting genomic data by chromosome #137

Are you sure you want to change the base?

ALS-7737: Better support for splitting genomic data by chromosome #137

Conversation

ramari16 commented Feb 27, 2025 • edited Loading

ramari16 Feb 27, 2025

Choose a reason for hiding this comment

ramari16 Feb 27, 2025

Choose a reason for hiding this comment

ramari16 Feb 27, 2025

Choose a reason for hiding this comment

ramari16 Feb 27, 2025

Choose a reason for hiding this comment

ramari16 commented Feb 27, 2025 •

edited

Loading