-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALS-7737: Better support for splitting genomic data by chromosome #137
base: main
Are you sure you want to change the base?
Conversation
…data in a similar fashion
@@ -159,7 +159,11 @@ public Set<String> filterVariantSetForPatientSet(Set<String> variantSet, Collect | |||
String bucketKey = variantSpec.split(",")[0] + ":" + (Integer.parseInt(variantSpec.split(",")[1])/1000); | |||
|
|||
//testBit uses inverted indexes include +2 offset for bookends | |||
return _bucketMask.testBit(maxIndex - Collections.binarySearch(bucketList, bucketKey) + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug has existed since forever. testBit(-1) was sometimes returning not false because of overflows
import java.io.IOException; | ||
|
||
import static edu.harvard.hms.dbmi.avillach.hpds.etl.genotype.NewVCFLoader.convertInfoStoresToByteIndexed; | ||
|
||
public class ReSplitMergeInfoStores { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is used, will verify and delete
@@ -22,57 +22,77 @@ | |||
|
|||
public class NewVCFLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes in this file are to support extending it and removing all the static nonsense
|
||
private static final Joiner COMMA_JOINER = Joiner.on(","); | ||
|
||
public VCFIndexBuilder(File vcfPatientMappingFile, File patientUUIDToIdMappingFile, File vcfIndexOutputDirectory, Set<String> validPatientType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a BCH specific utility class. I will add some documentation to explain exactly the inputs to create a vcf file but I doubt it will be useful for anyone else
This PR adds better support for building a genomic dataset split by chromosome. Previously, in BDC, we ran the
NewVCFLoader
for each chromosome, which was not as efficient and would be even less so as we move to running jobs in docker containers.This new functionality allows us to pass a single VCF index file to
SplitChromosomeVcfLoader
and it will process all VCFs in it and store the genomic dataset in appropriate directories split by chromosome. A similar approach was done forSplitChromosomeVariantMetadataLoader
.Integration tests were also updated to use the implementation of
GenomicProcessor
using a split-chromosome dataset. This uncovered several small bugs which are fixed in this PR