Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALS-7737: Better support for splitting genomic data by chromosome #137

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ramari16
Copy link
Contributor

@ramari16 ramari16 commented Feb 27, 2025

This PR adds better support for building a genomic dataset split by chromosome. Previously, in BDC, we ran the NewVCFLoader for each chromosome, which was not as efficient and would be even less so as we move to running jobs in docker containers.

This new functionality allows us to pass a single VCF index file to SplitChromosomeVcfLoader and it will process all VCFs in it and store the genomic dataset in appropriate directories split by chromosome. A similar approach was done for SplitChromosomeVariantMetadataLoader.

Integration tests were also updated to use the implementation of GenomicProcessor using a split-chromosome dataset. This uncovered several small bugs which are fixed in this PR

@ramari16 ramari16 added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Feb 27, 2025
@@ -159,7 +159,11 @@ public Set<String> filterVariantSetForPatientSet(Set<String> variantSet, Collect
String bucketKey = variantSpec.split(",")[0] + ":" + (Integer.parseInt(variantSpec.split(",")[1])/1000);

//testBit uses inverted indexes include +2 offset for bookends
return _bucketMask.testBit(maxIndex - Collections.binarySearch(bucketList, bucketKey) + 2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug has existed since forever. testBit(-1) was sometimes returning not false because of overflows

import java.io.IOException;

import static edu.harvard.hms.dbmi.avillach.hpds.etl.genotype.NewVCFLoader.convertInfoStoresToByteIndexed;

public class ReSplitMergeInfoStores {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used, will verify and delete

@@ -22,57 +22,77 @@

public class NewVCFLoader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in this file are to support extending it and removing all the static nonsense


private static final Joiner COMMA_JOINER = Joiner.on(",");

public VCFIndexBuilder(File vcfPatientMappingFile, File patientUUIDToIdMappingFile, File vcfIndexOutputDirectory, Set<String> validPatientType) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a BCH specific utility class. I will add some documentation to explain exactly the inputs to create a vcf file but I doubt it will be useful for anyone else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant