The following assumes that the array data is dbGaP fingerprint files with 10,000 SNPs.
- Create directory for this study under
. This is where we store results associated with the array data and not with any particular freeze. - Convert tped to bed with <file>
. bed file will be written in same directory as tped (under/projects/topmed/downloaded_data/prior_array_data/
) - If necessary, liftover the .bim file to match the sequence build with
. - Convert bed to gds with
qsub -N bed2gds bed2gds.R <config>
. An AnnotatedDataFrame with sample annotation will be created also.
If we have an alternate array source, modify the above code so as to end up with a GDS file in the same build as the sequence data, and an AnnotatedDataFrame with columns "" and "". "" should be unique, and "" should match the "submitted_subject_id" in the TOPMed sample annotation.
Create directory for this study and the specific freeze to be checked under
. -
Subset sequencing GDS to overlapping variants and merge chromosomes: <config>
config parameter default value description out_prefix
Prefix for files created by this script. array_gds_file
Path to array GDS file. seq_gds_file
Path to sequencing GDS file. seq_annot_file
Path to sequencing sample annotation file. subset_gds_file
Path to output file. study
Study name maf_threshold
Minimum MAF for sequence variants to include missing_threshold
Maximum missing call rate for sequence variants to include -
If we have an alternate array source, check the overlap between fingerprints and the subset file with
. If the number of overlapping variants is much less than 10,000, the code will supplement with a randomly selected set of variants that overlap between the sequence and array data, then combine these variants and the fingerprint variants in a GRanges object. -
In the following steps, use a config file that sets
, or to the alternate include file defined in the previous step. -
Run duplicate discordance with -n N <config>
where N is the number of sample blocks to run in parallel.config parameter default value description out_prefix
Prefix for files created by this script. array_gds_file
Path to array GDS file. array_annot_file
Path to array sample annotation file. seq_gds_file
Path to sequencing GDS file (same as subset_gds_file
Path to sequencing sample annotation file. study
Study name granges_include_file
RData file with GRanges defining subset of variants. sample_include_file
RData file with to include, if different from all samples in study
RData file with variants to include (result will be the intersection with granges_include_file
) -
If any samples were discordant, save an RData file with a vector of those sample ids. Check them against all other array samples with
qsub -t 1-N -N match_samples -s array_match_sample.R <config>
where N is the number of samples to match.