Make sure have the environemnt setup (../README.md) and that you have extracted the fasta file:
zcat ../1-predict/input/hg19.chr22.fa.gz > ../1-predict/input/hg19.chr22.fa.
Have a look at the clinvar_20180429.pathogenic.chr22.vcf.gz
zless -S input/clinvar_20180429.pathogenic.chr22.vcf.gzThis file contains genetic variants from the ClinVar database. We filtered the original ClinVar VCF file to chromosome 22 and included only pathogenic variants.
Let's score the impact of these genetic variants to different molecular phenotypes (e.g. TF-factor binding affinity or DNA accessibility) using models in Kipoi.
First, activate the right environment:
source activate `kipoi env get DeepBind`
mkdir -p outputNext, run model predictions for sequences containing the reference allele, the alternative allele and also write the difference between model predictions to a file:
kipoi veff score_variants DeepBind/Homo_sapiens/TF/D00328.018_ChIP-seq_CTCF \
--dataloader_args='{"fasta_file": "input/hg19.chr22.fa"}' \
-i input/clinvar_20180429.pathogenic.chr22.vcf.gz \
-s ref alt diff \
-o /tmp/annotated.vcfLet's investigate the results
less -S /tmp/annotated.vcf
As you can see, new entries were added to the INFO field of the vcf:
##INFO=<ID=KV:kipoi:DeepBind/Homo_sapiens/TF/D00328.018_ChIP_seq_CTCF:REF,...
##INFO=<ID=KV:kipoi:DeepBind/Homo_sapiens/TF/D00328.018_ChIP_seq_CTCF:ALT,...
##INFO=<ID=KV:kipoi:DeepBind/Homo_sapiens/TF/D00328.018_ChIP_seq_CTCF:DIFF,...
##INFO=<ID=KV:kipoi:DeepBind/Homo_sapiens/TF/D00328.018_ChIP_seq_CTCF:rID,...
REF, ALT or DIFF correspond to different scoring functions specified with -s ref alt diff.
Let's write the scores to a tsv file instead of the vcf:
kipoi veff score_variants DeepBind/Homo_sapiens/TF/D00328.018_ChIP-seq_CTCF \
--dataloader_args='{"fasta_file": "input/hg19.chr22.fa"}' \
-i input/clinvar_20180429.pathogenic.chr22.vcf.gz \
-s ref alt diff \
-e /tmp/annotated.tsvless -S /tmp/annotated.tsv- Use the
Bassetmodel to run model predictions. - Write only the predictions for
A549of the Basset model. Hint: usekipoi veff score_variants --helpand--model_outputs. - Run variant effect predictions from python:
import kipoi_veff.snv_predict as sp
sp.score_variants(model='Basset',
dl_args={'fasta_file': 'input/hg19.chr22.fa'},
input_vcf='input/clinvar_20180429.pathogenic.chr22.vcf.gz',
output_vcf='/tmp/py-annotated.vcf')Now, let's run model predictions in parallel. We'll use Snakemake for this.
First, explore the Snakefile.
Next, run:
snakemake -j 5This will run variant effect prediction for many different models. -j 5 runs 5 jobs in parallel.
Now that we have the predictions scored under output/, let's load them into python, join them into a table and do a simple analysis. Go through the load-visualize.ipynb notebook.
- kipoi-veff repository
- kipoi-veff documentation
- Notebooks:
Next step: 3-interpret