Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Breaking down the PCA user story #4

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
635 changes: 635 additions & 0 deletions stories/pop-structure-pca/000-Data-Structure.ipynb

Large diffs are not rendered by default.

505 changes: 505 additions & 0 deletions stories/pop-structure-pca/001-Exploratory-Statistics-Attributes.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

329 changes: 329 additions & 0 deletions stories/pop-structure-pca/003-Filter-Variants.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filter Variants\n",
"\n",
"This comes from the example notebook and the [tour of scikit-allel](http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html) blog post.\n",
"\n",
"There are many possible approaches to filtering variants. The simplest approach is define thresholds on variant attributes like DP, MQ and QD, and exclude SNPs that fall outside of a defined range (a.k.a. “hard filtering”). This is crude but simple to implement and in many cases may suffice, at least for an initial exploration of the data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import zarr\n",
"import pandas as pd\n",
"import dask.array as da\n",
"import allel\n",
"import scipy\n",
"from pprint import pprint\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"sns.set_style('white')\n",
"sns.set_style('ticks')\n",
"sns.set_context('notebook')\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"distributed.scheduler - INFO - Clear task state\n",
"/opt/conda/lib/python3.7/site-packages/distributed/dashboard/core.py:72: UserWarning: \n",
"Port 8787 is already in use. \n",
"Perhaps you already have a cluster running?\n",
"Hosting the diagnostics dashboard on a random port instead.\n",
" warnings.warn(\"\\n\" + msg)\n",
"distributed.scheduler - INFO - Scheduler at: tcp://10.35.63.92:42375\n",
"distributed.scheduler - INFO - dashboard at: :39099\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "254b2ce3c1b44817979da34c174e4129",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(HTML(value='<h2>KubeCluster</h2>'), HBox(children=(HTML(value='\\n<div>\\n <style scoped>\\n .…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from dask_kubernetes import KubeCluster\n",
"cluster = KubeCluster(n_workers=30)\n",
"cluster"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import the Variant Data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import gcsfs\n",
"\n",
"gcs_bucket_fs = gcsfs.GCSFileSystem(project='malariagen-jupyterhub', token='anon', access='read_only')\n",
"\n",
"storage_path = 'ag1000g-release/phase2.AR1/variation/main/zarr/pass/ag1000g.phase2.ar1.pass'\n",
"store = gcsfs.mapping.GCSMap(storage_path, gcs=gcs_bucket_fs, check=False, create=False)\n",
"callset = zarr.Group(store)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"allel allel-DisplayAsTable\"><span>&lt;VariantChunkedTable shape=(10640388,) dtype=[('POS', '&lt;i4'), ('REF', 'S1'), ('ALT', 'S1', (3,)), ('DP', '&lt;i4'), ('MQ', '&lt;f4'), ('QD', '&lt;f4')]\n",
" nbytes=202.9M cbytes=-6 cratio=-35467960.0\n",
" values=zarr.hierarchy.Group&gt;</span><table><thead><tr><th></th><th style=\"text-align: center\">POS</th><th style=\"text-align: center\">REF</th><th style=\"text-align: center\">ALT</th><th style=\"text-align: center\">DP</th><th style=\"text-align: center\">MQ</th><th style=\"text-align: center\">QD</th></tr></thead><tbody><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">0</th><td style=\"text-align: center\">9790</td><td style=\"text-align: center\">b'C'</td><td style=\"text-align: center\">[b'T' b'' b'']</td><td style=\"text-align: center\">35484</td><td style=\"text-align: center\">54.96</td><td style=\"text-align: center\">14.26</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">1</th><td style=\"text-align: center\">9791</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'T' b'' b'']</td><td style=\"text-align: center\">35599</td><td style=\"text-align: center\">55.0</td><td style=\"text-align: center\">20.52</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">2</th><td style=\"text-align: center\">9798</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">35561</td><td style=\"text-align: center\">55.01</td><td style=\"text-align: center\">13.74</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">...</th><td style=\"text-align: center\" colspan=\"7\">...</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">10640385</th><td style=\"text-align: center\">41956541</td><td style=\"text-align: center\">b'C'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">40185</td><td style=\"text-align: center\">57.63</td><td style=\"text-align: center\">30.28</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">10640386</th><td style=\"text-align: center\">41956551</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">39819</td><td style=\"text-align: center\">58.01</td><td style=\"text-align: center\">8.53</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">10640387</th><td style=\"text-align: center\">41956556</td><td style=\"text-align: center\">b'T'</td><td style=\"text-align: center\">[b'A' b'C' b'']</td><td style=\"text-align: center\">39174</td><td style=\"text-align: center\">58.37</td><td style=\"text-align: center\">32.66</td></tr></tbody></table></div>"
],
"text/plain": [
"<VariantChunkedTable shape=(10640388,) dtype=[('POS', '<i4'), ('REF', 'S1'), ('ALT', 'S1', (3,)), ('DP', '<i4'), ('MQ', '<f4'), ('QD', '<f4')]\n",
" nbytes=202.9M cbytes=-6 cratio=-35467960.0\n",
" values=zarr.hierarchy.Group>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chrom = '3L'\n",
"variants = allel.VariantChunkedTable(callset[chrom]['variants'], \n",
" names=['POS', 'REF', 'ALT', 'DP', 'MQ', 'QD'],\n",
" index='POS')\n",
"variants"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filter using a Hard Expression\n",
"\n",
"Define the hard filter using an expression. This is just a string of Python code, which we will evaluate in a moment."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"filter_expression = '(QD > 5) & (MQ > 40) & (DP > 15000) & (DP < 30000)'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([False, False, False, ..., False, False, False])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"variant_selection = variants.eval(filter_expression)[:]\n",
"variant_selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many variants do we want to keep?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"304050"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of variants we're keeping based on filter_expression criteria\n",
"np.count_nonzero(variant_selection)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many variants do we filter out?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10336338"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"distributed.scheduler - INFO - Register tcp://10.35.87.2:38715\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.87.2:38715\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.67.2:36717\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.67.2:36717\n",
"distributed.core - INFO - Starting established connection\n"
]
}
],
"source": [
"# Number of variants we're tossing based on filter_expression criteria\n",
"np.count_nonzero(~variant_selection)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our variant filter, let’s make a new variants table with only rows for variants that pass our filter."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"distributed.scheduler - INFO - Register tcp://10.35.93.2:44567\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.93.2:44567\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.92.2:35361\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.92.2:35361\n",
"distributed.core - INFO - Starting established connection\n"
]
},
{
"data": {
"text/html": [
"<div class=\"allel allel-DisplayAsTable\"><span>&lt;VariantChunkedTable shape=(304050,) dtype=[('POS', '&lt;i4'), ('REF', 'S1'), ('ALT', 'S1', (3,)), ('DP', '&lt;i4'), ('MQ', '&lt;f4'), ('QD', '&lt;f4')]\n",
" nbytes=5.8M cbytes=3.3M cratio=1.7\n",
" values=allel.chunked.storage_zarr.ZarrTable&gt;</span><table><thead><tr><th></th><th style=\"text-align: center\">POS</th><th style=\"text-align: center\">REF</th><th style=\"text-align: center\">ALT</th><th style=\"text-align: center\">DP</th><th style=\"text-align: center\">MQ</th><th style=\"text-align: center\">QD</th></tr></thead><tbody><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">0</th><td style=\"text-align: center\">65172</td><td style=\"text-align: center\">b'T'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">29076</td><td style=\"text-align: center\">43.68</td><td style=\"text-align: center\">9.71</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">1</th><td style=\"text-align: center\">80433</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">29345</td><td style=\"text-align: center\">59.34</td><td style=\"text-align: center\">17.23</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">2</th><td style=\"text-align: center\">80434</td><td style=\"text-align: center\">b'T'</td><td style=\"text-align: center\">[b'C' b'' b'']</td><td style=\"text-align: center\">29141</td><td style=\"text-align: center\">59.34</td><td style=\"text-align: center\">11.88</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">...</th><td style=\"text-align: center\" colspan=\"7\">...</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">304047</th><td style=\"text-align: center\">41949138</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'T' b'' b'']</td><td style=\"text-align: center\">28489</td><td style=\"text-align: center\">57.36</td><td style=\"text-align: center\">9.42</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">304048</th><td style=\"text-align: center\">41949139</td><td style=\"text-align: center\">b'A'</td><td style=\"text-align: center\">[b'C' b'' b'']</td><td style=\"text-align: center\">28453</td><td style=\"text-align: center\">57.37</td><td style=\"text-align: center\">17.56</td></tr><tr><th style=\"text-align: center; background-color: white; border-right: 1px solid black; \">304049</th><td style=\"text-align: center\">41949142</td><td style=\"text-align: center\">b'G'</td><td style=\"text-align: center\">[b'A' b'' b'']</td><td style=\"text-align: center\">29176</td><td style=\"text-align: center\">57.39</td><td style=\"text-align: center\">12.18</td></tr></tbody></table></div>"
],
"text/plain": [
"<VariantChunkedTable shape=(304050,) dtype=[('POS', '<i4'), ('REF', 'S1'), ('ALT', 'S1', (3,)), ('DP', '<i4'), ('MQ', '<f4'), ('QD', '<f4')]\n",
" nbytes=5.8M cbytes=3.3M cratio=1.7\n",
" values=allel.chunked.storage_zarr.ZarrTable>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"distributed.scheduler - INFO - Register tcp://10.35.69.2:34855\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.69.2:34855\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.85.2:36335\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.85.2:36335\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.68.2:46027\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.68.2:46027\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.94.2:35005\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.94.2:35005\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Register tcp://10.35.76.2:40437\n",
"distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.76.2:40437\n",
"distributed.core - INFO - Starting established connection\n",
"distributed.scheduler - INFO - Remove worker tcp://10.35.85.2:36335\n",
"distributed.core - INFO - Removing comms to tcp://10.35.85.2:36335\n",
"distributed.scheduler - INFO - Remove worker tcp://10.35.89.2:36985\n",
"distributed.core - INFO - Removing comms to tcp://10.35.89.2:36985\n"
]
}
],
"source": [
"variants_pass = variants.compress(variant_selection)\n",
"variants_pass"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:root] *",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading