From bae7f284c9d8076fffb1be9d6dc4885beef565fb Mon Sep 17 00:00:00 2001 From: Ivan Pokrovac Date: Mon, 26 Jul 2021 12:33:45 +0200 Subject: [PATCH] Some fixing --- .../Simulating data-checkpoint.ipynb | 672 ++++++++++++++++++ Simulating data.ipynb | 24 +- 2 files changed, 686 insertions(+), 10 deletions(-) create mode 100644 .ipynb_checkpoints/Simulating data-checkpoint.ipynb diff --git a/.ipynb_checkpoints/Simulating data-checkpoint.ipynb b/.ipynb_checkpoints/Simulating data-checkpoint.ipynb new file mode 100644 index 0000000..26a33fa --- /dev/null +++ b/.ipynb_checkpoints/Simulating data-checkpoint.ipynb @@ -0,0 +1,672 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Biological background\n", + "## Cellular biology\n", + "For a cell to enter mitosis, it first needs to undergo a series of changes called **interphase**. Interphase can be divided in three phases - G1, S, G2.\n", + "During the G1 phase, amount of DNA in the cell remains the same. During the S phase, amount of DNA is increasing, only to be doubled in the G2 phase.\n", + "Thus, depending on the amount of DNA, one can classify a single cell as belonging to either of those three phases. \n", + "\n", + "## DNA measuring\n", + "Content of the DNA can be measured indirectly, using intercalating flourescent dyes such as **propidium iodide** (PI) that bond to DNA molecules within the cells. Uniform suspension of cells is then passed through a flourescent cytometer that is capable of detecting single cells (*events*) and measuring the intensity of PI flourescence. Since this is an indirect measurement of DNA through PI signal intensity, repeated sampling of cells in say, G1 phase, won't give the same values, but will give values that are distributed according to normal distribution. In general, measuring N amount of cells whose DNA content is X will be normally distributed with mean of PI(X) and standard deviation $\\sigma$ where PI(X) is a function that translates DNA content into PI flourescence. Since we will be comparing cell content, we are not interested in the *exact amount* of DNA present in the cells, and thus we will make an assumption that PI(X) ~ X.\n", + "\n", + "## Cytostatics and cytotoxicity\n", + "Certain componds can inhibit normal progression of cell mitosis. Such componds are called *cytostatics*. These compounds will lower the percentage of cells in S and/or G2 phase. \n", + "We know that certain compounds do just that. Evaluating cytotoxicity, or just effect on cell cycle, is an important part of evaluating a compound, especially if it's for potential human use.\n", + "\n", + "# Mathematical modelling\n", + "## Modeling DNA distributions\n", + "There are three aforementioned phases where the amount of DNA is different. We can model each of the phases by different function.\n", + "G1 phase can be approximated as $ G_{1} \\sim \\mathcal{N}(\\mu_{1},\\,\\sigma^{2}_{1})$ \n", + "\n", + "G2 phase can be approximated as $ G_{2} \\sim \\mathcal{N}(\\mu_{2},\\,\\sigma^{2}_{2})$\n", + "\n", + "The problematic phase to model is the S phase, because the content of DNA in the phase is anywhere between the amount of DNA in G1 and G2 phase. The selected model for S phase will be an infinite sum of normal distributions according to the following:\n", + "\n", + "\n", + "$\\int_{2n}^{4n} \\!w_{x} \\mathcal{N}(\\mu_{x},\\,\\sigma^{2}_{x}) \\, \\mathrm{d}x$.\n", + "\n", + "In here, 2n and 4n are the amount of DNA in G1 and G2 phase, while w represents a **weight** on the normal distribution, or the fraction of cells that have DNA content of X within the S phase. Making the assumption that all *weights* are equal, and that standard deviation are equal, and that $mu_{x}$ is simply x the above equation transforms into:\n", + "\n", + "$\\!w_{s}\\int_{2n}^{4n} \\mathcal{N}(x,\\,\\sigma^{2}_{x}) \\, \\mathrm{d}x$.\n", + "\n", + "The above infinite sum of Gaussians can be approximated by a uniform distribution:\n", + "\n", + "$\\int_{2n}^{4n} \\mathcal{N}(x,\\,\\sigma^{2}_{x}) \\, \\mathrm{d}x \\sim U(2n,4n)$\n", + "\n", + "## Mixture model\n", + "\n", + "Standard operating procedure of a flow cytometer is to acquire a number of events (ranging in thousands) regardless of parameters of events. Thus, a suspension of cells that's a *mixture* of the cells from G1, S, G2 phases is best approximated as a *mixture* model of DNA distributions of particular phases. We can write:\n", + "\n", + "\n", + "$model= w_{G1} \\mathcal{N}(\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2} \\mathcal{N}(\\mu_{2},\\,\\sigma^{2}_{2}) + (1-w_{G2}-w_{G1})\\ U(mu_{1},mu_{2})$ \n", + "\n", + "Note that the expression $(1-w_{G2}-w_{G1})$ is equal to $w_{s}$, but since a mixture model requires all weights to be equal, we only need $n-1$ weights to describe a mixture of $n$ components. Standard constraints apply (ie. all weights must be equal or greater than 0, and their sum must be 1). \n", + "\n", + "## Application of cytostatic agent\n", + "\n", + "Treating cells with a cytostatic will lower the amount of cells in S and/or G2 phase. In our model, the *weights* of S and/or G2 will lower by amounts $r_{1}$ and $r_{2}$, but the *weight* of G1 will increase by those same amounts. We represent amounts $r_{1}$ and $r_{2}$ as one factor - *the cytostatic factor of difference* or **${d}$**. If ${d}$ is 100 (or 1, depending on implementation), all cells will be of G1 phase and the model will only have only one relevant normal distribution.\n", + "\n", + "This is also unlikely to happen, even with strong *cytostatics*, as there will be cells in S and/or G2 phase prior to treating them with the cytostatic. Therefore we can add two additional parameters that *modify* the *strength* of acting cytostatic calling them ${gs}$ as \"G strength\" and ${ss}$ as \"S strength\". \n", + "\n", + "Implementing all three factors can be done by a formula:\n", + "\n", + "${reduction}_{(G2)}=w_{(G2)} {d} {gs}$\n", + "\n", + "${reduction}_{(S)}= w_{(S)} {d} {ss}$\n", + "\n", + "$w_{(G2)} = w_{(G2)}-{reduction}_{(G2)} = w_{(G2)} (1-{d} {gs})$\n", + "\n", + "$w_{(S)} = w_{(S)}-{reduction}_{(S)} = w_{(S)}(1-{d} {ss})$\n", + "\n", + "$w_{(G1)} = w_{(G1)} + {reduction}_{(G2)} + {reduction}_{(S)}$\n", + "\n", + "\n", + "In our model, *cytostatic factor d* represents our *certainty* that an applied agent is cytostatic. Respective strengths represent *power* of an applied cytostatic.\n", + "\n", + "# Programing implementation\n", + "\n", + "## Simulation of data\n", + "\n", + "To simulate data, taking into consideration everything written above, we can write a Python code to *simulate* and *draw* the data. Additionally, interactive sliders for *cytostatic factor of difference*, *G strength*, and *S strength* is provided. Run the following code to get the simulated data!" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "55ba60a23d5945b1a083897fd3b24e1f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "interactive(children=(IntSlider(value=0, description='Cytostatic'), IntSlider(value=75, description='G strengt…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import ipywidgets as widgets\n", + "from IPython.display import display\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "import numpy as np\n", + "import pandas as pd\n", + "plt.style.use('ggplot')\n", + "\n", + "\n", + "class SimulatedData():\n", + " \"\"\"\n", + " .cf - cytostatic factor d that controls how much cells are \"arrested\" in cell cycle\n", + " .jitter - controls how much cfd is applied (higher jitter, higher reduction)\n", + " .sym - controls how symmetric is cf at taking from G2 or S phase, bigger sym favors G2 phase, smaller favors S phase\n", + " .data - data\n", + " .parameters - parameters of distribution\n", + " \n", + " \"\"\"\n", + " def __init__(self,cfd,gs=90,ss=90,seed=0,size=15000):\n", + " \"\"\"\n", + " CFD - cytostatic factor of difference\n", + " Controls how much cells are \"arrested\" in cell cycle\n", + " \n", + " GS - G strength\n", + " Controls how much cfd takes from G2 phase\n", + " \n", + " SS - S strength\n", + " Controls how much cfd takes from S phase\n", + " \n", + " \"\"\"\n", + " if cfd<0 or cfd>100:\n", + " raise ValueError(\"cfd must be between 0 and 100\")\n", + " if gs<0 or gs>100:\n", + " raise ValueError(\"gs must be between 0 and 100\")\n", + " if ss<0 or ss>100:\n", + " raise ValueError(\"ss must be between 0 and 100\")\n", + " \n", + " np.random.seed(seed)\n", + " \n", + " #generate the mean1, mean2\n", + " m1=np.random.randint(100,300)+np.random.sample()\n", + " m2=2*m1+20*np.random.sample()\n", + " \n", + " #generate the sd1, sd2\n", + " s1=0.1*m1+np.random.sample()\n", + " s2=0.1*m2+np.random.sample()\n", + " \n", + " #generate the weights\n", + " w1=np.random.uniform(0.3,0.5)\n", + " w2=np.random.uniform(0.3,0.5)\n", + " w3=1-w1-w2\n", + " \n", + " #apply the factor\n", + " if cfd>0:\n", + " cf=cfd/100\n", + " g=gs/100\n", + " s=ss/100\n", + " a=w2*cf*g\n", + " b=w3*cf*s\n", + " #lowering weights\n", + " w2=w2-a\n", + " w3=w3-b\n", + " #adding weights\n", + " w1=w1+a+b \n", + "\n", + " #G0 phase (2n)\n", + " data1=np.random.normal(loc=m1,scale=s1,size=int(w1*size))\n", + " #G2 phase (4n)\n", + " data2=np.random.normal(loc=m2,scale=s2,size=int(w2*size))\n", + " #S phase (2n-4n)\n", + " data3=np.random.uniform(low=m1,high=m2,size=int(w3*size))\n", + " #concat data\n", + " data=np.concatenate([data1,data2,data3])\n", + " param_dic={\"m1\":m1,\"m2\":m2,\"s1\":s1,\"s2\":s2,\"w1\":w1,\"w2\":w2}\n", + " self.data=data\n", + " self.parameters=param_dic\n", + " self.cf=cfd\n", + " self.gs=gs\n", + " self.ss=ss\n", + " \n", + "\n", + "%matplotlib inline\n", + "\n", + "def update_plot(cfd,gs,ss):\n", + " new_data=SimulatedData(cfd,gs,ss)\n", + " data=new_data.data\n", + " hist_plot=sns.histplot(data,stat=\"density\",kde=True)\n", + "\n", + "cfd=widgets.IntSlider(min=0, max=100, step=1, value=0, description =\"Cytostatic\")\n", + "gs=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"G strength\")\n", + "ss=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"S strength\")\n", + "widgets.interactive(update_plot,cfd=cfd,gs=gs,ss=ss)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parameter estimation\n", + "\n", + "Recalling our model:\n", + "\n", + "$model= w_{G1} \\mathcal{N}(\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2} \\mathcal{N}(\\mu_{2},\\,\\sigma^{2}_{2}) + (1-w_{G2}-w_{G1})\\ U(\\mu_{1},\\mu_{2})$ \n", + "\n", + "We can see that the model has 6 parameters - 3 for each Gaussian. Question is, if we have the data, can we recover those parameters? \n", + "We will \"store\" our parameters into one vector, **$\\theta$**, and consider the case of one point (one cell, one event).\n", + "\n", + "$m(x_{i} | \\theta)=w_{G1} \\mathcal{N}(x_{i}|\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2} \\mathcal{N}(x_{i}|\\mu_{2},\\,\\sigma^{2}_{2}|x_{i}) + (1-w_{G2}-w_{G1})\\ U(x_{i}|\\mu_{1},\\mu_{2})$ \n", + "\n", + "Every event is independent from each other, and since the above mentioned model represents a kind of a *probability* distribution, for many measurements $N$ stored in a vector $X$, we can write\n", + "\n", + "$m(X | \\theta)=\\prod_n m(x_{i} | \\theta)$\n", + "$=\\prod_n\\ [w_{G1} \\mathcal{N}(x_{i}|\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2} \\mathcal{N}(x_{i}|\\mu_{2},\\,\\sigma^{2}_{2}|x_{i}) + (1-w_{G2}-w_{G1})\\ U(x_{i}|\\mu_{1},\\mu_{2})]$ \n", + "\n", + "The above expression is rather unwieldy, as probabilities are very small, and their product will likewise be very small. To ease the calculations, we apply log transform. Log transform will also transform multiplication to summation. We will take the **negative** log transform, because the problem will then be one of minimization, rather than maximization.\n", + "\n", + "$\\mathcal{L}(\\theta)=-log m(X|\\theta)=-\\sum log m(x_{i}|\\theta)$\n", + "\n", + "The above function will be the **objective** function for our model given our data. Thus, problem of finding optimal parameters can is one of minimization of the above function.\n", + "\n", + "There are following *constraints* we need to take in consideration. First, all our parameters must be positive (deviations and means must be strictly positive), especially our weights, and the sum of our weights must be less than or equal to 1. Luckily, those constraints are *linear*, and can be expressed in the form of a dot product of a matrix **A** and parameter vector **$\\theta$** We define the matrix **A** and parameter vector **$\\theta$** below:\n", + "\n", + "$A=\\begin{bmatrix} \n", + "1 & 1 & 0 & 0 & 0 & 0 \\\\\n", + "1 & 0 & 0 & 0 & 0 & 0 \\\\\n", + "0 & 0 & 0 & 0 & 0 & 0 \\\\\n", + "0 & 0 & 1 & 0 & 0 & 0 \\\\\n", + "0 & 0 & 0 & 1 & 0 & 0 \\\\\n", + "0 & 0 & 0 & 0 & 1 & 0 \\\\\n", + "0 & 0 & 0 & 0 & 0 & 1 \\\\\n", + "\\end{bmatrix}$\n", + "\n", + "$\\theta=\\begin{bmatrix}\n", + "w_{1}\\\\\n", + "w_{2}\\\\\n", + "\\mu_{1}\\\\\n", + "\\mu_{2}\\\\\n", + "\\sigma_{1}\\\\\n", + "\\sigma_{2}\\\\\n", + "\\end{bmatrix}$\n", + "\n", + "We can define **lower boundary** to be a zero vector, and **upper boundary** to be a vector with first value 1, and the rest infinity. We'll use @ symbol to denote matrix multiplication for concordance with numpy operator.\n", + "\n", + "**0** $\\leq A @ \\theta \\leq [1 \\infty \\infty \\infty \\infty \\infty \\infty]^{T}$\n", + "\n", + "Lastly, this needs to be implemented in Python. We will make extensive use of autograd library to calculate gradient (and value!) of our objective function and of scipy.optimize library to minimize both gradient and value of our function, as well as using LinearConstraint class so our minimization can converge without issues. \n", + "\n", + "Finally, since this is a simulated experiment all our prior parameters are known! It is then possible to define error as (original-determined)/(original) in percentage, and we can then average the vector. Code below does just that. It simulated the data according to the user provided input (just like in plotting), recovers the original parameters, determines new parameters according to the above equations (functions optimal_parameter_finding and neglog), and then reports the discrepancy between the as average error.\n", + "\n", + "*Note: Since autograd and scipy.minimize are algorithmic, they can ocassionally give errors and warning. Since we're using logarithmic transformation, if we take log of anything lesser than 0, we'll get an error. These errors are harmless, and both autograd and scipy.minimize will gloss over those errors and they won't affect end result. Nevertheless, for aesthetic purposes, we elect to suppress error warnings*\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "97dc6efed57147bbad078df3ab2c4367", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "interactive(children=(IntSlider(value=65, description='Cytostatic'), IntSlider(value=75, description='G streng…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import warnings\n", + "warnings.filterwarnings('ignore') #suppress error warnings\n", + "\n", + "from autograd.scipy.stats import norm\n", + "from autograd import numpy as np\n", + "from autograd import value_and_grad\n", + "from scipy.optimize import LinearConstraint as lincon\n", + "from scipy.optimize import minimize\n", + "from scipy.stats import uniform\n", + "\n", + "np.set_printoptions(formatter={'float': lambda x: \"{0:0.2f}\".format(x)}) #<-we dont want to get bogged down in a million of decimal points\n", + "\n", + "def neglog(params,data):\n", + " \"\"\"\n", + " order is w1,w2,m1,m2,s1,s2\n", + " \"\"\"\n", + " #parameter unpacking\n", + " w1,w2,m1,m2,s1,s2=params \n", + " gaus=w1*norm.pdf(data,m1,s1)+(w2)*norm.pdf(data,m2,s2)\n", + "\n", + " loc=min(m1._value,m2._value)\n", + " scale=abs(m1._value-m2._value)\n", + " \n", + " unif=(1-w1-w2)*uniform.pdf(data,loc,scale)\n", + " \n", + " imn=gaus+unif\n", + " imn=np.log(imn)\n", + " imn=-1*imn.sum()\n", + " return(imn)\n", + "\n", + "def optimal_parameter_finding(data):\n", + " #define our constraints\n", + " A=np.hstack((np.array([1,1,0,0,0,0]).reshape(6,1),np.identity(6))).T\n", + " lb=np.zeros(7)\n", + " up=np.full((7),np.inf)\n", + " up[0]=1\n", + " \n", + " conmat=lincon(A,lb,up)\n", + " \n", + " results = minimize(\n", + " value_and_grad(neglog), # see autograd docs.\n", + " x0 = np.array([0.4, 0.4, 200, 500, 30,50]), # initial value\n", + " args=(data,), #additional argument, namely data, must be passed\n", + " jac=True, #this tells the algorithm that our function returns (value(x),gradient(x))\n", + " constraints=conmat #these are our contraints\n", + " )\n", + " \n", + " return(results.x)\n", + "\n", + "def data_drawer(original_data,det_parameters):\n", + " size=15000\n", + " #unpacking determined parameters so we can redraw our data\n", + " w1=det_parameters[0]\n", + " w2=det_parameters[1]\n", + " w3=1-w1-w2\n", + " m1=det_parameters[2]\n", + " m2=det_parameters[3]\n", + " s1=det_parameters[4]\n", + " s2=det_parameters[5]\n", + " #G0 phase (2n)\n", + " data1=np.random.normal(loc=m1,scale=s1,size=int(w1*size))\n", + " #G2 phase (4n)\n", + " data2=np.random.normal(loc=m2,scale=s2,size=int(w2*size))\n", + " #S phase (2n-4n)\n", + " data3=np.random.uniform(low=m1,high=m2,size=int(w3*size))\n", + " #concat data\n", + " det_data=np.concatenate([data1,data2,data3])\n", + " hist_plot1=sns.histplot(original_data,stat=\"density\",kde=True,color=\"blue\",alpha=0.5)\n", + " hist_plot2=sns.histplot(det_data,stat=\"density\",kde=True,color=\"red\",alpha=0.5)\n", + " \n", + "\n", + "def simulate_new_data(cfd,gs,ss):\n", + " simulated_data=SimulatedData(cfd,gs,ss)\n", + " data=simulated_data.data\n", + " parameters=simulated_data.parameters\n", + " original_parameters=[parameters[\"w1\"],parameters[\"w2\"],parameters[\"m1\"],parameters[\"m2\"],parameters[\"s1\"],parameters[\"s2\"]]\n", + " original_parameters=np.array(original_parameters,dtype=float)\n", + " \n", + " determined_parameters=optimal_parameter_finding(data)\n", + " error_v=(original_parameters-determined_parameters)/original_parameters*100\n", + " print(\"Discrepancy between original and determined is: \",error_v,\" %\")\n", + " print(\"Average error is:\",round(error_v.mean(),2),\" %\")\n", + " #plot the difference\n", + " print(\"Below is the plot of original data in blue, and data reconstructed from determined parameters in red\")\n", + " data_drawer(data,determined_parameters)\n", + "\n", + "cfd=widgets.IntSlider(min=0, max=100, step=1, value=65, description =\"Cytostatic\")\n", + "gs=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"G strength\")\n", + "ss=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"S strength\")\n", + "widgets.interactive(simulate_new_data,cfd=cfd,gs=gs,ss=ss)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our simulated experiment will include one sample where there was no cytostatic (cfd=0), and one sample where a cytostatic is present (cfd=100). Our question is, if we have those two samples, can we determine G and S strength?" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "22a2fe9328074dcca26df311fae8c052", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "interactive(children=(IntSlider(value=75, description='G strength'), IntSlider(value=75, description='S streng…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def double_simulation(gs,ss):\n", + " noCancer=SimulatedData(0,gs,ss)\n", + " Cancer=SimulatedData(100,gs,ss)\n", + " \n", + " ##TAKING STRENGTHS DIRECTLY\n", + " #weights of cfd=0\n", + " w1_0=noCancer.parameters[\"w1\"]\n", + " w2_0=noCancer.parameters[\"w2\"]\n", + " w3_0=1-w1_0-w2_0\n", + " \n", + " #weights of cfd=100\n", + " w1_100=Cancer.parameters[\"w1\"]\n", + " w2_100=Cancer.parameters[\"w2\"]\n", + " w3_100=1-w1_100-w2_100\n", + " \n", + " #storing strengths\n", + " g_str_orig=(w2_0-w2_100)/(w2_0)\n", + " s_str_orig=(w3_0-w3_100)/(w3_0)\n", + " \n", + " ##TAKING STRENGTHS FROM PARAMETER ESTIMATION\n", + " par_0=optimal_parameter_finding(noCancer.data)\n", + " par_100=optimal_parameter_finding(Cancer.data)\n", + " \n", + " #weights of cfd=0\n", + " w1_0=par_0[0]\n", + " w2_0=par_0[1]\n", + " w3_0=1-w1_0-w2_0\n", + " \n", + " #weights of cfd=100\n", + " w1_100=par_100[0]\n", + " w2_100=par_100[1]\n", + " w3_100=1-w1_100-w2_100\n", + " \n", + " #recovering the determined strengths\n", + " g_str_det=(w2_0-w2_100)/w2_0\n", + " s_str_det=(w3_0-w3_100)/w3_0\n", + "\n", + " print(\n", + " \"Recovered strengths are: G=\",\n", + " round(g_str_orig*100,2),\n", + " \" S=\",\n", + " round(s_str_orig*100,2)\n", + " )\n", + " \n", + " print(\n", + " \"Estimated strengths are: G=\",\n", + " round(g_str_det*100,2),\n", + " \" S=\",\n", + " round(s_str_det*100,2)\n", + " )\n", + " print(\n", + " \"Error is G(err)=\",\n", + " round((g_str_orig-g_str_det)/(g_str_orig)*100,2),\n", + " \"%, S(err)=\",\n", + " round((s_str_orig-s_str_det)/(s_str_orig)*100,2),\n", + " \"%\")\n", + " \n", + " \n", + " \n", + " \n", + "gs=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"G strength\")\n", + "ss=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"S strength\")\n", + "widgets.interactive(double_simulation,gs=gs,ss=ss)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above answer is yes, we can! Generally speaking, when doing an experiment, the scientist will have to have one negative control (where there is no cytotoxic substance present) and one positive control (where there is a substance they are positively sure is cytotoxic), as well as the actual experiment. In the actual experiment, cytotoxicity of the substance is not known, and both negative and positive control are here to compare the tested substance against no substance and cytotoxic substance.\n", + "\n", + "## Estimating cytotoxicity of a substance \n", + "In our simulated example, we demonstrated that we can recover G and S strengths when we have negative control (corresponding to ${d=0}$) and positive control (corresponding to ${d=100}$). \n", + "Our next question will be that if we know G and S strengths, and a substance we don't know for sure is cytotoxic (corresponding to ${d=x}$), can we determine $x$?\n", + "\n", + "Before simulating further data, we'll bring up our mathematical model again (we replaced $1-w_{G1}-w_{G2}$ with $w_{S}$ to make calculation easier):\n", + "\n", + "$\\mathcal{L}(\\theta)=-log m(X|\\theta)=-\\sum log m(x_{i}|\\theta)$\n", + "\n", + "$m(x_{i} | \\theta)= [w_{G1} \\mathcal{N}(x_{i}|\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2} \\mathcal{N}(x_{i}|\\mu_{2},\\,\\sigma^{2}_{2}|x_{i}) + w_{S}\\ U(x_{i}|\\mu_{1},\\mu_{2})]$ \n", + "\n", + "Notice that this model makes no notice of parameters we put up on sliders (G, S strengths, cytostatic factor). Indeed, these parameters are modeled to act upon the weights, and only on weights, in a non-linear fashion that ultimately returns a value that is apt for a weight value - greater than or equal to 0, and sum of all weights ultimately being 1:\n", + "\n", + "$w_{i}=w_{i} (1- {s_{i}} {d}) $ or $w_{j}=w_{j} + \\sum({s_{i}} {d}) $\n", + "\n", + "Our model above models treats one population treated with nothing and another population treated with something as two different things. In order to better glean at the effects of factors that reduce (n-1) weights and increase one weight, we need to find modify our model with additional parameters - namely strengths, and the cytotoxic factor d. This amounts to plugging the above equation in our model, and extending $\\theta$ vector with additional parameters. Ours is a model of 3 components (2 gaussian, 1 uniform), and thus has 2 strengths and 1 factor - we extend our $\\theta$ by three additional parameters!\n", + "\n", + "$m(x_{i} | \\theta_{ext})= [(w_{G1}+w_{G2}{s_{2}}{d}+w_{S}{s_{3}}{d}) \\mathcal{N}(x_{i}|\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2}(1-{s_{2}} {d}) \\mathcal{N}(x_{i}|\\mu_{2},\\,\\sigma^{2}_{2}|x_{i}) + w_{S}(1-{s_{2}} {d})\\ U(x_{i}|\\mu_{1},\\mu_{2})]$ \n", + "\n", + "$\\theta_{ext}=\\begin{bmatrix}\n", + "w_{1}\\\\\n", + "w_{2}\\\\\n", + "\\mu_{1}\\\\\n", + "\\mu_{2}\\\\\n", + "\\sigma_{1}\\\\\n", + "\\sigma_{2}\\\\\n", + "d\\\\\n", + "s_{1}\\\\\n", + "s_{2}\\\\\n", + "\\end{bmatrix}$\n", + "\n", + "In case where d is equal to 0, strengths are simply irrelevant. What can the strengths represent? They could be a fraction of the cells that are in G2 or S phase right up until a moment a perfect cytostatic (a theoretical agent that stops cell cycle completely) enters the cell mixture - they can represent a threshold of weights of G2 and S phase components that the respective weights cannot go lower than.\n", + "\n", + "Our $\\theta_{ext}$ vector has 9 parameters, and our experiment with positive and negative control enables us to *determine* all 9 parameters. Assuming that the strengths remain constant within different experiments, that they're a property solely of a particular population of cells rather than a function of various other elements, our extended model for an unknown ${d=x}$ will only have one parameter to be deterimened, and that is ${d}$!\n", + "\n", + "Rewriting our model differently:\n", + " \n", + "$m_{unk}(x_{i} | {d})= [(w_{G1}+w_{G2}{s_{2}}{d}+w_{S}{s_{3}}{d}) \\mathcal{N}(x_{i}|\\mu_{1},\\,\\sigma^{2}_{1}) + w_{G2}(1-{s_{2}} {d}) \\mathcal{N}(x_{i}|\\mu_{2},\\,\\sigma^{2}_{2}|x_{i}) + w_{S}(1-{s_{2}} {d})\\ U(x_{i}|\\mu_{1},\\mu_{2})]$ \n", + " \n", + "$\\mathcal{L}({d})=-log m_{unk}(X|{d})=-\\sum log m_{unk}(x_{i}|{d})$\n", + "\n", + "We can consider all other parameters, other than d, as constants. The optimal parameter d will be the one that minimizes the above function.\n", + "\n", + "The following code will do the following things\n", + "\n", + "1. Take user provided G and S parameters and simulate two datasets corresponding to positive (D=100) and negative (D=0) control\n", + "2. Estimate parameters of dataset according to the \"optimal_parameter_finding\" function\n", + "3. Estimate G and S parameters from optimal parameters for positive (D=100) and negative (D=0) control\n", + "4. Take user provided D parameter (corresponding to cytotoxicity of a substance) and simulate one dataset.\n", + "5. Estimate the D parameter according to the above mentioned idea.\n", + "6. Draw all our experiments, positive and negative control, as well as tested substance\n", + "\n", + "G and S parameters are considered to be a function of cell population used in the experiment and are applied to all experiments equally. Cytostatic slider controls \"cytotoxicity\" of the supstance used in experiment. \n", + "\n", + "Our sucess will be if we can recover user inputed value for Cytostatic to a reasonable degree of precision.\n", + " \n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "703bb70da303407e830e13169056727e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "interactive(children=(IntSlider(value=50, description='Cytostatic', max=99, min=1), IntSlider(value=75, descri…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def estimate_strengths(posPar,negPar):\n", + " #weights of cfd=0\n", + " w1_0=negPar[0]\n", + " w2_0=negPar[1]\n", + " w3_0=1-w1_0-w2_0\n", + " \n", + " #weights of cfd=100\n", + " w1_100=posPar[0]\n", + " w2_100=posPar[1]\n", + " w3_100=1-w1_100-w2_100\n", + " \n", + " #recovering the determined strengths\n", + " g_str_det=(w2_0-w2_100)/w2_0\n", + " s_str_det=(w3_0-w3_100)/w3_0\n", + " return(g_str_det,s_str_det)\n", + "\n", + "def neglog_cytostat(d,data):\n", + " \"\"\"\n", + " constants are in order w1,w2,m1,m2,s1,s2,gstr,sstr\n", + " \"\"\"\n", + " #parameter unpacking\n", + " w1,w2,m1,m2,s1,s2,gstr,sstr=constants\n", + " ws=1-w1-w2\n", + " \n", + " #applying strengths\n", + " w2=w2*(1-d*gstr)\n", + " ws=ws*(1-d*sstr)\n", + " w1=1-w2-ws #going about it a roundabout way\n", + " \n", + " \n", + " gaus=w1*norm.pdf(data,m1,s1)+(w2)*norm.pdf(data,m2,s2)\n", + "\n", + " loc=min(m1,m2)\n", + " scale=abs(m1-m2)\n", + " unif=ws*uniform.pdf(data,loc,scale)\n", + " \n", + " imn=gaus+unif\n", + " imn=np.log(imn)\n", + " imn=-1*imn.sum()\n", + " return(imn)\n", + "\n", + "def cytostat_param_finding(exData,constants):\n", + " \n", + " results = minimize(\n", + " value_and_grad(neglog_cytostat), # see autograd docs.\n", + " x0 = 0.5, # initial value\n", + " args=(exData,), #additional argument, namely data, must be passed\n", + " jac=True, #this tells the algorithm that our function returns (value(x),gradient(x))\n", + " bounds= ((0,1),) #d must be between 0 and 1\n", + " )\n", + " return(results)\n", + " \n", + "def drawing_our_experiment(posData,negData,exData,cfd):\n", + " #transfer our data into pandas DataFrame\n", + " \n", + " pos=pd.DataFrame(posData,columns=[\"DNA content\"])\n", + " pos.insert(0,\"Cytostatic\",100)\n", + " \n", + " neg=pd.DataFrame(negData,columns=[\"DNA content\"])\n", + " neg.insert(0,\"Cytostatic\",0)\n", + " \n", + " ex=pd.DataFrame(exData,columns=[\"DNA content\"])\n", + " ex.insert(0,\"Cytostatic\",cfd)\n", + " \n", + " dataset=pd.concat([neg,ex,pos])\n", + " sns.displot(data=dataset,x=\"DNA content\",hue=\"Cytostatic\",col=\"Cytostatic\",stat=\"density\",kde=True)\n", + " \n", + "def main_function(cfd,gs,ss):\n", + " #first we simulate our controls\n", + " posData=SimulatedData(100,gs,ss)\n", + " negData=SimulatedData(0,gs,ss)\n", + " #then we estimate parameters of controls\n", + " posPar=optimal_parameter_finding(posData.data)\n", + " negPar=optimal_parameter_finding(negData.data)\n", + " #then we estimate strengths\n", + " gstr,sstr=estimate_strengths(posPar,negPar)\n", + " #print(gstr,sstr)\n", + " \n", + " #we will pack our estimated parameters (weights, means, stdevs, strengths) into one array called constant\n", + " #the constants are in order (w1,w2,m1,m2,s1,s2,gstr,sstr)\n", + " global constants #we make constants global because Autograd wrapped numpy doesn't behave\n", + " constants=np.append(negPar,[gstr,sstr])\n", + " \n", + " #now we will simulate our \"experimental dataset\"\n", + " exData=SimulatedData(cfd,gs,ss)\n", + " \n", + " #we will now draw our datasets\n", + " drawing_our_experiment(posData.data,negData.data,exData.data,cfd)\n", + " \n", + " #and finally we find try to recover our parameter\n", + " result=cytostat_param_finding(exData.data,constants)\n", + " d=result.x\n", + " print(\"Your Cytostatic parameter is \",d*100)\n", + " \n", + "\n", + "\n", + "cfd=widgets.IntSlider(min=1, max=99, step=1, value=50, description =\"Cytostatic\")\n", + "gs=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"G strength\")\n", + "ss=widgets.IntSlider(min=0, max=100, step=1, value=75, description =\"S strength\")\n", + "widgets.interactive(main_function,cfd=cfd,gs=gs,ss=ss)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.3" + }, + "toc-autonumbering": true, + "toc-showmarkdowntxt": false + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Simulating data.ipynb b/Simulating data.ipynb index 3ecc2d9..26a33fa 100644 --- a/Simulating data.ipynb +++ b/Simulating data.ipynb @@ -13,6 +13,10 @@ "## DNA measuring\n", "Content of the DNA can be measured indirectly, using intercalating flourescent dyes such as **propidium iodide** (PI) that bond to DNA molecules within the cells. Uniform suspension of cells is then passed through a flourescent cytometer that is capable of detecting single cells (*events*) and measuring the intensity of PI flourescence. Since this is an indirect measurement of DNA through PI signal intensity, repeated sampling of cells in say, G1 phase, won't give the same values, but will give values that are distributed according to normal distribution. In general, measuring N amount of cells whose DNA content is X will be normally distributed with mean of PI(X) and standard deviation $\\sigma$ where PI(X) is a function that translates DNA content into PI flourescence. Since we will be comparing cell content, we are not interested in the *exact amount* of DNA present in the cells, and thus we will make an assumption that PI(X) ~ X.\n", "\n", + "## Cytostatics and cytotoxicity\n", + "Certain componds can inhibit normal progression of cell mitosis. Such componds are called *cytostatics*. These compounds will lower the percentage of cells in S and/or G2 phase. \n", + "We know that certain compounds do just that. Evaluating cytotoxicity, or just effect on cell cycle, is an important part of evaluating a compound, especially if it's for potential human use.\n", + "\n", "# Mathematical modelling\n", "## Modeling DNA distributions\n", "There are three aforementioned phases where the amount of DNA is different. We can model each of the phases by different function.\n", @@ -44,7 +48,7 @@ "\n", "## Application of cytostatic agent\n", "\n", - "Certain molecules can inhibit normal progression of cell division by various means. These molecules are called *cytostatics* (meaning they *stop* cells). Treating cells with a cytostatic will lower the amount of cells in S and/or G2 phase. In our model, the *weights* of S and/or G2 will lower by amounts $r_{1}$ and $r_{2}$, but the *weight* of G1 will increase by those same amounts. We represent amounts $r_{1}$ and $r_{2}$ as one factor - *the cytostatic factor of difference* or **${d}$**. If ${d}$ is 100 (or 1, depending on implementation), all cells will be of G1 phase and the model will only have only one relevant normal distribution.\n", + "Treating cells with a cytostatic will lower the amount of cells in S and/or G2 phase. In our model, the *weights* of S and/or G2 will lower by amounts $r_{1}$ and $r_{2}$, but the *weight* of G1 will increase by those same amounts. We represent amounts $r_{1}$ and $r_{2}$ as one factor - *the cytostatic factor of difference* or **${d}$**. If ${d}$ is 100 (or 1, depending on implementation), all cells will be of G1 phase and the model will only have only one relevant normal distribution.\n", "\n", "This is also unlikely to happen, even with strong *cytostatics*, as there will be cells in S and/or G2 phase prior to treating them with the cytostatic. Therefore we can add two additional parameters that *modify* the *strength* of acting cytostatic calling them ${gs}$ as \"G strength\" and ${ss}$ as \"S strength\". \n", "\n", @@ -72,13 +76,13 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "0c394bfb79f649dda815d195079826c0", + "model_id": "55ba60a23d5945b1a083897fd3b24e1f", "version_major": 2, "version_minor": 0 }, @@ -245,13 +249,13 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "8b3d1dc805364c1cb2249d3000e7a16a", + "model_id": "97dc6efed57147bbad078df3ab2c4367", "version_major": 2, "version_minor": 0 }, @@ -366,13 +370,13 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "e235ee6d1e0641629d80efc0711a54f4", + "model_id": "22a2fe9328074dcca26df311fae8c052", "version_major": 2, "version_minor": 0 }, @@ -515,13 +519,13 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "f51a522be7264d2eb2128a296b366980", + "model_id": "703bb70da303407e830e13169056727e", "version_major": 2, "version_minor": 0 }, @@ -658,7 +662,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.8" + "version": "3.8.3" }, "toc-autonumbering": true, "toc-showmarkdowntxt": false