Skip to content

[WIP] Mp/dispersion smoothing #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 59 commits into
base: development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
c001fef
added sctransform vst wrapper
picciama May 27, 2022
9f768db
fixed trailing whitespace
picciama May 27, 2022
3fe5524
removed return statements
picciama May 27, 2022
74155f0
cast np.finfo to float explicitly
picciama May 27, 2022
47c8341
resolved commments by Ilan Gold in #145
picciama Jun 19, 2022
6e347f7
bugfix: also check for "no_smoothing"
picciama Jun 19, 2022
235e8a5
added nbdeviance from edgeR
picciama Jul 15, 2022
66b3ed5
added aveLogCPM from edgeR
picciama Jul 15, 2022
17360a0
added calcNormFactors from edgeR
picciama Jul 15, 2022
6e1e530
added residDF from edgeR
picciama Jul 15, 2022
21f6570
added wleb wrapper from edgeR
picciama Jul 15, 2022
63444d1
added maximizeInterpolant from edgeR
picciama Jul 15, 2022
9756c36
added external.py for external imports
picciama Jul 15, 2022
9dd2d38
added squeezeVar from limma
picciama Jul 15, 2022
b09b853
added glm_one_group from edgeR
picciama Jul 15, 2022
0b0dcef
bugfixes in tmmwsp fixed
picciama Jul 19, 2022
1541d7d
added deps for newly included edgeR procedures
picciama Jul 23, 2022
c6dcf99
compute() dask arrays before calculating nb_dev
picciama Jul 23, 2022
7701c78
added levenberg-marquardt estimator as in edgeR
picciama Jul 23, 2022
bb78dab
added qr_decomposition for param init as in edgeR
picciama Jul 23, 2022
ae5a8bd
integrated batchglm model / estimator environment
picciama Aug 6, 2022
5280a08
added documentation, bugfixing and use c_utils
picciama Aug 6, 2022
737e099
added documentation and minor fixes
picciama Aug 6, 2022
ce34637
documentation and cleanup
picciama Aug 6, 2022
d307bd4
cleanup
picciama Aug 6, 2022
4547fdd
documentation / cleanup
picciama Aug 6, 2022
5d54b1f
cleanup / bugfixing /integration in estimator
picciama Aug 6, 2022
8c4e96c
added imports
picciama Aug 6, 2022
56d9220
bugfixing and cleanup
picciama Aug 6, 2022
281336a
added effects function
picciama Aug 6, 2022
291f2f2
added module init
picciama Aug 6, 2022
2829929
added fit_f_dist function
picciama Aug 6, 2022
9b4bb0b
added adjusted profile likelihood
picciama Aug 6, 2022
8d4ac3b
cleanup and integration of one_group_glm
picciama Aug 6, 2022
0e78480
added prior degrees of freedom
picciama Aug 6, 2022
dc4ad4c
added basic estimateDisp function
picciama Aug 6, 2022
f2e0831
added c_utility functions
picciama Aug 6, 2022
26861d7
cleanup
picciama Aug 7, 2022
dd7c053
added support for summed nb_deviance
picciama Aug 7, 2022
2520732
bugfix: leverages not correctly returned
picciama Aug 7, 2022
6abe5c9
removed print statement
picciama Aug 7, 2022
3d2e1f3
ignore irrelevant log division errors
picciama Aug 7, 2022
9595dd9
delayed levenberg theta_loc init + dask support
picciama Aug 7, 2022
ca7e50d
added sizefactor and dask/nodask support
picciama Aug 7, 2022
e1228ee
added newest mypy and pytest versions
picciama Sep 13, 2022
0a21e4a
updated pre-commit yaml
picciama Sep 13, 2022
0594375
bumped version of black to support newer click
picciama Sep 13, 2022
09475f0
typo fixed
picciama Sep 13, 2022
0e53b11
added dask support to typehint
picciama Sep 13, 2022
5cd4676
make constraints a mandatory argument
picciama Sep 13, 2022
8730ba7
fix mypy related problems + black reformatting
picciama Sep 13, 2022
a0636ca
index using np.ndarray for mypy to succeed
picciama Sep 20, 2022
5856631
return all, not just prior_df
picciama Sep 20, 2022
d961166
minor refactoring and bugfixing
picciama Sep 20, 2022
4a230b5
fixed pre-commit hooks and mypy
picciama Sep 22, 2022
fbc5e91
increased the param clipping limits
picciama Sep 22, 2022
16915b7
Merge branch 'development' into mp/dispersion_smoothing
picciama Oct 7, 2022
13a7800
fix mypy / batchglm API after merging develoment
picciama Oct 7, 2022
dd63af4
added missing import and init file
picciama Oct 8, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
77 changes: 77 additions & 0 deletions batchglm/external/edgeR/adjProfileLik.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import dask.array
import numpy as np
import scipy

from .estimator import NBEstimator


def adjusted_profile_likelihood(
estimator: NBEstimator,
adjust: bool = True,
):
"""
Featurewise Cox-Reid adjusted profile log-likelihoods for the dispersion.
dispersion can be a scalar or a featurewise vector.
Computationally, dispersion can also be a matrix, but the apl is still computed tagwise.
y is a matrix: rows are genes/tags/transcripts, columns are samples/libraries.
offset is a matrix of the same dimensions as y.
This is a numpy vectorized python version of edgeR's adjProfileLik function implemented in C++.
"""
low_value = 1e-10
log_low_value = np.log(low_value)

estimator.train(maxit=250, tolerance=1e-10)
model = estimator._model_container
poisson_idx = np.where(1 / model.scale < 0)[0]
if isinstance(poisson_idx, dask.array.core.Array):
poisson_idx = poisson_idx.compute()

if len(poisson_idx) == model.num_features:
loglik = model.x * np.log(model.location) - model.location - scipy.special.lgamma(model.x + 1)
elif len(poisson_idx) == 0:
loglik = model.ll
else:
loglik = np.zeros_like(model.x)

poisson_x = model.x[:, poisson_idx]
poisson_loc = model.location_j(poisson_idx)

loglik[:, poisson_idx] = poisson_x * np.log(poisson_loc) - poisson_loc - scipy.special.lgamma(poisson_x + 1)

non_poisson_idx = np.where(model.theta_scale > 0)[0]
loglik[:, non_poisson_idx] = model.ll_j(non_poisson_idx)

sum_loglik = np.sum(loglik, axis=0)

if adjust:
w = -model.fim_weight_location_location

adj = np.zeros(model.num_features)
n_loc_params = model.design_loc.shape[1]
if n_loc_params == 1:
adj = np.sum(w, axis=0)
adj = np.log(np.abs(adj))
if isinstance(adj, dask.array.core.Array):
adj = adj.compute()
else:
xh = model.xh_loc
xhw = np.einsum("ob,of->fob", xh, w)
fim = np.einsum("fob,oc->fbc", xhw, xh)
if isinstance(fim, dask.array.core.Array):
fim = fim.compute()
for i in range(fim.shape[0]):

ldu, _, info = scipy.linalg.lapack.dsytrf(lower=0, a=fim[i])
if info < 0:
adj[i] = 0
print(f"LDL factorization failed for feature {i}")
else:
ldu_diag = np.diag(ldu)
adj[i] = np.sum(
np.where((ldu_diag < low_value) | np.isinf(ldu_diag), log_low_value, np.log(ldu_diag))
)

adj /= 2
sum_loglik -= adj

return sum_loglik
80 changes: 80 additions & 0 deletions batchglm/external/edgeR/aveLogCPM.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
from typing import Optional, Union

import dask.array
import numpy as np

from .external import InputDataGLM, ModelContainer, NBModel
from .glm_one_group import fit_single_group, get_single_group_start


def calculate_avg_log_cpm(
x: np.ndarray,
size_factors: Optional[np.ndarray] = None,
dispersion: Union[np.ndarray, float] = 0.05,
prior_count: int = 2,
weights: Optional[Union[np.ndarray, float]] = None,
maxit: int = 50,
tolerance: float = 1e-10,
chunk_size_cells=1e6,
chunk_size_genes=1e6,
):
"""
Computes average log2 counts per million per feature over all observations.
The method is a python derivative of edgeR's aveLogCPM method.

:param x: the counts data.
:param model_class: the class object to use for creation of a model during the calculation
:param size_factors: Optional size_factors. This is equivalent to edgeR's offsets.
:param dispersion: Optional fixed dispersion parameter used during the calculation.
:param prior_count: The count to be added to x prior to calculation.
:param weights: Optional weights per feature (currently unsupported and ignored)
:param: maxit: The max number of iterations during newton-raphson approximation.
:param: tolerance: The minimal difference in change used as a stopping criteria during NR approximation.
:param: chunk_size_cells: chunks used over the feature axis when using dask
:param: chunk_size_genes: chunks used over the observation axis when using dask
"""

if weights is None:
weights = 1.0
if isinstance(dispersion, float):
dispersion = np.full((1, x.shape[1]), dispersion, dtype=float)
if size_factors is None:
size_factors = np.full((x.shape[0], 1), np.log(1.0))

adjusted_prior, adjusted_size_factors = add_priors(prior_count, size_factors)
x = x + adjusted_prior
avg_cpm_model = NBModel(
InputDataGLM(
data=x,
design_loc=np.ones((x.shape[0], 1)),
design_loc_names=["Intercept"],
size_factors=adjusted_size_factors,
design_scale=np.ones((x.shape[0], 1)),
design_scale_names=["Intercept"],
as_dask=isinstance(x, dask.array.core.Array),
chunk_size_cells=chunk_size_cells,
chunk_size_genes=chunk_size_genes,
)
)
avg_cpm_model_container = ModelContainer(
model=avg_cpm_model,
init_theta_location=get_single_group_start(avg_cpm_model.x, avg_cpm_model.size_factors),
init_theta_scale=np.log(1 / dispersion),
chunk_size_genes=chunk_size_genes,
dtype=x.dtype,
)

fit_single_group(avg_cpm_model_container, maxit=maxit, tolerance=tolerance)
output = (avg_cpm_model_container.theta_location + np.log(1e6)) / np.log(2)

return output


def add_priors(prior_count: int, size_factors: np.ndarray):

factors = np.exp(size_factors)
avg_factors = np.mean(factors)
adjusted_priors = prior_count * factors / avg_factors
adjusted_size_factors = np.log(factors + 2 * adjusted_priors)

return adjusted_priors, adjusted_size_factors
Loading