-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve accuracy of summary statistic methods for transformed distributions #73
Comments
The Taylor series approximation used to compute the mean of log-Normal distributions is appropriate for small variances. The log normal distribution is sufficiently common to justify a specific distribution, but I'm concerned about how this inaccuracy might impact non-log transformations. One option could be to sample the mean, the @robjhyndman - do you have any suggestions for improving this approximation? |
The Taylor series approximation up to 2nd derivatives has the nice property that it is very general provided the transformation is differentiable, but it is not necessarily very accurate. I can think of three possible approaches:
|
Great, thanks for the suggestions. |
OK. To start, a more accurate version for exp() has mean exp(m + v/2) and variance [exp(v)-1]exp(2m+v) where the original distribution has mean m and variance v. But if we are going to allow special cases, then perhaps identify when exp is applied to a normal distribution, and save it as logNormal rather than t(N). |
I've added the log-normal distribution, and a shortcut between it and the normal with library(distributional)
exp(dist_normal())
#> <distribution[1]>
#> [1] lN(0, 1)
log(dist_lognormal())
#> <distribution[1]>
#> [1] N(0, 1) Created on 2021-11-12 by the reprex package (v2.0.0) |
Following up on the discussion started in #102
Yes, I did notice they use taylor series approximations. But I'm not sure that they do work fine in most cases. The examples I posted use fairly typical parameter ranges, and the error is huge. When you say they work fine in most cases, do you have any results on the expected error under different transformations and distributions? They wway it is now, I personally wouldn't use this functionality. Maybe the simplest fix is to have an argument to |
Btw, the speed issue is greatly improved by now having the symbolic derivatives. For example, on the master branch: mean_new <- function(x, rel.tol = .Machine$double.eps^0.5, ...) {
fun <- function(at, dist) {
unlist(density(dist, at)) * at
}
integrate(fun, -Inf, Inf, dist = x, rel.tol = rel.tol)$value
}
trans_lnorm <- exp(dist_wrap('norm', mean = 2, sd = 1.5))
microbenchmark::microbenchmark(mean(trans_lnorm),
+ mean_new(trans_lnorm))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> mean(trans_lnorm) 514.468 542.4095 581.0446 575.1685 601.8595 902.205 100
#> mean_new(trans_lnorm) 103720.283 105202.5765 109171.2150 106897.0450 109197.6370 175162.168 100 And on the symmbolic derivatives branch: Unit: microseconds
expr min lq mean median
mean(trans_lnorm) 540.093 553.828 606.679 568.629
mean_new(trans_lnorm) 10147.090 10281.221 10995.819 10466.890 |
and using low rel.tol in integrate such as 0.1 is only 7 times slower than the taylor series, but produces ~2000 times smaller error: type mean microseconds
1 analytical 22.75990 0.58548
2 taylor series 15.70174 627.20898
3 integrate(rel.tol = .Machine$double.eps^0.5) 22.75990 11238.07048
4 integrate (rel.tol = 1e-5) 22.75990 8340.58449
5 integrate (rel.tol = 1e-3) 22.75982 6154.55551
6 integrate (rel.tol = 1e-1) 22.75512 4015.56993 |
I'm open to changing it, there are many options here. For example, I also experimented with Here's some other examples the taylor series approach with different transformations: library(distributional)
mu = 15
sd = 2
x <- dist_normal(mu, sd)^2
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 229.0715
mean(x)
#> [1] 229
x <- sqrt(dist_normal(mu, sd))
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 3.864325
mean(x)
#> [1] 3.864377
bc <- scales::boxcox_trans(0.5)
x <- dist_transformed(dist_normal(mu, sd), bc$transform, bc$inverse)
mean(x)
#> [1] 5.728753
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 5.728844 Created on 2024-04-19 with reprex v2.0.2 |
I guess the issue is much more pronounced when the inverse function is |
Yes, that's correct. This is why we have |
I'm not sure the exact problem, but it seems that the
mean()
of transformed distributions is not always correct. Here's an example ofexp()
applied todist_normal()
, which should yield a lognormal distribution, giving incorrect means:I'm not sure the issue. It does appear to work in some cases:
The text was updated successfully, but these errors were encountered: