Improve accuracy of summary statistic methods for transformed distributions #73

mjskay · 2021-11-07T02:29:14Z

I'm not sure the exact problem, but it seems that the mean() of transformed distributions is not always correct. Here's an example of exp() applied to dist_normal(), which should yield a lognormal distribution, giving incorrect means:

> x = dist_normal(0:1, 1:2)
> mean(x)
[1] 0 1
> mean(exp(x))   # should be roughly exp(c(0.5, 3))
[1] 1.500000 8.154845
> exp(c(0.5, 3))
[1]  1.648721 20.085537

I'm not sure the issue. It does appear to work in some cases:

> mean(log(exp(x)))    # should be roughly c(0, 1)
[1] -4.732208e-07  1.000000e+00

The text was updated successfully, but these errors were encountered:

mitchelloharawild · 2021-11-07T23:52:27Z

The Taylor series approximation used to compute the mean of log-Normal distributions is appropriate for small variances.
Some relevant discussion can be found here: https://robjhyndman.com/hyndsight/backtransforming/#comment-3043970592
In most forecasting applications this approximation results in a relatively small inaccuracy in the forecast distribution's mean.

The log normal distribution is sufficiently common to justify a specific distribution, but I'm concerned about how this inaccuracy might impact non-log transformations.

One option could be to sample the mean, the quantile() and generate() methods are accurate. But this would substantially degrade performance for a task commonly done at scale in forecasting.

@robjhyndman - do you have any suggestions for improving this approximation?

robjhyndman · 2021-11-08T06:11:01Z

The Taylor series approximation up to 2nd derivatives has the nice property that it is very general provided the transformation is differentiable, but it is not necessarily very accurate. I can think of three possible approaches:

Use a higher order Taylor series approximation. That would involve computing up to the 4th order derivative and moment.
Use better approximations (or exact results) for specific transformations such as log, exp, and power transformations; and Taylor series if the transformation is not recognized.
Use an empirical approach by randomly sampling a few thousand observations from the transformed distribution.

mitchelloharawild · 2021-11-08T06:25:35Z

Great, thanks for the suggestions.
I think a good start would be to add better methods for common transformations and fallback to Taylor series approximations otherwise.
I think sampling for the mean will be too slow for most applications, but would be useful to provide as an option for users.

robjhyndman · 2021-11-08T08:06:40Z

OK. To start, a more accurate version for exp() has mean exp(m + v/2) and variance [exp(v)-1]exp(2m+v) where the original distribution has mean m and variance v. But if we are going to allow special cases, then perhaps identify when exp is applied to a normal distribution, and save it as logNormal rather than t(N).

Ref: #73

mitchelloharawild · 2021-11-12T01:07:26Z

I've added the log-normal distribution, and a shortcut between it and the normal with exp() and log().

library(distributional)
exp(dist_normal())
#> <distribution[1]>
#> [1] lN(0, 1)
log(dist_lognormal())
#> <distribution[1]>
#> [1] N(0, 1)

^{Created on 2021-11-12 by the reprex package (v2.0.0)}

venpopov · 2024-04-19T07:14:57Z

Following up on the discussion started in #102

Currently these are calculated with taylor series approximations, which are faster than the numerical integration and I think work fine in most cases. Maybe numerical integration would be fast enough with a higher tolerance, while still giving comparable accuracy but with improved consistency.

Yes, I did notice they use taylor series approximations. But I'm not sure that they do work fine in most cases. The examples I posted use fairly typical parameter ranges, and the error is huge. When you say they work fine in most cases, do you have any results on the expected error under different transformations and distributions?

They wway it is now, I personally wouldn't use this functionality. Maybe the simplest fix is to have an argument to mean and variance that allows the user to choose which method (TS vs integration) to use. That way anyone can decide on what is the acceptable speed-accuracy trade-off for them

venpopov · 2024-04-19T07:35:52Z

Btw, the speed issue is greatly improved by now having the symbolic derivatives. For example, on the master branch:

mean_new <- function(x, rel.tol = .Machine$double.eps^0.5, ...) {
  fun <- function(at, dist) {
    unlist(density(dist, at)) * at
  }
  integrate(fun, -Inf, Inf, dist = x, rel.tol = rel.tol)$value
}

trans_lnorm <- exp(dist_wrap('norm', mean = 2, sd = 1.5))

microbenchmark::microbenchmark(mean(trans_lnorm),
+                                mean_new(trans_lnorm))
#> Unit: microseconds
#>                   expr        min          lq        mean      median          uq        max neval
#>      mean(trans_lnorm)    514.468    542.4095    581.0446    575.1685    601.8595    902.205   100
#>  mean_new(trans_lnorm) 103720.283 105202.5765 109171.2150 106897.0450 109197.6370 175162.168   100

And on the symmbolic derivatives branch:

Unit: microseconds
                 expr       min        lq      mean    median
    mean(trans_lnorm)   540.093   553.828   606.679   568.629
mean_new(trans_lnorm) 10147.090 10281.221 10995.819 10466.890

venpopov · 2024-04-19T07:47:59Z

and using low rel.tol in integrate such as 0.1 is only 7 times slower than the taylor series, but produces ~2000 times smaller error:

                                          type     mean microseconds
1                                   analytical 22.75990      0.58548
2                                taylor series 15.70174    627.20898
3 integrate(rel.tol = .Machine$double.eps^0.5) 22.75990  11238.07048
4                   integrate (rel.tol = 1e-5) 22.75990   8340.58449
5                   integrate (rel.tol = 1e-3) 22.75982   6154.55551
6                   integrate (rel.tol = 1e-1) 22.75512   4015.56993

mitchelloharawild · 2024-04-19T07:52:27Z

I'm open to changing it, there are many options here. For example, mean.dist_default() uses numerical integration. Originally it took a the average from a sufficiently large sample with generate(). The current behaviour is for both speed and consistency with the forecast/fable packages, but I agree that accuracy is a more important consideration (@robjhyndman can comment on the accuracy of the TS approximation under different circumstances).

I also experimented with rel.tol, and I think that something high like 0.1 by default is ok - of course this should be parameterised.

Here's some other examples the taylor series approach with different transformations:

library(distributional)
mu = 15
sd = 2
x <- dist_normal(mu, sd)^2
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 229.0715
mean(x)
#> [1] 229

x <- sqrt(dist_normal(mu, sd))
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 3.864325
mean(x)
#> [1] 3.864377

bc <- scales::boxcox_trans(0.5)
x <- dist_transformed(dist_normal(mu, sd), bc$transform, bc$inverse)
mean(x)
#> [1] 5.728753
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 5.728844

^{Created on 2024-04-19 with reprex v2.0.2}

venpopov · 2024-04-19T08:55:53Z

I guess the issue is much more pronounced when the inverse function is log because the taylor series approximation for log has a very small radius of covergence, while the transformations you posted involve power and exp transformations that are well approximated by Taylor series

mitchelloharawild · 2024-04-19T08:57:23Z

Yes, that's correct. This is why we have log(dist_normal()) -> dist_lognormal() to catch this common case.

mitchelloharawild closed this as completed in 9dc1556 Nov 7, 2021

mitchelloharawild reopened this Nov 7, 2021

mitchelloharawild self-assigned this Nov 7, 2021

mitchelloharawild added the bug Something isn't working label Nov 7, 2021

mitchelloharawild added a commit that referenced this issue Nov 12, 2021

Add log-normal distribution

71a2b9b

Ref: #73

mitchelloharawild changed the title ~~mean of exp(dist_normal()) is incorrect in some cases~~ Improve accuracy of summary statistic methods for transformed distributions Nov 12, 2021

mitchelloharawild added enhancement New feature or request and removed bug Something isn't working labels Sep 1, 2022

mitchelloharawild mentioned this issue Apr 19, 2024

mean and variance are incorrect for transformed distributions #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve accuracy of summary statistic methods for transformed distributions #73

Improve accuracy of summary statistic methods for transformed distributions #73

mjskay commented Nov 7, 2021

mitchelloharawild commented Nov 7, 2021

robjhyndman commented Nov 8, 2021

mitchelloharawild commented Nov 8, 2021

robjhyndman commented Nov 8, 2021 •

edited

Loading

mitchelloharawild commented Nov 12, 2021

venpopov commented Apr 19, 2024

venpopov commented Apr 19, 2024

venpopov commented Apr 19, 2024

mitchelloharawild commented Apr 19, 2024 •

edited

Loading

venpopov commented Apr 19, 2024

mitchelloharawild commented Apr 19, 2024

Improve accuracy of summary statistic methods for transformed distributions #73

Improve accuracy of summary statistic methods for transformed distributions #73

Comments

mjskay commented Nov 7, 2021

mitchelloharawild commented Nov 7, 2021

robjhyndman commented Nov 8, 2021

mitchelloharawild commented Nov 8, 2021

robjhyndman commented Nov 8, 2021 • edited Loading

mitchelloharawild commented Nov 12, 2021

venpopov commented Apr 19, 2024

venpopov commented Apr 19, 2024

venpopov commented Apr 19, 2024

mitchelloharawild commented Apr 19, 2024 • edited Loading

venpopov commented Apr 19, 2024

mitchelloharawild commented Apr 19, 2024

robjhyndman commented Nov 8, 2021 •

edited

Loading

mitchelloharawild commented Apr 19, 2024 •

edited

Loading