-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement weighted sampling #72
base: main
Are you sure you want to change the base?
Conversation
* w/o replacement is currently implemnented in R * w/ replacement uses either probabilistic sampling or the alias method
Problematic benchmark from #52 looks much better now: library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 22.42ms 25.5ms 38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 7.96ms 8.78ms 114.
m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 227µs 245µs 3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 113µs 125µs 7508. Created on 2023-10-07 with reprex v2.0.2 However, there is still some potential for improvement in the case of uneven weight distribution: library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 18.3ms 20.5ms 47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 21.7ms 22.5ms 43.0
m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 161µs 189µs 4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 122µs 135µs 7011. Created on 2023-10-07 with reprex v2.0.2 |
Similar to unweighted case. Two variants with stochastic acceptance (fast for even weight distribution) and alias method. These methods seem to be interesting for selection ratios < 0.5 (also similar to unweighted case).
Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when |
For unweighted sampling the |
Recreate RcppExports.cpp with current development version of Rcpp to fix WARN on CRAN
Merge branch 'master' into feature/weighted-sampling-2 # Conflicts: # DESCRIPTION # NEWS.md
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main: |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718d is merged into main: |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cd is merged into main: |
This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5 is merged into main: |
Something to consider here as well: https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/ |
w/o replacement is currently implemented in Rfixes #18
fixes #45
fixes #52
n < 1000 * size
cut-over point between bitset and hashset