-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
References and a memory friendly cccrank method #7
base: main
Are you sure you want to change the base?
Conversation
Welcome to Codecov 🎉Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment. Thanks for integrating Codecov - We've got you covered ☂️ |
Thanks, I apologize for my slow response. Checks pass now. Kudos on finding the Yellott reference. Do you remember which section(s) would be most relevant to understand the application? I switched to testthat 3, this should make the tests locale-invariant. Would you mind adding a few tests for the new implementation? |
Hi @vgherard, hi @krlmlr! BTW, what is the best way to cite "Accelerating weighted random sampling without replacement"? Pre-print with GitHub URL? Or https://ethz.ch/content/dam/ethz/special-interest/baug/ivt/ivt-dam/vpl/reports/1101-1200/ab1141.pdf? |
@krlmlr I completely forgot about this merge request, I'm sorry. In case this is still useful: I rapidly went through the reference, it appears that the relevant part is contained in the proof of Lemma 6, page 123 of the journal. It's an elementary proof of the fact that the max of "Gumbel" variables is equivalent to a (single) weighted sampling, which is the essence of the sampling algorithm. Regarding testing: I see that the current version of Thanks and sorry again for the long silence. @rstub thank you, but from my side I'm quite short of time as you can see from the slow ping-pong here 👀 |
I don't mind moving the code to dqrng, but you could also forward to this package? The link you shared is as good as it gets, it has also made it into a chapter of my thesis: https://www.research-collection.ethz.ch/handle/20.500.11850/183170 . I haven't tried hard enough to publish this. I don't have the capacity to contribute here in the foreseeable future, but I can commit to keeping this alive and on CRAN. |
I force-pushed. @vgherard: can you please also add documentation on how Can we somehow test that the new implementation is correct? I burned several CPU years to get the numbers in my paper... |
The results in the paper were computed with the help of https://github.com/krlmlr/wrswoR.benchmark. |
Now looking into my old disk for wrswoR.correctness ... |
Found them! Let me know if you're interested in the data, it's rather large. The code is smaller. |
Hi @krlmlr , I added a few words on the space complexity of CCCrank in the documentation as you suggested. I didn't mention time complexity, because I realized that you have used If I'm not confused, this makes the time complexity of all I'm looking into validity tests and benchmarks right now - we may also need to fix the statement " |
Concerning testing algorithm correctness: I set up a simple script that simulates hypothesis testing on the equality of sampling distributions (CCCRank vs. CRank algorithms) - I believe this is more or less similar to the approach you used in your thesis. I did some basic exploration and (thankfully) there's no signal of a striking difference between sampling distributions, although in order to be quantitative we should analyse power. If you think it's necessary, we can try to run a systematic analysis, although honestly it seems a bit overkill, since the C++ codes defining these functions are very similar.
|
@krlmlr I cannot just forward to wrswoR because the main feature of dqrng are the faster RNGs it provides. With the current development release it is possible to register it as "user-supplied RNG", so that e.g. wrswoR would also be using them. But this indirection does cost some performance. At least this is the case for other sampling methods. I haven't done any comparisons here yet. Anyway, I will adapt the code and run some benchmarks. |
Thanks. Just performing a small sanity check: set.seed(20231022)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#> [1] 1 2 3 4 5 6 8 7 10 9
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#> [1] 1 2 3 4 5 6 7 8 9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#> [1] 3 1 4 2 5 7 6 8 10 9
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#> [1] 10 9 8 6 5 7 4 3 1 2 Created on 2023-10-22 with reprex v2.0.2 Does |
I think the min-heap is filled correctly. But it is then walked from the minimum to the maximum, i.e. the output vector should be filled in reverse order, e.g. // Fill result with 'size' indexes k with largest g[k] - equivalent to
// sampling w/o replacement (Gumbel-Max trick).
// T ~ O(size * log(size))
Rcpp::IntegerVector res(size);
for (size_t k = 0; k < size; ++k) {
res[size - k - 1] = H.top().index + 1;
H.pop();
} |
😱 thanks for spotting this. It's as you say, the min-heap was correctly populated with the sampled element, and the set of returned indexes was correctly distributed - that's why it would have escaped the test above. I have made the correction you suggested above, now the sequences look fine: set.seed(20231023)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#> [1] 1 2 3 4 5 6 7 8 9 10
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#> [1] 3 2 4 1 7 5 6 8 9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#> [1] 2 1 3 4 6 7 5 8 9 10
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#> [1] 1 2 5 3 4 6 7 8 9 10 Created on 2023-10-23 with reprex v2.0.2 |
Hi, sorry for the long delay.
I've added a few references for the Gumbel-Max stuff I was mentioning (Yellott appears to be the first one discussing the method).
I also propose a third version of the *rank() algorithms which uses only O(size) extra space by keeping a small heap of the best
size
elements seen so far. Even though this version has worse asymptotic time complexity than the other implementations, it turns out in practice to be faster (at least for the few benchmark points I've tried).Finally, notice that testing on the error message of base
sample.int()
will fail if the locale is not English. Left a small comment.Hope any of this is useful,
Cheers,
Valerio.
EDIT: just noticed that the test-coverage workflow has failed, as in your last commit on the master branch. The error occured during the installation of
curl
, which in turn tries to installmulttest
, which was removed from CRAN.