Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

References and a memory friendly cccrank method #7

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

vgherard
Copy link

@vgherard vgherard commented Feb 9, 2021

Hi, sorry for the long delay.

I've added a few references for the Gumbel-Max stuff I was mentioning (Yellott appears to be the first one discussing the method).

I also propose a third version of the *rank() algorithms which uses only O(size) extra space by keeping a small heap of the best size elements seen so far. Even though this version has worse asymptotic time complexity than the other implementations, it turns out in practice to be faster (at least for the few benchmark points I've tried).

Finally, notice that testing on the error message of base sample.int() will fail if the locale is not English. Left a small comment.

Hope any of this is useful,
Cheers,

Valerio.

EDIT: just noticed that the test-coverage workflow has failed, as in your last commit on the master branch. The error occured during the installation of curl, which in turn tries to install multtest, which was removed from CRAN.

@codecov
Copy link

codecov bot commented Oct 18, 2022

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2022

Thanks, I apologize for my slow response.

Checks pass now.

Kudos on finding the Yellott reference. Do you remember which section(s) would be most relevant to understand the application?

I switched to testthat 3, this should make the tests locale-invariant. Would you mind adding a few tests for the new implementation?

@rstub
Copy link

rstub commented Oct 11, 2023

Hi @vgherard, hi @krlmlr!
Over in daqana/dqrng#72 I am implementing weighted sampling for dqrng. For the w/o replacement case I have at the moment implemented a version of crank, but the other algorithms you have implemented here are interesting as well. Before I adapt them and add you as cph: Would you be interested in contributing directly?

BTW, what is the best way to cite "Accelerating weighted random sampling without replacement"? Pre-print with GitHub URL? Or https://ethz.ch/content/dam/ethz/special-interest/baug/ivt/ivt-dam/vpl/reports/1101-1200/ab1141.pdf?

@vgherard
Copy link
Author

vgherard commented Oct 16, 2023

@krlmlr I completely forgot about this merge request, I'm sorry.

In case this is still useful: I rapidly went through the reference, it appears that the relevant part is contained in the proof of Lemma 6, page 123 of the journal. It's an elementary proof of the fact that the max of "Gumbel" variables is equivalent to a (single) weighted sampling, which is the essence of the sampling algorithm.

Regarding testing: I see that the current version of wrswoR automatically tests all the algorithms implemented, while I see no special algorithm-specific test. Did you have something special in mind?

Thanks and sorry again for the long silence.


@rstub thank you, but from my side I'm quite short of time as you can see from the slow ping-pong here 👀

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2023

Thanks @rstub and @vgherard.

I don't mind moving the code to dqrng, but you could also forward to this package? The link you shared is as good as it gets, it has also made it into a chapter of my thesis: https://www.research-collection.ethz.ch/handle/20.500.11850/183170 . I haven't tried hard enough to publish this.

I don't have the capacity to contribute here in the foreseeable future, but I can commit to keeping this alive and on CRAN.

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2023

I force-pushed. @vgherard: can you please also add documentation on how cccrank is different from the other options? Something along the lines of what you have written in the OP should be good enough.

Can we somehow test that the new implementation is correct? I burned several CPU years to get the numbers in my paper...

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2023

The results in the paper were computed with the help of https://github.com/krlmlr/wrswoR.benchmark.

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2023

Now looking into my old disk for wrswoR.correctness ...

@krlmlr
Copy link
Owner

krlmlr commented Oct 18, 2023

Found them! Let me know if you're interested in the data, it's rather large. The code is smaller.

@vgherard
Copy link
Author

vgherard commented Oct 19, 2023

Hi @krlmlr , I added a few words on the space complexity of CCCrank in the documentation as you suggested.

I didn't mention time complexity, because I realized that you have used std::partial_sort, which according to the documentation has $N \log (\text{size})$ time complexity, in comparison to the $\text{size}\log (N)$ that could be achieved in principle, as you mention in your thesis.

If I'm not confused, this makes the time complexity of all _rank() functions equal (to $N \log (\text{size})$ ), which would explain my observations in the OP.

I'm looking into validity tests and benchmarks right now - we may also need to fix the statement "*_crank() seems to be faster on most inputs" in the documentation.

@vgherard
Copy link
Author

Concerning testing algorithm correctness: I set up a simple script that simulates hypothesis testing on the equality of sampling distributions (CCCRank vs. CRank algorithms) - I believe this is more or less similar to the approach you used in your thesis.

I did some basic exploration and (thankfully) there's no signal of a striking difference between sampling distributions, although in order to be quantitative we should analyse power.

If you think it's necessary, we can try to run a systematic analysis, although honestly it seems a bit overkill, since the C++ codes defining these functions are very similar.

# ------------------------------------------------ Testing algorithm correctness
test_sample_int <- function(
  f1 = wrswoR::sample_int_cccrank,
  f2 = wrswoR::sample_int_crank,
  
  n = 50,
  size = 25,
  reps = 1e3,
  
  prob = pexp(3 * ppoints(n)),
  delta_prob = 0,
  
  test_stat = ks.test,
  simulate.p.value = TRUE,
  B = 1e3
) {
  s1 <- replicate(reps, f1(n, size, prob))
  s2 <- replicate(reps, f2(n, size, prob + delta_prob))
  
  test <- test_stat(s1, s2, simulate.p.value = simulate.p.value, B = B)
  return(test$p.value)
}

n <- 50

replicate(100, test_sample_int(reps = 1e3)) -> dd
mean(dd < 0.05)  # FPR is actually conservative (less than .05)
summary(dd)

delta <- 1:n / n
replicate(100, test_sample_int(reps = 1e3, delta_prob = delta)) -> dd1
mean(dd1 < 0.05)  # FNR is very small with such a big effect
summary(dd1)

@rstub
Copy link

rstub commented Oct 21, 2023

@krlmlr I cannot just forward to wrswoR because the main feature of dqrng are the faster RNGs it provides. With the current development release it is possible to register it as "user-supplied RNG", so that e.g. wrswoR would also be using them. But this indirection does cost some performance. At least this is the case for other sampling methods. I haven't done any comparisons here yet. Anyway, I will adapt the code and run some benchmarks.

@krlmlr
Copy link
Owner

krlmlr commented Oct 22, 2023

Thanks. Just performing a small sanity check:

set.seed(20231022)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  8  7 10  9
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  7  8  9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#>  [1]  3  1  4  2  5  7  6  8 10  9
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#>  [1] 10  9  8  6  5  7  4  3  1  2

Created on 2023-10-22 with reprex v2.0.2

Does cccrank have a different interpretation of prob ?

@rstub
Copy link

rstub commented Oct 22, 2023

I think the min-heap is filled correctly. But it is then walked from the minimum to the maximum, i.e. the output vector should be filled in reverse order, e.g.

  // Fill result with 'size' indexes k with largest g[k] - equivalent to
  // sampling w/o replacement (Gumbel-Max trick).
  // T ~ O(size * log(size))
  Rcpp::IntegerVector res(size);
  for (size_t k = 0; k < size; ++k) {
    res[size - k - 1] = H.top().index + 1;
    H.pop();
  }

@vgherard
Copy link
Author

😱 thanks for spotting this. It's as you say, the min-heap was correctly populated with the sampled element, and the set of returned indexes was correctly distributed - that's why it would have escaped the test above.

I have made the correction you suggested above, now the sequences look fine:

set.seed(20231023)
wrswoR::sample_int_rank(10, 10, 3^(10:1))
#>  [1]  1  2  3  4  5  6  7  8  9 10
wrswoR::sample_int_crank(10, 10, 3^(10:1))
#>  [1]  3  2  4  1  7  5  6  8  9 10
wrswoR::sample_int_ccrank(10, 10, 3^(10:1))
#>  [1]  2  1  3  4  6  7  5  8  9 10
wrswoR::sample_int_cccrank(10, 10, 3^(10:1))
#>  [1]  1  2  5  3  4  6  7  8  9 10

Created on 2023-10-23 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants