Ping @ajgates42 because you are the last person committing here
I was looking through the JMLR paper on clustering metrics and was pointed to the repo here.
I noticed in several places the mpmath.stirling2 method is used.
While this method is useful, it is not as performant as the scipy version for some of the use cases.
One reason is the vectorization must be done outside the mpmath.stirling2 function, e.g. the Stirling2 table
needs to be rebuilt/recalculated for each (n,k) argument pair. Also because the function mpmath.stirling2 uses the inclusion exlcusion formula, it can be faster for single (n,k) values as compared to the scipy version.
For example timings of a single call to mpmath.stirling2(9,3) versus scipy.stirling2(9,3) will show that mpmath is faster.
This is also true for (5,3) and for succesive pairs like often happens in the sampling routines.
For example: mpmath.stirling2(9,3), mpmath.stirling2(8,2) is faster than the scipy equivalent.
However, in the call to _random_partition_num_iterator in the recursive DFS for generate_random_partition_num
the computation becomes slower for larger (n,k), values especially when considering that multiple partitions may be requested in a simulation or computational experiment. In these scenarios, the Stirling2 table of values is rebuilt multiple times which slows the method down and preventing the scaling to even bigger (n,k) values (and better partition sample coverage too.
Here is some notes and data to give evidence for claims:
Note 1:
For n=100 elements both methods work but for n=1000 elements the recursion depth is exceeded for the current clusim package implementation. The existing implementation will still execute at higher values of n only if the max recursion limit is bumped up, so we do that the scipy implementation does not require the recursion depth to be increased for n=1000 to execute successfully.
Note 2: 82.3% speedup for simulation/draws of partitions
setup_code_scipy = """
import scipy.special as sps
import numpy as np
def random_partition(n, k, number_samples):
## one time costs
N = np.array([[i] for i in range(1, n+1)])
triangle = sps.stirling2(N, np.array(list(range(1, n+1))), exact=True)
samples = []
original_n_k = (n, k)
for i in range(number_samples):
_n, _k = original_n_k
assignments = [_k]
# handle the offsets correctly
_n -= 1
_k -= 1
while _n > 0:
u = np.random.uniform(0, 1)
if u < triangle[_n-1][_k-1] / triangle[_n][_k]:
assignments.append(_k)
_k -= 1
else:
if _k > 1:
assignments.append(np.random.randint(1, _k))
else:
assignments.append(1)
_n -= 1
assignments.reverse()
# codeword = ','.join(str(val) for val in assignments)
samples.append(assignments)
return samples
"""
setup_code_clusim = """
import mpmath
import numpy as np
def _random_partition_num_iterator(n_elements, n_clusters):
assert n_clusters <= n_elements
if n_elements == 1:
current_partition = [[0]]
else:
stirling_prob = mpmath.stirling2(n_elements - 1, n_clusters - 1) / mpmath.stirling2(n_elements, n_clusters)
if np.random.random() < stirling_prob:
current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters - 1)
current_partition.append([n_elements - 1])
else:
current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters)
current_clu = np.random.randint(n_clusters)
current_partition[current_clu].append(n_elements - 1)
return current_partition
"""
# this line takes about 25-30 seconds on my machine
nsamp, n, k = 100, 1000, 50
a = random_partition(n, k, nsamp)
b = [_random_partition_num_iterator(n, k) for i in range(nsamp)]
timing_result_scipy = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; random_partition(n, k, nsamp)', setup=setup_code_scipy, number=10)
print(f"Total time scipy: {timing_result_scipy} seconds")
# Total time scipy: 26.251820417004637 seconds
timing_result_clusim = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; [_random_partition_num_iterator(n,k) for _ in range(nsamp)]', setup=setup_code_clusim, number=10)
print(f"Total time clusim: {timing_result_clusim} seconds")
running the above snippets gives me (on an M4 mac laptop)
Total time clusim: 148.07740566704888 seconds
reduces runtime by >>> (148.077405 - 26.251820417004637) / 148.077405 = 0.822715555982328 or 82.3%
which is pretty good.
To get a better idea of where speedups and breakevens happens I ran some more values
I organized them into the table here for easier inspection and comparison
There is some reduction even at modestly small values.
For the 2019 JOSS paper there is a section with:
compute clusters for 1024 elements on 32 clusters over 100 runs ...
for this scenario my implementation clearly wins over the mpmath implementation
n, k, nsamp = 1024, 32, 100
Running the timeit snippet above with these values is the last entry of the second table beneath.`
All runtimes are given in seconds-this is what timeit measures but defauly on my machine.
| n |
k |
nsamp |
time-scipy |
time-clusim |
%reduction |
| 100 |
50 |
10 |
0.1679379 |
0.2164677 |
22.42 |
| 100 |
50 |
9 |
0.1644772 |
0.188570 |
12.78 |
| 100 |
50 |
8 |
0.15831120 |
0.1672997 |
5.37 |
| 100 |
50 |
5 |
0.1707545 |
0.10551958 |
-61.8 |
| 100 |
50l |
1 |
0.153685 |
0.0211916 |
-625 |
So here the break even point is generating around 8-9 samples for n,k = 100,50 anything more than that and you are
constrained by the recalculation of the Stirling2 triangle.
some other examples:
| n |
k |
nsamp |
time-scipy |
time-clusim |
%reduction |
| 50 |
10 |
50 |
0.101605 |
0.1627458 |
37.57 |
| 50 |
10 |
25 |
0.0711955 |
0.080919 |
12.01 |
| 1024 |
32 |
100 |
27.423443667 |
82.5327272 |
66.77 |
The last entry here corresponds to values from the JOSS 2019 paper you wrote so it seems that the approach I'm proposing would enable those simulations to both scale in (n,k,nsamples) and also for larger values too.
Is there any interest in putting this more performance version into clusim in lieue of the existing DFS?
PS: I'm not sure how the mpmath library calculates Bell numbers but there is also the possibility that a row sum over the k from scipy will be faster for certain n values and call patterns but I haven't looked into that and figured I'd post these gains to gauge interest.
Ping @ajgates42 because you are the last person committing here
I was looking through the JMLR paper on clustering metrics and was pointed to the repo here.
I noticed in several places the
mpmath.stirling2method is used.While this method is useful, it is not as performant as the scipy version for some of the use cases.
One reason is the vectorization must be done outside the
mpmath.stirling2function, e.g. the Stirling2 tableneeds to be rebuilt/recalculated for each
(n,k)argument pair. Also because the functionmpmath.stirling2uses the inclusion exlcusion formula, it can be faster for single(n,k)values as compared to the scipy version.For example timings of a single call to
mpmath.stirling2(9,3)versusscipy.stirling2(9,3)will show that mpmath is faster.This is also true for (5,3) and for succesive pairs like often happens in the sampling routines.
For example:
mpmath.stirling2(9,3), mpmath.stirling2(8,2)is faster than the scipy equivalent.However, in the call to _random_partition_num_iterator in the recursive DFS for generate_random_partition_num
the computation becomes slower for larger
(n,k), values especially when considering that multiple partitions may be requested in a simulation or computational experiment. In these scenarios, the Stirling2 table of values is rebuilt multiple times which slows the method down and preventing the scaling to even bigger(n,k)values (and better partition sample coverage too.Here is some notes and data to give evidence for claims:
Note 1:
For
n=100elements both methods work but forn=1000elements the recursion depth is exceeded for the currentclusimpackage implementation. The existing implementation will still execute at higher values ofnonly if the max recursion limit is bumped up, so we do that the scipy implementation does not require the recursion depth to be increased forn=1000to execute successfully.Note 2: 82.3% speedup for simulation/draws of partitions
running the above snippets gives me (on an M4 mac laptop)
Total time clusim: 148.07740566704888 seconds reduces runtime by >>> (148.077405 - 26.251820417004637) / 148.077405 = 0.822715555982328 or 82.3%which is pretty good.
To get a better idea of where speedups and breakevens happens I ran some more values
I organized them into the table here for easier inspection and comparison
There is some reduction even at modestly small values.
For the 2019 JOSS paper there is a section with:
Running the timeit snippet above with these values is the last entry of the second table beneath.`
All runtimes are given in seconds-this is what timeit measures but defauly on my machine.
So here the break even point is generating around 8-9 samples for
n,k = 100,50anything more than that and you areconstrained by the recalculation of the Stirling2 triangle.
some other examples:
The last entry here corresponds to values from the JOSS 2019 paper you wrote so it seems that the approach I'm proposing would enable those simulations to both scale in
(n,k,nsamples)and also for larger values too.Is there any interest in putting this more performance version into clusim in lieue of the existing DFS?
PS: I'm not sure how the
mpmathlibrary calculates Bell numbers but there is also the possibility that a row sum over thekfrom scipy will be faster for certainnvalues and call patterns but I haven't looked into that and figured I'd post these gains to gauge interest.