Make code using stirling2 perform better at scale

Ping @ajgates42  because you are the last person committing here 

I was looking through the JMLR paper on clustering metrics and was pointed to the repo here. 

I noticed in several places the `mpmath.stirling2` method is used. 
While this method is useful, it is not as performant as the scipy version for some of the use cases. 
One reason is the vectorization must be done outside the `mpmath.stirling2` function, e.g. the Stirling2 table
needs to be rebuilt/recalculated for each `(n,k)` argument pair. Also because the function `mpmath.stirling2` uses the inclusion exlcusion formula, it can be faster for single `(n,k)` values as compared to the scipy version. 

For example timings of a single call to `mpmath.stirling2(9,3)` versus `scipy.stirling2(9,3)` will show that mpmath is faster. 
This is also true for (5,3) and for succesive pairs like often happens in the sampling routines. 

For example:` mpmath.stirling2(9,3), mpmath.stirling2(8,2)` is faster than the scipy equivalent. 

However, in the call to [_random_partition_num_iterator](https://github.com/Hoosier-Clusters/clusim/blob/fa7e880a8411d2197fc4b2dfca6ec0a016490ce8/clusim/clugen.py#L385) in the recursive DFS for [generate_random_partition_num](https://github.com/Hoosier-Clusters/clusim/blob/fa7e880a8411d2197fc4b2dfca6ec0a016490ce8/clusim/clugen.py#L409)

the computation becomes slower for larger `(n,k)`, values especially when considering that multiple partitions may be requested in a simulation or computational experiment. In these scenarios, the Stirling2 table of values is rebuilt *multiple* times which slows the method down and preventing the scaling to even bigger `(n,k)` values (and better partition sample coverage too.

Here is some notes and data to give evidence for claims: 

# Note 1: 
For `n=100` elements both methods work but for `n=1000` elements the recursion depth is exceeded for the current `clusim` package implementation. The existing implementation will still execute at higher values of `n` *only if* the max recursion limit is bumped up, so we do that the scipy implementation *does not* require the recursion depth to be increased for `n=1000` to execute successfully. 

# Note 2: 82.3% speedup for simulation/draws of partitions 

```python
setup_code_scipy = """
import scipy.special as sps
import numpy as np

def random_partition(n, k, number_samples):
    ## one time costs
    N = np.array([[i] for i in range(1, n+1)])
    triangle = sps.stirling2(N, np.array(list(range(1, n+1))), exact=True)
    samples = []
    original_n_k = (n, k)
    for i in range(number_samples):
        _n, _k = original_n_k
        assignments = [_k]
        # handle the offsets correctly
        _n -= 1
        _k -= 1
        while _n > 0:
            u = np.random.uniform(0, 1)
            if u < triangle[_n-1][_k-1] / triangle[_n][_k]:
                assignments.append(_k)
                _k -= 1
            else:
                if _k > 1:
                    assignments.append(np.random.randint(1, _k))
                else:
                    assignments.append(1)
            _n -= 1
        assignments.reverse()
        # codeword = ','.join(str(val) for val in assignments)
        samples.append(assignments)
    return samples
"""

setup_code_clusim = """
import mpmath
import numpy as np

def _random_partition_num_iterator(n_elements, n_clusters):
    assert n_clusters <= n_elements
    if n_elements == 1:
        current_partition = [[0]]
    else:
        stirling_prob = mpmath.stirling2(n_elements - 1, n_clusters - 1) / mpmath.stirling2(n_elements, n_clusters)
        if np.random.random() < stirling_prob:
            current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters - 1)
            current_partition.append([n_elements - 1])
        else:
            current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters)
            current_clu = np.random.randint(n_clusters)
            current_partition[current_clu].append(n_elements - 1)
    return current_partition
"""

# this line takes about 25-30 seconds on my machine
nsamp, n, k = 100, 1000, 50
a = random_partition(n, k, nsamp)
b = [_random_partition_num_iterator(n, k) for i in range(nsamp)]

timing_result_scipy = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; random_partition(n, k, nsamp)', setup=setup_code_scipy, number=10)
print(f"Total time scipy: {timing_result_scipy} seconds")
#  Total time scipy: 26.251820417004637 seconds
timing_result_clusim = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; [_random_partition_num_iterator(n,k) for _ in range(nsamp)]', setup=setup_code_clusim, number=10)
print(f"Total time clusim: {timing_result_clusim} seconds")
```

running the above snippets gives me (on an M4 mac laptop)
```sh
Total time clusim: 148.07740566704888 seconds
reduces runtime by >>> (148.077405 - 26.251820417004637) / 148.077405 = 0.822715555982328 or 82.3% 
```
which is pretty good.

To get a better idea of where speedups and breakevens happens I ran some more values
I organized them into the table here for easier inspection and comparison
There is some reduction even at modestly small values. 


For the 2019 JOSS paper there is a section with: 
>compute clusters for 1024 elements on 32 clusters over 100 runs ... 
 >for this scenario my implementation clearly wins over the mpmath implementation 
 >`n, k, nsamp = 1024, 32, 100`

Running the timeit snippet above with these values is the last entry of the second table beneath.`

All runtimes are given in seconds-this is what timeit measures but defauly on my machine. 

| n | k | nsamp | time-scipy | time-clusim | %reduction | 
|--------|--------|--------|--------|--------|--------|
| 100 | 50 | 10 | 0.1679379 | 0.2164677 | 22.42 |
| 100 | 50 | 9 | 0.1644772 | 0.188570 | 12.78 |
| 100 | 50 | 8 | 0.15831120 |  0.1672997 |  5.37 | 
| 100 | 50 | 5 | 0.1707545 | 0.10551958 | -61.8 |
| 100 | 50l | 1 | 0.153685 | 0.0211916 | -625 |


So here the break even point is generating around 8-9 samples for `n,k = 100,50` anything more than that and you are
constrained by the recalculation of the Stirling2 triangle.

some other examples: 

| n | k | nsamp | time-scipy | time-clusim | %reduction | 
|--------|--------|--------|--------|--------|--------|
| 50 | 10 | 50 | 0.101605 | 0.1627458 | 37.57 |
| 50 | 10 | 25 | 0.0711955 | 0.080919 | 12.01 |
| 1024 | 32 | 100 | 27.423443667 | 82.5327272 | 66.77 |

The last entry here corresponds to values from the JOSS 2019 paper you wrote so it seems that the approach I'm proposing would enable those simulations to both scale in `(n,k,nsamples)` and also for larger values too. 

Is there any interest in putting this more performance version into clusim in lieue of the existing DFS?  


PS: I'm not sure how the `mpmath` library calculates Bell numbers but there is also the possibility that a row sum over the `k` from scipy will be faster for certain `n` values and call patterns but I haven't looked into that and figured I'd post these gains to gauge interest.

 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make code using stirling2 perform better at scale #45

Note 1:

Note 2: 82.3% speedup for simulation/draws of partitions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

n	k	nsamp	time-scipy	time-clusim	%reduction
100	50	10	0.1679379	0.2164677	22.42
100	50	9	0.1644772	0.188570	12.78
100	50	8	0.15831120	0.1672997	5.37
100	50	5	0.1707545	0.10551958	-61.8
100	50l	1	0.153685	0.0211916	-625

n	k	nsamp	time-scipy	time-clusim	%reduction
50	10	50	0.101605	0.1627458	37.57
50	10	25	0.0711955	0.080919	12.01
1024	32	100	27.423443667	82.5327272	66.77

Make code using stirling2 perform better at scale #45

Description

Note 1:

Note 2: 82.3% speedup for simulation/draws of partitions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions