Skip to content

Make code using stirling2 perform better at scale #45

@rlucas7

Description

@rlucas7

Ping @ajgates42 because you are the last person committing here

I was looking through the JMLR paper on clustering metrics and was pointed to the repo here.

I noticed in several places the mpmath.stirling2 method is used.
While this method is useful, it is not as performant as the scipy version for some of the use cases.
One reason is the vectorization must be done outside the mpmath.stirling2 function, e.g. the Stirling2 table
needs to be rebuilt/recalculated for each (n,k) argument pair. Also because the function mpmath.stirling2 uses the inclusion exlcusion formula, it can be faster for single (n,k) values as compared to the scipy version.

For example timings of a single call to mpmath.stirling2(9,3) versus scipy.stirling2(9,3) will show that mpmath is faster.
This is also true for (5,3) and for succesive pairs like often happens in the sampling routines.

For example: mpmath.stirling2(9,3), mpmath.stirling2(8,2) is faster than the scipy equivalent.

However, in the call to _random_partition_num_iterator in the recursive DFS for generate_random_partition_num

the computation becomes slower for larger (n,k), values especially when considering that multiple partitions may be requested in a simulation or computational experiment. In these scenarios, the Stirling2 table of values is rebuilt multiple times which slows the method down and preventing the scaling to even bigger (n,k) values (and better partition sample coverage too.

Here is some notes and data to give evidence for claims:

Note 1:

For n=100 elements both methods work but for n=1000 elements the recursion depth is exceeded for the current clusim package implementation. The existing implementation will still execute at higher values of n only if the max recursion limit is bumped up, so we do that the scipy implementation does not require the recursion depth to be increased for n=1000 to execute successfully.

Note 2: 82.3% speedup for simulation/draws of partitions

setup_code_scipy = """
import scipy.special as sps
import numpy as np

def random_partition(n, k, number_samples):
    ## one time costs
    N = np.array([[i] for i in range(1, n+1)])
    triangle = sps.stirling2(N, np.array(list(range(1, n+1))), exact=True)
    samples = []
    original_n_k = (n, k)
    for i in range(number_samples):
        _n, _k = original_n_k
        assignments = [_k]
        # handle the offsets correctly
        _n -= 1
        _k -= 1
        while _n > 0:
            u = np.random.uniform(0, 1)
            if u < triangle[_n-1][_k-1] / triangle[_n][_k]:
                assignments.append(_k)
                _k -= 1
            else:
                if _k > 1:
                    assignments.append(np.random.randint(1, _k))
                else:
                    assignments.append(1)
            _n -= 1
        assignments.reverse()
        # codeword = ','.join(str(val) for val in assignments)
        samples.append(assignments)
    return samples
"""

setup_code_clusim = """
import mpmath
import numpy as np

def _random_partition_num_iterator(n_elements, n_clusters):
    assert n_clusters <= n_elements
    if n_elements == 1:
        current_partition = [[0]]
    else:
        stirling_prob = mpmath.stirling2(n_elements - 1, n_clusters - 1) / mpmath.stirling2(n_elements, n_clusters)
        if np.random.random() < stirling_prob:
            current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters - 1)
            current_partition.append([n_elements - 1])
        else:
            current_partition = _random_partition_num_iterator(n_elements=n_elements - 1, n_clusters=n_clusters)
            current_clu = np.random.randint(n_clusters)
            current_partition[current_clu].append(n_elements - 1)
    return current_partition
"""

# this line takes about 25-30 seconds on my machine
nsamp, n, k = 100, 1000, 50
a = random_partition(n, k, nsamp)
b = [_random_partition_num_iterator(n, k) for i in range(nsamp)]

timing_result_scipy = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; random_partition(n, k, nsamp)', setup=setup_code_scipy, number=10)
print(f"Total time scipy: {timing_result_scipy} seconds")
#  Total time scipy: 26.251820417004637 seconds
timing_result_clusim = timeit.timeit(stmt='nsamp, n, k = 100, 1000, 50; [_random_partition_num_iterator(n,k) for _ in range(nsamp)]', setup=setup_code_clusim, number=10)
print(f"Total time clusim: {timing_result_clusim} seconds")

running the above snippets gives me (on an M4 mac laptop)

Total time clusim: 148.07740566704888 seconds
reduces runtime by >>> (148.077405 - 26.251820417004637) / 148.077405 = 0.822715555982328 or 82.3% 

which is pretty good.

To get a better idea of where speedups and breakevens happens I ran some more values
I organized them into the table here for easier inspection and comparison
There is some reduction even at modestly small values.

For the 2019 JOSS paper there is a section with:

compute clusters for 1024 elements on 32 clusters over 100 runs ...
for this scenario my implementation clearly wins over the mpmath implementation
n, k, nsamp = 1024, 32, 100

Running the timeit snippet above with these values is the last entry of the second table beneath.`

All runtimes are given in seconds-this is what timeit measures but defauly on my machine.

n k nsamp time-scipy time-clusim %reduction
100 50 10 0.1679379 0.2164677 22.42
100 50 9 0.1644772 0.188570 12.78
100 50 8 0.15831120 0.1672997 5.37
100 50 5 0.1707545 0.10551958 -61.8
100 50l 1 0.153685 0.0211916 -625

So here the break even point is generating around 8-9 samples for n,k = 100,50 anything more than that and you are
constrained by the recalculation of the Stirling2 triangle.

some other examples:

n k nsamp time-scipy time-clusim %reduction
50 10 50 0.101605 0.1627458 37.57
50 10 25 0.0711955 0.080919 12.01
1024 32 100 27.423443667 82.5327272 66.77

The last entry here corresponds to values from the JOSS 2019 paper you wrote so it seems that the approach I'm proposing would enable those simulations to both scale in (n,k,nsamples) and also for larger values too.

Is there any interest in putting this more performance version into clusim in lieue of the existing DFS?

PS: I'm not sure how the mpmath library calculates Bell numbers but there is also the possibility that a row sum over the k from scipy will be faster for certain n values and call patterns but I haven't looked into that and figured I'd post these gains to gauge interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions