Pearson correlation coefficient final aggregation algorithm #2885

alexrgilbert · 2024-12-27T01:34:17Z

alexrgilbert
Dec 27, 2024

I am a bit confused by the current implementation of the _final_aggregation function used by PearsonCorrCoef, and the reference link in the current docstring is broken.

A quick description of how to implement a parallel algorithm for aggregating running statistics for calculating Pearson correlation is given on the Wikipedia page for variance calculation algorithms. More detailed derivations and analysis can be found in papers by Chan et al. and Schubert et al. (which are cited by the Wikipedia article).

While the current implementation indeed simplifies to the equations provided by these references, it is significantly more complex (and difficult to understand). Is there any reason that I am overlooking (e.g., numerical precision, avoiding overflow) for this specific implementation?

Below is a simplified implementation which I believe matches the output of the current (I have tested on my own data and it passes the torchmetrics unit tests), but is more closely aligned with the source algorithms. If there is no reason for the current implementation, would it be worthwhile to replace with this simpler implementation?

def _final_aggregation(
    means_x: torch.Tensor,
    means_y: torch.Tensor,
    vars_x: torch.Tensor,
    vars_y: torch.Tensor,
    corrs_xy: torch.Tensor,
    nbs: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Aggregate the statistics from multiple devices.

    Formula taken from here: `Parallel algorithm for calculating variance <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm>`_

    """
    if len(means_x) == 1:
        return means_x[0], means_y[0], vars_x[0], vars_y[0], corrs_xy[0], nbs[0]
    mx1, my1, vx1, vy1, cxy1, n1 = means_x[0], means_y[0], vars_x[0], vars_y[0], corrs_xy[0], nbs[0]
    for i in range(1, len(means_x)):
        mx2, my2, vx2, vy2, cxy2, n2 = means_x[i], means_y[i], vars_x[i], vars_y[i], corrs_xy[i], nbs[i]
        # count
        nb = n1 + n2
        # mean_x
        mean_x = (n1 * mx1 + n2 * mx2) / nb
        # mean_y
        mean_y = (n1 * my1 + n2 * my2) / nb
        # intermediates for running variances
        n12_b = n1 * n2 / nb
        delta_x = mx2 - mx1
        delta_y = my2 - my1
        # var_x
        var_x = vx1 + vx2 + n12_b * delta_x ** 2
        # var_y
        var_y = vy1 + vy2 + n12_b * delta_y ** 2
        # corr_xy
        corr_xy = cxy1 + cxy2 + n12_b * delta_x * delta_y

        mx1, my1, vx1, vy1, cxy1, n1 = mean_x, mean_y, var_x, var_y, corr_xy, nb
    return mean_x, mean_y, var_x, var_y, corr_xy, nb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pearson correlation coefficient final aggregation algorithm #2885

{{title}}

Replies: 0 comments

Select a reply

Pearson correlation coefficient final aggregation algorithm #2885

alexrgilbert Dec 27, 2024

Replies: 0 comments

alexrgilbert
Dec 27, 2024