Unit Tests #5

mdhaber · 2024-12-18T06:08:13Z

I think there are two kinds of unit tests that distributions must pass to be included.

An "identity" test confirms that the distribution is the one it is purported to be. These tests are written by the reviewer based on the primary reference provided by the distribution author.
A generic suite of "consistency" tests provided by scikit-stats confirms that all methods of the distribution are consistent with one another.

For instance:

An identity test of the normal distribution might compare the output of the pdf method against tabulated reference values or a very simple, independently written, and easily-verified implementation included in the test suite.
If the distribution overrides _pdf_formula and _cdf_formula, we test that X.cdf(x, method='formula') is close to X.cdf(x, method='quadrature') (computed from the PDF); if _logcdf_formula is also defined, X.logcdf(x, method='formula') is tested against X.cdf(x, method='logexp').

The idea is that if 1) one of the overridden methods is checked against an external reference and 2) all the other overridden methods are checked against one another, there is no need to test the other overridden methods against an external reference. In fact, to save developer and reviewer time, only one of the overridden methods should be tested against an external reference in a distribution-specific, and the rest of the tests should be generic. If the author or reviewer is concerned that this is not thorough enough, the solution is to improve the generic tests, not add tests against external references.

scipy/stats/tests/test_coninuous.py::TestDistributions implements this strategy, and here are some things I like about how it turned out:

The shape parameters are randomly drawn from a domain defined in the code of the distribution, and the arguments are randomly drawn from the support of the distribution. In addition, parameters/arguments are drawn from the boundaries of their domain/support to check and outside their domain/support to check edge case behavior. We get good coverage by drawing values stochastically from an appropriate distribution, not by hand-selecting numbers.
The generic tests are very compact. For instance, one test, test_funcs, is parametrized over all distribution families and most of the methods (from entropy to ilogccdf), and it compares results with all the values of the method keyword argument (e.g. 'formula', 'log/exp', 'complement', 'quadrature') against one another in a loop.
The generic tests are quite thorough. Sometimes they are redundant because leaving them redundant is simpler than the alternative. For instance, I have no problem with testing both allclose(X.cdf(x, method='formula'), X.ccdf(x, method='complement'))_and_allclose(X.ccdf(x, method='formula'), X.cdf(x, method='complement')) rather than choosing only one. The tests can take as long as they want to run. The important thing is that we trust them to catch inconsistencies (within the tested domain) if there are any.

Here are some things I don't like about how it turned out:

The relative frequency of within domain/support, on the edge of domain/support, outside domain/support, and NaN values to be tested needs to be considered more carefully. For instance, when a distribution has several shape parameters and each of them has a fixed probability of being outside the domain, the probability that at least one is outside the domain can get quite high, and we could wind up up testing mostly that NaN = NaN.
Sometimes generic tests fail through no fault of the distribution implementation. For instance, the generic method for computing the mode is numerical optimization of the PDF (method='optimization'). For the Uniform distribution, the _mode_formula returns (a + b)/2, but numerical optimization would not be wrong to pick any other point. As another example, the generic method for computing the CDF is numerical integration of the PDF (method='quadrature'). If the PDF is tricky to integrate, occasionally hypothesis will find a case that fails to meet tolerances. We need to come up with a strategy that makes it easy to allow for occasional failures without every one of them requiring developer attention.
Parameterization is nice because it makes the tests relatively concise and easy to maintain, but when a failure does occur, it is hard to reproduce the failure even with the hypothesis reproduce_failure decorator. There should be an easy way of reproducing failures outside of the test suite so we can more easily poke at the situation.

For these reasons, I am leaning toward not using hypothesis and instead writing our own lightweight framework for the property-based tests we need.

Note that the tests described here are not intended to check the precision of the results or reliability for extreme inputs. Instead, these metrics will be assessed with benchmarks which do not "pass" or "fail". Instead, the results of the benchmarks are clearly communicated to the user so they know what sort of accuracy they can expect within a given domain. I'll write about that in a separate issue.

The text was updated successfully, but these errors were encountered:

mdhaber mentioned this issue Feb 13, 2025

Expansion Plan #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unit Tests #5

Unit Tests #5

mdhaber commented Dec 18, 2024

Unit Tests #5

Unit Tests #5

Comments

mdhaber commented Dec 18, 2024