Skip to content

7 out of 14 tests produce incorrect results (verified with os.urandom) #13

Description

@ReinhardJesolowitz24

Bug Report: 7 out of 14 tests produce incorrect results (verified with os.urandom)

Summary

Multiple tests in nistrng v1.2.3 produce incorrect p-values for all inputs, including cryptographically secure random data (os.urandom). This was discovered by running the test suite against 5 different inputs as a cross-validation:

Input Expected nistrng Result Correct Result
os.urandom() (CSPRNG) ~14/14 PASS 7/14 14/14 PASS
AES-256 encrypted data ~14/14 PASS 7/14 14/14 PASS
Unencrypted JPEG ~0/14 PASS 0/14 0/14 PASS

The 7 broken tests fail identically regardless of input data, producing the same wrong p-values for true random data as for encrypted data.

Environment

  • nistrng version: 1.2.3
  • Python: 3.14 (Windows)
  • numpy: 2.x
  • scipy: 1.17.1
  • Test method: 10 samples of 1,000,000 bits each

Bug 1: Approximate Entropy — min and max swapped (CRITICAL)

File: sp800_22r1a/test_approximate_entropy.py, line 52

# Current (WRONG):
blocks_length: int = min(2, max(3, int(math.floor(math.log(bits.size, 2))) - 6))

# This ALWAYS returns 2, because:
#   max(3, anything) >= 3
#   min(2, 3+) = 2

For 1,000,000 bits: log2(1000000) = 19.9 -> 19 - 6 = 13 -> max(3, 13) = 13 -> min(2, 13) = 2

The test always runs with m=2 instead of the correct m=13. With only 4 possible 2-bit patterns, the test is meaningless for large inputs.

Suggested fix:

blocks_length: int = max(2, min(int(math.floor(math.log(bits.size, 2))) - 6, 13))

Bug 2: Approximate Entropy — incorrect log divisor

File: sp800_22r1a/test_approximate_entropy.py, line 74

# Current (WRONG):
phi_m.append(numpy.sum(c_i[c_i > 0.0] * numpy.log((c_i[c_i > 0.0] / 10.0))))

# Should be (per NIST SP 800-22):
phi_m.append(numpy.sum(c_i[c_i > 0.0] * numpy.log(c_i[c_i > 0.0])))

The division by 10.0 has no basis in the NIST specification and corrupts the Phi-m statistic.

Bug 3: Serial Test — hardcoded pattern length

File: sp800_22r1a/test_serial.py, line 38

self._pattern_length: int = 4  # Hardcoded!

Per NIST SP 800-22, m should be chosen such that m < floor(log2(n)) - 2. For 1,000,000 bits, this means m should be around 14-17. With m=4, the test only examines 16 possible patterns instead of tens of thousands, making it far too coarse to detect non-randomness.

Bug 4: Cumulative Sums — int8 overflow

File: sp800_22r1a/test_cumulative_sums.py, lines 44-56

bits_copy: numpy.ndarray = bits.copy()      # bits is int8 (-128..127)
bits_copy[bits_copy == 0] = -1
# ...
forward_sum += bits_copy[i]                  # Overflows after ~128 steps!

The cumulative sum of +1/-1 values stored as int8 overflows after approximately 128 additions. For 1,000,000 bits, this produces completely wrong forward_max/backward_max values. The test always returns p = 1.0, which is a clear indicator of the bug.

Suggested fix: Convert to int32 or int64 before computation:

bits_copy = bits.copy().astype(numpy.int64)

Bug 5: Maurer's Universal — suspiciously constant p-values

Maurer's Universal test returns p-values of approximately 0.00978 for all inputs, including os.urandom(). The p-values across 10 samples of true random data:

0.00980, 0.00996, 0.00976, 0.00978, 0.00964

This consistency across completely different inputs (random, AES-256, custom ciphers) strongly suggests a computational error in the test implementation.

Bug 6: Random Excursion — incorrect pass/fail evaluation

The test produces reasonable-looking p-values (0.630, 0.683) but marks them as FAIL. A p-value of 0.63 should clearly be a PASS (threshold is 0.01). This appears to be a bug in how the result is evaluated.

Bug 7: DFT/Spectral and Linear Complexity

Both tests return p = 0.000000 for all inputs including os.urandom(), indicating fundamental implementation errors.

Verification Method

The bugs were discovered using a 5-way cross-validation approach:

  1. True random (os.urandom) — must pass all tests
  2. AES-256 (7-Zip encrypted JPEG) — must pass all tests
  3. Custom cipher A (Turbine V5) — expected to mostly pass
  4. Custom cipher B (SCHFM2) — expected to mostly pass
  5. Unencrypted JPEG — must fail all tests (control)

When true random data and AES-256 both fail the same 7 tests with identical p-values, the tests themselves are clearly broken.

A separate pure-Python implementation of the same NIST tests (Monobit, Block Frequency, Runs, Longest Run, Cumulative Sums, Approximate Entropy with m=10, Serial with m=16, Maurer's Universal) correctly returns PASS for all encrypted inputs and true random data, and FAIL for the unencrypted JPEG.

Impact

Any user relying on nistrng to evaluate encryption quality will get misleading results: 7 out of 14 tests always report FAIL regardless of input quality. This can lead to:

  • Rejecting good algorithms based on false test failures
  • Wasting development time trying to fix non-existent weaknesses
  • False sense of confidence when all "working" tests pass

Recommendation

I would recommend adding a simple validation test to the test suite:

import os, numpy as np, nistrng

# Generate true random bits
bits = np.unpackbits(np.frombuffer(os.urandom(125000), dtype=np.uint8)).astype(np.int8)

# All tests should pass for true random data
for name, test in nistrng.SP800_22R1A_BATTERY.items():
    if test.is_eligible(bits):
        result = test.run(bits)[0]
        assert result.passed, f"{name} failed on true random data (p={result.score})"

This would have caught all 7 bugs immediately.

Thank you for maintaining this package. I hope this report helps improve it!

While working on the analysis above, I needed a working test suite for my own project, so I put together a pure-Python implementation of the tests that were giving me trouble.

In case it's useful to anyone running into similar issues, I've published it here:
https://github.com/ReinhardJesolowitz24/py-nist-sp800-22

It covers 12 NIST SP 800-22 tests plus 4 supplementary tests, all validated against os.urandom(), AES-256, and a raw JPEG as negative control. Single file, no dependencies except optional numpy for the DFT test.

It's not meant as a replacement for this project — just a quick workaround until the issues here get addressed. Happy to help with fixes if that's preferred!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions