7 out of 14 tests produce incorrect results (verified with os.urandom)

# Bug Report: 7 out of 14 tests produce incorrect results (verified with os.urandom)

## Summary

Multiple tests in nistrng v1.2.3 produce **incorrect p-values for all inputs**, including cryptographically secure random data (`os.urandom`). This was discovered by running the test suite against 5 different inputs as a cross-validation:

| Input | Expected | nistrng Result | Correct Result |
|-------|----------|---------------|----------------|
| `os.urandom()` (CSPRNG) | ~14/14 PASS | **7/14** | 14/14 PASS |
| AES-256 encrypted data | ~14/14 PASS | **7/14** | 14/14 PASS |
| Unencrypted JPEG | ~0/14 PASS | 0/14 | 0/14 PASS |

The 7 broken tests fail identically regardless of input data, producing the same wrong p-values for true random data as for encrypted data.

## Environment

- nistrng version: 1.2.3
- Python: 3.14 (Windows)
- numpy: 2.x
- scipy: 1.17.1
- Test method: 10 samples of 1,000,000 bits each

## Bug 1: Approximate Entropy — `min` and `max` swapped (CRITICAL)

**File:** `sp800_22r1a/test_approximate_entropy.py`, line 52

```python
# Current (WRONG):
blocks_length: int = min(2, max(3, int(math.floor(math.log(bits.size, 2))) - 6))

# This ALWAYS returns 2, because:
#   max(3, anything) >= 3
#   min(2, 3+) = 2
```

For 1,000,000 bits: `log2(1000000) = 19.9 -> 19 - 6 = 13 -> max(3, 13) = 13 -> min(2, 13) = 2`

The test always runs with `m=2` instead of the correct `m=13`. With only 4 possible 2-bit patterns, the test is meaningless for large inputs.

**Suggested fix:**
```python
blocks_length: int = max(2, min(int(math.floor(math.log(bits.size, 2))) - 6, 13))
```

## Bug 2: Approximate Entropy — incorrect log divisor

**File:** `sp800_22r1a/test_approximate_entropy.py`, line 74

```python
# Current (WRONG):
phi_m.append(numpy.sum(c_i[c_i > 0.0] * numpy.log((c_i[c_i > 0.0] / 10.0))))

# Should be (per NIST SP 800-22):
phi_m.append(numpy.sum(c_i[c_i > 0.0] * numpy.log(c_i[c_i > 0.0])))
```

The division by 10.0 has no basis in the NIST specification and corrupts the Phi-m statistic.

## Bug 3: Serial Test — hardcoded pattern length

**File:** `sp800_22r1a/test_serial.py`, line 38

```python
self._pattern_length: int = 4  # Hardcoded!
```

Per NIST SP 800-22, `m` should be chosen such that `m < floor(log2(n)) - 2`. For 1,000,000 bits, this means `m` should be around 14-17. With `m=4`, the test only examines 16 possible patterns instead of tens of thousands, making it far too coarse to detect non-randomness.

## Bug 4: Cumulative Sums — int8 overflow

**File:** `sp800_22r1a/test_cumulative_sums.py`, lines 44-56

```python
bits_copy: numpy.ndarray = bits.copy()      # bits is int8 (-128..127)
bits_copy[bits_copy == 0] = -1
# ...
forward_sum += bits_copy[i]                  # Overflows after ~128 steps!
```

The cumulative sum of +1/-1 values stored as `int8` overflows after approximately 128 additions. For 1,000,000 bits, this produces completely wrong `forward_max`/`backward_max` values. The test always returns `p = 1.0`, which is a clear indicator of the bug.

**Suggested fix:** Convert to int32 or int64 before computation:
```python
bits_copy = bits.copy().astype(numpy.int64)
```

## Bug 5: Maurer's Universal — suspiciously constant p-values

Maurer's Universal test returns p-values of approximately 0.00978 for **all inputs**, including `os.urandom()`. The p-values across 10 samples of true random data:

```
0.00980, 0.00996, 0.00976, 0.00978, 0.00964
```

This consistency across completely different inputs (random, AES-256, custom ciphers) strongly suggests a computational error in the test implementation.

## Bug 6: Random Excursion — incorrect pass/fail evaluation

The test produces reasonable-looking p-values (0.630, 0.683) but marks them as FAIL. A p-value of 0.63 should clearly be a PASS (threshold is 0.01). This appears to be a bug in how the result is evaluated.

## Bug 7: DFT/Spectral and Linear Complexity

Both tests return `p = 0.000000` for all inputs including `os.urandom()`, indicating fundamental implementation errors.

## Verification Method

The bugs were discovered using a 5-way cross-validation approach:

1. **True random** (`os.urandom`) — must pass all tests
2. **AES-256** (7-Zip encrypted JPEG) — must pass all tests  
3. **Custom cipher A** (Turbine V5) — expected to mostly pass
4. **Custom cipher B** (SCHFM2) — expected to mostly pass
5. **Unencrypted JPEG** — must fail all tests (control)

When true random data and AES-256 both fail the same 7 tests with identical p-values, the tests themselves are clearly broken.

A separate pure-Python implementation of the same NIST tests (Monobit, Block Frequency, Runs, Longest Run, Cumulative Sums, Approximate Entropy with m=10, Serial with m=16, Maurer's Universal) correctly returns PASS for all encrypted inputs and true random data, and FAIL for the unencrypted JPEG.

## Impact

Any user relying on nistrng to evaluate encryption quality will get **misleading results**: 7 out of 14 tests always report FAIL regardless of input quality. This can lead to:
- Rejecting good algorithms based on false test failures
- Wasting development time trying to fix non-existent weaknesses
- False sense of confidence when all "working" tests pass

## Recommendation

I would recommend adding a simple validation test to the test suite:

```python
import os, numpy as np, nistrng

# Generate true random bits
bits = np.unpackbits(np.frombuffer(os.urandom(125000), dtype=np.uint8)).astype(np.int8)

# All tests should pass for true random data
for name, test in nistrng.SP800_22R1A_BATTERY.items():
    if test.is_eligible(bits):
        result = test.run(bits)[0]
        assert result.passed, f"{name} failed on true random data (p={result.score})"
```

This would have caught all 7 bugs immediately.

Thank you for maintaining this package. I hope this report helps improve it!


While working on the analysis above, I needed a working test suite for my own project, so I put together a pure-Python implementation of the tests that were giving me trouble.

In case it's useful to anyone running into similar issues, I've published it here:
https://github.com/ReinhardJesolowitz24/py-nist-sp800-22

It covers 12 NIST SP 800-22 tests plus 4 supplementary tests, all validated against os.urandom(), AES-256, and a raw JPEG as negative control. Single file, no dependencies except optional numpy for the DFT test.

It's not meant as a replacement for this project — just a quick workaround until the issues here get addressed. Happy to help with fixes if that's preferred!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7 out of 14 tests produce incorrect results (verified with os.urandom) #13

Bug Report: 7 out of 14 tests produce incorrect results (verified with os.urandom)

Summary

Environment

Bug 1: Approximate Entropy — `min` and `max` swapped (CRITICAL)

Bug 2: Approximate Entropy — incorrect log divisor

Bug 3: Serial Test — hardcoded pattern length

Bug 4: Cumulative Sums — int8 overflow

Bug 5: Maurer's Universal — suspiciously constant p-values

Bug 6: Random Excursion — incorrect pass/fail evaluation

Bug 7: DFT/Spectral and Linear Complexity

Verification Method

Impact

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Input	Expected	nistrng Result	Correct Result
`os.urandom()` (CSPRNG)	~14/14 PASS	7/14	14/14 PASS
AES-256 encrypted data	~14/14 PASS	7/14	14/14 PASS
Unencrypted JPEG	~0/14 PASS	0/14	0/14 PASS

7 out of 14 tests produce incorrect results (verified with os.urandom) #13

Description

Bug Report: 7 out of 14 tests produce incorrect results (verified with os.urandom)

Summary

Environment

Bug 1: Approximate Entropy — min and max swapped (CRITICAL)

Bug 2: Approximate Entropy — incorrect log divisor

Bug 3: Serial Test — hardcoded pattern length

Bug 4: Cumulative Sums — int8 overflow

Bug 5: Maurer's Universal — suspiciously constant p-values

Bug 6: Random Excursion — incorrect pass/fail evaluation

Bug 7: DFT/Spectral and Linear Complexity

Verification Method

Impact

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: Approximate Entropy — `min` and `max` swapped (CRITICAL)