Skip to content

int_in_range (and others) reports success even when running out of entropy #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ejmount opened this issue Apr 10, 2025 · 4 comments
Open

Comments

@ejmount
Copy link

ejmount commented Apr 10, 2025

Firstly, I feel like I'm missing something here because this is such a fundamental piece of functionality, it's implausible to me this problem isn't described anywhere (I couldn't see anything about in the docs - the only reference I've found is in a unit test buried in #184), but I'm raising it because it nonetheless caused problems for me.

In Unstructured, a call int_in_range(n..=m) consumes some bytes of the seed data to construct an index between 0 and the smallest power of 256 larger than (m-n). It does this by simply concatenating the bytes together, before mod-reducing and shifting this value into the desired output range. As mentioned in #184, this leads to biased results and inefficient use of the entropy if m-n is not a power of 256, but it will nonetheless work if the data is available. However, if there aren't enough bytes available, int_in_range does not return an Err - instead, if it can't get a next byte, it stops constructing the large index value and uses it as is in the rest of the calculation, returning an Ok. IOW, if we have less than the required number of bytes left, we interpret whatever's left as a single integer (which will be by definition less than (m-n), possibly as little as 1/256th of that if the range is a very large one) and use that for the reduction and shifting. In the particular case that there's no data left at all, we "successfully" return exactly n.

While I appreciate that this is technically compliant with the documentation, it's extremely counterintuitive and I can't tell if it's intended behavior or not. If you enable clippy::pedantic and configure avoid-breaking-exported-api = false, it will indeed flag int_in_range returning a Result as redundant because it cannot fail. If this isn't intended, can it be fixed to return NotEnoughData appropriately? (It also maybe looks like it should return EmptyChoose for an empty range, instead of panicking as it does currently)

Alternatively, if this is the intended behavior, perhaps I've misunderstood and none of Unstructured should be expected to be checking for this case, because only one method, bytes(), actually does sometimes return NotEnoughData. I assume this is because it returns a reference to the underlying data in order to facilitate &[u8]: Arbitrary, and so can't invent a value out of thin air like the other methods. If this is the idea, I'd appreciate if it could be included in the documentation somewhere that none of the methods are intended to be checking for a lack of data except by necessity.

FWIW I'm happy to submit a PR to help fix this, or do any work needed to push #184 over the line if this is an issue of documentation, I'd just need someone to clarify the intended design.

@ejmount ejmount changed the title int_in_range reports success even when running out of entropy int_in_range (and others) reports success even when running out of entropy Apr 10, 2025
@fitzgen
Copy link
Member

fitzgen commented Apr 15, 2025

This is intended behavior. We have found empirically that it is better (in terms of fuzzing coverage and efficiency) to produce dummy values when we run out of data than to return errors. Intuitively, this helps the fuzzer avoid exploring all error paths in the Arbitrary implementation, rather than the actual system under test's code paths.

That said, improvements towards uniformity (especially when we partial bytes from the underlying data) are very welcome. As are documentation improvements.

The reason it returns an error is to avoid API breaking changes.

If you do not want this behavior, you can check the length of the data.

@ejmount
Copy link
Author

ejmount commented Apr 17, 2025

Hi, thanks for the clarification. I'm currently having a look at adjusting int_into_range to make better use of the seed data, I'll put a PR up if I get somewhere with that.

If API compatibility is the only reason for it to return an error, would it be worth putting a change to remove that under #217 to better reflect the semantics? Alternatively, the various methods in Unstructured have seemingly inconsistent behaviour if given an empty choice, with some panicking and others returning Err - would it be suitable to fix this (for either the new breaking version or a patch update) by having int_in_range et al return EmptyChoose rather than panic? If the current code is the intended behaviour in this aspect too, it would be helpful to understand the overall reasoning for why the behaviour differs.

In terms of documentation, #184 appears to have been ready to go apart from a clippy lint that's since been addressed in the mainline. Does anything still need addressed on that?

@fitzgen
Copy link
Member

fitzgen commented Apr 18, 2025

If API compatibility is the only reason for it to return an error, would it be worth putting a change to remove that under #217 to better reflect the semantics?

I added the last checkbox in that OP with the intention for cleaning this stuff up. Does that item not align with what you are thinking?

@ejmount
Copy link
Author

ejmount commented Apr 18, 2025

Sorry, that was my mistake. That's exactly what I meant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants