Skip to content

Commit b34616c

Browse files
authored
Merge pull request #54 from RalfJung/validity
Area proposal: validity invariants
2 parents 6864424 + 4e12fda commit b34616c

File tree

1 file changed

+131
-0
lines changed

1 file changed

+131
-0
lines changed

active_discussion/validity.md

+131
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Data type validity requirements
2+
3+
This discussion is meant to focus on the question: Which invariants derived from
4+
types are there that the compiler expects to be *always* maintained, and
5+
(equivalently) that unsafe code must *always* uphold (or else cause undefined
6+
behavior)? This is what is called "validity invariant" in
7+
[Ralf's blog post](https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html),
8+
but we might also decide to change that name.
9+
10+
### Interactions and constraints
11+
12+
Choices of invariants interact, in particular, with layout optimizations: For
13+
example, the fact that `Option<&T>` is pointer-sized relies on the fact that the
14+
validity invariant for `&T` rules out `0x0`, and hence we can use that value as
15+
signaling the `None` case.
16+
17+
Moreover, the invariants are constrained by attributes that we emit when
18+
generating LLVM IR. For example, we emit `aligned` attributes pretty much any
19+
time we can, which means it is probably a good idea to say that valid references
20+
must be aligned.
21+
22+
Finally, another consideration to take into account is that ruling out certain
23+
behavior can be great for bug finding. For example, if arithmetic overflow is
24+
defined to have two's-complement-behavior, then bug finding tools can no longer
25+
use overflow as an indication of a software bug. (This is a real problem with
26+
unsigned integer arithmetic in C/C++.)
27+
28+
### Possible bit patterns
29+
30+
The validity invariant of a type is, basically, a set of bit patterns that is
31+
allowed to occur at that type. ("Basically" because the invariant may also be
32+
allowed to depend on memory.) To discuss this properly, we need to first agree
33+
on what "bit patterns" even are. It is not enough to just consider sequences of
34+
0 and 1, because we also need to take uninitialized data into account. For the
35+
purpose of this discussion, I think it is sufficient to consider every bit as
36+
being either 0, 1 or uninitialized.
37+
[That is not always sufficient](https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html),
38+
but I think we can mostly ignore the extra complications introduced by pointer
39+
values.
40+
41+
In terms of comparing with C, the "uninitialized" bit corresponds to what C
42+
calls "indeterminate" data. In particular, it is allowed to be a "trap
43+
representation". Also, observing the same indeterminate data multiple times is
44+
allowed to yield different results. That's why I am proposing we treat it as a
45+
third state a bit can be in.
46+
47+
In terms of LLVM, the "uninitialized" bit corresponds to `poison`. It is *not*
48+
the same as `undef`! See
49+
[this paper](https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf) for some
50+
more material on the topic.
51+
52+
### Extent of "always"
53+
54+
One point we will have to figure out is what exactly "always" means. Thinking
55+
in terms of a semantics for MIR, data most probably needs to be valid any time
56+
it is copied, which primarily happens when executing assignment statements (the
57+
other cases are passing of function arguments and return values). However, it
58+
is less clear whether merely creating a place without accessing the data inside
59+
(such as in `&*x`) should require the data to be valid.
60+
61+
The entire discussion here is only about validity invariants that have to hold
62+
when the compiler considers a variable initialized. For example, `let b: bool;`
63+
is completely okay to not be initialized because the compiler knows about that;
64+
`let b: bool = mem::uninitialized();` however copies uninitialized data at type
65+
`bool` and hence violates `bool`'s validity invariant.
66+
67+
## Goals
68+
69+
* For every primitive type, determine which assumptions (if any) the compiler
70+
makes about values *not* occurring at that type (serving as a lower bound for
71+
what to declare invalid), and determine which popular patterns in unsafe code
72+
might create "interesting" values of this type that safe code cannot create on
73+
its own (serving as an upper bound for how much we want to declare invalid).
74+
Both of these bounds are soft, but informative.
75+
* Based on that, map out a design space of invariants that seem reasonable.
76+
* Determine when exactly the validity invariant is assumed to hold.
77+
78+
## Active threads
79+
80+
To start, we will create threads for each major category of types.
81+
82+
* Integers and floating point types
83+
* Do we allow values that contain uninitialized bits? If yes, what are the
84+
rules for arithmetic and logical operations involving uninitialized bits,
85+
e.g. in cases like `x * 0`? There is also some interaction with bug finding
86+
here: tools can only flag uninitialized data at integer type as a bug if we
87+
do not allow that to happen in unsafe code.
88+
89+
* Raw pointers
90+
* Do we allow values that contain uninitialized bits?
91+
* Are there any requirements on the metadata?
92+
93+
* References
94+
* Presumably, references must be non-NULL.
95+
* They probably also must be aligned, but is that required every time a
96+
reference is taken? Also see the [ongoing discussion in RFC 2582][RFC2582].
97+
* Can there ever be uninitialized bits in a reference?
98+
* Do they have to be dereferencable? What exactly does that even mean?
99+
* Does `&[mut] T` have to point to data that is valid at `T`? This interacts
100+
with the question of whether `&*x` is allowed when `x` is a well-aligned
101+
non-null dereferencable pointer that points to invalid data.
102+
* Out of scope: aliasing rules
103+
104+
* Function pointers
105+
* Presumably, these must be non-NULL. Anything else? Can there ever be
106+
uninitialized bits?
107+
108+
* Booleans
109+
* Is there anything to say besides: A `bool` must be `0x0` or `0x1`? Do we
110+
allow the remaining bits to be uninitialized?
111+
112+
* Unions
113+
* Do we make any restrictions here, or are unions just "bags of bits" that may
114+
contain anything? That would mean we can do no layout optimizations.
115+
116+
* Enums
117+
* Is there anything to say besides: The discriminant must be valid, and all
118+
fields of the active variant must be valid at their respective types?
119+
* The padding between fields can be anything, including uninitialized.
120+
121+
* Structs, tuples, arrays and all other aggregates (closures, ...)
122+
* Is there anything to say besides: All fields must be valid at their
123+
respective types?
124+
* The padding between fields can be anything, including uninitialized. It was
125+
[recently determined][generators-maybe-uninit] that generators behave
126+
different from other aggregates here. Are we okay with that? Should we push
127+
for generator fields to reflect this in their types?
128+
129+
[RFC2582]: https://github.com/rust-lang/rfcs/pull/2582
130+
[generators-maybe-uninit]: https://github.com/rust-lang/rust/pull/56100
131+

0 commit comments

Comments
 (0)