Skip to content

Commit b14035f

Browse files
committed
add a chapter on enum representation
1 parent b34616c commit b14035f

File tree

2 files changed

+377
-0
lines changed

2 files changed

+377
-0
lines changed

Diff for: reference/src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
- [Uninitialized memory](./active_discussion/uninitialized_memory.md)
1616
- [Data representation](./representation.md)
1717
- [Structs and tuples](./representation/structs-and-tuples.md)
18+
- [Enums](./representation/enums.md)
1819
- [Unions](./representation/unions.md)
1920
- [Vectors](./representation/vectors.md)
2021
- [Optimizations](./optimizations.md)

Diff for: reference/src/representation/enums.md

+376
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,376 @@
1+
# Representation of Rust `enum` types
2+
3+
**Disclaimer:** Some parts of this section were decided in RFCs, but
4+
others represent the consensus from issue [#10]. The text will attempt
5+
to clarify which parts are "guaranteed" (owing to the RFC decision)
6+
and which parts are still in a "preliminary" state.
7+
8+
[#10]: https://github.com/rust-rfcs/unsafe-code-guidelines/issues/10
9+
10+
## Background
11+
12+
**C-like enums.** The simplest form of enum is simply a list of
13+
variants:
14+
15+
```rust
16+
enum SomeEnum {
17+
Variant1,
18+
Variant2,
19+
Variant3,
20+
```
21+
22+
Such enums are called "C-like" because they correspond quite closely
23+
with enums in the C language (though there are important differences
24+
as well, covered later). Presuming that they have more than one
25+
variant, these sorts of enums are always represented as a simple integer,
26+
though the size will vary.
27+
28+
C-like enums may also specify the value of their discriminants explicitly:
29+
30+
```rust
31+
enum SomeEnum {
32+
Variant22 = 22,
33+
Variant44 = 44,
34+
Variant45,
35+
}
36+
```
37+
38+
As in C, discriminant values that are not specified are defined as
39+
either 0 (for the first variant) or as one more than the prior
40+
variant.
41+
42+
**Data-carrying enums.** Enums whose enums have fields are called
43+
"data-carrying" enums. Note that for the purposes of this definition,
44+
it is not relevant whether those fields are zero-sized. Therefore this
45+
enum is considered "data-carrying":
46+
47+
```rust
48+
enum Foo {
49+
Bar(()),
50+
Baz,
51+
}
52+
```
53+
54+
**Option-like enums.** As a special case of data-carrying enums, we
55+
identify "option-like" enums as enums where all of the variants but
56+
one have no fields, and one variant has a single field. The most
57+
common example is `Option` itself. In some cases, as described below,
58+
the compiler may apply special optimization rules to the layout of
59+
option-like enums. The **payload** of an option-like enum is the value
60+
of that single field.
61+
62+
## Enums with a specified representation
63+
64+
Enums may be annotation using the following `#[repr]` tags:
65+
66+
- A specific integer type (called `Int` as a shorthand below):
67+
- `#[repr(u8)]`
68+
- `#[repr(u16)]`
69+
- `#[repr(u32)]`
70+
- `#[repr(u64)]`
71+
- `#[repr(i8)]`
72+
- `#[repr(i16)]`
73+
- `#[repr(i32)]`
74+
- `#[repr(i64)]`
75+
- C-compatible layout:
76+
- `#[repr(C)]`
77+
- C-compatible layout with a specified discriminant size:
78+
- `#[repr(C, u8)]`
79+
- `#[repr(C, u16)]`
80+
- etc
81+
82+
We cover each of the categories below. The layout rules for enums with
83+
explicit `#[repr]` annotations are specified in [RFC 2195][].
84+
85+
[RFC 2195]: https://rust-lang.github.io/rfcs/2195-really-tagged-unions.html
86+
87+
### Layout of an enum with no variants
88+
89+
An enum with no variants can never be instantiated and is logically
90+
equivalent to the "never type" `!`. Such enums are guaranteed to have
91+
the same layout as `!` (zero size and alignment 1).
92+
93+
### Layout of a C-like enum
94+
95+
If there is no `#[repr]` attached to a C-like enum, it is guaranteed
96+
to be represented as an integer of sufficient size to store the
97+
discriminants for all possible variants. The size is selected by the
98+
compiler but must be at least a `u8`.
99+
100+
When a `#[repr(Int)]`-style annotation is attached to a C-like enum
101+
(one without any data for its variants), it will cause the enum to be
102+
represented as a simple integer of the specified size `Int`. This must
103+
be sufficient to store all the required discriminant values.
104+
105+
The `#[repr(C)]` annotation is equivalent, but it selects the same
106+
size as the C compiler would use for the given target for an
107+
equivalent C-enum declaration.
108+
109+
Combining a `C` and `Int` representation (e.g., `#[repr(C, u8)]`) is
110+
not permitted on a C-like enum.
111+
112+
The values used for the discriminant will match up with what is
113+
specified (or automatically assigned) in the enum definition. For
114+
example, the following enum defines the discriminants for its variants
115+
as 22 and 23 respectively:
116+
117+
```rust
118+
enum Foo {
119+
// Specificy discriminant of this variant as 22:
120+
Variant22 = 22,
121+
122+
// Default discriminant is one more than the previous,
123+
// so 23 will be assigned.
124+
Variant23
125+
}
126+
```
127+
128+
**Unresolved question:** What about platforms where `-fshort-enums`
129+
are the default? Do we know/care about that?
130+
131+
### Layout for enums that carry data
132+
133+
For enums that carry data, the layout differs depending on whether
134+
C-compatibility is requested or not.
135+
136+
#### Non-C-compatible layouts
137+
138+
When an enum is tagged with `#[repr(Int)]` for some integral type
139+
`Int` (e.g., `#[repr(u8)]`), it will be represented as a C-union of a
140+
series of `#[repr(C)]` structs, one per variant. Each of these structs
141+
begins with an integral field containing the **discriminant**, which
142+
specifies which variant is active. They then contain the remaining
143+
fields associated with that variant.
144+
145+
**Example.** The following enum uses an `repr(u8)` annotation:
146+
147+
```rust
148+
#[repr(u8)]
149+
enum TwoCases {
150+
A(u8, u16),
151+
B(u16),
152+
}
153+
```
154+
155+
This will be laid out equivalently to the following more
156+
complex Rust types:
157+
158+
```
159+
union TwoCasesRepr {
160+
A: TwoCasesVariantA,
161+
B: TwoCasesVariantB,
162+
}
163+
164+
#[repr(u8)]
165+
enum TwoCasesTag { A, B }
166+
167+
#[repr(C)]
168+
struct TwoCasesVariantA(TwoCasesTag, u8, u16);
169+
170+
#[repr(C)]
171+
struct TwoCasesVariantB(TwoCasesTag, u16);
172+
```
173+
174+
Note that the `TwoCasesVariantA` and `TwoCasesVariantB` structs are
175+
`#[repr(C)]`; this is needed to ensure that the `TwoCasesTag` value
176+
appears at offset 0 in both cases, so that we can read it to determine
177+
the current variant.
178+
179+
#### C-compatible layouts.
180+
181+
When the `#[repr]` tag includes `C`, e.g., `#[repr(C)]` or `#[repr(C,
182+
u8)]`, the layout of enums is changed to better match C++ enums. In
183+
this mode, the data is laid out as a tuple of `(discriminant, union)`,
184+
where `union` represents a C union of all the possible variants. The
185+
type of the discriminant will be the integral type specified (`u8`,
186+
etc) -- if no type is specified, then the compiler will select one
187+
based on what a size a C-like enum would have with the same number of
188+
variants.
189+
190+
This layout, while more compatible and arguably more obvious, is also
191+
less efficient than the non-C compatible layout in some cases in terms
192+
of total size.
193+
194+
**Example.** The following enum:
195+
196+
```rust
197+
#[repr(C, Int)]
198+
enum MyEnum {
199+
A(u32),
200+
B(f32, u64),
201+
C { x: u32, y: u8 },
202+
D,
203+
}
204+
```
205+
206+
is equivalent to the following Rust definition:
207+
208+
```rust
209+
#[repr(C)]
210+
struct MyEnumRepr {
211+
tag: MyEnumTag,
212+
payload: MyEnumPayload,
213+
}
214+
215+
#[repr(Int)]
216+
enum MyEnumTag { A, B, C, D }
217+
218+
#[repr(C)]
219+
union MyEnumPayload {
220+
A: u32,
221+
B: MyEnumPayloadB,
222+
C: MyEnumPayloadC,
223+
D: (),
224+
}
225+
226+
#[repr(C)]
227+
struct MyEnumPayloadB(f32, u64);
228+
229+
#[repr(C)]
230+
struct MyEnumPayloadC { x: u32, y: u8 }
231+
}
232+
```
233+
234+
This enum can also be represented in C++ as follows:
235+
236+
```c++
237+
#include <stdint.h>
238+
239+
enum class MyEnumTag: CppEquivalentOfInt { A, B, C, D };
240+
struct MyEnumPayloadB { float _0; uint64_t _1; };
241+
struct MyEnumPayloadC { uint32_t x; uint8_t y; };
242+
243+
union MyEnumPayload {
244+
uint32_t A;
245+
MyEnumPayloadB B;
246+
MyEnumPayloadC C;
247+
};
248+
249+
struct MyEnum {
250+
MyEnumTag tag;
251+
MyEnumPayload payload;
252+
};
253+
```
254+
255+
## Enums without a specified representation
256+
257+
If no explicit `#[repr]` attribute is used, then the layout of most
258+
enums is not specified, with one crucial exception: option-like enums
259+
may in some cases use a compact layout that is identical to their
260+
payload.
261+
262+
(Meta-note: The content in this section is not described by any RFC
263+
and is therefore "non-normative".)
264+
265+
### Discriminant elision on Option-like enums
266+
267+
**Definition.** An **option-like enum** is an enum which has:
268+
269+
- one variant with a single field,
270+
- other variants with no fields ("unit" variants).
271+
272+
The simplest example is `Option<T>` itself, where the `Some` variant
273+
has a single field (of type `T`), and the `None` variant has no
274+
fields. But other enums that fit that same template (and even enums
275+
that include multiple `None`-like fields) fit.
276+
277+
**Definition.** The **payload** of an option-like enum is the single
278+
field which it contains; in the case of `Option<T>`, the payload has
279+
type `T`.
280+
281+
**Definition.** In some cases, the payload type may contain illegal
282+
values, which are called **niches**. For example, a value of type `&T`
283+
may never be NULL, and hence defines a niche consisting of the
284+
bitstring `0`. Similarly, the standard library types [`NonZeroU8`]
285+
and friends may never be zero, and hence also define the value of `0`
286+
as a niche. (Types that define niche values will say so as part of the
287+
description of their representation invariant.)
288+
289+
[`NonZeroU8`]: https://doc.rust-lang.org/std/num/struct.NonZeroU8.html
290+
291+
**Option-like enums where the payload defines an adequate number of
292+
niche values are guaranteed to be represented without using any
293+
discriminant at all.** This is called **discriminant elision**. If
294+
discriminant elision is in effect, then the layout of the enum is
295+
equal to the layout of its payload.
296+
297+
The most common example is that `Option<&u8>` can be represented as an
298+
nullable `&u8` reference -- the `None` variant is then represented
299+
using the niche value zero. This is because a valid `&u8` value can
300+
never be zero, so if we see a zero value, we know that this must be
301+
`None` variant.
302+
303+
In order for the optimization to apply, the payload type must define a
304+
number of niches greater than or equal to the number of unit variants.
305+
In the case of `Option<T>`, this means that any niche at all will
306+
suffice, as there is only one unit variant (`None`).
307+
308+
**Example.** The type `Option<&u32>` will be represented at runtime as
309+
a nullable pointer. FFI interop often depends on this property.
310+
311+
**Example.** As `fn` types are non-nullable, the type `Option<extern
312+
"C" fn()>` will be represented at runtime as a nullable function
313+
pointer (which is therefore equivalent to a C function pointer) . FFI
314+
interop often depends on this property.
315+
316+
**Example.** Consider the following enum definitions:
317+
318+
```rust
319+
enum Enum1<T> {
320+
Present(T),
321+
Absent1,
322+
Absent2,
323+
}
324+
325+
enum Enum2 {
326+
A, B, C
327+
}
328+
```
329+
330+
`Enum1<&u8>` is not eligible for discriminant elision, since `&u8`
331+
defines a single niche value, but `Enum1` has two unit
332+
variants. However, `Enum2` has only three legal values (0 for `A`, 1
333+
for `B`, and 2 for `C`), and hence defines a plethora of niche values[^caveat].
334+
Therefore, `Enum1<Enum2>` is guaranteed to be laid out the same as
335+
`Enum2` ([consider the results of applying
336+
`size_of`](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eadff247f2c5713b8f3b6c9cda297711)).
337+
338+
[^caveat]: Strictly speaking, niche values are considered part of the "representation invariant" for an enum and not its type. Therefore, this section is added only as a preview for future unsafe-code-guidelines discussion.
339+
340+
### Other optimizations
341+
342+
The previous section specified a relatively narrow set of layout
343+
optimizations that are **guaranteed** by the compiler. However, the
344+
compiler is always free to perform **more** optimizations than this
345+
minimal set. For example, the compiler presently treats `Result<T,
346+
()>` and `Option<T>` as equivalent, but this behavior is not
347+
guaranteed to continue as `Result<T, ()>` is not considered
348+
"option-like".
349+
350+
As of this writing, the compiler's current behavior is to attempt to
351+
elide discriminants whenever possible. Furthermore, a variant whose
352+
only fields are of zero-size is considered a unit variant for this
353+
purpose. If eliding discriminants is not possible (e.g., because the
354+
payload does not define sufficient niche values), then the compiler
355+
will select an appropriate discriminant size `N` and use a
356+
representation roughly equivalent to `#[repr(N)]`, though without the
357+
strict `#[repr(C)]` guarantees on each struct. However, this behavior
358+
is not guaranteed to remain the same in future versions of the
359+
compiler and should not be relied upon. (While it is not expected that
360+
existing layout optimizations will be removed, it is possible -- it is
361+
also possible for the compiler to introduce new sorts of
362+
optimizations.)
363+
364+
## Niche values
365+
366+
C-like enums with N variants and no specified representation are
367+
guaranteed to supply niche values corresponding to 256 - N (presuming
368+
that is a positive number). This is because a C-like enum must be
369+
represented using an integer and that integer must correspond to a
370+
valid variant: the precise size of C-like enums is not specified but
371+
it must be at least one byte, which means that there are at least 256
372+
possible bitstrings (only N of which are valid).
373+
374+
Other enums -- or enums with a specified representation -- may supply
375+
niches if their representation invariant permits it, but that is not
376+
**guaranteed**.

0 commit comments

Comments
 (0)