-
Notifications
You must be signed in to change notification settings - Fork 22
Document dense encoding of invalid pushdata in EOFv0 #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
043c959
75eb1b2
d231ec7
35955cf
df12705
920a4cc
2552767
34fba68
640b51a
0168937
de25406
462c8c3
d6e0255
3afee13
a087506
778d10e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
/.idea | ||
__pycache__ | ||
/corpus | ||
/venv |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -105,6 +105,127 @@ The same as above except encode the values as 6-bit numbers | |
(minimum number of bits needed for encoding `32`). | ||
Such encoding lowers the size overhead from 3.1% to 2.3%. | ||
|
||
### Encode only invalid jumpdests (dense encoding) | ||
|
||
Alternate option is instead of encoding all valid `JUMPDEST` locations, to only encode invalid ones. | ||
By invalid `JUMPDEST` we mean a `0x5b` byte in any pushdata. | ||
|
||
This is beneficial because most contracts only contain a limited number of offending cases. | ||
Our initial analysis of the top 1000 bytecodes used in last year confirms this: | ||
only 0.07% of bytecode bytes are invalid jumpdests. | ||
|
||
Let's create a map of `invalid_jumpdests[chunk_index] = first_instruction_offset`. We can densely encode this | ||
map using techniques similar to *run-length encoding* to skip distances and delta-encode indexes. | ||
This map is always fully loaded prior to execution, and so it is important to ensure the encoded | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to self: see how much of those costs could be covered by the 21000 gas. |
||
version is as dense as possible (without sacrificing on complexity). | ||
|
||
We propose the encoding which uses [VLQ](https://en.wikipedia.org/wiki/Variable-length_quantity): | ||
|
||
For each entry `index, first_instruction_offset` in `invalid_jumpdests`: | ||
|
||
- Compute the chunk index distance to the previously encoded chunk `delta = index - last_chunk_index - 1`. | ||
- Combine two numbers into single unsigned integer `entry = delta * 33 + first_instruction_offset`. | ||
This is reversible because `first_instruction_offset < 33`. | ||
- Encode `entry` into sequence of bytes using VLQ (e.g. LEB128). | ||
|
||
For the worst case where each chunk contains an invalid `JUMPDEST` the encoding length is equal | ||
to the number of chunks in the code. I.e. the size overhead is 3.1%. | ||
|
||
| code size limit | code chunks | encoding chunks | | ||
|-----------------|-------------|-----------------| | ||
| 24576 | 768 | 24 | | ||
| 32768 | 1024 | 32 | | ||
| 65536 | 2048 | 64 | | ||
|
||
Our current hunch is that in average contracts this results in ~0.1% overhead, while the worst case is 3.1%. | ||
This is strictly better than the 3.2% overhead of the current Verkle code chunking. | ||
|
||
Stats from "top 1000 bytecodes used in last year": | ||
|
||
``` | ||
total code length: 11785831 | ||
total encoding length: 11693 (0.099%) | ||
encoding chunks distribution: | ||
0: 109 (10.9%) | ||
1: 838 (83.8%) | ||
2: 49 ( 4.9%) | ||
3: 4 ( 0.4%) | ||
``` | ||
|
||
#### Encoding example | ||
|
||
The top used bytecode: [0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2](https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2) (WETH). | ||
|
||
``` | ||
length: 3124 | ||
chunks: 98 | ||
|
||
chunks with invalid jumpdests: | ||
chunk_index delta first_instruction_offset entry leb128 | ||
37 37 4 1225 c909 | ||
49 11 12 375 f702 | ||
50 0 14 14 0e | ||
87 36 13 1201 b109 | ||
``` | ||
|
||
#### Header location | ||
|
||
It is possible to place above as part of the "EOFv0" header, but given the upper bound of number of chunks occupied is low (33 vs 21), | ||
it is also possible to make this part of the Verkle account header. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but if we want to increase the maximum code size to 64k, there won't be enough space left for it in the header. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With scheme 1 it is still 56 verkle leafs for 64k code in worst case. That should still easily fit into the 128 "special" first header leafs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we definitely need a variadic length of this section because the average case (1–2 chunks) is much different from the worst case (20–30 chunks). I.e. you don't want to reserve ~60 chunks in the tree just to use 2 on average. |
||
|
||
This second option allows for the simplification of the `code_size` value, as it does not need to change. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By "second option", you mean "adding it to the account header", not "Scheme 2", right ? I don't see why there would be a difference with the other case though : in both cases, one needs to use the code size to skip the header. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes.
No because I'd imagine the account header (i.e. not code leafs/keys) would be handled separately, so the actual EVM code remains verbatim. |
||
|
||
#### Runtime after Verkle | ||
|
||
During execution of a jump two checks must be done in this order: | ||
|
||
1. Check if the jump destination is the `JUMPDEST` opcode. | ||
2. Check if the jump destination chunk is in the `invalid_jumpdests` map. | ||
If yes, the jumpdest analysis of the chunk must be performed | ||
to confirm the jump destination is not push data. | ||
|
||
It is possible to reconstruct sparse account code prior to execution with all the submitted chunks of the transaction | ||
and perform `JUMPDEST`-validation to build up a relevant *valid `JUMPDEST` locations* map instead. | ||
|
||
#### Reference encoding implementation | ||
|
||
```python | ||
import leb128 | ||
import io | ||
|
||
class VLQM33: | ||
VALUE_MOD = 33 | ||
|
||
def encode(self, chunks: dict[int, int]) -> tuple[bytes, int]: | ||
ops = b'' | ||
last_chunk_index = 0 | ||
for index, value in chunks.items(): | ||
assert 0 <= value < self.VALUE_MOD | ||
delta = index - last_chunk_index | ||
e = delta * self.VALUE_MOD + value | ||
ops += leb128.u.encode(e) | ||
last_chunk_index = index + 1 | ||
return ops, 8 * len(ops) | ||
|
||
def decode(self, ops: bytes) -> dict[int, int]: | ||
stream = io.BytesIO(ops) | ||
stream.seek(0, 2) | ||
end = stream.tell() | ||
stream.seek(0, 0) | ||
|
||
m = {} | ||
index = 0 | ||
while stream.tell() != end: | ||
e, _ = leb128.u.decode_reader(stream) | ||
delta = e // self.VALUE_MOD | ||
value = e % self.VALUE_MOD | ||
index += delta | ||
m[index] = value | ||
index += 1 | ||
return m | ||
``` | ||
|
||
|
||
## Backwards Compatibility | ||
|
||
EOF-packaged code execution if fully compatible with the legacy code execution. | ||
|
Uh oh!
There was an error while loading. Please reload this page.