Always emit constants in bytecode #3976

sjamesr · 2024-12-19T20:57:28Z

No description provided.

Currently, the bytecode generator adds constants to a const_buf. In doing this, it searches the const_buf for existing consts of the same value. This ends up being quadratic in the number of constants. I thought of adding a hashmap to make the lookup quicker, but I thought we could maybe just emit an instruction to push the constant onto the stack (which is what happens already if the const_buf is full). This will increase the generated bytecode size, but it makes the bytecode generator a lot faster (e.g. I saw get_const_offset taking 25% of the loading time on a 12kb module).

sjamesr · 2024-12-19T21:10:55Z

I just wanted to open this PR to start a discussion about the tradeoffs between having this constant list and emitting loads vs. simply pushing the value onto the stack.

sjamesr · 2024-12-19T21:20:10Z

I haven't looked at how much this increases module size, but it definitely decreases bytecode generation time (previously it was like 25% of the time in get_const_offset)

lum1n0us · 2024-12-20T04:55:01Z

I haven't looked at how much this increases module size, but it definitely decreases bytecode generation time (previously it was like 25% of the time in get_const_offset)

The data will be helpful since involving constant space is based on a trade-off.

@xujuntwt95329 @wenyongh Please join us.

sjamesr · 2024-12-20T05:02:39Z

If we didn't want to pay the price in bytecode size, we could use a hashmap of value->offset to do the searches in ~constant time, at the cost of O(n) memory.

wenyongh · 2024-12-20T10:57:02Z

If we didn't want to pay the price in bytecode size, we could use a hashmap of value->offset to do the searches in ~constant time, at the cost of O(n) memory.

The bytecode size or memory consumption is really important for us since fast-interp already increases footprint, the main design goal of constant offsets is to reduce the footprint since there may be many duplicated constants. Using hashmap is a choice, another choice may be to just sort the loader_ctx->const_buf (which is an struct Const array) by the const value, e.g., binary search the right position to insert when there is a new const to add, and binary search the array to get the slot index. Since there are i32/i64/f32/f64 values, I think we can sort them by treat all the values as f64 type and compare two values with f64 type.

sjamesr · 2024-12-20T14:27:32Z

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

wenyongh · 2024-12-21T11:01:07Z

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

I mean that the sort/search operations are on the array of Const struct which has value, value_type and slot_index fields, but not on the const buffer, since the Const struct records the slot index, we don't need to rewriting the bytecode, note loader just emits slot_index to bytecode. This improves the search algorithm, but yes, it introduces many insertion operations, it depends on the efficiency of memmove.

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.

Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

xujuntwt95329 · 2024-12-23T08:58:45Z

Would this approach require a second pass over the module data to collect all the constants? If you've already emitted slot indexes, we now cannot move the const from that slot (e.g. to insert a const before it) without rewriting the bytecode that was previously emitted. Also insertion sort like this is going to have average n^2 performance. Am I missing something?

I mean that the sort/search operations are on the array of Const struct which has value, value_type and slot_index fields, but not on the const buffer, since the Const struct records the slot index, we don't need to rewriting the bytecode, note loader just emits slot_index to bytecode. This improves the search algorithm, but yes, it introduces many insertion operations, it depends on the efficiency of memmove.

So, I think the choices are either a separate pass to collect all the constants, then sort them and binary search, or a hashmap of value -> index which we maintain alongside the const_buf to speed up searches.

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.

Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

A different process for second traverse is a good idea, but it may waste some space when there are multiple duplicated values. Maybe there can be some special process for some common value (e.g. 0, 1) to reduce the waste.

But anyway this is a tradeoff, seems there isn't a best way for all scenarios.

wenyongh · 2024-12-25T02:00:53Z

In fact the prepare_bytecode of fast-interp scans the bytecode twice, now it seems that the two passes do the same on the Const struct array, I am not sure whether we can handle them differently, e.g., in the first pass, loader doesn't search the const offset, but just always appends the new value to the Const array, and in the second pass, loader sorts the Const array and removes duplicated values, and then quickly searches the array to get the const array when handling const opcodes.
Using hashmap is a good way, my understanding is that the Const struct array is kept and you will create a new hashmap?

A different process for second traverse is a good idea, but it may waste some space when there are multiple duplicated values. Maybe there can be some special process for some common value (e.g. 0, 1) to reduce the waste.

But anyway this is a tradeoff, seems there isn't a best way for all scenarios.

Agree, there is realy not a best way. Since there are already two traverses in the loader of fast-interp, my suggestion is not to introduce an extra traversing, which also takes time and degrades performance. Instead, had better leverage the first traverse to do something. One way that I can think is to mark down the bytecode positions and calculate the total i64/f64 and i32/f32 const counts in the first traverse, for example: we add two link lists to the loader context, one is for i64/f64 with node like struct I64Node { uint8 *bytecode; struct I64Node *next; }, another is for i32/f32 with node like struct I32Node { uint8 *bytecoe; struct I32Node *next; }, and also add uint32 i64_node_count; uint32 i32_node_count in loader context:

typedef struct I64Node {
    uint8 *bytecode;
    struct I64Node *next;
} I64Node;
typedef struct I32Node {
    uint8 *bytecode;
    struct I32Node *next;
} I32Node;

struct WASMLoaderContext {
    ...
    I64Node *i64_node_list;
    I32Node *i32_node_list;
    uint32 i64_node_count;
    uint32 i32_node_count;
    ...
};

And then add i64/f64.const bytecodes one by one to the first list and add i32/f32.const bytecodes one by one to the second list, and increase i64_node_count and i32_node_count accordingly. Note that we just remember all the nodes, we don't sort the list. And f64 can be treated as i64 by converting its buffer content to i64, like i64 = *(i64 *)&f64. And same for f32.

Then after the first traverse, we allocate a big buffer (an int64 array) for i64 nodes, traverse i64_node_list and copy i64/f64 const to the array one by one, and qsort the array buffer, and then remove duplicated consts. And same for i32_node_list. After that we can allocate a merged buffer for the two array, copy data to the new buffer and then free all previous buffer (two lists, two array buffer). Then in the second traverse, we can just use binary search (bsearch) to get the const offset when emitting bytecode.

sjamesr · 2024-12-26T19:59:05Z

what about

struct WASMLoaderContext {
    ...
    size_t operand_offset;  // points to beginning of operand within const_buf.data
    Vector *const_buf;
    HashMap *const_map;  // with appropriate Const equals and hash functions
    ...
}

bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
    Const search_key;
    if (!const_from_value(value, type, &search_key)) return false;
    void* entry = bh_hash_map_find(ctx->const_map, &search_key);
    if (entry) { 
        *offset = -*(size_t*)offset;
         return true;
    }

    // It's a value we haven't seen before
    ctx->operand_offset += is_32bit_type(type) ? 1 : 2;  // whatever the appropriate value should be
    bh_vector_append(ctx->const_buf, search_key);

    size_t* value = malloc(...);
    *value = *offset = ctx->operand_offset;  // whatever is necessary to point into ctx->const_buf.data
    if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
    return true;
}

Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode).

On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes.

What do you think?

wenyongh · 2024-12-27T03:02:15Z

what about

struct WASMLoaderContext {
    ...
    size_t operand_offset;  // points to beginning of operand within const_buf.data
    Vector *const_buf;
    HashMap *const_map;  // with appropriate Const equals and hash functions
    ...
}

bool wasm_loader_get_const_offset(WASMLoaderContext *ctx, uint8 value, void *type, size_t *offset) {
    Const search_key;
    if (!const_from_value(value, type, &search_key)) return false;
    void* entry = bh_hash_map_find(ctx->const_map, &search_key);
    if (entry) { 
        *offset = -*(size_t*)offset;
         return true;
    }

    // It's a value we haven't seen before
    ctx->operand_offset += is_32bit_type(type) ? 1 : 2;  // whatever the appropriate value should be
    bh_vector_append(ctx->const_buf, search_key);

    size_t* value = malloc(...);
    *value = *offset = ctx->operand_offset;  // whatever is necessary to point into ctx->const_buf.data
    if (!bh_hash_map_insert(ctx->const_map, pointer_to_operand_offset)) { return false; }
    return true;
}

Basically augment const_buf with a hashmap to allow ~constant time lookups. We would need to handle the case where const_buf fills up (more than SHORT_MAX entries), this would be treated as it is now (emit param into bytecode).

On the first traversal, the buf and map would be filled with unique values. On the second traversal, the value is either already in the map, or the buf filled up and we emit bytecodes.

What do you think?

It is basically OK for me, sounds like the performance is better than my idea but it will consume more memory. A concern is there may be many memory re-alloction when vector length increases. Had better not use size_t but use uint32 instead, and seems there are some typos in the sample code, e.g. uint8 value should be WASMValule value? And in size_t* value = malloc(...); *value = *offset = ctx->operand_offset;, the value is duplicated with the function argument uint8 value. I think you can have a try.

@xujuntwt95329 how do you think?

sjamesr · 2024-12-27T04:04:26Z

Thanks for the feedback, and yes in my sample code, it should be uint8 type, void *value to match what already exists in wasm_loader_get_const_offset.

My solution will involve many more allocations: once inserted into the hash, the pointer to key cannot change, so that means that they can't go into e.g. a vector that will be reallocated on growth. Perhaps to reduce allocations, we can have a linked list of buffers each containing a given number of keys (e.g. 4kb key buffers).

The advantage would be my solution should involve fewer memmoves/deletions, which your solution would presumably incur during the "remove duplicates" phase.

Thanks for the feedback, I will try working on a PR to see if this will work.

sjamesr requested review from loganek, lum1n0us, no1wudi, wenyongh, xujuntwt95329 and yamt as code owners December 19, 2024 20:57

sjamesr force-pushed the dont_update_const_list branch from 930086d to 440a16b Compare December 19, 2024 21:02

lum1n0us mentioned this pull request Dec 23, 2024

improving wasm_loader_get_const_offset #3969

Open

lum1n0us linked an issue Dec 23, 2024 that may be closed by this pull request

improving wasm_loader_get_const_offset #3969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always emit constants in bytecode #3976

Always emit constants in bytecode #3976

sjamesr commented Dec 19, 2024

sjamesr commented Dec 19, 2024

sjamesr commented Dec 19, 2024

lum1n0us commented Dec 20, 2024

sjamesr commented Dec 20, 2024

wenyongh commented Dec 20, 2024

sjamesr commented Dec 20, 2024 •

edited

Loading

wenyongh commented Dec 21, 2024

xujuntwt95329 commented Dec 23, 2024

wenyongh commented Dec 25, 2024

sjamesr commented Dec 26, 2024

wenyongh commented Dec 27, 2024

sjamesr commented Dec 27, 2024

Always emit constants in bytecode #3976

Are you sure you want to change the base?

Always emit constants in bytecode #3976

Conversation

sjamesr commented Dec 19, 2024

sjamesr commented Dec 19, 2024

sjamesr commented Dec 19, 2024

lum1n0us commented Dec 20, 2024

sjamesr commented Dec 20, 2024

wenyongh commented Dec 20, 2024

sjamesr commented Dec 20, 2024 • edited Loading

wenyongh commented Dec 21, 2024

xujuntwt95329 commented Dec 23, 2024

wenyongh commented Dec 25, 2024

sjamesr commented Dec 26, 2024

wenyongh commented Dec 27, 2024

sjamesr commented Dec 27, 2024

sjamesr commented Dec 20, 2024 •

edited

Loading