Skip to content

Conversation

@huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Dec 13, 2025

This PR implements delta encoding for BasicBlockNode indptr arrays, building on the padded serialization format from #2466. It reduces batch metadata from 10 bytes to 5 bytes per batch through 4-bit delta encoding.

The delta-encoding mentioned in the previous PR is now implemented:

  • Before: 9 bytes indptr array (9 × 8 bits = 72 bits)
  • After: 4 bytes packed deltas (8 × 4 bits = 32 bits, first index implicit)

Clarified the partial validity semantics of OpBatch arrays:

  • indptr[0..num_groups+1] - semantically valid prefix
  • padding[0..num_groups] - semantically valid prefix
  • groups[0..num_groups] - semantically valid prefix
  • Tail entries beyond these prefixes - undefined garbage for semantic purposes, but must be "safe garbage" for serialization

The entire indptr array must be monotonically non-decreasing for delta encoding. OpBatchAccumulator fills the tail with the final ops count to maintain this invariant while keeping the tail semantically meaningless.

Summary

The delta-encoded format adds 2.20% size overhead compared to unpadded baseline:

  • 0.67% from NOOP padding (633 ops)
  • 1.53% from batch metadata (17,536 bytes with delta encoding)

Savings vs padded format: 15,780 bytes (-1.84%)

Most blocks (92.7%) add ≤22 bytes.

Size comparison across serialization formats for MastForest.

Size Comparison

Format Size vs Unpadded vs Padded
Unpadded (next) 825,755 bytes (806.40 KB) baseline -3.95%
Padded (previous PR) 859,730 bytes (839.58 KB) +33,975 bytes (+4.11%) baseline
Delta-encoded (this PR) 843,950 bytes (824.17 KB) +18,195 bytes (+2.20%) -15,780 bytes (-1.84%)

Overhead Sources

NOOP Padding: 633 operations (0.67%)

Metric Count
Total operations 94,744
NOOP padding 633
Real operations 94,111

Batch Metadata: 17,536 bytes (with delta encoding)

Component Per Unit Count Total Previous
Packed indptr deltas 4 bytes 3,156 batches 12,624 bytes 28,404 bytes
num_groups 1 byte 3,156 batches 3,156 bytes (included in indptr)
Batch count 4 bytes 439 blocks 1,756 bytes 1,756 bytes

Total metadata: 17,536 bytes (17.12 KB)

Metadata per block: 4 + (num_batches × 5) bytes

Delta Encoding Details

Each indptr array is encoded as:

  1. Elide indptr[0] (always 0)
  2. Compute 8 deltas: indptr[i+1] - indptr[i] for i ∈ [0,7]
  3. Pack each 4-bit delta into 4 bytes (2 deltas per byte)

Valid delta range: [0, 9] (max 9 ops per group)

Tail semantics: Elements beyond [0..num_groups] must maintain monotonicity for serialization but are semantically undefined. The tail is filled with the final ops count during construction.

Wire Format

Each basic block stores:

┌─────────────────────────────────────────────────┐
│ Padded Operations (variable)                    │
├─────────────────────────────────────────────────┤
│ Batch Count (u32, 4 bytes)                      │
├─────────────────────────────────────────────────┤
│ num_groups (1 byte per batch)                   │
├─────────────────────────────────────────────────┤
│ Packed Indptr Deltas (4 bytes per batch)        │
└─────────────────────────────────────────────────┘

Size: ops_size + 4 + (5 × num_batches) bytes

Previous format size: ops_size + 4 + (10 × num_batches) bytes

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a very deep review - but looks good! Thank you!

//! - Padding flags per batch (1 byte each, bit-packed)
//!
//! **Total**: `ops_size + 4 + (10 * num_batches)` bytes
//! **Total**: `ops_size + 4 + (5 * num_batches)` bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought for the future about serialization of operations:

  • Currently, we use 1 byte to serialize each opcode - though, technically, each opcode is only 7 bits.
  • For operations like Assert, MpVerify, and U32assert2, we also serialize the error code (8 bytes) which is not semantically-relevant and ideally would somehow go into the debug info section.
  • For the Push operation we always serialize the immediate values as 8 bytes. We could optimize this in the future by using a variable-length encoding, and also maybe building an immediate values table (since there is probably quite a lot of duplication of immediate values - though, with variable length encoding, the benefit of the table is somewhat questionable).

Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you!

@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from edea9b7 to f0f925e Compare January 7, 2026 18:31
@huitseeker huitseeker force-pushed the delta-encode-indptr branch from aad51b4 to ce2fad2 Compare January 7, 2026 18:31
@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from f0f925e to b383cbf Compare January 7, 2026 18:42
@huitseeker huitseeker force-pushed the delta-encode-indptr branch from ce2fad2 to 842094c Compare January 7, 2026 18:42
Base automatically changed from serialize-padded-opbatches to next January 7, 2026 18:56
1. Updated decoder to use unpack_indptr_deltas (4 bytes per batch)
2. Fixed OpBatchAccumulator::into_batch to fill indptr tail with final
   value, ensuring monotonicity required by delta encoding
3. Updated snapshot tests to reflect new indptr representation

All serialization tests now pass with full round-trip working.
Add documentation explaining that indptr/padding/groups arrays are partial
structures where only prefixes are semantically valid. Add debug assertions
to validate invariants including full array monotonicity required for
serialization.
Add validation for full indptr array monotonicity and final value in
MastForest validation. Call validate() after deserialization to ensure
reconstructed forests meet all invariants required for serialization.
@huitseeker huitseeker force-pushed the delta-encode-indptr branch from 842094c to 1aa41a1 Compare January 7, 2026 19:01
@huitseeker huitseeker merged commit 988aecf into next Jan 7, 2026
16 checks passed
@huitseeker huitseeker deleted the delta-encode-indptr branch January 7, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants