Skip to content

Conversation

@huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Dec 12, 2025

As a prelude to #2448, this changes the serialization of BasicBlock to reflect the padded contents, to not need to re-batch and pad those blocks again.

The goal of this PR is twofold:

  • experiment with and analyze the padded on-disk representation, unlocked by this comment 🎉 — get conversation started about how permissible it is to serialize metadata:
    • One important sub-part of this: in this PR, we're over-consuming bits in the indptr representation, not only because each index is represented with one byte (the value is in [0, 72] which fits in 7 bits) but because the groups contain at most 8 ops, so a simple delta-encoding would give us $8 \times 4$ bits $= 32$ bits with an implicit start at 0, instead of $9 \times 8 = 72$. This is not yet implemented, but would cut the ovehead numbers below about in half.
  • Groundwork for Simplify MastForest serialization by directly serializing DebugInfo #2448, which needs to change the over-the-wire representation of decorator info anyway, but has a long paragraph in there on the difficulty of un-padding decorators ("The Padding Wrinkle") that disappears in a puff of simplification after this work.

As a consequence of bumping the version number for serialization of MastForest, this PR is intended to stay open until we land on the right over-the-wire format. I.e. there's a version bump in there which I don't intend to have in the PRs which will live on top of this. This stack will be three PRs (one PR delta-encoding, one PR #2448)

Test Data: Miden Standard Library

  • 727 total nodes
  • 439 basic block nodes (60.4%)
  • 284 procedures
  • 94,744 operations (with padding)

Summary

The padded format adds 4.05% size overhead:

  • 0.67% from NOOP padding (633 ops)
  • 3.38% from batch metadata (33,316 bytes)

Most blocks (92.7%) add ≤34 bytes.

Size comparison between unpadded and padded serialization formats for MastForest.

Size Comparison

Format Size Overhead
Unpadded (next) 838,400 bytes (818.75 KB) baseline
Padded (serialize-padded-opbatches) 872,375 bytes (851.93 KB) +33,975 bytes (+4.05%)

The unpadded format cannot guarantee exact OpBatch reconstruction after deserialization.

Overhead Sources

NOOP Padding: 633 operations (0.67%)

Metric Count
Total operations 94,744
NOOP padding 633
Real operations 94,111

Batch Metadata: 33,316 bytes (98% of overhead)

Component Per Unit Count Total
Indptr array 9 bytes 3,156 batches 28,404 bytes
Padding flags 1 byte 3,156 batches 3,156 bytes
Batch count 4 bytes 439 blocks 1,756 bytes

Total metadata: 33,316 bytes (32.54 KB)

Distribution summary:

  • 92.7% of blocks: ≤34 bytes overhead (1-3 batches)
  • 2.3% of blocks: >100 bytes overhead (9+ batches)
  • Average: 75.89 bytes per block (skewed by outliers)

Metadata per block: 4 + (num_batches × 10) bytes

Wire Format

Each basic block stores:

┌─────────────────────────────────────────────────┐
│ Padded Operations (variable)                    │
├─────────────────────────────────────────────────┤
│ Batch Count (u32, 4 bytes)                      │
├─────────────────────────────────────────────────┤
│ Indptr Arrays (9 × u8 per batch)               │
├─────────────────────────────────────────────────┤
│ Padding Flags (1 byte per batch, bit-packed)   │
└─────────────────────────────────────────────────┘

Size: ops_size + 4 + (10 × num_batches) bytes

Script: https://gist.github.com/huitseeker/f957014b15b0ed08a36b7f936079b698

@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch 3 times, most recently from 41d5579 to 20b4415 Compare December 12, 2025 22:17
@huitseeker huitseeker changed the title feat: serialize BasicBlocks in padded representation (1/2 or 1/3) feat: serialize BasicBlocks in padded representation (1/3) Dec 12, 2025
@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from 20b4415 to 99bb3be Compare December 12, 2025 23:07
@huitseeker huitseeker requested review from adr1anh, bobbinth and plafer and removed request for adr1anh December 13, 2025 16:38
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very shallow review from me - but looks great! Thank you!

Comment on lines +1268 to +1272
/// Represents the operation data for a [`BasicBlockNodeBuilder`].
///
/// The decorators are bundled with the operation data to maintain the invariant that
/// decorator indices match the format of the operations:
/// - `Raw`: decorators have raw (unpadded) indices
/// - `Batched`: decorators have padded indices
#[derive(Debug)]
enum OperationData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we need this "duality" mostly to support the slow processor? That is, is the slow processor the only reason why we need to keep track of the raw (unpadded) indexes?

Copy link
Contributor Author

@huitseeker huitseeker Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we still support raw indexes for those reasons:

  • Assembly block merging (raw ops & raw decorator indexes)
  • Node creation (raw → padded conversion in initial batching)
  • Fingerprinting (padded → raw conversion for stability, which I will remove, after this stack of PRs)

This stack of PRs also removes the use of raw indexes in serialization.

Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from 99bb3be to 2103969 Compare December 16, 2025 21:07
@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch 2 times, most recently from cd5c728 to db92d2d Compare December 30, 2025 04:04
@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from db92d2d to 9204e3d Compare January 5, 2026 13:53
@plafer
Copy link
Contributor

plafer commented Jan 7, 2026

I think the stack of 3 PRs are good to merge - any reason to hold off?

@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from edea9b7 to f0f925e Compare January 7, 2026 18:31
Serialize BasicBlockNode operations in padded form with batch metadata
to enable exact OpBatch reconstruction during deserialization.

Changes:
- Add batch metadata to serialization format (indptr, padding, groups)
- Add OperationData enum to bundle operations with matching decorator indices
- Add from_op_batches constructor to BasicBlockNodeBuilder
- Serialize decorators with padded indices to match padded operations
- Bump serialization version from [0,0,0] to [0,0,1]
- Add comprehensive tests including 7 unit tests and 3 proptests

This preserves the exact OpBatch structure across serialization boundaries,
eliminating the need for re-batching during deserialization.
Add module-level documentation showing the over-the-wire format for
basic blocks with byte consumption formula.
Modified BasicBlockNode::to_builder to directly use pre-batched operations
and padded decorators already stored in the node, eliminating redundant
re-batching and decorator adjustment. The implementation now:

- Uses from_op_batches constructor to preserve existing op_batches
- Extracts padded decorators directly from Owned or Linked storage
- Avoids wasteful extraction of unpadded operations followed by re-batching
- Merged test_to_builder_identity_{owned,linked} into single test
  covering both storage types

- Simplified OpBatch roundtrip tests by using PartialEq instead of
  checking each field individually

- Simplified proptest assertions to compare OpBatch directly instead of
  checking ops, indptr, padding, groups, num_groups separately
Add comprehensive serialization round-trip test using the standard library
to verify multi-batch basic block serialization.
@huitseeker huitseeker force-pushed the serialize-padded-opbatches branch from f0f925e to b383cbf Compare January 7, 2026 18:42
@huitseeker huitseeker merged commit 3dc9dd5 into next Jan 7, 2026
16 checks passed
@huitseeker huitseeker deleted the serialize-padded-opbatches branch January 7, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants