Skip to content

Conversation

@huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Jan 4, 2026

This replaces #2524.

This fixes 4 issues in both legacy and fast processor, see individual commits:

  1. when in_debug_mode == false, decorators should not be accessed, never mind executed,
  2. when strip_decorators is false, the DebugInfo provided as replacement should not panic on access,
  3. adds a --strip-decorators CLI option to the prove and run commands which strips decorators from the MastForest before operation,
  4. the err_ctx! macro should respect in_debug_mode, that is it should not perform decorator access in debug mode.

Even without stripping decorators, the combination of 1. and 4. yields a substantive performance improvement:

Baseline (main branch)

  • Total Time: 6906 ms
  • Execution Time: 4015 ms (58.14%)
  • Trace Generation: 4015 ms for 1,048,576 steps (48% padded)

Optimized (bypass-decorator-retrieval-in-release-mode branch)

  • Total Time: 3533 ms
  • Execution Time: 289 ms (8.17%)
  • Trace Generation: 288 ms for 1,048,576 steps (48% padded)

@huitseeker huitseeker changed the title Huitseeker/bypass decorator retrieval in release mode fix: bypass decorator retrieval in release mode Jan 4, 2026
Gate all decorator retrieval calls behind `in_debug_mode` checks,
ensuring zero overhead when debugging is disabled.

Processor changes:
- before_enter/after_exit decorator loops
- decorators_for_op in basic block execution

FastProcessor changes:
- execute_before_enter_decorators early return
- execute_after_exit_decorators early return
- decorators_for_op in basic block execution

Includes spy tests to verify retrieval is bypassed.
Update strip_decorators() to create an empty but valid CSR structure
instead of calling clear(), which removed the structure entirely and
caused panics when accessing decorator information after stripping.

- Add DebugInfo::empty_for_nodes(num_nodes) to create valid empty CSR
- Update from_components to accept empty structures
- Add edge case tests for empty forest, idempotency, multiple node types
@huitseeker huitseeker force-pushed the huitseeker/bypass-decorator-retrieval-in-release-mode branch from 1421357 to b817703 Compare January 4, 2026 08:30
@huitseeker huitseeker marked this pull request as ready for review January 4, 2026 08:36
@huitseeker huitseeker force-pushed the huitseeker/bypass-decorator-retrieval-in-release-mode branch from b817703 to a704496 Compare January 4, 2026 08:59
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I reviewed all non-test code and left a few comments inline (mostly for the future).

One other note, in this branch, we are not really differentiating execution vs. trace generation time. I think the benchmarks in the PR description listing execution around 300 ms also includes trace generation (just execution time should be about 10x faster than that). In next, I believe we already have this broken out - so, we can double-check when we apply these changes there.

Comment on lines +145 to +148
// Get the node ID once since it doesn't change within the loop
let node_id = basic_block
.linked_id()
.expect("basic block node should be linked when executing operations");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but getting node ID from the node in this way feels a bit backwards to me. When we apply these changes to next, we could probably just pass in node_id instead of back_block into this function (and in general, we should think how to simplify the number of parameters passed into this function). This will require refactoring how we build error context - but it should be relatively easy (e.g., instead of node.get_assembly_op(mast_forest, target_op_idx) we should be able to do something like mast.get_assembly_op(node_id, target_op_idx).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but a couple of nuances:

  • obviously, this is just moving a local variable definition out of a hot loop, not changing the definition (in a PR to main),
  • there's upcoming work on removing err_ctx as part of Investigate removing ErrorContext #1978 that may make this obsolete.

This fix resolves a 20x performance degradation in release mode when
decorators were present in the MastForest but not being executed.

Root Cause:
The err_ctx! macro was unconditionally calling node.get_assembly_op(),
which traversed the CSR decorator storage on every operation (522,059
times in blake3 benchmark), even when in_debug_mode was false. This
caused execution time to increase from 191ms to 3,884ms.

Solution:
Modified the error context creation pipeline to accept and respect the
in_debug_mode flag:
- Updated err_ctx! macro to require in_debug_mode parameter
- Updated ErrorContextImpl::new() and new_with_op_idx() to accept
  in_debug_mode
- Modified precalc_label_and_source_file() to return early when
  !in_debug_mode, avoiding expensive decorator traversal
- Updated all err_ctx!() call sites in Process and FastProcessor to
  pass in_debug_mode flag

Performance Impact:
- Before: Op execution time 3,884ms (7,439ns/op)
- After: Op execution time 191ms (366ns/op)
- Improvement: 20.3x faster

When in_debug_mode is false, decorators (including AsmOp decorators
used for error context) are no longer accessed, even if present in
the MastForest.
@huitseeker huitseeker force-pushed the huitseeker/bypass-decorator-retrieval-in-release-mode branch from a704496 to 07031b0 Compare January 5, 2026 08:49
Copy link
Contributor

@adr1anh adr1anh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just left a minor nit.

Comment on lines +114 to +117
let mut node_indptr_for_op_idx = IndexVec::new();
for _ in 0..=num_nodes {
let _ = node_indptr_for_op_idx.push(0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we initialize the vector with zeros directly, rather than pushing one by one?

Copy link
Contributor Author

@huitseeker huitseeker Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an API that's available on IndexVec. Should be fixed, just not on a PR to main. #2532

@huitseeker
Copy link
Contributor Author

huitseeker commented Jan 5, 2026

@bobbinth
Re: confusion between trace generation and execution time, here are sequential runs rebased on next:

Baseline (origin/next) - 3 Runs

  | Run     | execute_for_trace | build_trace | Total Time |
  |---------|-------------------|-------------|------------|
  | 1       | 3.58s             | 70.9ms      | 34,991ms   |
  | 2       | 3.57s             | 73.8ms      | 35,100ms   |
  | 3       | 3.56s             | 76.2ms      | 35,001ms   |
  | Average | 3.57s             | 73.6ms      | 35,031ms   |

  Decorator Bypass (decorator-bypass-on-next) - 3 Runs

  | Run     | execute_for_trace | build_trace | Total Time |
  |---------|-------------------|-------------|------------|
  | 1       | 10.7ms            | 86.0ms      | 31,428ms   |
  | 2       | 7.34ms            | 78.9ms      | 31,688ms   |
  | 3       | 11.3ms            | 83.4ms      | 31,857ms   |
  | Average | 9.78ms            | 82.8ms      | 31,658ms   |

@huitseeker huitseeker merged commit 097626f into main Jan 5, 2026
16 checks passed
@huitseeker huitseeker deleted the huitseeker/bypass-decorator-retrieval-in-release-mode branch January 5, 2026 10:57
huitseeker added a commit that referenced this pull request Jan 5, 2026
This merge brings the decorator bypass optimization from main (#2529) into next.

The changes adapt the decorator bypass optimization to next's architecture:
- No changes to legacy Process (removed in next)
- All changes apply to FastProcessor only
- Decorator retrieval gated behind in_debug_mode checks
- Error context creation respects in_debug_mode flag

Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time.

Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
huitseeker added a commit that referenced this pull request Jan 5, 2026
This merge brings the decorator bypass optimization from main (#2529) into next.

The changes adapt the decorator bypass optimization to next's architecture:
- No changes to legacy Process (removed in next)
- All changes apply to FastProcessor only
- Decorator retrieval gated behind in_debug_mode checks
- Error context creation respects in_debug_mode flag

Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time.

Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
@huitseeker huitseeker mentioned this pull request Jan 5, 2026
huitseeker added a commit that referenced this pull request Jan 5, 2026
This merge brings the decorator bypass optimization from main (#2529) into next.

The changes adapt the decorator bypass optimization to next's architecture:
- No changes to legacy Process (removed in next)
- All changes apply to FastProcessor only
- Decorator retrieval gated behind in_debug_mode checks
- Error context creation respects in_debug_mode flag

Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time.

Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
huitseeker added a commit that referenced this pull request Jan 5, 2026
This merge brings the decorator bypass optimization from main (#2529) into next.

The changes adapt the decorator bypass optimization to next's architecture:
- No changes to legacy Process (removed in next)
- All changes apply to FastProcessor only
- Decorator retrieval gated behind in_debug_mode checks
- Error context creation respects in_debug_mode flag

Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time.

Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
bobbinth pushed a commit that referenced this pull request Jan 5, 2026
* fix: bypass decorator retrieval in release mode

Gate all decorator retrieval calls behind `in_debug_mode` checks,
ensuring zero overhead when debugging is disabled.

Processor changes:
- before_enter/after_exit decorator loops
- decorators_for_op in basic block execution

FastProcessor changes:
- execute_before_enter_decorators early return
- execute_after_exit_decorators early return
- decorators_for_op in basic block execution

Includes spy tests to verify retrieval is bypassed.

* fix(core): strip decorators while maintaining valid CSR structure

Update strip_decorators() to create an empty but valid CSR structure
instead of calling clear(), which removed the structure entirely and
caused panics when accessing decorator information after stripping.

- Add DebugInfo::empty_for_nodes(num_nodes) to create valid empty CSR
- Update from_components to accept empty structures
- Add edge case tests for empty forest, idempotency, multiple node types

* fix(processor): gate error context decorator access with in_debug_mode

This fix resolves a 20x performance degradation in release mode when
decorators were present in the MastForest but not being executed.

Root Cause:
The err_ctx! macro was unconditionally calling node.get_assembly_op(),
which traversed the CSR decorator storage on every operation (522,059
times in blake3 benchmark), even when in_debug_mode was false. This
caused execution time to increase from 191ms to 3,884ms.

Solution:
Modified the error context creation pipeline to accept and respect the
in_debug_mode flag:
- Updated err_ctx! macro to require in_debug_mode parameter
- Updated ErrorContextImpl::new() and new_with_op_idx() to accept
  in_debug_mode
- Modified precalc_label_and_source_file() to return early when
  !in_debug_mode, avoiding expensive decorator traversal
- Updated all err_ctx!() call sites in Process and FastProcessor to
  pass in_debug_mode flag

Performance Impact:
- Before: Op execution time 3,884ms (7,439ns/op)
- After: Op execution time 191ms (366ns/op)
- Improvement: 20.3x faster

When in_debug_mode is false, decorators (including AsmOp decorators
used for error context) are no longer accessed, even if present in
the MastForest.
@bobbinth
Copy link
Contributor

bobbinth commented Jan 5, 2026

Re: confusion between trace generation and execution time, here are sequential runs rebased on next:

Baseline (origin/next) - 3 Runs

  | Run     | execute_for_trace | build_trace | Total Time |
  |---------|-------------------|-------------|------------|
  | 1       | 3.58s             | 70.9ms      | 34,991ms   |
  | 2       | 3.57s             | 73.8ms      | 35,100ms   |
  | 3       | 3.56s             | 76.2ms      | 35,001ms   |
  | Average | 3.57s             | 73.6ms      | 35,031ms   |

  Decorator Bypass (decorator-bypass-on-next) - 3 Runs

  | Run     | execute_for_trace | build_trace | Total Time |
  |---------|-------------------|-------------|------------|
  | 1       | 10.7ms            | 86.0ms      | 31,428ms   |
  | 2       | 7.34ms            | 78.9ms      | 31,688ms   |
  | 3       | 11.3ms            | 83.4ms      | 31,857ms   |
  | Average | 9.78ms            | 82.8ms      | 31,658ms   |

Nice! Thank you for running these!

So, it seems like our current execution runs roughly at 100 MHz, right? I think that's pretty nice, and in the future we could probably try to optimize it to get closer to 200 MHz.

Re-trace generation, it should be able take advantage of multi-threading - so, if we run with concurrent feature, I'd expect it to be closer to 20ms range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants