-
Notifications
You must be signed in to change notification settings - Fork 250
fix: bypass decorator retrieval in release mode #2529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bypass decorator retrieval in release mode #2529
Conversation
Gate all decorator retrieval calls behind `in_debug_mode` checks, ensuring zero overhead when debugging is disabled. Processor changes: - before_enter/after_exit decorator loops - decorators_for_op in basic block execution FastProcessor changes: - execute_before_enter_decorators early return - execute_after_exit_decorators early return - decorators_for_op in basic block execution Includes spy tests to verify retrieval is bypassed.
Update strip_decorators() to create an empty but valid CSR structure instead of calling clear(), which removed the structure entirely and caused panics when accessing decorator information after stripping. - Add DebugInfo::empty_for_nodes(num_nodes) to create valid empty CSR - Update from_components to accept empty structures - Add edge case tests for empty forest, idempotency, multiple node types
1421357 to
b817703
Compare
b817703 to
a704496
Compare
bobbinth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I reviewed all non-test code and left a few comments inline (mostly for the future).
One other note, in this branch, we are not really differentiating execution vs. trace generation time. I think the benchmarks in the PR description listing execution around 300 ms also includes trace generation (just execution time should be about 10x faster than that). In next, I believe we already have this broken out - so, we can double-check when we apply these changes there.
| // Get the node ID once since it doesn't change within the loop | ||
| let node_id = basic_block | ||
| .linked_id() | ||
| .expect("basic block node should be linked when executing operations"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but getting node ID from the node in this way feels a bit backwards to me. When we apply these changes to next, we could probably just pass in node_id instead of back_block into this function (and in general, we should think how to simplify the number of parameters passed into this function). This will require refactoring how we build error context - but it should be relatively easy (e.g., instead of node.get_assembly_op(mast_forest, target_op_idx) we should be able to do something like mast.get_assembly_op(node_id, target_op_idx).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but a couple of nuances:
- obviously, this is just moving a local variable definition out of a hot loop, not changing the definition (in a PR to main),
- there's upcoming work on removing err_ctx as part of Investigate removing
ErrorContext#1978 that may make this obsolete.
This fix resolves a 20x performance degradation in release mode when decorators were present in the MastForest but not being executed. Root Cause: The err_ctx! macro was unconditionally calling node.get_assembly_op(), which traversed the CSR decorator storage on every operation (522,059 times in blake3 benchmark), even when in_debug_mode was false. This caused execution time to increase from 191ms to 3,884ms. Solution: Modified the error context creation pipeline to accept and respect the in_debug_mode flag: - Updated err_ctx! macro to require in_debug_mode parameter - Updated ErrorContextImpl::new() and new_with_op_idx() to accept in_debug_mode - Modified precalc_label_and_source_file() to return early when !in_debug_mode, avoiding expensive decorator traversal - Updated all err_ctx!() call sites in Process and FastProcessor to pass in_debug_mode flag Performance Impact: - Before: Op execution time 3,884ms (7,439ns/op) - After: Op execution time 191ms (366ns/op) - Improvement: 20.3x faster When in_debug_mode is false, decorators (including AsmOp decorators used for error context) are no longer accessed, even if present in the MastForest.
a704496 to
07031b0
Compare
adr1anh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just left a minor nit.
| let mut node_indptr_for_op_idx = IndexVec::new(); | ||
| for _ in 0..=num_nodes { | ||
| let _ = node_indptr_for_op_idx.push(0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize the vector with zeros directly, rather than pushing one by one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an API that's available on IndexVec. Should be fixed, just not on a PR to main. #2532
|
@bobbinth |
This merge brings the decorator bypass optimization from main (#2529) into next. The changes adapt the decorator bypass optimization to next's architecture: - No changes to legacy Process (removed in next) - All changes apply to FastProcessor only - Decorator retrieval gated behind in_debug_mode checks - Error context creation respects in_debug_mode flag Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time. Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
This merge brings the decorator bypass optimization from main (#2529) into next. The changes adapt the decorator bypass optimization to next's architecture: - No changes to legacy Process (removed in next) - All changes apply to FastProcessor only - Decorator retrieval gated behind in_debug_mode checks - Error context creation respects in_debug_mode flag Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time. Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
This merge brings the decorator bypass optimization from main (#2529) into next. The changes adapt the decorator bypass optimization to next's architecture: - No changes to legacy Process (removed in next) - All changes apply to FastProcessor only - Decorator retrieval gated behind in_debug_mode checks - Error context creation respects in_debug_mode flag Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time. Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
This merge brings the decorator bypass optimization from main (#2529) into next. The changes adapt the decorator bypass optimization to next's architecture: - No changes to legacy Process (removed in next) - All changes apply to FastProcessor only - Decorator retrieval gated behind in_debug_mode checks - Error context creation respects in_debug_mode flag Performance impact: ~10% overall speedup, 99.7% reduction in trace execution time. Conflicts resolved using solutions from huitseeker/decorator-bypass-on-next rebase work.
* fix: bypass decorator retrieval in release mode Gate all decorator retrieval calls behind `in_debug_mode` checks, ensuring zero overhead when debugging is disabled. Processor changes: - before_enter/after_exit decorator loops - decorators_for_op in basic block execution FastProcessor changes: - execute_before_enter_decorators early return - execute_after_exit_decorators early return - decorators_for_op in basic block execution Includes spy tests to verify retrieval is bypassed. * fix(core): strip decorators while maintaining valid CSR structure Update strip_decorators() to create an empty but valid CSR structure instead of calling clear(), which removed the structure entirely and caused panics when accessing decorator information after stripping. - Add DebugInfo::empty_for_nodes(num_nodes) to create valid empty CSR - Update from_components to accept empty structures - Add edge case tests for empty forest, idempotency, multiple node types * fix(processor): gate error context decorator access with in_debug_mode This fix resolves a 20x performance degradation in release mode when decorators were present in the MastForest but not being executed. Root Cause: The err_ctx! macro was unconditionally calling node.get_assembly_op(), which traversed the CSR decorator storage on every operation (522,059 times in blake3 benchmark), even when in_debug_mode was false. This caused execution time to increase from 191ms to 3,884ms. Solution: Modified the error context creation pipeline to accept and respect the in_debug_mode flag: - Updated err_ctx! macro to require in_debug_mode parameter - Updated ErrorContextImpl::new() and new_with_op_idx() to accept in_debug_mode - Modified precalc_label_and_source_file() to return early when !in_debug_mode, avoiding expensive decorator traversal - Updated all err_ctx!() call sites in Process and FastProcessor to pass in_debug_mode flag Performance Impact: - Before: Op execution time 3,884ms (7,439ns/op) - After: Op execution time 191ms (366ns/op) - Improvement: 20.3x faster When in_debug_mode is false, decorators (including AsmOp decorators used for error context) are no longer accessed, even if present in the MastForest.
Nice! Thank you for running these! So, it seems like our current execution runs roughly at 100 MHz, right? I think that's pretty nice, and in the future we could probably try to optimize it to get closer to 200 MHz. Re-trace generation, it should be able take advantage of multi-threading - so, if we run with concurrent feature, I'd expect it to be closer to 20ms range. |
This replaces #2524.
This fixes 4 issues in both legacy and fast processor, see individual commits:
in_debug_mode == false, decorators should not be accessed, never mind executed,strip_decoratorsis false, theDebugInfoprovided as replacement should not panic on access,adds a--strip-decoratorsCLI option to theproveandruncommands which strips decorators from theMastForestbefore operation,err_ctx!macro should respectin_debug_mode, that is it should not perform decorator access in debug mode.Even without stripping decorators, the combination of 1. and 4. yields a substantive performance improvement:
Baseline (main branch)
Optimized (bypass-decorator-retrieval-in-release-mode branch)