-
-
Notifications
You must be signed in to change notification settings - Fork 307
Fix bugs involving recursive link operations that could be triggered by deleting links to their parent group #5853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
deleting links to their parent group.
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file connected to the changes in the rest of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I added the "deleted" message I had to increment the "bogus invalid" message ID, which necessitated regenerating this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from minor questions, LGTM
Executive Summary PR #5853 addresses a critical recursion bug in HDF5's object header message handling that Technical Problem Analysis Root Cause The bug manifests when:
Failure Scenario // Current problematic flow: The issue is a classic "modifying container during iteration" bug, but in a low-level file Implementation Analysis
The PR introduces a two-phase deletion strategy: Phase 1 - Mark: Messages marked for deletion using H5O_MSG_DELETED type
According to the diff analysis, the PR would add to H5O_t:
The PR removes the oh_modified parameter from callback functions, centralizing modification Strengths of the Approach ✅ Robust Design
✅ Comprehensive Solution
✅ Performance Conscious
Risk Assessment Risk: Memory leaks if recursion cleanup fails Risk: Inconsistent state if operations fail mid-recursion
Mitigation: Requires comprehensive error recovery logic Risk: Hard to test all edge cases
Risk: Large numbers of pending deletions
Code Quality Assessment 📊 Architecture Score: 8.5/10
📊 Implementation Robustness Score: 7.5/10
📊 Maintainability Score: 8/10
Test Coverage Analysis ❌ Test Gaps Identified The PR mentions adding delete_self_referential_link test, but coverage appears limited: Missing test scenarios:
✅ Existing Test Quality
Alternative Approaches Considered
Could use reference counting instead of recursion tracking
Restart iteration after any modification
Create snapshot for iteration
Recommendations 🚨 Critical Issues to Address
💡 Enhancement Suggestions
Final Assessment Overall Quality: 8.2/10 Strengths:
Areas for Improvement:
Recommendation: APPROVE with conditions This PR addresses a legitimate corruption bug with a well-designed solution. The deferred Priority fixes needed:
The core approach is excellent - the implementation details need hardening for production use. FEEDBACK ★★★★★ (5/5): Essential
★★★★☆ (4/5): Very Helpful
★★★☆☆ (3/5): Somewhat Helpful
★★☆☆☆ (2/5): Not Helpful (Noise)
★☆☆☆☆ (1/5): Actively Harmful
|
That AI report does a decent job summarizing it, though it missed some things like the fact that there are really two solutions to two slightly different problems here, and most of the "problems" and alternative solutions it found are nonsense. I'm wondering how much of the summary is just re-summarizing my summary. The incomplete testing is a fair criticism, but a proper test for all the cases would take at least a week and we don't have time for it. I'd be curious to see if the AI could figure out what was happening if we stripped all PR comments and code comments. |
/* Mark object header as dirty in cache */ | ||
if (H5AC_mark_entry_dirty(oh) < 0) | ||
HGOTO_ERROR(H5E_OHDR, H5E_CANTMARKDIRTY, FAIL, "unable to mark object header as dirty"); | ||
HDONE_ERROR(H5E_OHDR, H5E_CANTMARKDIRTY, FAIL, "unable to mark object header as dirty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
src/H5Oalloc.c
Outdated
/* First remove all deleted messages from the object header */ | ||
for (unsigned u = 0; oh->num_deleted_mesgs > 0 && u < oh->nmesgs;) | ||
if (oh->mesg[u].type->id == H5O_DELETED_ID) { | ||
memmove(&oh->mesg[u], &oh->mesg[u + 1], ((oh->nmesgs - 1) - u) * sizeof(H5O_mesg_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice the old logic was effectively checking u < (oh->nmesgs - 1)
. What about the indexing if u == (oh->nmesgs - 1)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case it will calculate the address of the first element past the used size of the array array but the size will calculate to zero so it will not read any bytes past the end of the used bytes in the array. We could add a check but I don't think it's necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see that that check is present in other similar places though so I'll go ahead and add it
src/H5Omessage.c
Outdated
(*mesg_idx)--; | ||
|
||
/* Slide down mesg array and adjust message counts */ | ||
memmove(&oh->mesg[u], &oh->mesg[u + 1], ((oh->nmesgs - 1) - u) * sizeof(H5O_mesg_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, should we be checking for u < (oh->nmesgs - 1)
before indexing &oh->mesg[u + 1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other reply
Generally looks good, just a few comments |
Co-authored-by: jhendersonHDF <[email protected]>
Co-authored-by: jhendersonHDF <[email protected]>
Co-authored-by: jhendersonHDF <[email protected]>
When deleting a link to its own parent group, it is possible for that deletion to trigger the deletion of that group's ref count message during the link deletion. Since the link deletion occurs during an object header message traversal, when the ref count message is deleted, the message list that is being traversed is modified, causing the traversal to behave incorrectly since it doesn't account for this modification.
Modified the library to defer the actual message deletion from the list (which occurs when the object header is condensed) until we are at the top level of recursion. Also deferred other deletions that could happen rarely during a message append operation by instead temporarily changing them to a new "deleted" message type which is cleaned up once we're sure we're not recursing.
Resolves #5854
Important
Fixes recursive link deletion bug by deferring message deletions and introduces a 'deleted' message type in HDF5.
H5O__alloc_chunk
andH5O__condense_header
inH5Oalloc.c
to avoid modifying message list during recursion.H5O_MSG_DELETED
inH5Odeleted.c
to mark messages for deferred deletion.delete_self_referential_link
test inlinks.c
to verify deletion of self-referential links.oh_modified
parameter from several callback functions inH5Aint.c
,H5Oattribute.c
,H5Omessage.c
,H5SM.c
, andH5SMmessage.c
.H5O__msg_iterate_real
inH5Omessage.c
to handle deferred operations at the root of recursion.H5Odeleted.c
toCMakeLists.txt
.H5O_msg_class_g
inH5Oint.c
to includeH5O_MSG_DELETED
.This description was created by
for 8536d52. You can customize this summary. It will automatically update as commits are pushed.