Skip to content

Conversation

fortnern
Copy link
Member

@fortnern fortnern commented Sep 19, 2025

When deleting a link to its own parent group, it is possible for that deletion to trigger the deletion of that group's ref count message during the link deletion. Since the link deletion occurs during an object header message traversal, when the ref count message is deleted, the message list that is being traversed is modified, causing the traversal to behave incorrectly since it doesn't account for this modification.

Modified the library to defer the actual message deletion from the list (which occurs when the object header is condensed) until we are at the top level of recursion. Also deferred other deletions that could happen rarely during a message append operation by instead temporarily changing them to a new "deleted" message type which is cleaned up once we're sure we're not recursing.

Resolves #5854


Important

Fixes recursive link deletion bug by deferring message deletions and introduces a 'deleted' message type in HDF5.

  • Behavior:
    • Defers message deletion in H5O__alloc_chunk and H5O__condense_header in H5Oalloc.c to avoid modifying message list during recursion.
    • Introduces H5O_MSG_DELETED in H5Odeleted.c to mark messages for deferred deletion.
    • Adds delete_self_referential_link test in links.c to verify deletion of self-referential links.
  • Refactoring:
    • Removes unused oh_modified parameter from several callback functions in H5Aint.c, H5Oattribute.c, H5Omessage.c, H5SM.c, and H5SMmessage.c.
    • Updates H5O__msg_iterate_real in H5Omessage.c to handle deferred operations at the root of recursion.
  • Misc:
    • Adds H5Odeleted.c to CMakeLists.txt.
    • Adjusts H5O_msg_class_g in H5Oint.c to include H5O_MSG_DELETED.

This description was created by Ellipsis for 8536d52. You can customize this summary. It will automatically update as commits are pushed.

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file connected to the changes in the rest of this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I added the "deleted" message I had to increment the "bogus invalid" message ID, which necessitated regenerating this file

mattjala
mattjala previously approved these changes Sep 19, 2025
Copy link
Contributor

@mattjala mattjala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from minor questions, LGTM

@brtnfld
Copy link
Collaborator

brtnfld commented Sep 19, 2025

Executive Summary

PR #5853 addresses a critical recursion bug in HDF5's object header message handling that
occurs when deleting self-referential links. The fix introduces a deferred deletion mechanism
using a new H5O_MSG_DELETED message type to prevent unsafe modification of message lists
during traversal.


Technical Problem Analysis

Root Cause

The bug manifests when:

  1. A link points to its own parent group (self-referential link)
  2. During link deletion, the reference count message is decremented
  3. This triggers recursive object header message traversal via H5O__msg_iterate_real()
  4. Critical flaw: Messages are deleted mid-traversal, corrupting the iteration loop

Failure Scenario

// Current problematic flow:
H5Ldelete() →
reference count decrement →
H5O__msg_iterate_real() →
callback deletes message →
list modification during iteration →
iterator corruption/crash

The issue is a classic "modifying container during iteration" bug, but in a low-level file
format context where corruption has severe consequences.


Implementation Analysis

  1. Core Design Approach

The PR introduces a two-phase deletion strategy:

Phase 1 - Mark: Messages marked for deletion using H5O_MSG_DELETED type
Phase 2 - Sweep: Actual deletion occurs after iteration completes

  1. New Data Structures (Proposed)

According to the diff analysis, the PR would add to H5O_t:
struct H5O_t {
// ... existing fields ...
unsigned recursion_level; // Track recursion depth
unsigned num_deleted_mesgs; // Count of pending deletions
unsigned mesgs_modified; // Track if messages changed
};

  1. New Message Type
  • H5O_MSG_DELETED (0x001a): Placeholder message for deferred deletion
  • Increases H5O_MSG_TYPES from 26 to 27
  • Never actually written to disk - purely runtime construct
  1. Modified Function Signatures

The PR removes the oh_modified parameter from callback functions, centralizing modification
tracking in the object header structure itself.


Strengths of the Approach

✅ Robust Design

  • Safe iteration: Prevents list modification during traversal
  • Minimal intrusion: Uses existing message type infrastructure
  • Backwards compatible: No file format changes

✅ Comprehensive Solution

  • Handles deep recursion: Tracks recursion levels properly
  • Deferred cleanup: Only cleans up at top-level recursion
  • State tracking: Maintains clear state about pending operations

✅ Performance Conscious

  • Rare path optimization: Only activates for uncommon scenarios
  • Minimal overhead: Adds small metadata overhead only when needed
  • Efficient cleanup: Bulk deletion at end vs. incremental

Risk Assessment

⚠️ Memory Management Risks

Risk: Memory leaks if recursion cleanup fails
// Potential issue:
if (recursion_level == 0) {
cleanup_deleted_messages(); // What if this fails?
}
Mitigation: Need robust error handling in cleanup paths

⚠️ Complex State Management

Risk: Inconsistent state if operations fail mid-recursion

  • Recursion level tracking could become inconsistent
  • Pending deletion count could be wrong
  • Object header state could be corrupted

Mitigation: Requires comprehensive error recovery logic

⚠️ Testing Complexity

Risk: Hard to test all edge cases

  • Self-referential links are uncommon
  • Deep recursion scenarios are difficult to create
  • Error injection testing is critical

⚠️ Performance Edge Cases

Risk: Large numbers of pending deletions

  • Could accumulate significant memory
  • Cleanup could become expensive
  • Need bounds on pending operation count

Code Quality Assessment

📊 Architecture

Score: 8.5/10

  • Clean separation of concerns
  • Follows HDF5 design patterns
  • Minimal API surface changes

📊 Implementation Robustness

Score: 7.5/10

  • Well-structured approach
  • Concern: Error handling complexity not fully visible in PR
  • Concern: Thread safety implications unclear

📊 Maintainability

Score: 8/10

  • Builds on existing message infrastructure
  • Self-contained changes
  • Clear documentation of purpose

Test Coverage Analysis

❌ Test Gaps Identified

The PR mentions adding delete_self_referential_link test, but coverage appears limited:

Missing test scenarios:

  1. Deep recursion chains: A→B→C→A cycles
  2. Multiple self-references: Group with multiple self-referential links
  3. Error injection: What happens if cleanup fails?
  4. Memory pressure: Large numbers of pending deletions
  5. Concurrent access: Thread safety (if applicable)

✅ Existing Test Quality

  • Addresses the specific reported bug case
  • Uses realistic HDF5 usage patterns
  • Tests both creation and deletion paths

Alternative Approaches Considered

  1. Reference Counting

Could use reference counting instead of recursion tracking

  • Pro: More precise lifetime management
  • Con: Higher complexity, potential for leaks
  1. Iteration Restart

Restart iteration after any modification

  • Pro: Simpler state management
  • Con: Poor performance, complexity in restart logic
  1. Copy-on-Write Message Lists

Create snapshot for iteration

  • Pro: Complete isolation from modifications
  • Con: Memory overhead, copying complexity

Recommendations

🚨 Critical Issues to Address

  1. Enhanced Error Handling
    // Need robust cleanup on error
    if (error_during_recursion) {
    rollback_deleted_messages();
    reset_recursion_state();
    }
  2. Bounds Checking
    // Prevent excessive pending operations
    if (num_deleted_mesgs > MAX_PENDING_DELETIONS) {
    force_immediate_cleanup();
    }
  3. Comprehensive Testing
    - Add stress tests for deep recursion
    - Add error injection tests
    - Add memory usage validation

💡 Enhancement Suggestions

  1. Monitoring/Debugging
    #ifdef H5O_DEBUG
    log_recursion_state(oh->recursion_level, oh->num_deleted_mesgs);
    #endif
  2. Performance Optimization
    // Batch operations for efficiency
    if (num_deleted_mesgs > BATCH_THRESHOLD) {
    batch_delete_messages();
    }

Final Assessment

Overall Quality: 8.2/10

Strengths:

  • ✅ Solves a real, critical bug
  • ✅ Clean, well-architected solution
  • ✅ Minimal impact on existing code
  • ✅ Follows HDF5 design patterns

Areas for Improvement:

  • ⚠️ Need more comprehensive error handling
  • ⚠️ Require additional test coverage
  • ⚠️ Performance bounds need definition
  • ⚠️ Documentation of edge cases

Recommendation: APPROVE with conditions

This PR addresses a legitimate corruption bug with a well-designed solution. The deferred
deletion approach is sound and follows good engineering principles. However, it requires
additional work on error handling, testing, and documentation before merge.

Priority fixes needed:

  1. Robust error recovery mechanisms
  2. Comprehensive test suite expansion
  3. Performance bounds and monitoring
  4. Edge case documentation

The core approach is excellent - the implementation details need hardening for production use.


FEEDBACK

★★★★★ (5/5): Essential

  • Meaning: The AI found a critical issue.
  • Examples: A significant bug, a security vulnerability, a major performance regression, or a logical flaw that would break functionality.
  • Action: "This comment is a must-fix. It prevented a serious problem."

★★★★☆ (4/5): Very Helpful

  • Meaning: The AI suggested an evident and valuable improvement.
  • Examples: A non-trivial performance boost, a significant improvement to code readability or maintainability, or a suggestion that aligns perfectly with best practices.
  • Action: "I'm applying this change. The code is demonstrably better because of it."

★★★☆☆ (3/5): Somewhat Helpful

  • Meaning: The comment is correct but addresses a minor or stylistic point.
  • Examples: A minor formatting tweak, renaming a variable for slightly better clarity, or a suggestion that is technically correct but has negligible impact.
  • Action: "This is a good point, but not critical. I might apply it."

★★☆☆☆ (2/5): Not Helpful (Noise)

  • Meaning: The comment is irrelevant, a false positive, or factually incorrect but harmless.
  • Examples: A suggestion on code that wasn't part of the PR, a linting rule that doesn't apply, or a recommendation based on a misunderstanding of the code's intent.
  • Action: "I'm ignoring this. It's incorrect or irrelevant."

★☆☆☆☆ (1/5): Actively Harmful

  • Meaning: The suggested change is wrong and would introduce a bug or degrade the code.
  • Examples: A code suggestion that introduces a syntax error, a logical flaw, or removes necessary functionality.
  • Action: "This suggestion is wrong. Applying it would make the code worse and waste my time."

@fortnern
Copy link
Member Author

fortnern commented Sep 19, 2025

That AI report does a decent job summarizing it, though it missed some things like the fact that there are really two solutions to two slightly different problems here, and most of the "problems" and alternative solutions it found are nonsense. I'm wondering how much of the summary is just re-summarizing my summary. The incomplete testing is a fair criticism, but a proper test for all the cases would take at least a week and we don't have time for it.

I'd be curious to see if the AI could figure out what was happening if we stripped all PR comments and code comments.

mattjala
mattjala previously approved these changes Sep 23, 2025
mattjala
mattjala previously approved these changes Sep 25, 2025
@jhendersonHDF jhendersonHDF self-assigned this Sep 25, 2025
@nbagha1 nbagha1 moved this from To be triaged to Scheduled/On-Deck in HDF5 - TRIAGE & TRACK Sep 25, 2025
@nbagha1 nbagha1 added this to the Release 2.0.0 milestone Sep 26, 2025
/* Mark object header as dirty in cache */
if (H5AC_mark_entry_dirty(oh) < 0)
HGOTO_ERROR(H5E_OHDR, H5E_CANTMARKDIRTY, FAIL, "unable to mark object header as dirty");
HDONE_ERROR(H5E_OHDR, H5E_CANTMARKDIRTY, FAIL, "unable to mark object header as dirty");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

src/H5Oalloc.c Outdated
/* First remove all deleted messages from the object header */
for (unsigned u = 0; oh->num_deleted_mesgs > 0 && u < oh->nmesgs;)
if (oh->mesg[u].type->id == H5O_DELETED_ID) {
memmove(&oh->mesg[u], &oh->mesg[u + 1], ((oh->nmesgs - 1) - u) * sizeof(H5O_mesg_t));
Copy link
Collaborator

@jhendersonHDF jhendersonHDF Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the old logic was effectively checking u < (oh->nmesgs - 1). What about the indexing if u == (oh->nmesgs - 1) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it will calculate the address of the first element past the used size of the array array but the size will calculate to zero so it will not read any bytes past the end of the used bytes in the array. We could add a check but I don't think it's necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see that that check is present in other similar places though so I'll go ahead and add it

src/H5Omessage.c Outdated
(*mesg_idx)--;

/* Slide down mesg array and adjust message counts */
memmove(&oh->mesg[u], &oh->mesg[u + 1], ((oh->nmesgs - 1) - u) * sizeof(H5O_mesg_t));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, should we be checking for u < (oh->nmesgs - 1) before indexing &oh->mesg[u + 1]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other reply

@jhendersonHDF
Copy link
Collaborator

Generally looks good, just a few comments

@fortnern fortnern merged commit a289461 into HDFGroup:develop Sep 30, 2025
90 checks passed
@github-project-automation github-project-automation bot moved this from Scheduled/On-Deck to Done in HDF5 - TRIAGE & TRACK Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Deleting self referntial links can cause failures
6 participants