-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEST/GTEST/UCT: Retry when it cannot allocate MEMIC memory #10393
TEST/GTEST/UCT: Retry when it cannot allocate MEMIC memory #10393
Conversation
test/gtest/uct/uct_test.cc
Outdated
@@ -961,8 +961,14 @@ void uct_test::entity::mem_alloc(size_t length, unsigned mem_flags, | |||
status = uct_mem_alloc(length, alloc_methods, | |||
ucs_static_array_size(alloc_methods), ¶ms, | |||
mem); | |||
} | |||
|
|||
if (may_fail && (status != UCS_OK)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to pass may_fail
param, can we just catch exception from ASSERT_UCS_OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to pass may_fail
because of the way ASSERT_UCS_OK
handles failures, it does more than just throwing exception, it collects the failures to mark the test as successful or not at the end of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of may_fail we can pass number of retries (default: 0) to uct_test::entity::mem_alloc, then other tests that may want to allocate MEMIC with mem_buffer could enable retries as well.
Also, IMO we should retry only if the status is UCS_ERR_NO_MEMORY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe fixed, please check
eeb37fd
to
43c02a9
Compare
test/gtest/uct/uct_test.cc
Outdated
if (status == UCS_OK) { | ||
break; | ||
} else { | ||
UCS_TEST_MESSAGE << "Retry " << i + 1 << "/" << num_retries | ||
<< ": Buffer allocation failed - %s" | ||
<< ucs_status_string(status); | ||
if (status == UCS_ERR_NO_MEMORY) { | ||
usleep(10000); | ||
} else { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (status == UCS_OK) { | |
break; | |
} else { | |
UCS_TEST_MESSAGE << "Retry " << i + 1 << "/" << num_retries | |
<< ": Buffer allocation failed - %s" | |
<< ucs_status_string(status); | |
if (status == UCS_ERR_NO_MEMORY) { | |
usleep(10000); | |
} else { | |
break; | |
} | |
} | |
if (status != UCS_ERR_NO_MEMORY) { | |
break; | |
} | |
UCS_TEST_MESSAGE << "Retry " << (i + 1) << "/" << num_retries | |
<< ": Allocation failed - " << ucs_status_string(status); | |
usleep(10000); | |
} | |
ASSERT_UCS_OK(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if the status is UCS_OK it shouldn't retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, in the suggested code UCS_OK != UCS_ERR_NO_MEMORY so we exit the loop and don't retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
5fd3f54
to
b338e1c
Compare
eb81270
to
c501d24
Compare
What?
Added a retry mechanism for allocating MEMIC memory in
test_atomic_key_reg_rdma_mem_type
.Why?
To address test failures caused by memory allocation issues, ensuring tests proceed even when allocation initially fails.
How?
Introduced retries with controlled delays and error logging in
mapped_buffer
allocation. Updatedmem_alloc
and related functions to support optional failure handling.Before
After