Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST/GTEST/UCT: Retry when it cannot allocate MEMIC memory #10393

Merged

Conversation

michal-shalev
Copy link
Contributor

What?

Added a retry mechanism for allocating MEMIC memory in test_atomic_key_reg_rdma_mem_type.

Why?

To address test failures caused by memory allocation issues, ensuring tests proceed even when allocation initially fails.

How?

Introduced retries with controlled delays and error logging in mapped_buffer allocation. Updated mem_alloc and related functions to support optional failure handling.

Before

image

After

image

test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/uct_test.cc Outdated Show resolved Hide resolved
@@ -961,8 +961,14 @@ void uct_test::entity::mem_alloc(size_t length, unsigned mem_flags,
status = uct_mem_alloc(length, alloc_methods,
ucs_static_array_size(alloc_methods), &params,
mem);
}

if (may_fail && (status != UCS_OK)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to pass may_fail param, can we just catch exception from ASSERT_UCS_OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to pass may_fail because of the way ASSERT_UCS_OK handles failures, it does more than just throwing exception, it collects the failures to mark the test as successful or not at the end of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of may_fail we can pass number of retries (default: 0) to uct_test::entity::mem_alloc, then other tests that may want to allocate MEMIC with mem_buffer could enable retries as well.
Also, IMO we should retry only if the status is UCS_ERR_NO_MEMORY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe fixed, please check

@michal-shalev michal-shalev force-pushed the add-retry-after-memic-alloc-fail branch 2 times, most recently from eeb37fd to 43c02a9 Compare December 30, 2024 17:02
@michal-shalev michal-shalev requested a review from iyastreb January 5, 2025 13:16
test/gtest/uct/test_atomic_key_reg_rdma_mem_type.cc Outdated Show resolved Hide resolved
test/gtest/uct/uct_test.cc Outdated Show resolved Hide resolved
Comment on lines 967 to 969
if (status == UCS_OK) {
break;
} else {
UCS_TEST_MESSAGE << "Retry " << i + 1 << "/" << num_retries
<< ": Buffer allocation failed - %s"
<< ucs_status_string(status);
if (status == UCS_ERR_NO_MEMORY) {
usleep(10000);
} else {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (status == UCS_OK) {
break;
} else {
UCS_TEST_MESSAGE << "Retry " << i + 1 << "/" << num_retries
<< ": Buffer allocation failed - %s"
<< ucs_status_string(status);
if (status == UCS_ERR_NO_MEMORY) {
usleep(10000);
} else {
break;
}
}
if (status != UCS_ERR_NO_MEMORY) {
break;
}
UCS_TEST_MESSAGE << "Retry " << (i + 1) << "/" << num_retries
<< ": Allocation failed - " << ucs_status_string(status);
usleep(10000);
}
ASSERT_UCS_OK(status);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if the status is UCS_OK it shouldn't retry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, in the suggested code UCS_OK != UCS_ERR_NO_MEMORY so we exit the loop and don't retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@michal-shalev michal-shalev requested a review from yosefe January 5, 2025 14:53
yosefe
yosefe previously approved these changes Jan 5, 2025
iyastreb
iyastreb previously approved these changes Jan 6, 2025
@michal-shalev michal-shalev force-pushed the add-retry-after-memic-alloc-fail branch from eb81270 to c501d24 Compare January 7, 2025 11:12
@michal-shalev michal-shalev merged commit 9cee13a into openucx:master Jan 9, 2025
146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants