Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/GTEST: Workaround KSM ODP failure #10328

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

Artemy-Mellanox
Copy link
Contributor

@Artemy-Mellanox Artemy-Mellanox commented Nov 26, 2024

What?

Temporary disable shifted atomic address in ODP PUT tests.

Why?

Shifted atomic address exposes an issue in the driver (RM 4170682).

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT to disable atomic offset with ODP? so KSM would still be exercised

Comment on lines 337 to 340
if (md->config.enable_atomic_offset) {
return md->flush_rkey >> 8;
}
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it breaks remote flush functionality UCT_FLUSH_FLAG_REMOTE (e.g. see uct_dc_mlx5_flush_remote_rkey)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this PR it's not a problem since param is on by default, but in future if we make it off or auto will need to handle this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this PR introduces new env var, tweaking which may result in a failure.
Maybe just temprorarily disable the tests instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use the existing "INDIRECT_ATOMIC" env var?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still breaks remote flush functionality. Can you disabled "REMOTE_FLUSH" capability when this env var is set?

yosefe
yosefe previously approved these changes Dec 3, 2024
@yosefe
Copy link
Contributor

yosefe commented Dec 4, 2024

@Artemy-Mellanox pls squash

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still breaks remote flush functionality. Can you disabled "REMOTE_FLUSH" capability when this env var is set?

@@ -333,7 +333,10 @@ uct_ib_md_is_flush_rkey_valid(uint32_t flush_rkey) {

static UCS_F_ALWAYS_INLINE uint8_t uct_ib_md_get_atomic_mr_id(uct_ib_md_t *md)
{
return md->flush_rkey >> 8;
if (md->config.enable_indirect_atomic) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why do we need this change? i would expect setting IB_INDIRECT_ATOMIC=n in the test would already disable KSM atomic, w/o additional changes in IB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently ext.enable_indirect_atomic is not used at all in the code (some kind of leftover).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, so maybe we should make this var affect initiator side (send atomic ops to regular MR with offset 0) and not responder side, to avoid breaking remote flush

static UCS_F_ALWAYS_INLINE uint16_t
uct_dc_mlx5_atomic_offset(uct_dc_mlx5_iface_t *iface, uct_dc_mlx5_ep_t *ep)
{
if (uct_ib_iface_md(&iface->super.super.super)->config.enable_indirect_atomic) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems wrong - we can't decide to use offset 0 without also using direct (and not atomic) part of the rkey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if enable_indirect_atomic is off the atomic part of the rkey won't have an offset but may still be present to enforce strict order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we detect it from information provided from the target side, rather than local configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if iface/ep address can have extra flag to indicate if atomic offset is 0, or it's as defined by atomic_mr_id

@Artemy-Mellanox Artemy-Mellanox force-pushed the topic/war-ksm-odp branch 2 times, most recently from dda4a34 to 861b6a0 Compare December 20, 2024 06:59
src/uct/ib/mlx5/dc/dc_mlx5_ep.h Outdated Show resolved Hide resolved
src/uct/ib/mlx5/rc/rc_mlx5_ep.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/rc/rc_mlx5_ep.c Show resolved Hide resolved
src/uct/ib/mlx5/rc/rc_mlx5_iface.c Outdated Show resolved Hide resolved
test/gtest/uct/ib/test_ib_md.cc Outdated Show resolved Hide resolved
static UCS_F_ALWAYS_INLINE uint16_t
uct_dc_mlx5_atomic_offset(uct_dc_mlx5_ep_t *ep)
{
if (ep->flags & UCT_DC_MLX5_EP_FLAG_ATOMIC_OFFSET) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add TODO (tmp to be removed) here? It's extra branch on fast path

@yosefe yosefe merged commit ea90c3b into openucx:master Jan 10, 2025
146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants