-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/GTEST: Workaround KSM ODP failure #10328
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT to disable atomic offset with ODP? so KSM would still be exercised
47529cf
to
6c564cc
Compare
src/uct/ib/base/ib_md.h
Outdated
if (md->config.enable_atomic_offset) { | ||
return md->flush_rkey >> 8; | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it breaks remote flush functionality UCT_FLUSH_FLAG_REMOTE
(e.g. see uct_dc_mlx5_flush_remote_rkey
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this PR it's not a problem since param is on
by default, but in future if we make it off
or auto
will need to handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this PR introduces new env var, tweaking which may result in a failure.
Maybe just temprorarily disable the tests instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use the existing "INDIRECT_ATOMIC" env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still breaks remote flush functionality. Can you disabled "REMOTE_FLUSH" capability when this env var is set?
@Artemy-Mellanox pls squash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still breaks remote flush functionality. Can you disabled "REMOTE_FLUSH" capability when this env var is set?
src/uct/ib/base/ib_md.h
Outdated
@@ -333,7 +333,10 @@ uct_ib_md_is_flush_rkey_valid(uint32_t flush_rkey) { | |||
|
|||
static UCS_F_ALWAYS_INLINE uint8_t uct_ib_md_get_atomic_mr_id(uct_ib_md_t *md) | |||
{ | |||
return md->flush_rkey >> 8; | |||
if (md->config.enable_indirect_atomic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why do we need this change? i would expect setting IB_INDIRECT_ATOMIC=n in the test would already disable KSM atomic, w/o additional changes in IB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently ext.enable_indirect_atomic
is not used at all in the code (some kind of leftover).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, so maybe we should make this var affect initiator side (send atomic ops to regular MR with offset 0) and not responder side, to avoid breaking remote flush
src/uct/ib/mlx5/dc/dc_mlx5_ep.c
Outdated
static UCS_F_ALWAYS_INLINE uint16_t | ||
uct_dc_mlx5_atomic_offset(uct_dc_mlx5_iface_t *iface, uct_dc_mlx5_ep_t *ep) | ||
{ | ||
if (uct_ib_iface_md(&iface->super.super.super)->config.enable_indirect_atomic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems wrong - we can't decide to use offset 0 without also using direct (and not atomic) part of the rkey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if enable_indirect_atomic is off the atomic part of the rkey won't have an offset but may still be present to enforce strict order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we detect it from information provided from the target side, rather than local configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if iface/ep address can have extra flag to indicate if atomic offset is 0, or it's as defined by atomic_mr_id
dda4a34
to
861b6a0
Compare
61db705
to
76e00c2
Compare
static UCS_F_ALWAYS_INLINE uint16_t | ||
uct_dc_mlx5_atomic_offset(uct_dc_mlx5_ep_t *ep) | ||
{ | ||
if (ep->flags & UCT_DC_MLX5_EP_FLAG_ATOMIC_OFFSET) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add TODO (tmp to be removed) here? It's extra branch on fast path
What?
Temporary disable shifted atomic address in ODP PUT tests.
Why?
Shifted atomic address exposes an issue in the driver (RM 4170682).