Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/GDR_COPY: Optimize memory registration in fast-path #10342

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Artemy-Mellanox
Copy link
Contributor

No description provided.

@Artemy-Mellanox Artemy-Mellanox force-pushed the topic/gdrcopy-perf branch 2 times, most recently from fd4c701 to b504c8e Compare December 2, 2024 05:39
sched_yield();
}

x = __sync_fetch_and_add(&lock->l, UCS_RWLOCK_READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my curiosity: why we are using deprecated atomic functions, while there are modern family of __atomic_* functions __atomic_fetch_add?
Those __sync functions always make a full barrier, and therefore come with significant overhead when only acquire or release semantics is sufficient

Copy link
Contributor

@iyastreb iyastreb Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From GCC documentation: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
They [family of __sync functions] should not be used for new code which should use the ‘__atomic’ builtins instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider using/extending atomic.h?

static inline void ucs_rwlock_init(ucs_rwlock_t *lock) {
lock->l = 0;
}

Copy link
Contributor

@iyastreb iyastreb Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it must come with a set of unit tests.
And maybe multithread (like 32 threads) litmus test that would run for a while with some simple logic (like incrementing few counters by writer threads, acquring these counters by reader threads etc)

int x;

while (1) {
x = lock->l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reuse ucs_rwlock_write_trylock here?

return 0;
}

return -EBUSY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's more logical to return 1 when lock is acquired, 0 otherwise?

/* Writer is waiting or has lock */
#define UCS_RWLOCK_READ 0x4 /* Reader increment */


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add #define UCS_RWLOCK_STATIC_INITIALIZER {0}

@@ -8,6 +8,7 @@
#define UCS_RCACHE_INL_

#include "rcache_int.h"
#include <ucs/profile/profile.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

*
* Readers increment the counter by UCS_RWLOCK_READ (4)
* Writers set the UCS_RWLOCK_WRITE bit when lock is held
* and set the UCS_RWLOCK_WAIT bit while waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

#define UCS_RWLOCK_WAIT 0x1 /* Writer is waiting */
#define UCS_RWLOCK_WRITE 0x2 /* Writer has the lock */
#define UCS_RWLOCK_MASK (UCS_RWLOCK_WAIT | UCS_RWLOCK_WRITE)
/* Writer is waiting or has lock */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, no need for this comment

sched_yield();
}

x = __sync_fetch_and_add(&lock->l, UCS_RWLOCK_READ);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider using/extending atomic.h?

@@ -1187,6 +1187,44 @@ ucs_status_t ucp_mem_unmap(ucp_context_h context, ucp_mem_h memh)
return UCS_OK;
}

static ucs_status_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always use one md for mtype, maybe we can avoid ucp_memh as well, just use uct objects?

@@ -75,6 +75,7 @@ nobase_dist_libucs_la_HEADERS = \
type/param.h \
type/init_once.h \
type/spinlock.h \
type/rwlock.h \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move "introducing fast rwlock to ucs" to separate PR, including gtest for functionality and performance measurement (print lock overhead using UCS_TEST_MESSAGE + write in the PR desc how fast is new lock compared to glibc)

@@ -80,6 +81,17 @@ ucs_rcache_lookup_unsafe(ucs_rcache_t *rcache, void *address, size_t length,
return region;
}

static UCS_F_ALWAYS_INLINE ucs_rcache_region_t *
ucs_rcache_lookup(ucs_rcache_t *rcache, void *address, size_t length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we integrate this optimizations into ucs_rcache_get?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants