Skip to content

Commit 78a80ab

Browse files
committed
UCT: Merge with latest master branch
2 parents b193dea + 224b217 commit 78a80ab

File tree

139 files changed

+2673
-879
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

139 files changed

+2673
-879
lines changed

AUTHORS

+7
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,18 @@
1+
Aboorva Devarajan <[email protected]>
12
Akshay Venkatesh <[email protected]>
3+
Alastair McKinstry <[email protected]>
24
Aleksey Senin <[email protected]>
35
Alex Margolin <[email protected]>
46
Alex Mikheev <[email protected]>
57
Alexey Rivkin <[email protected]>
68
Alina Sklarevich <[email protected]>
9+
Alma Mastbaum <[email protected]>
710
Anatoly Vildemanov <[email protected]>
811
Andrey Maslennikov <[email protected]>
912
Artem Polyakov <[email protected]>
1013
Artem Ryabov <[email protected]>
1114
Artemy Kovalyov <[email protected]>
15+
Arun Chandran <[email protected]>
1216
Aurelien Bouteiller <[email protected]>
1317
1418
Boris Karasev <[email protected]>
@@ -70,6 +74,7 @@ Pavan Balaji <[email protected]>
7074
Pavel Shamis (Pasha) <[email protected]>
7175
Peter Andreas Entschev <[email protected]>
7276
Peter Rudenko <[email protected]>
77+
Peter-Jan Gootzen <[email protected]>
7378
7479
Raul Akhmetshin <[email protected]>
7580
Robert Dietrich <[email protected]>
@@ -96,9 +101,11 @@ Valentin Petrov <[email protected]>
96101
Wenbin Lu <[email protected]>
97102
98103
Xu Yifeng <[email protected]>
104+
Yiltan Hassan Temucin <[email protected]>
99105
Yossi Itigin <[email protected]>
100106
Yuriy Shestakov <[email protected]>
101107
Zhu Yanjun <[email protected]>
108+
Zihao Zhao <[email protected]>
102109

103110
In addition we would like to acknowledge the following members of UCX community
104111
for their participation in annual face-to-face meeting, design discussions, and

NEWS

+151
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,157 @@
1111
### Features:
1212
### Bugfixes:
1313

14+
## 1.18.0 (January 17, 2025)
15+
### Features:
16+
#### UCP
17+
* Enabled using CUDA staging buffers for pipeline protocols by default
18+
* Added endpoint reconfiguration support for non-reused p2p scenarios
19+
* Enabled non-cacheable memory domains, activated for gdr_copy
20+
* Added user_data parameter to ucp_ep_query
21+
* Added support for host memory pipeline through CUDA buffers for rendezvous protocol
22+
* Added global VA infrastructure and memory region in absence of error handling
23+
* Made protocol performance node names more informative
24+
* Enforced always running on the same thread in single thread mode
25+
* Multiple improvements in protocols selection infrastructure
26+
* Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
27+
* Allowed up-to 64 endpoint lanes for systems with many transports or devices
28+
* Added usage tracker to worker
29+
* Improved various logging messages
30+
#### RDMA CORE (IB, ROCE, etc.)
31+
* Added environment variable to manage DC initiator capacity
32+
* Added DC dcs_hybrid policy
33+
* Reduced MLX5/DV stack size consumption
34+
* Added ODP support for verbs and mlx5dv
35+
* Added support of CUDA managed memory on IB when ODP is available
36+
* Added support of Adaptive Routing on RoCE
37+
* Enabled use of implicit ODP with relaxed ordering
38+
* Improved GPU-Direct detection in IB transport
39+
* Increased DC initiator default count to 32 for performance optimization
40+
* Added ConnectX-8 device support with DDP
41+
* Added support for subnet filter list for RoCE interfaces
42+
* Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
43+
* Added IB MLX5 as a separate UCX module with separate RPM sub-package
44+
* Added initial support for GGA transport, for fast DPU memory access
45+
* Set IB DevX atomic mode based on device capabilities
46+
* Removed DC keepalive mechanism, since the keepalive is done on UCP layer
47+
* Optimized cross-gVMI memory registration using indirect memory keys cache
48+
* Improved various logging messages
49+
#### CUDA
50+
* Added multi-node NVlink support
51+
* Added CUDA Fabric memory support with detection and allocation
52+
* Improved gdr_copy latency estimations on AMD Milan systems
53+
* Added check for gdr_copy runtime/build version mismatch
54+
* Added handling missing IPC capability when unpacking keys
55+
* Added caching for CUDA IPC memory pool import operation
56+
* Added gdr_copy variables to optimize performance on Grace Hopper systems
57+
* Improved CUDA IPC concurrency for a larger count of reachable peers
58+
#### UCS
59+
* Added support for wildcards in configuration parameter names
60+
* Added ASAN protection to several internal data structures
61+
* Reduced stack usage in topology detection code
62+
* Improved bitmaps configuration parsing with wider bitfield
63+
* Added options to set topology distance between devices
64+
* Optimized VFS unix socket watch by using user private folder
65+
* Added general IP subnet matching infrastructure
66+
* Extend array data structure to support user-provided array copy routine
67+
* Improved time units description
68+
#### UCM
69+
* Extend CUDA memory hooks to include memory mapping APIs
70+
#### Tools
71+
* Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
72+
* Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
73+
* Improved ucx_perftest uni-directional test with added fence
74+
* Detailed ucx_perftest batch section of command-line documentation
75+
#### Documentation
76+
* Added a section regarding adaptive routing on RoCE
77+
#### Architecture
78+
* Added CPU Model for MI300A
79+
* Added Fujitsu ARM specific values to ucx.conf
80+
* Added AMD Turin support
81+
* Added an optimized non-temporal memory copy implementation for AMD CPU
82+
#### Build
83+
* Improved compiler error reporting with added flag
84+
* Improved coverity script to allow faster turnaround time
85+
* Improved Intel Compiler detection and support
86+
#### GO
87+
* Added multi-send flag and user memh support in request params
88+
#### Packaging
89+
* Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
90+
### Bugfixes:
91+
#### UCP
92+
* Fixed stack overflow in exported rkey unpack
93+
* Removed extra remote-cpu overhead from protocol estimation for zcopy
94+
* Fixed performance estimation for rndv pipeline protocols
95+
* Fixed ATP sending by picking the correct lane
96+
* Fixed missing reg_id on memh creation
97+
* Fixed repeated invalidations by retaining existing access flags
98+
* Fixed abort reason propagation for rendezvous RTR mtype
99+
* Do not check transport availability if it is disabled by UCX_TLS environment variable
100+
* Fixed wrong flag being used for checking BCOPY capability
101+
* Fixed sending too many ATPs for small messages
102+
* Enforced 16 bits size for Active Messages identifiers
103+
* Fixed unnecessary status check for emulated AMO
104+
* Fixed more than one fragment sending in rendezvous pipeline
105+
* Fixed crash by using biggest max frag across all lanes
106+
* Fixed missing memory handle flags by copying from parent to child
107+
* Fixed worker interface activate count
108+
* Fixed flush requests by replacing ATP/flush lane map with lane indexes
109+
* Fixed lost uct_flags when merging memory regions
110+
#### UCT
111+
* Fixed memory domain UCT flags description
112+
#### RDMA CORE (IB, ROCE, etc.)
113+
* Fixed FETCH_ADD remote access error for ODP/KSM case
114+
* Fixed missing conditional compilation checks for DM
115+
* Fixed IB MD allocation naming typo
116+
* Fixed invalid GIDs filter in IB
117+
* Fixed flags usage in MLX5 zcopy_post
118+
* Do not limit ODP registration retries
119+
* Fixed JUCX failures by considering the number of supported completion vectors
120+
#### CUDA
121+
* Fixed async memory handling using CUDA memory type on Grace
122+
* Added rcache overhead in performance estimation
123+
* Fixed gdr_copy performance regression by providing maximum estimation between get and put
124+
* Fixed CUDA IPC reachability check
125+
* Fixed crash in MPI_Finalize when CUDA context is destroyed
126+
* Always require rcache by default for gdr_copy
127+
* Fixed crash in gdr_copy cleanup when registration cache is disabled
128+
* Fixed CUDA copy memory domain allocations
129+
* Fixed multiple tests for gdr_copy transport
130+
* Fixed race condition in CUDA IPC peer accessible cache
131+
#### UCS
132+
* Fixed a crash by using heap allocation to process expired timers in batch
133+
* Fixed allocation issue on memtrack dump
134+
* Fixed deletion of the monitored folder in VFS
135+
* Fixed unsafe resize for DC initiator array
136+
* Fixed function macro invocation to match C standard
137+
* Fixed calling async handler on already released resource
138+
* Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
139+
* Fixed undeclared value error in timer conversion routine
140+
* Fixed uninitialized value access in registration cache
141+
#### UCM
142+
* Fixed race condition in parsing proc maps
143+
* Fixed mremap failure while parsing /proc/self/maps
144+
#### ROCM
145+
* Fixed ROCM interface reachability test
146+
* Fixed memory domain fork test
147+
#### TCP
148+
* Always bind endpoint to interface
149+
#### Tools
150+
* Fixed buffer size potential overflow in ucx_perftest
151+
* Fixed missing address when packing memory keys on ucx_perftest
152+
* Fixed memory leak for endpoint report in ucx_info
153+
* Fixed build without openmp in ucx_perftest
154+
* Fixed UCT device override on server side on ucx_perftest
155+
#### Build
156+
* Fixed using correct ASAN version for running tests
157+
#### Configuration
158+
* Used POSIX bourne syntax to check equality
159+
* Fixed build failure by using proper flags in compiler.m4
160+
* Fixed perftest MAD support default guessing
161+
#### GO
162+
* Added serialized thread mode to avoid subtle races between threads
163+
* Fixed make distcheck
164+
14165
## 1.17.0 (June 13, 2024)
15166
### Features:
16167
#### UCP

buildlib/azure-pipelines-perf.yml

+2
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,8 @@ stages:
8989
- job: Perf
9090
displayName: Performance testing
9191
timeoutInMinutes: 180
92+
workspace:
93+
clean: outputs
9294
pool:
9395
name: MLNX
9496
demands:

buildlib/jucx/jucx-build.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,8 @@ jobs:
5353
set -eE
5454
{
5555
echo -e "<settings><servers><server>"
56-
echo -e "<id>ossrh</id><username>\${env.SONATYPE_USERNAME}</username>"
57-
echo -e "<password>\${env.SONATYPE_PASSWORD}</password>"
56+
echo -e "<id>ossrh</id><username>$(SONATYPE_USERNAME)</username>"
57+
echo -e "<password>$(SONATYPE_PASSWORD)</password>"
5858
echo -e "</server></servers></settings>"
5959
} > $(temp_cfg)
6060
displayName: Generate temporary config

buildlib/pr/main.yml

+27-27
Original file line numberDiff line numberDiff line change
@@ -18,24 +18,15 @@ resources:
1818
- container: fedora
1919
image: rdmz-harbor.rdmz.labs.mlnx/ucx/fedora33:2
2020
options: $(DOCKER_OPT_ARGS)
21-
- container: fedora34
22-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/fedora34:2
21+
- container: fedora41
22+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/fedora41/builder:inbox
2323
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
2424
- container: coverity_rh7
2525
image: rdmz-harbor.rdmz.labs.mlnx/ucx/coverity:mofed-5.1-2.3.8.0
2626
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
2727
- container: rhel76
2828
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel7.6/builder:mofed-5.0-1.0.0.0
2929
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
30-
- container: rhel76_mofed47
31-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel7.6/builder:mofed-4.7-1.0.0.1
32-
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
33-
- container: rhel74
34-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel7.4/builder:mofed-5.0-1.0.0.0
35-
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
36-
- container: rhel72
37-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel7.2/builder:mofed-5.0-1.0.0.0
38-
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
3930
- container: rhel82
4031
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel8.2/builder:mofed-5.0-1.0.0.0
4132
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
@@ -63,11 +54,11 @@ resources:
6354
- container: debian109
6455
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/debian10.9/builder:mofed-5.8-3.0.7.0
6556
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
66-
- container: sles15sp2
67-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/sles15sp2/builder:mofed-5.0-1.0.0.0
57+
- container: debian125
58+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/debian12.5/builder:doca-2.9.0
6859
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
69-
- container: sles12sp5
70-
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/sles12sp5/builder:mofed-5.0-1.0.0.0
60+
- container: sles15sp6
61+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/sles15sp6/builder:doca-2.9.0
7162
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
7263
- container: centos7_cuda_11_0
7364
image: nvidia/cuda:11.0.3-devel-centos7
@@ -180,6 +171,15 @@ resources:
180171
- container: ubuntu2204_rocm_6_0_0
181172
image: rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/ubuntu2204:rocm-6.0.0
182173
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
174+
- container: kylin10sp3
175+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/kylin10sp3/builder:doca-2.9.0
176+
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
177+
- container: euleros2sp12
178+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/euleros2.0sp12/builder:doca-2.9.0
179+
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
180+
- container: centos10stream
181+
image: rdmz-harbor.rdmz.labs.mlnx/hpcx/x86_64/centos10stream/builder:inbox
182+
options: $(DOCKER_OPT_ARGS) $(DOCKER_OPT_VOLUMES)
183183

184184
stages:
185185
- stage: Codestyle
@@ -201,16 +201,9 @@ stages:
201201
- ucx_docker -equals yes
202202
strategy:
203203
matrix:
204-
rhel72:
205-
CONTAINER: rhel72
206-
rhel74:
207-
CONTAINER: rhel74
208204
rhel76:
209205
CONTAINER: rhel76
210206
long_test: yes
211-
rhel76_mofed47:
212-
CONTAINER: rhel76_mofed47
213-
long_test: yes
214207
ubuntu2004:
215208
CONTAINER: ubuntu2004
216209
long_test: yes
@@ -226,21 +219,28 @@ stages:
226219
CONTAINER: debian113
227220
debian109:
228221
CONTAINER: debian109
229-
sles15sp2:
230-
CONTAINER: sles15sp2
222+
debian125:
223+
CONTAINER: debian125
224+
sles15sp6:
225+
CONTAINER: sles15sp6
231226
rhel82:
232227
CONTAINER: rhel82
233228
rhel90:
234229
CONTAINER: rhel90
235-
fedora34:
236-
CONTAINER: fedora34
237-
long_test: yes
230+
fedora41:
231+
CONTAINER: fedora41
238232
centos7:
239233
CONTAINER: centos7_ib
234+
centos10stream:
235+
CONTAINER: centos10stream
240236
ubuntu2004_rocm:
241237
CONTAINER: ubuntu2004_rocm_5_4_0
242238
ubuntu2204_rocm:
243239
CONTAINER: ubuntu2204_rocm_6_0_0
240+
kylin10sp3:
241+
CONTAINER: kylin10sp3
242+
euleros2sp12:
243+
CONTAINER: euleros2sp12
244244
container: $[ variables['CONTAINER'] ]
245245
timeoutInMinutes: 340
246246

buildlib/tools/build_static.sh

-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,6 @@ az_init_modules
6464
prepare_build
6565

6666
# Don't cross-connect RoCE devices
67-
export UCX_IB_ROCE_LOCAL_SUBNET=y
6867
export UCX_IB_ROCE_SUBNET_PREFIX_LEN=inf
6968
build_static
7069

buildlib/tools/builds.sh

+5
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,10 @@ build_gcc_debug_opt() {
320320
build_gcc CFLAGS=-Og CXXFLAGS=-Og
321321
}
322322

323+
build_gcc_with_dndebug() {
324+
build_gcc CFLAGS=-DNDEBUG CXXFLAGS=-DNDEBUG
325+
}
326+
323327
#
324328
# Build with armclang compiler
325329
#
@@ -444,6 +448,7 @@ then
444448
'build_no_devx' \
445449
'build_no_openmp' \
446450
'build_gcc_debug_opt' \
451+
'build_gcc_with_dndebug' \
447452
'build_clang' \
448453
'build_armclang')
449454
fi

buildlib/tools/test_wire_compat.sh

-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ common_opt=$3
1111
port=$((10000 + 1000 * ${AZP_AGENT_ID} + 100 * ${WIRE_COMPAT_STAGE_ID}))
1212

1313
export UCX_CM_REUSEADDR=y UCX_LOG_LEVEL=info UCX_WARN_UNUSED_ENV_VARS=n
14-
export UCX_IB_ROCE_LOCAL_SUBNET=y
1514

1615
exe_cmd="stdbuf -oL ${exe_name} -p ${port}"
1716

0 commit comments

Comments
 (0)