|
11 | 11 | ### Features:
|
12 | 12 | ### Bugfixes:
|
13 | 13 |
|
| 14 | +## 1.18.0 (January 17, 2025) |
| 15 | +### Features: |
| 16 | +#### UCP |
| 17 | + * Enabled using CUDA staging buffers for pipeline protocols by default |
| 18 | + * Added endpoint reconfiguration support for non-reused p2p scenarios |
| 19 | + * Enabled non-cacheable memory domains, activated for gdr_copy |
| 20 | + * Added user_data parameter to ucp_ep_query |
| 21 | + * Added support for host memory pipeline through CUDA buffers for rendezvous protocol |
| 22 | + * Added global VA infrastructure and memory region in absence of error handling |
| 23 | + * Made protocol performance node names more informative |
| 24 | + * Enforced always running on the same thread in single thread mode |
| 25 | + * Multiple improvements in protocols selection infrastructure |
| 26 | + * Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping |
| 27 | + * Allowed up-to 64 endpoint lanes for systems with many transports or devices |
| 28 | + * Added usage tracker to worker |
| 29 | + * Improved various logging messages |
| 30 | +#### RDMA CORE (IB, ROCE, etc.) |
| 31 | + * Added environment variable to manage DC initiator capacity |
| 32 | + * Added DC dcs_hybrid policy |
| 33 | + * Reduced MLX5/DV stack size consumption |
| 34 | + * Added ODP support for verbs and mlx5dv |
| 35 | + * Added support of CUDA managed memory on IB when ODP is available |
| 36 | + * Added support of Adaptive Routing on RoCE |
| 37 | + * Enabled use of implicit ODP with relaxed ordering |
| 38 | + * Improved GPU-Direct detection in IB transport |
| 39 | + * Increased DC initiator default count to 32 for performance optimization |
| 40 | + * Added ConnectX-8 device support with DDP |
| 41 | + * Added support for subnet filter list for RoCE interfaces |
| 42 | + * Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports |
| 43 | + * Added IB MLX5 as a separate UCX module with separate RPM sub-package |
| 44 | + * Added initial support for GGA transport, for fast DPU memory access |
| 45 | + * Set IB DevX atomic mode based on device capabilities |
| 46 | + * Removed DC keepalive mechanism, since the keepalive is done on UCP layer |
| 47 | + * Optimized cross-gVMI memory registration using indirect memory keys cache |
| 48 | + * Improved various logging messages |
| 49 | +#### CUDA |
| 50 | + * Added multi-node NVlink support |
| 51 | + * Added CUDA Fabric memory support with detection and allocation |
| 52 | + * Improved gdr_copy latency estimations on AMD Milan systems |
| 53 | + * Added check for gdr_copy runtime/build version mismatch |
| 54 | + * Added handling missing IPC capability when unpacking keys |
| 55 | + * Added caching for CUDA IPC memory pool import operation |
| 56 | + * Added gdr_copy variables to optimize performance on Grace Hopper systems |
| 57 | + * Improved CUDA IPC concurrency for a larger count of reachable peers |
| 58 | +#### UCS |
| 59 | + * Added support for wildcards in configuration parameter names |
| 60 | + * Added ASAN protection to several internal data structures |
| 61 | + * Reduced stack usage in topology detection code |
| 62 | + * Improved bitmaps configuration parsing with wider bitfield |
| 63 | + * Added options to set topology distance between devices |
| 64 | + * Optimized VFS unix socket watch by using user private folder |
| 65 | + * Added general IP subnet matching infrastructure |
| 66 | + * Extend array data structure to support user-provided array copy routine |
| 67 | + * Improved time units description |
| 68 | +#### UCM |
| 69 | + * Extend CUDA memory hooks to include memory mapping APIs |
| 70 | +#### Tools |
| 71 | + * Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest |
| 72 | + * Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest |
| 73 | + * Improved ucx_perftest uni-directional test with added fence |
| 74 | + * Detailed ucx_perftest batch section of command-line documentation |
| 75 | +#### Documentation |
| 76 | + * Added a section regarding adaptive routing on RoCE |
| 77 | +#### Architecture |
| 78 | + * Added CPU Model for MI300A |
| 79 | + * Added Fujitsu ARM specific values to ucx.conf |
| 80 | + * Added AMD Turin support |
| 81 | + * Added an optimized non-temporal memory copy implementation for AMD CPU |
| 82 | +#### Build |
| 83 | + * Improved compiler error reporting with added flag |
| 84 | + * Improved coverity script to allow faster turnaround time |
| 85 | + * Improved Intel Compiler detection and support |
| 86 | +#### GO |
| 87 | + * Added multi-send flag and user memh support in request params |
| 88 | +#### Packaging |
| 89 | + * Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments |
| 90 | +### Bugfixes: |
| 91 | +#### UCP |
| 92 | + * Fixed stack overflow in exported rkey unpack |
| 93 | + * Removed extra remote-cpu overhead from protocol estimation for zcopy |
| 94 | + * Fixed performance estimation for rndv pipeline protocols |
| 95 | + * Fixed ATP sending by picking the correct lane |
| 96 | + * Fixed missing reg_id on memh creation |
| 97 | + * Fixed repeated invalidations by retaining existing access flags |
| 98 | + * Fixed abort reason propagation for rendezvous RTR mtype |
| 99 | + * Do not check transport availability if it is disabled by UCX_TLS environment variable |
| 100 | + * Fixed wrong flag being used for checking BCOPY capability |
| 101 | + * Fixed sending too many ATPs for small messages |
| 102 | + * Enforced 16 bits size for Active Messages identifiers |
| 103 | + * Fixed unnecessary status check for emulated AMO |
| 104 | + * Fixed more than one fragment sending in rendezvous pipeline |
| 105 | + * Fixed crash by using biggest max frag across all lanes |
| 106 | + * Fixed missing memory handle flags by copying from parent to child |
| 107 | + * Fixed worker interface activate count |
| 108 | + * Fixed flush requests by replacing ATP/flush lane map with lane indexes |
| 109 | + * Fixed lost uct_flags when merging memory regions |
| 110 | +#### UCT |
| 111 | + * Fixed memory domain UCT flags description |
| 112 | +#### RDMA CORE (IB, ROCE, etc.) |
| 113 | + * Fixed FETCH_ADD remote access error for ODP/KSM case |
| 114 | + * Fixed missing conditional compilation checks for DM |
| 115 | + * Fixed IB MD allocation naming typo |
| 116 | + * Fixed invalid GIDs filter in IB |
| 117 | + * Fixed flags usage in MLX5 zcopy_post |
| 118 | + * Do not limit ODP registration retries |
| 119 | + * Fixed JUCX failures by considering the number of supported completion vectors |
| 120 | +#### CUDA |
| 121 | + * Fixed async memory handling using CUDA memory type on Grace |
| 122 | + * Added rcache overhead in performance estimation |
| 123 | + * Fixed gdr_copy performance regression by providing maximum estimation between get and put |
| 124 | + * Fixed CUDA IPC reachability check |
| 125 | + * Fixed crash in MPI_Finalize when CUDA context is destroyed |
| 126 | + * Always require rcache by default for gdr_copy |
| 127 | + * Fixed crash in gdr_copy cleanup when registration cache is disabled |
| 128 | + * Fixed CUDA copy memory domain allocations |
| 129 | + * Fixed multiple tests for gdr_copy transport |
| 130 | + * Fixed race condition in CUDA IPC peer accessible cache |
| 131 | +#### UCS |
| 132 | + * Fixed a crash by using heap allocation to process expired timers in batch |
| 133 | + * Fixed allocation issue on memtrack dump |
| 134 | + * Fixed deletion of the monitored folder in VFS |
| 135 | + * Fixed unsafe resize for DC initiator array |
| 136 | + * Fixed function macro invocation to match C standard |
| 137 | + * Fixed calling async handler on already released resource |
| 138 | + * Fixed performance by setting higher bandwidth for different NUMA nodes on Grace |
| 139 | + * Fixed undeclared value error in timer conversion routine |
| 140 | + * Fixed uninitialized value access in registration cache |
| 141 | +#### UCM |
| 142 | + * Fixed race condition in parsing proc maps |
| 143 | + * Fixed mremap failure while parsing /proc/self/maps |
| 144 | +#### ROCM |
| 145 | + * Fixed ROCM interface reachability test |
| 146 | + * Fixed memory domain fork test |
| 147 | +#### TCP |
| 148 | + * Always bind endpoint to interface |
| 149 | +#### Tools |
| 150 | + * Fixed buffer size potential overflow in ucx_perftest |
| 151 | + * Fixed missing address when packing memory keys on ucx_perftest |
| 152 | + * Fixed memory leak for endpoint report in ucx_info |
| 153 | + * Fixed build without openmp in ucx_perftest |
| 154 | + * Fixed UCT device override on server side on ucx_perftest |
| 155 | +#### Build |
| 156 | + * Fixed using correct ASAN version for running tests |
| 157 | +#### Configuration |
| 158 | + * Used POSIX bourne syntax to check equality |
| 159 | + * Fixed build failure by using proper flags in compiler.m4 |
| 160 | + * Fixed perftest MAD support default guessing |
| 161 | +#### GO |
| 162 | + * Added serialized thread mode to avoid subtle races between threads |
| 163 | + * Fixed make distcheck |
| 164 | + |
14 | 165 | ## 1.17.0 (June 13, 2024)
|
15 | 166 | ### Features:
|
16 | 167 | #### UCP
|
|
0 commit comments