MPI Support (new attempt) #99

carstenbauer · 2024-08-06T11:08:40Z

TODO:

Changelog:

renamed pinthreads_mpi to mpi_pinthreads
new public API mpi_getcpuids, mpi_gethostnames, mpi_getlocalrank (based on explicit communication/exchange between MPI ranks)

Closes #61

carstenbauer · 2024-08-06T18:06:31Z

Basic multi-node test (of mpi_pinthreads(:numa)) was successful. (See https://carstenbauer.github.io/ThreadPinning.jl/previews/PR99/examples/ex_mpi/)

NUMA node 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]
NUMA node 2: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159]
NUMA node 3: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175]
NUMA node 4: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191]
NUMA node 5: [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207]
NUMA node 6: [80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]
NUMA node 7: [96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239]
NUMA node 8: [112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255]


BEFORE: Where are the Julia threads of the MPI ranks running?
	rank 0 is running 2 Julia threads on the CPU-threads [127, 253] of node nid004406
	rank 1 is running 2 Julia threads on the CPU-threads [113, 182] of node nid004406
	rank 2 is running 2 Julia threads on the CPU-threads [109, 138] of node nid004406
	rank 3 is running 2 Julia threads on the CPU-threads [26, 115] of node nid004406
	rank 4 is running 2 Julia threads on the CPU-threads [4, 146] of node nid005218
	rank 5 is running 2 Julia threads on the CPU-threads [101, 236] of node nid005218
	rank 6 is running 2 Julia threads on the CPU-threads [42, 255] of node nid005218
	rank 7 is running 2 Julia threads on the CPU-threads [8, 90] of node nid005218
	rank 8 is running 2 Julia threads on the CPU-threads [80, 237] of node nid005908
	rank 9 is running 2 Julia threads on the CPU-threads [23, 198] of node nid005908
	rank 10 is running 2 Julia threads on the CPU-threads [5, 47] of node nid005908
	rank 11 is running 2 Julia threads on the CPU-threads [54, 26] of node nid005908
	rank 12 is running 2 Julia threads on the CPU-threads [42, 143] of node nid005915
	rank 13 is running 2 Julia threads on the CPU-threads [8, 120] of node nid005915
	rank 14 is running 2 Julia threads on the CPU-threads [238, 217] of node nid005915
	rank 15 is running 2 Julia threads on the CPU-threads [27, 159] of node nid005915


AFTER: Where are the Julia threads of the MPI ranks running?
	rank 0 is running 2 Julia threads on the CPU-threads [0, 1] of node nid004406
	rank 1 is running 2 Julia threads on the CPU-threads [16, 17] of node nid004406
	rank 2 is running 2 Julia threads on the CPU-threads [32, 33] of node nid004406
	rank 3 is running 2 Julia threads on the CPU-threads [48, 49] of node nid004406
	rank 4 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005218
	rank 5 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005218
	rank 6 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005218
	rank 7 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005218
	rank 8 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005908
	rank 9 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005908
	rank 10 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005908
	rank 11 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005908
	rank 12 is running 2 Julia threads on the CPU-threads [0, 1] of node nid005915
	rank 13 is running 2 Julia threads on the CPU-threads [16, 17] of node nid005915
	rank 14 is running 2 Julia threads on the CPU-threads [32, 33] of node nid005915
	rank 15 is running 2 Julia threads on the CPU-threads [48, 49] of node nid005915

carstenbauer · 2024-08-06T19:02:47Z

cc @sloede (because you were interested in this at some point)

carstenbauer · 2024-08-06T19:13:19Z

I plan to merge this tomorrow. @jagot are you happy with this PR or is something missing?

jagot · 2024-08-07T18:03:34Z

Sorry for the late reply; my vacation started this week, and I obviously decided to tear down my kitchen. I will have a look tomorrow!

jagot · 2024-08-08T14:35:15Z

I think the API looks nice, but it does not work as I would naïvely expect/hope; I ran the example from the documentation, adding a call to threadinfo() to really see the thread distribution after pinning (I modified the output in case of no colour, see here https://github.com/jagot/ThreadPinning.jl/tree/improve-colorless-output):

Job official-example ID 183709.bossy on 2 nodes, 2 processes, and 48 threads per process
Host: janeway16
  Activating project at `~/projects/mpi-test`
  Activating project at `~/projects/mpi-test`
NUMA node 1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71]
NUMA node 2: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]


BEFORE: Where are the Julia threads of the MPI ranks running?
	rank 0 is running 48 Julia threads on the CPU-threads [95, 37, 41, 43, 72, 35, 46, 78, 38, 39, 26, 28, 44, 82, 93, 30, 29, 81, 88, 90, 75, 47, 31, 84, 76, 25, 34, 40, 89, 45, 77, 32, 85, 92, 45, 24, 36, 79, 86, 33, 83, 87, 73, 85, 75, 89, 80, 0] of node janeway16
	rank 1 is running 48 Julia threads on the CPU-threads [48, 75, 2, 83, 88, 92, 46, 38, 73, 90, 28, 78, 33, 84, 74, 41, 91, 45, 77, 86, 25, 1, 82, 42, 36, 26, 87, 35, 29, 79, 43, 47, 34, 85, 76, 44, 31, 75, 73, 32, 0, 30, 39, 44, 93, 80, 94, 72] of node janeway17


AFTER: Where are the Julia threads of the MPI ranks running?
	rank 0 is running 48 Julia threads on the CPU-threads [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71] of node janeway16
Hostname: 	janeway16
CPU(s): 	2 x AMD EPYC 7352 24-Core Processor
CPU target: 	znver2
Cores: 		48 (96 CPU-threads due to 2-way SMT)
NUMA domains: 	2 (24 cores each)

Julia threads: 	48

CPU socket 1
  0,48h, 1,49h, 2,50h, 3,51h, 4,52h, 5,53h, 6,54h, 7,55h,
  8,56h, 9,57h, 10,58h, 11,59h, 12,60h, 13,61h, 14,62h, 15,63h,
  16,64h, 17,65h, 18,66h, 19,67h, 20,68h, 21,69h, 22,70h, 23,71h

CPU socket 2
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_


# = Julia thread, h = Julia thread on HT, ! = >1 Julia thread

(Mapping: 1 => 0, 2 => 1, 3 => 2, 4 => 3, 5 => 4, ...)
	rank 1 is running 48 Julia threads on the CPU-threads [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71] of node janeway17
Hostname: 	janeway16
CPU(s): 	2 x AMD EPYC 7352 24-Core Processor
CPU target: 	znver2
Cores: 		48 (96 CPU-threads due to 2-way SMT)
NUMA domains: 	2 (24 cores each)

Julia threads: 	48

CPU socket 1
  0,48h, 1,49h, 2,50h, 3,51h, 4,52h, 5,53h, 6,54h, 7,55h,
  8,56h, 9,57h, 10,58h, 11,59h, 12,60h, 13,61h, 14,62h, 15,63h,
  16,64h, 17,65h, 18,66h, 19,67h, 20,68h, 21,69h, 22,70h, 23,71h

CPU socket 2
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_,
  _,_, _,_, _,_, _,_, _,_, _,_, _,_, _,_


# = Julia thread, h = Julia thread on HT, ! = >1 Julia thread

(Mapping: 1 => 0, 2 => 1, 3 => 2, 4 => 3, 5 => 4, ...)

i.e., there are two MPI ranks, each running on a separate node. The problem is that all threads are pinned to the first socket only (on these machines, the socket and NUMA are equivalent), leaving the other socket un-utilized, and half of the threads are pinned to hyperthreads.

How do I achieve pinning across all sockets? Should there be a :node alternative?

carstenbauer · 2024-08-08T16:46:54Z

You're running on two nodes each hosting one MPI rank with 48 Julia threads. What mpi_pinthreads(:numa) (or mpi_pinthreads(:sockets) because NUMA domains and sockets fall together) does is distributing MPI ranks (that is all of their Julia threads) in a round-robin fashion among NUMA domains on each node. Since their is only a single MPI rank per node, both MPI ranks (that is their 48 Julia threads) simply get assigned to the first NUMA domain of their respective nodes.

Why are threads pinned to hyperthreads?
Because 48 threads only fit into the NUMA domain if hyperthreads are included. If you don't want hyperthreads involved, choose 24 threads per MPI rank instead of 48.

How can I also occupy the second NUMA domain on each node?
The simple answer is: Use 2 MPI ranks per node. More generally speaking, choose as many MPI ranks per node as there are NUMA domains. This is what one typically does/is advised to do anyways for MPI applications, irrespective of the programming language.

What if I don't want to use 2 MPI ranks per node?
Well, you probably should 😉. More seriously, if you only have a single MPI rank per node, you don't need any of the mpi_pinthreads business anyways (because there is no conflict between ranks). Just call pinthreads(:numa) on each MPI rank and you're done.

(We could imagine a more complicated case though. Say we have 2 MPI ranks per node but a system with 4 NUMA domains per socket. We might want to distribute MPI ranks among sockets but, within each socket, distribute the Julia threads of the MPI rank among NUMA domains. This is currently not available out of the box but one could imagine a mpi_pinthreads(:sockets, :numa) variant.)

carstenbauer · 2024-08-08T16:53:44Z

BTW, threadinfo() adds little gaps (i.e. spaces) between CPU IDs that don't belong to the same core (look closely and compare to threadinfo(; coregaps=false)). This should also be the case for threadinfo(; color=false) and allows you to identify hyperthreads without adding "h"s.

jagot · 2024-08-10T17:00:58Z

Thanks for the detailed writeup! I am relearning MPI, having not used it for a long while.

I still think it is a valid use-case to have only one MPI rank per node, if nothing else to be able to compare. AFAIU, architectures vary, and for some it is/should be beneficial to utilize all cores within the same SMP program.

pinthreads indeed does the trick, but for that to be applicable, the code needs to be aware that it is alone on a particular node, which is what I implemented in #61 (comment). That logic could easily be baked into mpi_getlocalrank

ThreadPinning.jl/ext/MPIExt/mpi_querying.jl

Lines 55 to 65 in 345cd1b

    
           function ThreadPinning.mpi_getlocalrank(; comm = MPI.COMM_WORLD) 
        
               hostnames_ranks = ThreadPinning.mpi_gethostnames(; comm) 
        
               rank = MPI.Comm_rank(comm) 
        
               localranks = nothing 
        
               if rank == 0 
        
                   mpi_topo = compute_mpi_topology(hostnames_ranks) 
        
                   localranks = [r.localrank for r in mpi_topo] 
        
               end 
        
               localrank = MPI.scatter(localranks, comm; root = 0) 
        
               return localrank 
        
           end

if we so wish, or it could be up to the user to do something like

using ThreadPinning
using MPI

function mpi_alone_on_this_node(;comm=MPI.COMM_WORLD, dest=0)
    rank = MPI.Comm_rank(comm)
    num_ranks = MPI.Comm_size(comm)
    
    hostname = gethostname()
    all_hostnames = MPI.gather(hostname, comm; root=dest)
    
    if rank == 0
        hostname_unique = Dict(h => count(==(h), all_hostnames) == 1
            for h in unique(all_hostnames))
        for (i,h) in enumerate(all_hostnames)
            i == 1 && continue
            MPI.send(hostname_unique[h], comm, dest=i-1)
        end
        hostname_unique[hostname]
    else
        MPI.recv(comm, source=dest)
    end
end

if mpi_alone_on_this_node()
    pinthreads(:numa)
else
    mpi_pinthreads(:numa)
end

I would prefer the former, I think.

carstenbauer · 2024-08-10T18:25:47Z

I would prefer the former, I think.

I think I don't. Not because 1 MPI rank per node is unreasonable (it isn't) but because it would make the mpi_pinthreads API confusing. What you're proposing is to do one thing if you're not alone on a node and a different thing if you're not.

Currently, the semantics of mpi_pinthreads is simple: The :numa in mpi_pinthreads(:numa) means that MPI ranks (with all their threads as "one unit") will be distributed among NUMA domains. It does not mean that the threads of a MPI rank get distributed among different NUMA domains. But that's what you want it do in the case where the MPI rank is alone on the node. While I believe to see, where you're coming from, this seems pretty arbitrary and confusing to me.

jagot · 2024-08-10T18:51:25Z

Fair point. I will play around, having the above function in my user code for the time being. When/if something solidifies, and it is useful to anyone but me, we can revisit this issue.

carstenbauer added 8 commits August 6, 2024 10:32

tmp

3769c51

tmp

795254a

more

36204b9

mpi

52b939d

sort by nodename

196d747

debug ci

752ac32

fix ci

18623c6

fix

fa813e9

docs

3f37242

Update CHANGELOG.md

ebc453e

This was referenced Aug 6, 2024

Distributed.jl support #100

Closed

MPI Extension for Julia >= 1.9 + Distributed support (Single node only) #64

Closed

carstenbauer merged commit 165173c into main Aug 7, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI Support (new attempt) #99

MPI Support (new attempt) #99

carstenbauer commented Aug 6, 2024 •

edited

Loading

carstenbauer commented Aug 6, 2024 •

edited

Loading

carstenbauer commented Aug 6, 2024

carstenbauer commented Aug 6, 2024

jagot commented Aug 7, 2024

jagot commented Aug 8, 2024

carstenbauer commented Aug 8, 2024 •

edited

Loading

carstenbauer commented Aug 8, 2024 •

edited

Loading

jagot commented Aug 10, 2024 •

edited

Loading

carstenbauer commented Aug 10, 2024

jagot commented Aug 10, 2024

MPI Support (new attempt) #99

MPI Support (new attempt) #99

Conversation

carstenbauer commented Aug 6, 2024 • edited Loading

carstenbauer commented Aug 6, 2024 • edited Loading

carstenbauer commented Aug 6, 2024

carstenbauer commented Aug 6, 2024

jagot commented Aug 7, 2024

jagot commented Aug 8, 2024

carstenbauer commented Aug 8, 2024 • edited Loading

carstenbauer commented Aug 8, 2024 • edited Loading

jagot commented Aug 10, 2024 • edited Loading

carstenbauer commented Aug 10, 2024

jagot commented Aug 10, 2024

carstenbauer commented Aug 6, 2024 •

edited

Loading

carstenbauer commented Aug 6, 2024 •

edited

Loading

carstenbauer commented Aug 8, 2024 •

edited

Loading

carstenbauer commented Aug 8, 2024 •

edited

Loading

jagot commented Aug 10, 2024 •

edited

Loading