-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Support (new attempt) #99
Conversation
Basic multi-node test (of
|
cc @sloede (because you were interested in this at some point) |
I plan to merge this tomorrow. @jagot are you happy with this PR or is something missing? |
Sorry for the late reply; my vacation started this week, and I obviously decided to tear down my kitchen. I will have a look tomorrow! |
I think the API looks nice, but it does not work as I would naïvely expect/hope; I ran the example from the documentation, adding a call to
i.e., there are two MPI ranks, each running on a separate node. The problem is that all threads are pinned to the first socket only (on these machines, the socket and NUMA are equivalent), leaving the other socket un-utilized, and half of the threads are pinned to hyperthreads. How do I achieve pinning across all sockets? Should there be a |
You're running on two nodes each hosting one MPI rank with 48 Julia threads. What Why are threads pinned to hyperthreads? How can I also occupy the second NUMA domain on each node? What if I don't want to use 2 MPI ranks per node? (We could imagine a more complicated case though. Say we have 2 MPI ranks per node but a system with 4 NUMA domains per socket. We might want to distribute MPI ranks among sockets but, within each socket, distribute the Julia threads of the MPI rank among NUMA domains. This is currently not available out of the box but one could imagine a |
BTW, |
Thanks for the detailed writeup! I am relearning MPI, having not used it for a long while. I still think it is a valid use-case to have only one MPI rank per node, if nothing else to be able to compare. AFAIU, architectures vary, and for some it is/should be beneficial to utilize all cores within the same SMP program.
ThreadPinning.jl/ext/MPIExt/mpi_querying.jl Lines 55 to 65 in 345cd1b
if we so wish, or it could be up to the user to do something like using ThreadPinning
using MPI
function mpi_alone_on_this_node(;comm=MPI.COMM_WORLD, dest=0)
rank = MPI.Comm_rank(comm)
num_ranks = MPI.Comm_size(comm)
hostname = gethostname()
all_hostnames = MPI.gather(hostname, comm; root=dest)
if rank == 0
hostname_unique = Dict(h => count(==(h), all_hostnames) == 1
for h in unique(all_hostnames))
for (i,h) in enumerate(all_hostnames)
i == 1 && continue
MPI.send(hostname_unique[h], comm, dest=i-1)
end
hostname_unique[hostname]
else
MPI.recv(comm, source=dest)
end
end
if mpi_alone_on_this_node()
pinthreads(:numa)
else
mpi_pinthreads(:numa)
end I would prefer the former, I think. |
I think I don't. Not because 1 MPI rank per node is unreasonable (it isn't) but because it would make the Currently, the semantics of |
Fair point. I will play around, having the above function in my user code for the time being. When/if something solidifies, and it is useful to anyone but me, we can revisit this issue. |
TODO:
mpi_getcpuids
mpi_gethostnames
mpi_getlocalrank
mpi_pinthreads
Changelog:
pinthreads_mpi
tompi_pinthreads
mpi_getcpuids
,mpi_gethostnames
,mpi_getlocalrank
(based on explicit communication/exchange between MPI ranks)(cc @jagot)
Closes #61