Overall Design #7
Replies: 2 comments 7 replies
-
|
@ryanmrichard I only have a few comments. We need to make a distinction between a "Node" and a "Worker" (i.e. an MPI rank). A "Node" is something physical, i.e. it has some number of CPUs/GPUs while a "Worker" is a software concept that is superimposed over the "Node". There's a few reasons for this distinction as it stands in the current discussion
A possible hierarchy could be The trick here is that we need to figure out how to detail how the Workers need to interoperate (i.e. how many Workers are talking to the same GPU, etc) - this is probably best done at the "Cluster" level, but I'm not entirely sure. This is where PAPI will come in - give me some set of MPI ranks (Workers) and PAPI should be able to tell me all the things I need to know to avoid doing stupid things (i.e. over allocating device memory if theres a many-to-many MPI-to-GPU mapping, etc) Your bringing up allocators is a good point, we should flesh this idea out. I think the general consensus is that things like Umpire are/will be the future of hardware specific memory allocations / pooling, etc. I know this is what TA uses for GPU allocations and I could easily set up GauXC to do the same. Having a memory pool for these things would be very interesting, and this is more-or-less what TAMM is using TALSH for. I would highly recommend that we don't try to write something ourselves though, that would be a nightmare to maintain. |
Beta Was this translation helpful? Give feedback.
-
|
2 cents: you should avoid replicating what resource managers do for you.
Have a look at jsrun, slurm, etc. They already have a ton of complexity
(esp. jsrun) that you don't want to replicate. Plus these tools deal with
the OS/hypervisors that you will not be able to, or will not want to.
I suggest you pick a couple of simple scenarios and target those. 1 rank
(forget trying to define what rank corresponds to physically ... core?
socket? NUMA region? you may not be able to anyway) driving 1 GPU, and 1
rank driving multiple GPUs. David, why do you need to have multiple ranks
driving 1 GPU? Can you not saturate the network using multiple comm threads?
…On Fri, Jul 2, 2021 at 10:44 AM Ryan Richard ***@***.***> wrote:
To expand a bit more on what I'm thinking.
The backend of the runtime certainly needs to know about how many MPI
ranks have access to a resource, for scheduling purposes as well as for
preparing the Node instance a particular process sees. I'd like to keep
the actual rank to resource mapping as an implementation detail though
because it adds a significant amount of complexity to the API. If at all
possible, I want the user of the runtime to just focus on submitting tasks
which fit within the resources provided; the backend of the runtime then
focuses on scheduling those tasks in a manner that is intelligent.
I admittedly haven't written any GPU code, so maybe that's why I still
don't understand the many MPI ranks to a single GPU. I'm under the
impression that MPI doesn't interact with the GPU; that's all done through
cuda, hip, and sycl, As I understand it, and have used it, MPI is all about
communication between physical nodes. Once you're in a node that all goes
through threading. I understand that the GPU needs to be kept busy, but I
would have assumed this would be handled at a thread level, i.e. you use a
bunch of CPU threads to asynchronously push tasks to the GPU. Using a bunch
of MPI processes (particularly if there's more processes than sockets)
seems heavy-handed given the cost of a process vs. a thread. The only
scenario I can possibly see for multiple MPI ranks to a GPU is if some of
those ranks are not physically on the same node as the GPU. Although that
is probably better characterized as a load imbalance problem, as you're
asking the remote rank to do too much work and it's contracting out to
other nodes. I understand that we can't control how the user sets up MPI,
i.e. if they want to make a hundred MPI processes per physical node that's
their right, but we can offer suggestions and bias our code towards those
suggestions.
Maybe the unifying theme here is to adopt a new perspective and think of:
auto new_node = node.split<CPU, GPU, RandomAccessMemory>(1, 1, 10 gb);
as providing a hint to the scheduler of how computationally intensive your
tasks are vs. actually securing those resources for you. Realistically we
then need to go a bit more fine-grained on the resources so that the
scheduler gets a better feel for just how much of the CPU and/or GPU you're
actually going to use and it can schedule accordingly.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQXIZ5FGSV3WUVZ2FSEMLTTVXGGDANCNFSM47GIDBIQ>
.
--
web: valeyev.net
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The goal of the runtime library is to abstract MPI and have a common API that allows us to express parallelism and hardware interactions in a manner that is agnostic to the actual backend (original plan is MADNESS on top of MPI, with maybe ExaPAPI for hardware information).
Assumptions
This design assumes:
Nodeobject could hold multiple MPI CommunicatorsClasses
Structure wise the main class is
Runtime.Runtimeis an effectiveMPI_COMM_WORLDon steroids with RAII (effective meaning it doesn't need to be the actual top-level communicator). The code will useRuntimeinstances much in the same way that one would use MPI communicators. Any time your code needs to know something about the current running conditions you inquire with theRuntime(or members of). You should be able to partition aRuntimeinto subRuntimeinstances much like you partition an MPI communicator. Each partition has a subset of the resources available to the parentRuntime. TheRuntimeclass is a container ofClusterinstances.The
Clusterclass is similar to an MPI group. TheClusterclass is a container ofNodes.Ranks on an MPI communicator map to
Nodeinstances. EachNodeis a collection ofHadwareComponentinstances. In theoryNodesare hardware that are spatially located together; in practice ensuring that the hardware is local is up to whoever assigns the MPI ranks.Running Tasks
At a first pass we'll only worry about wrapping MPI calls. Wrapping threaded tasks (CPU and GPU) are left for later. The goal of the MPI wrappers would be to express the functionality in an object-oriented manner. In MPI the "lowest" level calls are point-to-point calls like send. Point-to-point calls involve
Nodeinstances and the data to send (we usedataas a generic placeholder for that data; it will need to be serializable to work with this API). As an example thesendcall should look something like:Under the hood the above call should be a no-op for all nodes besides
node[0]. Fornode[0]the call should map to the appropriateMPI_Sendcall.node[1]would be expected to call the correspondingruntime.cluster.nodes[1].receive(data, nodes[0])call at some point.For collectives involving a single source the suggested API is:
Here
cluster_a.nodes[0]and all nodes incluster_bparticipate (in MPI parlance ifcluster_ais the same ascluster_bthis is an intra-communicator collective, otherwise it's an inter-communicator collective).For collectives involving a single destination the proposed API is:
The API works the opposite of of the previous single source collective. Here all nodes in
cluster_aparticipate and only nodecluster_b.nodes[0]participates.Finally for a collective where everyone is a source and a destination the proposed API is:
Here all nodes in
cluster_aand incluster_bparticipate.Printing
Most of the time we either want every node to print some data or we only want one node to print some data. To have all nodes print an
object
objectthe proposed API is:whereas to have only a certain node print the API is:
cluster.nodes[0].stderr << object;The Dream
It's really pushing it, but it would be great to get thread-level parallelism by doing something like:
and to get memory-aware allocators via:
Beta Was this translation helpful? Give feedback.
All reactions