Overall Design #7

ryanmrichard · 2021-06-23T17:31:22Z

ryanmrichard
Jun 23, 2021
Maintainer

The goal of the runtime library is to abstract MPI and have a common API that allows us to express parallelism and hardware interactions in a manner that is agnostic to the actual backend (original plan is MADNESS on top of MPI, with maybe ExaPAPI for hardware information).

Assumptions

This design assumes:

MPI is somewhere in the runtime.
MPI ranks are mapped to nodes.
- We will try to fully encapsulate MPI so in theory other mappings could be handled with different PIMPLs
- e.g., if MPI was per CPU a Node object could hold multiple MPI Communicators
We may not be on the top-level MPI communicator
We may not have started MPI

Classes

Structure wise the main class is Runtime. Runtime is an effective MPI_COMM_WORLD on steroids with RAII (effective meaning it doesn't need to be the actual top-level communicator). The code will use Runtime instances much in the same way that one would use MPI communicators. Any time your code needs to know something about the current running conditions you inquire with the Runtime (or members of). You should be able to partition a Runtime into sub Runtime instances much like you partition an MPI communicator. Each partition has a subset of the resources available to the parent Runtime. The Runtime class is a container of Cluster instances.

The Cluster class is similar to an MPI group. The Cluster class is a container of Nodes.

Ranks on an MPI communicator map to Node instances. Each Node is a collection of HadwareComponent instances. In theory Nodes are hardware that are spatially located together; in practice ensuring that the hardware is local is up to whoever assigns the MPI ranks.

Running Tasks

At a first pass we'll only worry about wrapping MPI calls. Wrapping threaded tasks (CPU and GPU) are left for later. The goal of the MPI wrappers would be to express the functionality in an object-oriented manner. In MPI the "lowest" level calls are point-to-point calls like send. Point-to-point calls involve Node instances and the data to send (we use data as a generic placeholder for that data; it will need to be serializable to work with this API). As an example the send call should look something like:

runtime.cluster.nodes[0].send(data, nodes[1]); //Sends data, which lives on node[0] to node[1]

Under the hood the above call should be a no-op for all nodes besides node[0]. For node[0] the call should map to the appropriate MPI_Send call. node[1] would be expected to call the corresponding runtime.cluster.nodes[1].receive(data, nodes[0]) call at some point.

For collectives involving a single source the suggested API is:

cluster_a.nodes[0].broadcast(data, cluster_b); // Sends data, which lives on cluster_a.node[0], to all nodes in cluster_b

Here cluster_a.nodes[0] and all nodes in cluster_b participate (in MPI parlance if cluster_a is the same as cluster_b this is an intra-communicator collective, otherwise it's an inter-communicator collective).

For collectives involving a single destination the proposed API is:

auto result = cluster_a.reduce(data, op, cluster_b.nodes[0]); //reduces data using op, final result lives on nodes[0]

The API works the opposite of of the previous single source collective. Here all nodes in cluster_a participate and only node cluster_b.nodes[0] participates.

Finally for a collective where everyone is a source and a destination the proposed API is:

auto result = cluster_a.reduce(data, op, cluster_b); // Cluster_a reduces data using op and gives the result to cluster_b

Here all nodes in cluster_a and in cluster_b participate.

Printing

Most of the time we either want every node to print some data or we only want one node to print some data. To have all nodes print an
object object the proposed API is:

cluster.stdout << object ;

whereas to have only a certain node print the API is:

cluster.nodes[0].stderr << object;

The Dream

It's really pushing it, but it would be great to get thread-level parallelism by doing something like:

cluster.nodes[0].at<CPU>(0).at(1).run(task); // Run "task" on thread 1 of CPU 0.

and to get memory-aware allocators via:

auto disk_allocator = cluster.nodes[0].at<RandomAccessMemory>(0).allocator(); // Get an allocator which tracks our memory usage

wavefunction91 · 2021-07-01T20:58:12Z

wavefunction91
Jul 1, 2021
Collaborator

@ryanmrichard I only have a few comments.

We need to make a distinction between a "Node" and a "Worker" (i.e. an MPI rank). A "Node" is something physical, i.e. it has some number of CPUs/GPUs while a "Worker" is a software concept that is superimposed over the "Node". There's a few reasons for this distinction as it stands in the current discussion

CPUs are physical, threads are software(-ish). Never will it make sense to launch a task onto a CPU because scheduling instructions into physical CPUs is on the job of the OS, not at the software level. We can launch things into threads through, and I think that's really what you mean here.
It is sometimes (often) the case the Node hardware resources are shared by multiple Workers on the same node (i.e. we can have 6 GPUs on the Node and each MPI rank can see all 6 of them, etc).

A possible hierarchy could be
Runtime -> Cluster (collection of Workers) -> Worker (knows what node it's hooked up to and necessary software stuff like threads, GPU streams, etc) -> Node (a collection of HardwareComponents) -> HardwareComponent (the actual phyiscal hardware)

The trick here is that we need to figure out how to detail how the Workers need to interoperate (i.e. how many Workers are talking to the same GPU, etc) - this is probably best done at the "Cluster" level, but I'm not entirely sure. This is where PAPI will come in - give me some set of MPI ranks (Workers) and PAPI should be able to tell me all the things I need to know to avoid doing stupid things (i.e. over allocating device memory if theres a many-to-many MPI-to-GPU mapping, etc)

Your bringing up allocators is a good point, we should flesh this idea out. I think the general consensus is that things like Umpire are/will be the future of hardware specific memory allocations / pooling, etc. I know this is what TA uses for GPU allocations and I could easily set up GauXC to do the same. Having a memory pool for these things would be very interesting, and this is more-or-less what TAMM is using TALSH for. I would highly recommend that we don't try to write something ourselves though, that would be a nightmare to maintain.

4 replies

ryanmrichard Jul 1, 2021
Maintainer Author

I went back and forth on the worker thing while writing this up. To me the Node class is actually a software concept. It maps to the current process's effective hardware. If that's a physical node great, otherwise I think we can handle that to.

For sake of argument let's say the physical node has two CPUs (to me a CPU is a single mult-core chip; so your average laptop/desktop has one CPU, if you have a dual socket motherboard you have two CPUs, etc.) and a GPU. In my mind when you do something like:

auto node = cluster.me();

regardless of how many MPI processes are mapped to a node, they all get back a Node instance with two CPU instances and a GPU instance. I don't know how/if you can make MPI go more fine-grained (e.g. I don't know if you can say there's two MPI ranks per node, each gets one CPU and they share the GPU), but if you can then there's no reason we can't make the processes see different node instances (using the previous example each rank would see one CPU and one GPU).

Since multiple ranks can potentially see the same resource we have to make sure everybody plays nicely in the sandbox. The easiest solution is that when you do:

auto gpu = node.get<GPU>(0);

it works like a barrier. If none of the other ranks which can see the GPU are using it, your process gets the GPU and you can use it; otherwise, you sit there until it becomes available again. Moving beyond that I can see adding a scheduler to the GPU instance and you get a future back.

As for device memory. I'd recommend adding Storage instances to the GPU. Then you can programatically stake your claim to it (the actual staking of claim is on the honor system as this would all be on the host). If you're worried about whether you should clean-up memory between calls to the GPU or not (i.e. if you have a different algorithm for when you have exclusive access to the GPU vs. when you have to contend with someone may have used the GPU between your last run and your current run) we could add was_I_last_to_use() to the GPU or something.

If I missed something let me know. Otherwise, I can make sure these points get propagated to the appropriate discussions.

I would highly recommend that we don't try to write something ourselves though, that would be a nightmare to maintain.

I fully agree. Allocators could be cool, but they're also a lot of work.

wavefunction91 Jul 1, 2021
Collaborator

To me the Node class is actually a software concept.

Ok, then is sounds like you're envisioning this to be more or less like the Worker I described mod some details. Doesn't matter what we call it, just that we can expose the right functionality.

if you have a dual socket motherboard you have two CPUs

using the previous example each rank would see one CPU and one GPU

This has the same problem as I previously stated, even on dual sockets, you can't control how the OS will schedule them. If Node is going to be a software concept, then we can really only say things about its specific execution paths (i.e. it as a CPU path, and possibly multiple accelerator paths). Might be splitting hairs here, but just trying to get at what actually makes sense practically.

works like a barrier.

was_I_last_to_use() to the GPU or something.

This might be able to be implemented in the case that you only ever want a single MPI rank driving a particular GPU at a given time, but that's not always the case - a very common use case has multiple MPI ranks launching kernels onto the same GPU to ensure saturation.

This just gets back to the fact that we need to somehow know how MPI ranks share resources, this is important on the CPU side as well (i.e. setting up shared memory contexts for RMA, etc). Knowing locally what we have access to and just writing barriers to make sure we don't step on eachother's toes is not sufficient, we need to know a lot about how these things share resources in general. Where that happens in this hierarchy is not immediately clear to me

ryanmrichard Jul 2, 2021
Maintainer Author

Ok, then is sounds like you're envisioning this to be more or less like the Worker

Yeah I guess you could think of it that way. I think of it as the hardware resource set available to the current process

This has the same problem as I previously stated, even on dual sockets, you can't control how the OS will schedule them. If Node is going to be a software concept, then we can really only say things about its specific execution paths (i.e. it as a CPU path, and possibly multiple accelerator paths). Might be splitting hairs here, but just trying to get at what actually makes sense practically.

I'm not sure what this is a criticism of. Are you saying you don't like the get<CPU>(0) API? In hindsight, having the API act like you're getting CPU 0, or thread 0, is misleading. It's probably better to do something where you specify how many you want:

// Get a subset of the node containing 1 CPU, 1 GPU, and 10 GB of RAM (maybe unit literals?)
auto new_node = node.split<CPU, GPU, RandomAccessMemory>(1, 1, 10 gb);

a very common use case has multiple MPI ranks launching kernels onto the same GPU to ensure saturation.

Maybe I'm seeing where you're going with this. Are you saying that in your code you have something like:

if(gpu_is_all_mine) {
   gpu.push_back_all_the_tasks();
} else {
  // map my rank to a subset of all the tasks
 // push back just those tasks
}

ryanmrichard Jul 2, 2021
Maintainer Author

To expand a bit more on what I'm thinking.

The backend of the runtime certainly needs to know about how many MPI ranks have access to a resource, for scheduling purposes as well as for preparing the Node instance a particular process sees. I'd like to keep the actual rank to resource mapping as an implementation detail though because it adds a significant amount of complexity to the API. If at all possible, I want the user of the runtime to just focus on submitting tasks which fit within the resources provided; the backend of the runtime then focuses on scheduling those tasks in a manner that is intelligent.

I admittedly haven't written any GPU code, so maybe that's why I still don't understand the many MPI ranks to a single GPU. I'm under the impression that MPI doesn't interact with the GPU; that's all done through cuda, hip, and sycl, As I understand it, and have used it, MPI is all about communication between physical nodes. Once you're in a node that all goes through threading. I understand that the GPU needs to be kept busy, but I would have assumed this would be handled at a thread level, i.e. you use a bunch of CPU threads to asynchronously push tasks to the GPU. Using a bunch of MPI processes (particularly if there's more processes than sockets) seems heavy-handed given the cost of a process vs. a thread. The only scenario I can possibly see for multiple MPI ranks to a GPU is if some of those ranks are not physically on the same node as the GPU. Although that is probably better characterized as a load imbalance problem, as you're asking the remote rank to do too much work and it's contracting out to other nodes. I understand that we can't control how the user sets up MPI, i.e. if they want to make a hundred MPI processes per physical node that's their right, but we can offer suggestions and bias our code towards those suggestions.

Maybe the unifying theme here is to adopt a new perspective and think of:

auto new_node = node.split<CPU, GPU, RandomAccessMemory>(1, 1, 10 gb);

as providing a hint to the scheduler of how computationally intensive your tasks are vs. actually securing those resources for you. Realistically we then need to go a bit more fine-grained on the resources so that the scheduler gets a better feel for just how much of the CPU and/or GPU you're actually going to use and it can schedule accordingly.

evaleev · 2021-07-02T15:18:03Z

evaleev
Jul 2, 2021
Maintainer

2 cents: you should avoid replicating what resource managers do for you. Have a look at jsrun, slurm, etc. They already have a ton of complexity (esp. jsrun) that you don't want to replicate. Plus these tools deal with the OS/hypervisors that you will not be able to, or will not want to. I suggest you pick a couple of simple scenarios and target those. 1 rank (forget trying to define what rank corresponds to physically ... core? socket? NUMA region? you may not be able to anyway) driving 1 GPU, and 1 rank driving multiple GPUs. David, why do you need to have multiple ranks driving 1 GPU? Can you not saturate the network using multiple comm threads?

…

On Fri, Jul 2, 2021 at 10:44 AM Ryan Richard ***@***.***> wrote: To expand a bit more on what I'm thinking. The backend of the runtime certainly needs to know about how many MPI ranks have access to a resource, for scheduling purposes as well as for preparing the Node instance a particular process sees. I'd like to keep the actual rank to resource mapping as an implementation detail though because it adds a significant amount of complexity to the API. If at all possible, I want the user of the runtime to just focus on submitting tasks which fit within the resources provided; the backend of the runtime then focuses on scheduling those tasks in a manner that is intelligent. I admittedly haven't written any GPU code, so maybe that's why I still don't understand the many MPI ranks to a single GPU. I'm under the impression that MPI doesn't interact with the GPU; that's all done through cuda, hip, and sycl, As I understand it, and have used it, MPI is all about communication between physical nodes. Once you're in a node that all goes through threading. I understand that the GPU needs to be kept busy, but I would have assumed this would be handled at a thread level, i.e. you use a bunch of CPU threads to asynchronously push tasks to the GPU. Using a bunch of MPI processes (particularly if there's more processes than sockets) seems heavy-handed given the cost of a process vs. a thread. The only scenario I can possibly see for multiple MPI ranks to a GPU is if some of those ranks are not physically on the same node as the GPU. Although that is probably better characterized as a load imbalance problem, as you're asking the remote rank to do too much work and it's contracting out to other nodes. I understand that we can't control how the user sets up MPI, i.e. if they want to make a hundred MPI processes per physical node that's their right, but we can offer suggestions and bias our code towards those suggestions. Maybe the unifying theme here is to adopt a new perspective and think of: auto new_node = node.split<CPU, GPU, RandomAccessMemory>(1, 1, 10 gb); as providing a hint to the scheduler of how computationally intensive your tasks are vs. actually securing those resources for you. Realistically we then need to go a bit more fine-grained on the resources so that the scheduler gets a better feel for just how much of the CPU and/or GPU you're actually going to use and it can schedule accordingly. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQXIZ5FGSV3WUVZ2FSEMLTTVXGGDANCNFSM47GIDBIQ> .

-- web: valeyev.net

3 replies

wavefunction91 Jul 2, 2021
Collaborator

I'll try to group these as best as I can.

@ryanmrichard

I'm not sure what this is a criticism of.

Kind of what you said, kind of not. What I was getting at is that if you have two "physical CPUs", the accessors

auto cpu_0 = node.get<CPU>(0);
auto cpu_1 = node.get<CPU>(1);

often don't mean anything useful - like @evaleev said, you can't control affinities like that programmatically, that can really only (practically) be controlled by resources managers (SLURM, etc). We want nothing to do with that space, we have to assume that the user set up their resources correctly.

Note, this also goes for any partitioning at the "CPU" level (as we've described it here): this also means nothing useful for ncpu != 1

auto new_node = node.split<CPU, GPU, RandomAccessMemory>(ncpu, ngpu, 10 gb);

although it does mean something useful when sharing accelerators (ngpu) in some context.

TL;DR - from the CPU side, this kind of resource partition is not generally a useful abstraction. It could be if it represented something like threads (which are programmatically controllable), but not allocation of physical CPU resources (i.e. the only feasible / usable use case is if evert Node has exactly 1 CPU HardwareComponent).

MPI is all about communication between physical nodes. Once you're in a node that all goes through threading

It would be very restrictive to impose this execution paradigm as the only (or even the preferred) supported way of doing things. There are tons of reasons to have multiple MPI ranks / nodes (network saturation being one of them, but as @evaleev points out this can sometimes be handled by having multiple comm threads, but we can't assume that everyone who plugs in to NWX is going to have the same idea). There is even a school of thought in HPC that completely divorces itself from threads entirely (e.g. Petsc, etc). I'm not advocating for that extreme, but most use cases are going to exist somewhere in the middle (i.e. having multiple MPI ranks / node and multiple threads / rank), we just need to be able to tell people what they have and have them make their own algorithmic dispatch (or crashes) internally. We can't always assume that threads are the correct concurrency model for everything, and we need to be able to support that to the best extent that we can.

I'll group these since they're kind of the same point (@evaleev @ryanmrichard)

the only scenario I can possibly see for multiple MPI ranks to a GPU is if some of those ranks are not physically on the same node as the GPU

David, why do you need to have multiple ranks
driving 1 GPU? Can you not saturate the network using multiple comm threads?

@ryanmrichard 's point assumes that everyone has written an internal threaded load balancer to saturate the GPU devices accessible to it. Although some of us have, most have not, this is not even a common strategy, let alone one we'd want to "prefer". I'd say the most common thing that people do is break up the node with 1 MPI rank driving a single GPU (i.e. on Summit it's 6 MPI ranks with 7 cores and 1 GPU / rank) and handle any shared memory communication between resources on the same physical node through passive targets / RMA (virtually no overhead over SMP if do you it right).

@evaleev I don't, but I know of several libraries that perform quite well under these conditions (one that we'll want to use in NWX is ELPA).

I should note, the only affinities I support are 1 MPI <-> 1 GPU and (in a few weeks) 1 MPI <-> several GPUs, i.e. I don't even support the model that I suggested (multiple MPI <-> multiple GPU), so in the case of the DFT code, the blocking accessor model will actually work out fine. But I'm also aware that not everyone that we want to work with has optimized for these use cases, so it doesn't hurt for us to think about these things and make sure that we can tell downstream developers what they have available to them and expose necessary flexibility for them to optimize for their specific use case.

I'm under the impression that MPI doesn't interact with the GPU

This is probably splitting hairs, but the interaction of MPI and GPU's definitely exists from a memory perspective (not what you meant I don't think, but definitely a common use case) - there's a lot of work (by vendors) to optimize the transfer of data using MPI between GPUs on the same physical node.

ryanmrichard Jul 2, 2021
Maintainer Author

@wavefunction91 thanks for the reply I think I'm getting a better grasp on what you're after.

So I propose Node becomes ResourceSet and it maps to an MPI rank. ResourceSet tells the current process all of the resources which are directly available to it. The decision of how resources are assigned to ranks is left to the person running the program, ResourceSet simply regurgitates that information. Perhaps the caveat to this is the storage. I think that keeping track of how much memory is available is going to fall to the runtime (we can hopefully leverage am existing library though). Aside from memory we don't worry about splitting ResourceSet instances during the first pass. I can see eventually wanting to split other resources as we attempt to fine-tune the parallel efficiency.

We limit each ResourceSet to a single CPU instance. The CPU instance is essentially a wrapper around a thread pool. We have the CPU class more-or-less to preserve symmetry with the fact we have a GPU class. We allow ResourceSet to have multiple GPUs. We can add a member shared_by to HardwareComponent which returns the number of processes that have access to that resource and algorithms can limit their usage accordingly.

wavefunction91 Jul 2, 2021
Collaborator

@ryanmrichard That sounds like a good plan - eventually we might want to add which processes share the resources (i.e. if we wanted to implement a load balancer for a particular resource), but for now a simple count will suffice. PAPI should be able to tell us everything we need to know in that regard, and if not, I can add implementations details (at least for NVIDIA and AMD) that provide that information relatively straight forwardly.

Overall Design #7

Uh oh!

Uh oh!

ryanmrichard Jun 23, 2021 Maintainer

Assumptions

Classes

Running Tasks

Printing

The Dream

Replies: 2 comments · 7 replies

Uh oh!

wavefunction91 Jul 1, 2021 Collaborator

Uh oh!

ryanmrichard Jul 1, 2021 Maintainer Author

Uh oh!

wavefunction91 Jul 1, 2021 Collaborator

Uh oh!

Uh oh!

ryanmrichard Jul 2, 2021 Maintainer Author

Uh oh!

ryanmrichard Jul 2, 2021 Maintainer Author

Uh oh!

evaleev Jul 2, 2021 Maintainer

Uh oh!

wavefunction91 Jul 2, 2021 Collaborator

Uh oh!

ryanmrichard Jul 2, 2021 Maintainer Author

Uh oh!

wavefunction91 Jul 2, 2021 Collaborator

ryanmrichard
Jun 23, 2021
Maintainer

Replies: 2 comments 7 replies

wavefunction91
Jul 1, 2021
Collaborator

ryanmrichard Jul 1, 2021
Maintainer Author

wavefunction91 Jul 1, 2021
Collaborator

ryanmrichard Jul 2, 2021
Maintainer Author

ryanmrichard Jul 2, 2021
Maintainer Author

evaleev
Jul 2, 2021
Maintainer

wavefunction91 Jul 2, 2021
Collaborator

ryanmrichard Jul 2, 2021
Maintainer Author

wavefunction91 Jul 2, 2021
Collaborator