Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Cuda #3817

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Remove Cuda #3817

wants to merge 22 commits into from

Conversation

apfitzge
Copy link

@apfitzge apfitzge commented Nov 27, 2024

Problem

  • Nobody uses the cuda implementation
  • We have to maintain cuda implementation
  • The interfaces around cuda are making changes we want difficult

Summary of Changes

  • Remove cuda from the code

Fixes #

@LegNeato
Copy link

LegNeato commented Dec 6, 2024

Any interest in possibly switching to https://github.com/Rust-GPU/rust-gpu/ instead?

@apfitzge
Copy link
Author

apfitzge commented Dec 6, 2024

Any interest in possibly switching to Rust-GPU/rust-gpu instead?

No, that doesn't solve anything here.

@shirotech
Copy link

@apfitzge so the --cuda flag is broken and doesn't work?

Question, nobody is using it because it doesn't work or it's not needed because the performance gain isn't much?

@apfitzge
Copy link
Author

@apfitzge so the --cuda flag is broken and doesn't work?

Question, nobody is using it because it doesn't work or it's not needed because the performance gain isn't much?

It's my understanding that it currently works, but the benefit of running it is less than a node with GPU would cost you, so nobody does.
The way in which we are storing PacketBatch is blocking other changes we want to make (non-inlined packets). If we were to make those changes all the CUDA stuff would need to be fixed to work with that...which seems like wasted effort if nobody uses it.

@apfitzge apfitzge marked this pull request as ready for review January 22, 2025 14:20
@apfitzge apfitzge changed the title [Experiment] Remove Cuda Remove Cuda Jan 22, 2025
@steviez
Copy link

steviez commented Jan 23, 2025

This seems like something that should be called out in the CHANGELOG:
https://github.com/anza-xyz/agave/blob/master/CHANGELOG.md

@apfitzge apfitzge added the changelog Pull request requires an entry in CHANGELOG.md label Jan 24, 2025
Copy link
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm yet to test it properly, but so far looks good, only one nit.

use super::*;

#[test]
fn test_pinned_vec() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_pinned_vec() {
fn test_recycled_vec() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apfitzge This is not addressed yet 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some leftovers from CUDA in your branch:

@apfitzge
Copy link
Author

@vadorovsky Given your feedback, I'm going to talk with @yihau about removing all the gpu and cuda from scripts and CI before this PR lands. I'm not familiar with those so gonna work with him to remove all those and after it is removed from CI/scripts we can remove the implementation/support with this PR.

@apfitzge
Copy link
Author

apfitzge commented Feb 3, 2025

Rebased on @yihau's CI changes (🙏 ).

Remaining cuda references:

$ rg "cuda"
CHANGELOG.md
31:  * Remove support for `--cuda` from `agave-validator`

net/scripts/gce-provider.sh
182:    # imageName="ubuntu-2004-focal-v20201211-with-cuda-1

Copy link
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nits, otherwise looks great. Thanks!

use super::*;

#[test]
fn test_pinned_vec() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apfitzge This is not addressed yet 🙂

perf/src/recycled_vec.rs Outdated Show resolved Hide resolved
perf/src/recycled_vec.rs Outdated Show resolved Hide resolved
vadorovsky
vadorovsky previously approved these changes Feb 3, 2025
Copy link
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

The question is when do we want to merge it. I'm still working on the Packet<Bytes> change (it's taking longer than I expected, sorry 😔) - should we wait until I'm done and therefore 100% sure that the zero-copy approach actually?

Ping @alessandrod

@steviez
Copy link

steviez commented Feb 4, 2025

Started reviewing but didn't finish. But looks like there is a conflict that will require resolution anyways

The question is when do we want to merge it. I'm still working on the Packet<Bytes> change (it's taking longer than I expected, sorry 😔) - should we wait until I'm done and therefore 100% sure that the zero-copy approach actually?

I think it would make sense to push this one first. Given that this change is largely removing code, it should mean less stuff you have to account for with your change Michal. And, I think we feel pretty confident with the Bytes approach.

I don't know exactly how the quinn/TPU integration looks, but at least for the TVU path, the Bytes approach should save us an allocation per shred (currently, each shred allocates owned memory). With current MNB load, that is approximately ~3k allocation per second, each in excess of ~1 kB. The number of shreds should go up as we increase CU limits

@behzadnouri
Copy link

  • Nobody uses the cuda implementation

That is the case "today".
Sigverify is a major bottleneck of the pipline, more so once there are more transactions, shreds, gossip, etc packets to sigverify.
What if the load increases so much that our hands are forced to use gpu for that?

  • The interfaces around cuda are making changes we want difficult

What changes specifically?
are those changes addressing bottlenecks more significant than sigverify?

perf/src/recycled_vec.rs Outdated Show resolved Hide resolved
entry/src/entry.rs Show resolved Hide resolved
Self { packets }
}

pub fn new_pinned_with_capacity(capacity: usize) -> Self {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you were planning on doing it, consolidating the various constructors will be a nice follow-on PR here. Not sure if leaving those out was intentional or not, but think it makes sense to do outside of this PR


impl<T: Default + Clone + Sized> Reset for RecycledVec<T> {
fn reset(&mut self) {
self.resize(0, T::default());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PinnedVec had this line too, but Vec::clear() is probably more appropriate here. Avoid the T::default() + clear does less stuff than resize (which will call truncate)

Copy link
Author

@apfitzge apfitzge Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planned for clean up, I'm 90% sure most of this can just be removed and replaced with a Deref and DerefMut implementation on the RecycledVec type.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good / follow-on PR works for me

perf/src/sigverify.rs Show resolved Hide resolved
Comment on lines 576 to 578
let out = RecycledVec::<u8>::from_vec(
out.into_iter().flatten().flatten().map(u8::from).collect(),
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function exercised in this test, copy_return_values(), was only used in GPU path AFAIK. So, rip that function + this test out too ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - 17e2f1a

@apfitzge
Copy link
Author

apfitzge commented Feb 5, 2025

  • Nobody uses the cuda implementation

That is the case "today". Sigverify is a major bottleneck of the pipline, more so once there are more transactions, shreds, gossip, etc packets to sigverify. What if the load increases so much that our hands are forced to use gpu for that?

  • The interfaces around cuda are making changes we want difficult

What changes specifically? are those changes addressing bottlenecks more significant than sigverify?

Our hands will never be "forced" to use GPU because there are already better solutions than using a gpu for this.
Changes to make Packet not store the bytes inline, because right now everytime we move a Packet we're copying 1232 bytes, we want to not copy them.
If the bytes are not inline the current cuda code is broken.
IS the benefit of no-copy more significant than sigverify? No obviously not! and that's not the alternative, the alternative is just maintaining this code that no one uses.

If we have to re-add support for GPUs (unlikely) we can fix it then. There are other completely inefficient things the gpu impl is doing as well. it's not worth the cost of maintaining a feature no one uses. The code isn't going into the ether never to be seen again...we can always re-use parts of the current implementation iff we need to.

@behzadnouri
Copy link

Our hands will never be "forced" to use GPU because there are already better solutions than using a gpu for this.

What are the better solutions?

Changes to make Packet not store the bytes inline, because right now everytime we move a Packet we're copying 1232 bytes, we want to not copy them.

That would also break the recycler, and you may end up doing even more allocations or memcopies.
Shouldn't first confirm that that is a good idea before committing to it?!

the alternative is just maintaining this code that no one uses.
it's not worth the cost of maintaining a feature no one uses.

I am not sure how much time anyone has spend on maintaining cuda code in the past couple of years.

@apfitzge
Copy link
Author

apfitzge commented Feb 6, 2025

What are the better solutions?

fpga or smart nic

That would also break the recycler, and you may end up doing even more allocations or memcopies. Shouldn't first confirm that that is a good idea before committing to it?!

recycler really only made sense in the context of pinned memory. jemalloc already keeps caches of memory that it will re-use for the packets.

@alessandrod can probably list the many benefits in networking code.
In SV it will enable us to eventually move prioritization earlier, which is not possible when we have to move batches of packets around...which we're currently forced to do for performance if because packet data is inline. If the packets are just ptr + meta, then we can very cheaply move them around individually.
In BS, it stops us from having to manage our own memory and frees up a ton of capacity because our already scheduled packets do not need to take up room in the scheduler buffer.

I am not sure how much time anyone has spend on maintaining cuda code in the past couple of years.

Nobody uses an umbrella until it rains. It's getting in the way now, and would require a large rewrite to make work - all of the indexing does not work if the packet memory is not inline. We're trying to make progress in making the chain better. Taking the time to do this properly slows that down for very little, if any, benefit.

edit: I apologize if my responses seem short or rude. This was all discussed on slack previously in a channel you are in.

@behzadnouri
Copy link

recycler really only made sense in the context of pinned memory. jemalloc already keeps caches of memory that it will re-use for the packets.

There is no pinned memory here: #4381

We're trying to make progress in making the chain better.

and I am not arguing to make the chain worse. My point is:

  • I am more worried about long-term sigverify scalability than an allocation or memcopy.
  • If a change is potentially reducing our alternatives to address a bigger bottleneck (i.e. sigverify), lets at least do some testing first that we get anything out of it before committing to it.

@apfitzge
Copy link
Author

apfitzge commented Feb 6, 2025

There is no pinned memory here: #4381

There's also no jemalloc in that benchmark though which is not indicative of a running validator's allocation performance.

That said, I'm unable to replicate the behavior on my devbox and see no difference between master (2974f02), reverting #4381, or adding jemalloc.
All gave me around 140k/s +/- 2k. But again not sure what options were run for that PRs testing.

and I am not arguing to make the chain worse. My point is:

* I am more worried about long-term sigverify scalability than an allocation or memcopy.

* If a change is potentially reducing our alternatives to address a bigger bottleneck (i.e. sigverify), lets at least do some testing first that we get anything out of it before committing to it.

I know you're not arguing to make it worse, and did not intend to imply that - I think we both want what is best.
In my view it does not reduce our alternatives. If we need cuda we can spend the time to fix it. Have not evaluated, but jump has claimed to have significantly better cpu implementation for signature verification, which is a more immediate path to additional capacity if SV is close to becoming the major bottleneck.

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have another conflict that will prevent merge to master 😢


impl<T: Default + Clone + Sized> Reset for RecycledVec<T> {
fn reset(&mut self) {
self.resize(0, T::default());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good / follow-on PR works for me

@apfitzge
Copy link
Author

apfitzge commented Feb 6, 2025

Looks like you have another conflict that will prevent merge to master 😢

Yeah not surprising. Would like to resolve these larger conversations before fixing conflict and merging though.

@sakridge
Copy link

sakridge commented Feb 6, 2025

edit: I apologize if my responses seem short or rude. This was all discussed on slack previously in a channel you are in.

We did discuss this earlier in slack and it is true that no validator that we know of is using it today and it seems it will simplify reducing copies in the network pipeline and scheduler in the short term. We can still use the code in the future if we like to bring it back so I'm somewhat on board with this. I think we'll be able to create the Bytes view on top of the contiguous packet batch view if necessary. Packet batching should have benefits for CPU performance as well.

That's a bit orthogonal to the Recycler discussion, we should double check we aren't regressing anything there, maybe @lijunwangs has the benchmark setup for reproducing that.

What are the better solutions?
fpga or smart nic

This is a bit hand-wavy though, what is the evidence to say it's better? Does a smartnic even exist today that can do ed25519 verify on arbitrary packet data? I had trouble finding one.

@apfitzge
Copy link
Author

apfitzge commented Feb 6, 2025

That's a bit orthogonal to the Recycler discussion, we should double check we aren't regressing anything there, maybe @lijunwangs has the benchmark setup for reproducing that.

Sure, would appreciate any checks against regression. Any benchmark should be using jemalloc so it is similar to the validator's operation.

edit: but also to be clear, we didn't remove recycler in this PR. We only removed recyclers that were only used for the GPU code paths.

This is a bit hand-wavy though, what is the evidence to say it's better? Does a smartnic even exist today that can do ed25519 verify on arbitrary packet data? I had trouble finding one.

Yeah I'll admit it is and was hand wavy. I'll just use jump's numbers from this 2023 talk.

  • ~30ktps / cpu core
  • ~1mtps / gpu (batching adds latency, 300W)
  • ~1mtps / fpga (streaming, 50W)

In terms of smart nic - I'll just retract that. I don't know enough about them and was listing the alternatives discussed on slack. My understanding is that some smart nics have fpgas built in, so it may require significant reworking fd's implementation but I believe would not be too dissimilar.

@apfitzge
Copy link
Author

apfitzge commented Feb 6, 2025

We did discuss this earlier in slack and it is true that no validator that we know of is using it today and it seems it will simplify reducing copies in the network pipeline and scheduler in the short term.

Will also expand upon this.
It gives us the possibility of moving prioritization earlier so that banking can always ingest the highest priority packets instead of needing to go in network order to find the best.

We can still use the code in the future if we like to bring it back so I'm somewhat on board with this.

An alternative to deleting it entirely is to add a packet copy into some cuda-registered memory before we send it off for gpu verification.
That would allow us to at least isolate all the gpu stuff from the rest of the pipeline; allow CPU code to do zero-copy, and re-use the existing cuda implementation safely. We wouldn't be doing more packet copies than today, just more than CPU would after the change. This is significantly easier than fixing the implementation to work with non-inline memory that may or may not be contiguous.

I'm happy with either deleting or isolating; but fixing the impl to work with non-inlined data is a bigger lift.

@steviez
Copy link

steviez commented Feb 7, 2025

I think we'll be able to create the Bytes view on top of the contiguous packet batch view if necessary. Packet batching should have benefits for CPU performance as well.

Granted I haven't seen the branch, but this is my understanding of how this would work as well. I added a quick comment about this in #4803 (comment), but we shouldn't do much/any worse than we currently do with Vec<Packet>; I guess just one extra pointer deref per packet. But, iterating packet payload will all hit that contiguous buffer.

In terms of long term scalability, another idea that has come up is huge pages. Bytes might get us closer to zero-copy, but huge pages would get us closer to zero-copy + zero-runtime-allocations. Given that huge pages can be done in software, this seems like something that we would want to try before telling 1400 MNB validators to figure out how to get a GPU in their rack. And, I believe huge pages are inherently pinned so whatever we cook up to work with huge pages should inherently support DMA + hardware offload.

To be clear, I'm NOT suggesting we postpone remove-CUDA + Bytes in favor of huge pages approach. Rather, I'm pointing out that there is always some better™️ optimization on the horizon, but that shouldn't stop us from pursuing short/medium term improvements. So, I'm in favor of the (possibly temporary) removal of CUDA support that this PR makes

@sakridge
Copy link

sakridge commented Feb 7, 2025

Yeah I'll admit it is and was hand wavy. I'll just use jump's numbers from this 2023 talk.

  • ~30ktps / cpu core
  • ~1mtps / gpu (batching adds latency, 300W)
  • ~1mtps / fpga (streaming, 50W)

I've benchmarked Nvidia 3090 (released 2020) at around 3m/s and 4090 (released in 2022) at 10m/s, so I'm not sure how much I trust these numbers for GPU or what implementation they are using. There's no type of GPU, code or concrete benchmarks presented. I agree GPU likely has more latency, from my testing it does seem to be in the 5-10ms range for batch sizes which exceed the CPU speed which is not great, but is still workable within our constraints. FPGAs likely have a power advantage which not clear how much that really matters, but I think there probably needs to be a more rigorous analysis comparing like-for-like since there are many models of FPGAs and GPUs. I don't think one can take some spitballed numbers from a slide to do a good analysis and the situation changes since new GPUs and FPGAs are released all the time with different software stacks and whatnot that can improve latency and overheads. There are costs and availability concerns especially with FGPAs where the expensive ones can cost $20k+ each and aren't common in datacenters. Anyway, I think my position is I don't really know which hardware wins here and it's nice to have options. I also think it could be somewhat likely that FPGAs have similar memory constraints to the GPU in that to get the best performance you would setup DMA copy engines which can't really deal with CPU page-faults well or at all without a huge complexity hit or a large list of memory ranges that you would need for a highly fragmented copy to device memory.

I think the extra copy for me would be fine to introduce for the GPU for now and keep the path in. I think it will be somewhat harder to add it back in later if we completely remove it.

@behzadnouri
Copy link

create the Bytes view on top of the contiguous packet batch

I think the starting presumption here is that moving to Bytes is a good thing and improves performance.
But I am not even confident that is true:

  • Bytes does dynamic dispatch which is pretty slow, particularly so in certain runtime access patterns.
  • We already use Recycler for packets which Bytes is not compatible with (unless we do memcopies anyways). So not even sure we will do fewer allocations or memcopies with Bytes.
  • Bytes is pushing out the gpu code apparently. Again, sigverify is a bigger bottleneck than an allocation or memcopy.
  • Bytes does not work with [u8; N], so you are always forced an extra redirection (Packet is just a simple [u8; N] wrapper). Bytes also does not work with Arc<Vec<u8>> either.

Moving to Bytes needs pretty big and widespread changes (including this one), and if they are committed to master, in 2 or 3 months it would be practically impossible to revert them (due to merge conflicts and code diverging off).

So why not develop these changes first on an off-master branch, and get some reliable estimated numbers that performance improvements from Bytes (if any) does indeed justify the downsides?

If anything, an off-master branch would allow to iterate much faster.

@alexpyattaev
Copy link

Adding some insight form someone with several years of FPGA code development:

  • Maintaining FPGA code is pretty nightmarish, most tools are closed-source and terrible. Build times are a horror story like no other. CI tooling is also largely proprietary and terrible.
  • Debugging FPGA/CPU binding/driver code is unfun. Very very unfun. Let us not do it.
  • Pretty much all good FPGA code is platform-specific. What works well on Altera/Intel may not work so well on Xilinx. Port mapping (i.e. interfacing with outside world) is always platform specific.
  • Talking to the FPGA accelerator generally requires kernel driver of some sort (as you need to map some address space to be accessible by the device via PCIe) and/or a proprietary SDK (which is terrible). All of that is nasty and wildly unstable.
  • Contrary to what was mentioned above, FPGA accelerators do not necessarily require any particularly fancy memory layout of the input data, as long as there is sufficient bus bandwidth to copy things from main memory to the FPGA. FPGA-enabled smart NICs work on a per-packet basis and as such do not care how we sotre packets in agave.

Based on the above, my recommendation is to stay away from FPGA code if at all possible. Given that GPU acceleration can provide far more throughput than CPU with reasonable latencies, my suggestion would be to rely on those rather than FPGAs.

For me it seems that sigverify as such does not really require much context to work (and whatever context is required can be easily provided over RPC). So doing sigverify in a separate process (or even on a separate host) is not so hard. Having a dedicated process that uses e.g. a GPU to bulk sigverify all passing packets and just drop all the invalid ones makes far more sense to me than having to talk to the GPU from within agave:

  • we would get all the same perf benefits but without the added complexity in the agave codebase
  • if the GPU accelerator process segfaults due to some silly driver bug, it can be restarted in seconds rather than 15 minutes
  • one can have several of those GPU boxes with failover set up
  • for the operators it would also add the flexibility of being able to use a normal "gaming PCs" as the GPU accelerator frontends for the validator rather than fancy expensive GPU nodes tuned for ML applications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog Pull request requires an entry in CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants