-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Cuda #3817
base: master
Are you sure you want to change the base?
Remove Cuda #3817
Conversation
Any interest in possibly switching to https://github.com/Rust-GPU/rust-gpu/ instead? |
No, that doesn't solve anything here. |
@apfitzge so the Question, nobody is using it because it doesn't work or it's not needed because the performance gain isn't much? |
It's my understanding that it currently works, but the benefit of running it is less than a node with GPU would cost you, so nobody does. |
This seems like something that should be called out in the CHANGELOG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm yet to test it properly, but so far looks good, only one nit.
perf/src/recycled_vec.rs
Outdated
use super::*; | ||
|
||
#[test] | ||
fn test_pinned_vec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn test_pinned_vec() { | |
fn test_recycled_vec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apfitzge This is not addressed yet 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some leftovers from CUDA in your branch:
- https://github.com/apfitzge/agave/blob/kill_cuda/ci/setup-new-buildkite-agent/setup-cuda.sh I think this file should go away
- Then remove the call of this script from
setup_new_machine.sh
andsetup-partner-node.sh
as well.
- Then remove the call of this script from
ci/README
still mentions CUDA- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L39-L45
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L68-L72
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L110-L113
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L137-L148
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L162
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L168
- https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/ci/README.md?plain=1#L182-L184
- I think we don't need this script as well? https://github.com/apfitzge/agave/blob/8cd2531fc1b9575272ef9df26a3b096fac2749c0/net/scripts/enable-nvidia-persistence-mode.sh
ci/test-stable.sh
calls the script above and sets CUDA-related variables:- This comment needs to be updated https://github.com/apfitzge/agave/blob/kill_cuda/sdk/clock/src/lib.rs#L140
net/scripts/gce-provider.sh
net/scripts/ec2-provider.sh
net/README.md
@vadorovsky Given your feedback, I'm going to talk with @yihau about removing all the gpu and cuda from scripts and CI before this PR lands. I'm not familiar with those so gonna work with him to remove all those and after it is removed from CI/scripts we can remove the implementation/support with this PR. |
Rebased on @yihau's CI changes (🙏 ). Remaining cuda references:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nits, otherwise looks great. Thanks!
perf/src/recycled_vec.rs
Outdated
use super::*; | ||
|
||
#[test] | ||
fn test_pinned_vec() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apfitzge This is not addressed yet 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
The question is when do we want to merge it. I'm still working on the Packet<Bytes>
change (it's taking longer than I expected, sorry 😔) - should we wait until I'm done and therefore 100% sure that the zero-copy approach actually?
Ping @alessandrod
Started reviewing but didn't finish. But looks like there is a conflict that will require resolution anyways
I think it would make sense to push this one first. Given that this change is largely removing code, it should mean less stuff you have to account for with your change Michal. And, I think we feel pretty confident with the Bytes approach. I don't know exactly how the quinn/TPU integration looks, but at least for the TVU path, the |
That is the case "today".
What changes specifically? |
Self { packets } | ||
} | ||
|
||
pub fn new_pinned_with_capacity(capacity: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you were planning on doing it, consolidating the various constructors will be a nice follow-on PR here. Not sure if leaving those out was intentional or not, but think it makes sense to do outside of this PR
|
||
impl<T: Default + Clone + Sized> Reset for RecycledVec<T> { | ||
fn reset(&mut self) { | ||
self.resize(0, T::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PinnedVec
had this line too, but Vec::clear()
is probably more appropriate here. Avoid the T::default()
+ clear does less stuff than resize (which will call truncate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
planned for clean up, I'm 90% sure most of this can just be removed and replaced with a Deref
and DerefMut
implementation on the RecycledVec
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good / follow-on PR works for me
perf/src/sigverify.rs
Outdated
let out = RecycledVec::<u8>::from_vec( | ||
out.into_iter().flatten().flatten().map(u8::from).collect(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function exercised in this test, copy_return_values()
, was only used in GPU path AFAIK. So, rip that function + this test out too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch - 17e2f1a
Our hands will never be "forced" to use GPU because there are already better solutions than using a gpu for this. If we have to re-add support for GPUs (unlikely) we can fix it then. There are other completely inefficient things the gpu impl is doing as well. it's not worth the cost of maintaining a feature no one uses. The code isn't going into the ether never to be seen again...we can always re-use parts of the current implementation iff we need to. |
What are the better solutions?
That would also break the recycler, and you may end up doing even more allocations or memcopies.
I am not sure how much time anyone has spend on maintaining cuda code in the past couple of years. |
fpga or smart nic
recycler really only made sense in the context of pinned memory. jemalloc already keeps caches of memory that it will re-use for the packets. @alessandrod can probably list the many benefits in networking code.
Nobody uses an umbrella until it rains. It's getting in the way now, and would require a large rewrite to make work - all of the indexing does not work if the packet memory is not inline. We're trying to make progress in making the chain better. Taking the time to do this properly slows that down for very little, if any, benefit. edit: I apologize if my responses seem short or rude. This was all discussed on slack previously in a channel you are in. |
There is no pinned memory here: #4381
and I am not arguing to make the chain worse. My point is:
|
There's also no jemalloc in that benchmark though which is not indicative of a running validator's allocation performance. That said, I'm unable to replicate the behavior on my devbox and see no difference between master (2974f02), reverting #4381, or adding jemalloc.
I know you're not arguing to make it worse, and did not intend to imply that - I think we both want what is best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have another conflict that will prevent merge to master 😢
|
||
impl<T: Default + Clone + Sized> Reset for RecycledVec<T> { | ||
fn reset(&mut self) { | ||
self.resize(0, T::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good / follow-on PR works for me
Yeah not surprising. Would like to resolve these larger conversations before fixing conflict and merging though. |
We did discuss this earlier in slack and it is true that no validator that we know of is using it today and it seems it will simplify reducing copies in the network pipeline and scheduler in the short term. We can still use the code in the future if we like to bring it back so I'm somewhat on board with this. I think we'll be able to create the Bytes view on top of the contiguous packet batch view if necessary. Packet batching should have benefits for CPU performance as well. That's a bit orthogonal to the Recycler discussion, we should double check we aren't regressing anything there, maybe @lijunwangs has the benchmark setup for reproducing that.
This is a bit hand-wavy though, what is the evidence to say it's better? Does a smartnic even exist today that can do ed25519 verify on arbitrary packet data? I had trouble finding one. |
Sure, would appreciate any checks against regression. Any benchmark should be using jemalloc so it is similar to the validator's operation. edit: but also to be clear, we didn't remove recycler in this PR. We only removed recyclers that were only used for the GPU code paths.
Yeah I'll admit it is and was hand wavy. I'll just use jump's numbers from this 2023 talk.
In terms of smart nic - I'll just retract that. I don't know enough about them and was listing the alternatives discussed on slack. My understanding is that some smart nics have fpgas built in, so it may require significant reworking fd's implementation but I believe would not be too dissimilar. |
Will also expand upon this.
An alternative to deleting it entirely is to add a packet copy into some cuda-registered memory before we send it off for gpu verification. I'm happy with either deleting or isolating; but fixing the impl to work with non-inlined data is a bigger lift. |
Granted I haven't seen the branch, but this is my understanding of how this would work as well. I added a quick comment about this in #4803 (comment), but we shouldn't do much/any worse than we currently do with In terms of long term scalability, another idea that has come up is huge pages. To be clear, I'm NOT suggesting we postpone remove-CUDA + Bytes in favor of huge pages approach. Rather, I'm pointing out that there is always some better™️ optimization on the horizon, but that shouldn't stop us from pursuing short/medium term improvements. So, I'm in favor of the (possibly temporary) removal of CUDA support that this PR makes |
I've benchmarked Nvidia 3090 (released 2020) at around 3m/s and 4090 (released in 2022) at 10m/s, so I'm not sure how much I trust these numbers for GPU or what implementation they are using. There's no type of GPU, code or concrete benchmarks presented. I agree GPU likely has more latency, from my testing it does seem to be in the 5-10ms range for batch sizes which exceed the CPU speed which is not great, but is still workable within our constraints. FPGAs likely have a power advantage which not clear how much that really matters, but I think there probably needs to be a more rigorous analysis comparing like-for-like since there are many models of FPGAs and GPUs. I don't think one can take some spitballed numbers from a slide to do a good analysis and the situation changes since new GPUs and FPGAs are released all the time with different software stacks and whatnot that can improve latency and overheads. There are costs and availability concerns especially with FGPAs where the expensive ones can cost $20k+ each and aren't common in datacenters. Anyway, I think my position is I don't really know which hardware wins here and it's nice to have options. I also think it could be somewhat likely that FPGAs have similar memory constraints to the GPU in that to get the best performance you would setup DMA copy engines which can't really deal with CPU page-faults well or at all without a huge complexity hit or a large list of memory ranges that you would need for a highly fragmented copy to device memory. I think the extra copy for me would be fine to introduce for the GPU for now and keep the path in. I think it will be somewhat harder to add it back in later if we completely remove it. |
I think the starting presumption here is that moving to
Moving to So why not develop these changes first on an off-master branch, and get some reliable estimated numbers that performance improvements from If anything, an off-master branch would allow to iterate much faster. |
Adding some insight form someone with several years of FPGA code development:
Based on the above, my recommendation is to stay away from FPGA code if at all possible. Given that GPU acceleration can provide far more throughput than CPU with reasonable latencies, my suggestion would be to rely on those rather than FPGAs. For me it seems that sigverify as such does not really require much context to work (and whatever context is required can be easily provided over RPC). So doing sigverify in a separate process (or even on a separate host) is not so hard. Having a dedicated process that uses e.g. a GPU to bulk sigverify all passing packets and just drop all the invalid ones makes far more sense to me than having to talk to the GPU from within agave:
|
Problem
Summary of Changes
Fixes #