-
Notifications
You must be signed in to change notification settings - Fork 120
gdma: save/restore functionality #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
// Copyright (c) Microsoft Corporation. | ||
// Licensed under the MIT License. | ||
|
||
//! Mana save/restore module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be in a separate crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not apparent as part of this change, but a future PR will implement save/restore for Endpoint
s which are part of net_backend
. This crate should likely be renamed, but my understanding is that net_backend
shouldn't take a dep on anything endpoint-specific (like mana_driver
for instance). Happy to figure out an alternative here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound like the right design. Endpoint
/net_backend
shouldn't care about save/restore at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In the same way that the NVMe driver code's save/restore support doesn't infect disk_backend
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why this isn't just in mana_driver::save_restore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation I have currently adds a save
method to Endpoint
, which returns an EndpointSavedState
(an enum which currently only has one variant - ManaEndpointSavedState
. In order to import that type for the save
method's return type, I end up importing that new crate into net_backend
. I'm sure there's some design here that I'm missing but I'm not sure what it is. Definitely agree that net_backend
shouldn't care about save/restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who calls the save
method on Endpoint
?
We shouldn't be saving any backend-specific state as part of the netvsp
device state. The backend state needs to be wired up back into the servicing state structures, the way we did it for NVMe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rough outline of the flow I have is that UhVmNetworkSettings
has a save
method that is similar to shutdown_vf_devices
, but instead calls a new SaveState
RPC on HclNetworkVfManager
. The save RPC disconnects the endpoints and calls save on them, bubbling the saved state back up to the servicing state returned by LoadedVm::save
as part of handle_servicing_inner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, OK. So then the save method doesn't need to be on the Endpoint
trait, it can be on the specific implementation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved everything to mana_driver::save_restore
. I'll rework my prototype to not need to touch net_backend
@@ -499,6 +530,182 @@ impl<T: DeviceBacking> GdmaDriver<T> { | |||
Ok(this) | |||
} | |||
|
|||
#[allow(dead_code)] | |||
pub async fn save(mut self) -> anyhow::Result<GdmaDriverSavedState> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should happen when hwc_failure = true?
let doorbell = self.bar0.save(Some(self.db_id as u64)); | ||
|
||
let mut interrupt_config = Vec::new(); | ||
for (index, interrupt) in self.interrupts.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let interrupt_config = self.interrupts.iter().enumerate().filter_map(|(index, interrupt)| { ... }.collect();
mut device: T, | ||
dma_buffer: MemoryBlock, | ||
) -> anyhow::Result<Self> { | ||
tracing::info!("restoring gdma driver"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of this code at the beginning should be moved into a helper so new() and restore() can both use it
state: CqEqSavedState, | ||
doorbell: DoorbellPage, | ||
) -> anyhow::Result<Self> { | ||
Ok(Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend the same pattern as new_cq where this just calls a Self::restore(...) function.
#[mesh(9)] | ||
pub hwc_subscribed: bool, | ||
|
||
/// Whether the eq is armed or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would just be safest to not save armed state and always arm on restore.
/// Saved state of an interrupt for restoration during servicing | ||
#[derive(Protobuf, Clone, Debug)] | ||
#[mesh(package = "mana_driver")] | ||
pub struct InterruptSavedState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct behavior, but it is interesting to note that the netvsp code will immediately retarget all of the queues on restore with these same values, so holistically there is some redundancy here.
First in a series of PRs implementing keepalive for mana devices. This is save/restore functionality for the GdmaDriver.
I've introduced a new crate,
mana_save_restore
for shared types that are needed across different crates (may not be necessary but was while I was prototyping to deal with a cyclic dependency).