Skip to content

gdma: save/restore functionality #1240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

justus-camp-microsoft
Copy link
Contributor

First in a series of PRs implementing keepalive for mana devices. This is save/restore functionality for the GdmaDriver.

I've introduced a new crate, mana_save_restore for shared types that are needed across different crates (may not be necessary but was while I was prototyping to deal with a cyclic dependency).

@justus-camp-microsoft justus-camp-microsoft requested review from a team as code owners April 24, 2025 17:18
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

//! Mana save/restore module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be in a separate crate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not apparent as part of this change, but a future PR will implement save/restore for Endpoints which are part of net_backend. This crate should likely be renamed, but my understanding is that net_backend shouldn't take a dep on anything endpoint-specific (like mana_driver for instance). Happy to figure out an alternative here

Copy link
Member

@jstarks jstarks Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound like the right design. Endpoint/net_backend shouldn't care about save/restore at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In the same way that the NVMe driver code's save/restore support doesn't infect disk_backend).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why this isn't just in mana_driver::save_restore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation I have currently adds a save method to Endpoint, which returns an EndpointSavedState (an enum which currently only has one variant - ManaEndpointSavedState. In order to import that type for the save method's return type, I end up importing that new crate into net_backend. I'm sure there's some design here that I'm missing but I'm not sure what it is. Definitely agree that net_backend shouldn't care about save/restore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who calls the save method on Endpoint?

We shouldn't be saving any backend-specific state as part of the netvsp device state. The backend state needs to be wired up back into the servicing state structures, the way we did it for NVMe.

Copy link
Contributor Author

@justus-camp-microsoft justus-camp-microsoft Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rough outline of the flow I have is that UhVmNetworkSettings has a save method that is similar to shutdown_vf_devices, but instead calls a new SaveState RPC on HclNetworkVfManager. The save RPC disconnects the endpoints and calls save on them, bubbling the saved state back up to the servicing state returned by LoadedVm::save as part of handle_servicing_inner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, OK. So then the save method doesn't need to be on the Endpoint trait, it can be on the specific implementation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved everything to mana_driver::save_restore. I'll rework my prototype to not need to touch net_backend

@@ -499,6 +530,182 @@ impl<T: DeviceBacking> GdmaDriver<T> {
Ok(this)
}

#[allow(dead_code)]
pub async fn save(mut self) -> anyhow::Result<GdmaDriverSavedState> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should happen when hwc_failure = true?

let doorbell = self.bar0.save(Some(self.db_id as u64));

let mut interrupt_config = Vec::new();
for (index, interrupt) in self.interrupts.iter().enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let interrupt_config = self.interrupts.iter().enumerate().filter_map(|(index, interrupt)| { ... }.collect();

mut device: T,
dma_buffer: MemoryBlock,
) -> anyhow::Result<Self> {
tracing::info!("restoring gdma driver");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of this code at the beginning should be moved into a helper so new() and restore() can both use it

state: CqEqSavedState,
doorbell: DoorbellPage,
) -> anyhow::Result<Self> {
Ok(Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend the same pattern as new_cq where this just calls a Self::restore(...) function.

#[mesh(9)]
pub hwc_subscribed: bool,

/// Whether the eq is armed or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would just be safest to not save armed state and always arm on restore.

/// Saved state of an interrupt for restoration during servicing
#[derive(Protobuf, Clone, Debug)]
#[mesh(package = "mana_driver")]
pub struct InterruptSavedState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct behavior, but it is interesting to note that the netvsp code will immediately retarget all of the queues on restore with these same values, so holistically there is some redundancy here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants