Skip to content

feat: track per-device credential rotation convergence#2846

Merged
spydaNVIDIA merged 3 commits into
NVIDIA:mainfrom
spydaNVIDIA:rotate
Jun 26, 2026
Merged

feat: track per-device credential rotation convergence#2846
spydaNVIDIA merged 3 commits into
NVIDIA:mainfrom
spydaNVIDIA:rotate

Conversation

@spydaNVIDIA

Copy link
Copy Markdown
Contributor

Introduce the bookkeeping the future credential-rotation engine needs:
two tables (sitewide_credential_rotation, device_credential_rotation)
plus a one-time backfill recording every already-ingested device at v0,
and a db::credential_rotation writer that idempotently records a device
as converged to the current site-wide target.
Wire convergence recording at the points where NICo actually owns a
credential on the device, each committed atomically with the surrounding
write/transition:

  • bmc -- when the per-device BMC secret is written
    (site-explorer set_bmc_root_credentials)
  • host_uefi -- alongside the bios_password_set_time stamp (API handler
    and machine-controller UEFI setup)
  • dpu_uefi -- after uefi_setup(dpu) succeeds in the DPU init flow
  • lockdown_ikm -- when a SuperNIC card is confirmed locked (dpa-manager
    handle_locking), keyed by NIC MAC
  • nvos -- wired but gated off until REQ-6 (set-from-factory) lands;
    today NICo only copies the operator credential into Vault
    Add a dedicated, versioned site-wide lockdown IKM credential key
    (nic_lockdown_ikm/site/root/v{N}) seeded from the BMC root, decoupling
    NIC lockdown key derivation from the BMC root so the two rotate
    independently. Make a known BMC MAC a hard precondition for UEFI password
    operations, failing up front instead of after driving the device.

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@spydaNVIDIA spydaNVIDIA requested a review from a team as a code owner June 24, 2026 16:43
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added credential-rotation convergence tracking for BMC, Host UEFI, DPU UEFI, and SuperNIC lockdown IKM, including safe staging/promoting of in-flight versions.
    • Updated SuperNIC lockdown key handling to expose both the derived key and the IKM derivation version.
    • Introduced database-backed rotation bookkeeping and a one-time backfill for existing devices (excluding NVOS).
  • Bug Fixes

    • Prevented UEFI password setup when required BMC MAC data is missing.
    • Improved cleanup of convergence markers during credential and forced machine deletion (best-effort on cleanup failures).
  • Tests

    • Added SQL integration coverage for bookkeeping and backfill migration idempotency.

Walkthrough

Adds credential-rotation schema and runtime bookkeeping, versioned UEFI credential paths, and convergence recording or cleanup across lockdown, BMC, UEFI, DPA, machine-controller, and NVOS flows. BMC explorer setup now accepts a database pool.

Changes

Credential Rotation Infrastructure

Layer / File(s) Summary
Schema tables and indexes
crates/api-db/migrations/20260623120000_credential_rotation.sql
Adds the credential_rotation_type enum, the sitewide and per-device rotation tables, and supporting indexes.
Backfill baseline rows
crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql, crates/api-db/src/credential_rotation/test_backfill.rs
Seeds baseline sitewide and per-device rotation rows and validates the shipped backfill against representative devices.
Runtime rotation bookkeeping
crates/api-db/src/lib.rs, crates/api-core/src/errors.rs, crates/api-db/src/credential_rotation.rs
Exports the rotation module, adds the missing-sitewide-target error variant, implements record, stage, promote, and delete helpers, and includes integration tests.
Credential path variants
crates/secrets/src/credentials.rs
Replaces the shared UefiSiteTarget abstraction with explicit host and DPU site-versioned credential keys, updated prefixes, and path tests.
Lockdown key versioning
crates/api-core/src/dpa/lockdown.rs, crates/api-core/src/handlers/mlx_admin.rs
Returns the resolved IKM version with the lockdown secret, carries it in SupernicLockdownKey, and updates the mlx admin caller.
Lock convergence handoff
crates/api-core/src/handlers/dpa.rs, crates/dpa-manager/Cargo.toml, crates/dpa-manager/src/lib.rs, crates/dpa-manager/src/card_handler/svpc.rs
Stages LockdownIkm rotation before lock issuance and promotes or records convergence when the DPA manager handles a locked card.
BMC convergence wiring
crates/api-core/src/setup.rs, crates/api-core/src/test_support/builder.rs, crates/site-explorer/src/lib.rs, crates/site-explorer/src/bmc_endpoint_explorer.rs, crates/api-core/src/handlers/credential.rs, crates/bmc-explorer-cli/src/main.rs
Wires the database pool into BMC explorer construction, records BMC convergence after Vault writes, and deletes the marker on root-credential removal.
UEFI convergence flows
crates/api-core/src/handlers/uefi.rs, crates/api-core/src/handlers/machine.rs, crates/machine-controller/src/handler.rs
Requires BMC MACs for host and DPU UEFI setup, records convergence after password setup, and clears the host marker during force delete.
NVOS rotation flow
crates/switch-controller/src/configuring.rs
Splits the switch OS-password path around a shared completion outcome and records NVOS convergence in the disabled branch.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

  • NVIDIA/infra-controller#2785: Also updates crates/secrets/src/credentials.rs with versioned site-wide credential target paths and prefix changes.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: tracking per-device credential rotation convergence.
Description check ✅ Passed The description is directly aligned with the implemented credential-rotation bookkeeping and convergence wiring.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.23.0)
crates/api-core/src/handlers/credential.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.11][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

crates/api-core/src/handlers/machine.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.16][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

crates/api-core/src/errors.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.17][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

  • 17 others

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
crates/api-core/src/handlers/credential.rs (1)

623-628: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Anchor this TODO to a concrete tracking item.

Please include an issue/REQ reference in this TODO so cleanup of device_credential_rotation cannot be silently forgotten when the rotation-engine MR sequencing changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/credential.rs` around lines 623 - 628, The TODO
in credential rotation cleanup is too vague and can be missed later; update the
comment in credential.rs near the device_credential_rotation/Vault secret
removal logic to include a concrete issue or REQ reference. Anchor it to the
relevant tracking item tied to the rotation-engine MR so the planned deletion of
the matching device_credential_rotation row is explicitly traceable and not
forgotten if sequencing changes.
crates/secrets/src/credentials.rs (1)

1120-1122: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the exact prefix set, not just the count.

Line 1122 can still pass if CredentialPrefix::all() contains a duplicate and omits a real variant. Pinning the expected variants makes this test match its “contains every variant” intent.

Proposed test strengthening
 fn prefix_all_is_complete() {
     let all = CredentialPrefix::all();
-    assert_eq!(all.len(), 16);
+    let expected = [
+        CredentialPrefix::DpuSsh,
+        CredentialPrefix::DpuHbn,
+        CredentialPrefix::DpuRedfish,
+        CredentialPrefix::Bgp,
+        CredentialPrefix::HostRedfish,
+        CredentialPrefix::UfmAuth,
+        CredentialPrefix::DpuUefi,
+        CredentialPrefix::HostUefi,
+        CredentialPrefix::BmcCredentials,
+        CredentialPrefix::NicLockdownIkm,
+        CredentialPrefix::ExtensionService,
+        CredentialPrefix::NmxM,
+        CredentialPrefix::SwitchNvosAdmin,
+        CredentialPrefix::MqttAuth,
+        CredentialPrefix::MachineIdentityEncryptionKey,
+        CredentialPrefix::RackMaintenanceAccessToken,
+    ];
+
+    assert_eq!(all.len(), expected.len());
+    for prefix in expected {
+        assert!(
+            all.contains(&prefix),
+            "{prefix:?} missing from CredentialPrefix::all()"
+        );
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/credentials.rs` around lines 1120 - 1122, The prefix
coverage test in prefix_all_is_complete only checks the length of
CredentialPrefix::all(), so it can miss duplicates or omissions; update the test
to assert the exact set of expected CredentialPrefix variants returned by
CredentialPrefix::all() using the existing enum symbols, so the test verifies
completeness rather than just count.
crates/api-core/src/handlers/uefi.rs (1)

262-262: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured fields for this error log.

Prefer tracing::error!(error=%e, "Failed to update bios_password_set_time"); so the log remains searchable as logfmt. As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/uefi.rs` at line 262, The `tracing::error!` call
in the `uefi` handler is using string interpolation instead of structured
logging. Update the error log at the `bios_password_set_time` failure site to
pass the error as a field (for example, an `error` attribute) and keep the
message text separate, so the log stays searchable and follows the tracing
logging guidelines.

Source: Coding guidelines

crates/switch-controller/src/configuring.rs (2)

90-111: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Make the disabled REQ-6 branch fail closed instead of recording convergence.

Line 97 can persist nvos convergence while the required switch rotation/Vault storage is still only a TODO. Also, line 91 opens the DB transaction before the future device/Vault work described at lines 93-96, which sets up a transaction-across-external-I/O pattern.

Suggested fail-closed shape until REQ-6 lands
     if update_device_password {
-        let mut txn = ctx.services.db_pool.begin().await?;
-
         // TODO(REQ-6): rotate the NVOS password on the switch here (factory ->
         // NICo-owned) and store the resulting credential in Vault before recording
         // convergence below. Only once the password is actually changed on the
         // device has the switch converged to the current nvos target.
-        db::credential_rotation::record_device_converged(
-            &mut txn,
-            bmc_mac_address,
-            db::credential_rotation::CredentialRotationType::Nvos,
-        )
-        .await
-        .map_err(|e| {
-            StateHandlerError::GenericError(eyre::eyre!(
-                "Switch {:?}: failed to record nvos rotation convergence: {}",
-                switch_id,
-                e
-            ))
-        })?;
-
-        Ok(outcome.with_txn(txn))
+        Err(StateHandlerError::GenericError(eyre::eyre!(
+            "REQ-6 NVOS password rotation is not implemented"
+        )))
     } else {

When REQ-6 is implemented, perform switch/Vault I/O first, then open the DB transaction only for record_device_converged plus the state transition. As per coding guidelines, “Remove unused code, temporary logging, skipped assertions, placeholders, and hidden TODOs before asking for review” and “Do not hold a transaction open while doing long-running work.” As per path instructions, controller logic should be reviewed for state-machine transitions and safe recovery from partial failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 90 - 111, The
disabled REQ-6 branch in configuring.rs currently records nvos convergence even
though the required switch password rotation and Vault write are still TODO, so
it should fail closed instead of persisting success. Update the
update_device_password path in configuring logic to return an error or otherwise
stop before calling db::credential_rotation::record_device_converged until REQ-6
is fully implemented, and avoid opening ctx.services.db_pool.begin() until after
any future switch/Vault I/O is complete so no transaction spans external work.
Use the existing StateHandlerError and record_device_converged flow to locate
the branch and keep the state transition only in the completed, implemented
path.

Sources: Coding guidelines, Path instructions


138-142: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured tracing fields for switch/NVOS logs.

These service logs should expose switch_id and bmc_mac_address as fields instead of embedding them in the message.

Suggested logfmt-friendly rewrite
                 tracing::info!(
-                    "Switch {:?}: no NVOS credentials in vault or expected switch for BMC MAC {}, skipping",
-                    switch_id,
-                    bmc_mac_address
+                    ?switch_id,
+                    %bmc_mac_address,
+                    "no NVOS credentials in vault or expected switch; skipping"
                 );
@@
         tracing::info!(
-            "Switch {:?}: stored NVOS admin credentials from expected switch into vault for BMC MAC {}",
-            switch_id,
-            bmc_mac_address
+            ?switch_id,
+            %bmc_mac_address,
+            "stored NVOS admin credentials from expected switch into vault"
         );

As per coding guidelines, “All services should emit logs in 'logfmt' syntax” and common fields should be passed as tracing attributes instead of string interpolation.

Also applies to: 161-165

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 138 - 142, The
tracing logs in the NVOS credential lookup flow are embedding switch identifiers
in the message instead of emitting structured fields. Update the
`tracing::info!` calls in `configuring.rs`—including the one in the
`switch_id`/`bmc_mac_address` skip path and the related log at the other
referenced location—to pass `switch_id` and `bmc_mac_address` as tracing
attributes, while keeping the message text generic and logfmt-friendly.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/handlers/uefi.rs`:
- Around line 235-255: The HostUefi convergence write in uefi_setup is happening
before the BMC job is actually confirmed, and record_device_converged currently
leaves existing backfilled rows stuck on old versions because it does DO NOTHING
on conflict. Update the uefi_setup flow to only call the
convergence/bios_password_set_time updates after confirmed job completion, and
revise db::credential_rotation::record_device_converged to upsert
current_version for existing rows instead of ignoring conflicts. Keep the
convergence write tied to the same transaction path used by
db::machine::update_bios_password_set_time and
db::credential_rotation::CredentialRotationType::HostUefi.

In `@crates/api-db/migrations/20260623120000_credential_rotation.sql`:
- Around line 14-40: Add non-negative CHECK constraints to the rotation-state
tables so invalid version/counter values are rejected at insert/update time.
Update the `sitewide_credential_rotation` and `device_credential_rotation`
definitions to enforce `target_version`, `current_version`,
`rotating_to_version`, and `rotate_attempts` being either NULL where allowed or
>= 0, using the existing column names in the migration so the schema hardening
stays close to the `CREATE TABLE` statements.

In `@crates/api-db/src/credential_rotation.rs`:
- Around line 78-83: `record_device_converged` is masking a missing
`sitewide_credential_rotation` row by defaulting to version 0, which can write
an incorrect convergence state. Update the logic in `credential_rotation.rs` so
the insert/query path fails with an error when the site-wide target row for the
requested credential type is absent instead of using `COALESCE(..., 0)`, and
keep the NVOS branch gated until it has its own real
`sitewide_credential_rotation` entry. Use the `record_device_converged` flow and
the site-wide lookup by `credential_type` as the key places to adjust.

In `@crates/dpa-manager/src/card_handler/svpc.rs`:
- Around line 500-519: The locking flow in `handle_locking` is re-reading
mutable site-wide rotation state when calling `record_device_converged`, which
can persist the wrong version if the global target changes mid-operation. Thread
the exact lockdown version derived when building the lock command through this
path, and pass that explicit version into the convergence write instead of
letting `record_device_converged` resolve it from
`sitewide_credential_rotation`. Update the `handle_locking` transaction sequence
so the NIC MAC is recorded against the same version used by
`build_lock_command`, preserving consistency during partial failures.

In `@crates/machine-controller/src/handler.rs`:
- Around line 4400-4420: The DpuUefi convergence write in the state machine is
happening too early in the transition flow. Move the
`db::credential_rotation::record_device_converged` call in `handler.rs` into the
`Ok(true)` branch of `PollingBiosSetup` where `is_bios_setup` is confirmed, and
commit it together with that successful transition; use the existing
`StateHandlerOutcome::transition`/txn pattern so `record_device_converged` only
runs after BIOS setup verification.

In `@crates/site-explorer/src/bmc_endpoint_explorer.rs`:
- Around line 143-149: The convergence flow in record_device_converged is
treating existing device_credential_rotation rows as insert-only, which leaves
current_version stale on later BMC rotations. Update the helper in
credential_rotation so the existing row is updated when a matching device/type
record already exists instead of using DO NOTHING, and keep the call site in
bmc_endpoint_explorer::record_device_converged unchanged except for benefiting
from the new upsert behavior. Focus on the insert/update logic around
record_device_converged and the underlying device convergence write path in
credential_rotation.rs.

---

Nitpick comments:
In `@crates/api-core/src/handlers/credential.rs`:
- Around line 623-628: The TODO in credential rotation cleanup is too vague and
can be missed later; update the comment in credential.rs near the
device_credential_rotation/Vault secret removal logic to include a concrete
issue or REQ reference. Anchor it to the relevant tracking item tied to the
rotation-engine MR so the planned deletion of the matching
device_credential_rotation row is explicitly traceable and not forgotten if
sequencing changes.

In `@crates/api-core/src/handlers/uefi.rs`:
- Line 262: The `tracing::error!` call in the `uefi` handler is using string
interpolation instead of structured logging. Update the error log at the
`bios_password_set_time` failure site to pass the error as a field (for example,
an `error` attribute) and keep the message text separate, so the log stays
searchable and follows the tracing logging guidelines.

In `@crates/secrets/src/credentials.rs`:
- Around line 1120-1122: The prefix coverage test in prefix_all_is_complete only
checks the length of CredentialPrefix::all(), so it can miss duplicates or
omissions; update the test to assert the exact set of expected CredentialPrefix
variants returned by CredentialPrefix::all() using the existing enum symbols, so
the test verifies completeness rather than just count.

In `@crates/switch-controller/src/configuring.rs`:
- Around line 90-111: The disabled REQ-6 branch in configuring.rs currently
records nvos convergence even though the required switch password rotation and
Vault write are still TODO, so it should fail closed instead of persisting
success. Update the update_device_password path in configuring logic to return
an error or otherwise stop before calling
db::credential_rotation::record_device_converged until REQ-6 is fully
implemented, and avoid opening ctx.services.db_pool.begin() until after any
future switch/Vault I/O is complete so no transaction spans external work. Use
the existing StateHandlerError and record_device_converged flow to locate the
branch and keep the state transition only in the completed, implemented path.
- Around line 138-142: The tracing logs in the NVOS credential lookup flow are
embedding switch identifiers in the message instead of emitting structured
fields. Update the `tracing::info!` calls in `configuring.rs`—including the one
in the `switch_id`/`bmc_mac_address` skip path and the related log at the other
referenced location—to pass `switch_id` and `bmc_mac_address` as tracing
attributes, while keeping the message text generic and logfmt-friendly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0bc64e67-fdfc-43d3-9f12-3603cc5f75e5

📥 Commits

Reviewing files that changed from the base of the PR and between 874659f and d8aa88f.

📒 Files selected for processing (17)
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/api-db/src/credential_rotation_backfill.rs
  • crates/api-db/src/lib.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/machine-controller/src/handler.rs
  • crates/secrets/src/credentials.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/switch-controller/src/configuring.rs

Comment thread crates/api-core/src/handlers/uefi.rs
Comment thread crates/api-db/migrations/20260623120000_credential_rotation.sql
Comment thread crates/api-db/src/credential_rotation.rs Outdated
Comment thread crates/dpa-manager/src/card_handler/svpc.rs Outdated
Comment thread crates/machine-controller/src/handler.rs
Comment thread crates/site-explorer/src/bmc_endpoint_explorer.rs
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 283 6 24 98 7 148
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3265 134 776 1172 151 1032

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Comment thread crates/api-db/src/credential_rotation/test_backfill.rs
Comment thread crates/dpa-manager/src/card_handler/svpc.rs Outdated
@spydaNVIDIA

Copy link
Copy Markdown
Contributor Author

https://github.com/coderabbitai full review

@spydaNVIDIA

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
crates/secrets/src/credentials.rs (1)

1122-1122: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the exact prefix set instead of only its length.

This test claims CredentialPrefix::all() is complete, but a length-only assertion can still pass if one variant is dropped and another is added. Pin the exact list so Vault prefix drift is caught by name.

Suggested test tightening
         let all = CredentialPrefix::all();
-        assert_eq!(all.len(), 16);
+        assert_eq!(
+            all.as_ref(),
+            &[
+                CredentialPrefix::DpuSsh,
+                CredentialPrefix::DpuHbn,
+                CredentialPrefix::DpuRedfish,
+                CredentialPrefix::Bgp,
+                CredentialPrefix::HostRedfish,
+                CredentialPrefix::UfmAuth,
+                CredentialPrefix::DpuUefi,
+                CredentialPrefix::HostUefi,
+                CredentialPrefix::BmcCredentials,
+                CredentialPrefix::NicLockdownIkm,
+                CredentialPrefix::ExtensionService,
+                CredentialPrefix::NmxM,
+                CredentialPrefix::SwitchNvosAdmin,
+                CredentialPrefix::MqttAuth,
+                CredentialPrefix::MachineIdentityEncryptionKey,
+                CredentialPrefix::RackMaintenanceAccessToken,
+            ],
+        );

As per coding guidelines, “Verification should exercise the behavior that changed.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/credentials.rs` at line 1122, The test around
CredentialPrefix::all() is too weak because it only checks the number of
prefixes, so update the assertion to verify the exact set of returned variants
by name. Modify the test that currently uses all.len() to compare against the
full expected list from CredentialPrefix::all(), ensuring any dropped or added
prefix is caught explicitly rather than just matching the count.

Source: Coding guidelines

crates/switch-controller/src/configuring.rs (2)

138-165: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured tracing fields for switch identifiers.

These changed logs interpolate switch_id and bmc_mac_address into the message, which makes logfmt filtering harder. Put them in tracing fields instead.

As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”

Proposed fix
                 tracing::info!(
-                    "Switch {:?}: no NVOS credentials in vault or expected switch for BMC MAC {}, skipping",
-                    switch_id,
-                    bmc_mac_address
+                    ?switch_id,
+                    %bmc_mac_address,
+                    "no NVOS credentials in vault or expected switch; skipping"
                 );
@@
         tracing::info!(
-            "Switch {:?}: stored NVOS admin credentials from expected switch into vault for BMC MAC {}",
-            switch_id,
-            bmc_mac_address
+            ?switch_id,
+            %bmc_mac_address,
+            "stored NVOS admin credentials from expected switch into vault"
         );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 138 - 165, The
`tracing::info!` and error mapping in `configuring.rs` are interpolating
`switch_id` and `bmc_mac_address` into message strings instead of emitting them
as structured fields. Update the logging around the credential storage flow in
this block so the `tracing` calls for the NVOS credential skip/store path and
the `set_credentials` error context attach `switch_id` and `bmc_mac_address` as
fields, while keeping the human-readable message text separate.

Source: Coding guidelines


90-98: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Open the DB transaction only after future device/Vault work succeeds.

Line 91 starts a transaction before the REQ-6 TODO. If that TODO is implemented in place, the handler will hold a DB transaction across switch I/O and Vault writes.

As per coding guidelines, “Do not hold a transaction open while doing long-running work.”

Proposed fix
     let update_device_password = false;
     if update_device_password {
-        let mut txn = ctx.services.db_pool.begin().await?;
-
         // TODO(REQ-6): rotate the NVOS password on the switch here (factory ->
         // NICo-owned) and store the resulting credential in Vault before recording
         // convergence below. Only once the password is actually changed on the
         // device has the switch converged to the current nvos target.
+        let mut txn = ctx.services.db_pool.begin().await?;
         db::credential_rotation::record_device_converged(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 90 - 98, The
transaction in the update_device_password path is opened too early in
configuring::record_device_converged, which would keep a DB transaction open
during future switch and Vault work. Move the ctx.services.db_pool.begin() call
to after the REQ-6 device password rotation and Vault credential write succeed,
then use that transaction only for the final
db::credential_rotation::record_device_converged call.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/handlers/credential.rs`:
- Around line 629-638: The BMC session flush in credential rotation is happening
after the fallible convergence-marker DB cleanup, so a DB failure can skip
flushing even when the Vault secret was already deleted. In the credential
rotation flow around the txn_begin/commit path in the credential handler, move
api.bmc_session_manager.flush_mac(bmc_mac_address).await to run immediately
after the Vault delete succeeds, then keep the
db::credential_rotation::delete_device_converged cleanup and transaction commit
afterward.

In `@crates/api-db/src/credential_rotation.rs`:
- Around line 225-233: Reformat the failing assertion block in the credential
rotation test so it matches rustfmt output; the issue is only formatting, so
adjust the multiline assert_eq! invocation and surrounding lines in the test
that uses delete_device_converged and version_of until cargo fmt --check passes
cleanly.

In `@crates/site-explorer/src/bmc_endpoint_explorer.rs`:
- Around line 135-150: The BMC convergence write in
bmc_endpoint_explorer::BmcEndpointExplorer is currently happening for every
per-device Vault write, including the retain_credentials path that only reuses
the existing password. Update the flow so convergence is recorded only when the
site-wide BMC credential was actually validated or rotated, not when retained
credentials are stored. Add an explicit boolean flag to the relevant helper path
and pass !retain_credentials from set_sitewide_bmc_root_password and true from
try_sitewide_bmc_root_credentials, then guard the
db::credential_rotation::record_device_converged call with that flag.

---

Nitpick comments:
In `@crates/secrets/src/credentials.rs`:
- Line 1122: The test around CredentialPrefix::all() is too weak because it only
checks the number of prefixes, so update the assertion to verify the exact set
of returned variants by name. Modify the test that currently uses all.len() to
compare against the full expected list from CredentialPrefix::all(), ensuring
any dropped or added prefix is caught explicitly rather than just matching the
count.

In `@crates/switch-controller/src/configuring.rs`:
- Around line 138-165: The `tracing::info!` and error mapping in
`configuring.rs` are interpolating `switch_id` and `bmc_mac_address` into
message strings instead of emitting them as structured fields. Update the
logging around the credential storage flow in this block so the `tracing` calls
for the NVOS credential skip/store path and the `set_credentials` error context
attach `switch_id` and `bmc_mac_address` as fields, while keeping the
human-readable message text separate.
- Around line 90-98: The transaction in the update_device_password path is
opened too early in configuring::record_device_converged, which would keep a DB
transaction open during future switch and Vault work. Move the
ctx.services.db_pool.begin() call to after the REQ-6 device password rotation
and Vault credential write succeed, then use that transaction only for the final
db::credential_rotation::record_device_converged call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 41bcc299-b7e9-4b4b-9278-18737148edcb

📥 Commits

Reviewing files that changed from the base of the PR and between ea94336 and b1f51aa.

📒 Files selected for processing (18)
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/api-db/src/credential_rotation_backfill.rs
  • crates/api-db/src/lib.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/machine-controller/src/handler.rs
  • crates/secrets/src/credentials.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/switch-controller/src/configuring.rs

Comment thread crates/api-core/src/handlers/credential.rs
Comment thread crates/api-db/src/credential_rotation.rs
Comment on lines +135 to +150
if let Some(database_connection) = &self.database_connection {
let record_err = |cause: String| EndpointExplorationError::SetCredentials {
key: format!("device_credential_rotation/bmc/{bmc_mac_address}"),
cause,
};
let mut txn = db::Transaction::begin(database_connection)
.await
.map_err(|e| record_err(e.to_string()))?;
db::credential_rotation::record_device_converged(
&mut txn,
bmc_mac_address,
db::credential_rotation::CredentialRotationType::Bmc,
)
.await
.map_err(|e| record_err(e.to_string()))?;
txn.commit().await.map_err(|e| record_err(e.to_string()))?;

@coderabbitai coderabbitai Bot Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not mark retained BMC credentials as converged.

This records BMC convergence for every per-device Vault write, but the retain_credentials path skips rotation and stores the existing password at Lines 268-296. That can mark a device as current for the site-wide BMC target even though the hardware is still using retained credentials, causing rotation bookkeeping to skip required work.

Consider passing an explicit “record convergence” flag from the proven-sitewide paths, or move this write into only the branches that rotate/validate the site-wide credential.

One localized shape for the fix
 pub async fn set_bmc_root_credentials(
     &self,
     bmc_mac_address: MacAddress,
     credentials: &Credentials,
+    record_bmc_convergence: bool,
 ) -> Result<(), EndpointExplorationError> {
@@
-    if let Some(database_connection) = &self.database_connection {
+    if record_bmc_convergence {
+        if let Some(database_connection) = &self.database_connection {
             let record_err = |cause: String| EndpointExplorationError::SetCredentials {
                 key: format!("device_credential_rotation/bmc/{bmc_mac_address}"),
                 cause,
             };
@@
-        txn.commit().await.map_err(|e| record_err(e.to_string()))?;
+            txn.commit().await.map_err(|e| record_err(e.to_string()))?;
+        }
     }

Then call it with !retain_credentials from set_sitewide_bmc_root_password and true from try_sitewide_bmc_root_credentials.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/bmc_endpoint_explorer.rs` around lines 135 - 150,
The BMC convergence write in bmc_endpoint_explorer::BmcEndpointExplorer is
currently happening for every per-device Vault write, including the
retain_credentials path that only reuses the existing password. Update the flow
so convergence is recorded only when the site-wide BMC credential was actually
validated or rotated, not when retained credentials are stored. Add an explicit
boolean flag to the relevant helper path and pass !retain_credentials from
set_sitewide_bmc_root_password and true from try_sitewide_bmc_root_credentials,
then guard the db::credential_rotation::record_device_converged call with that
flag.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be noted that retain_credentials is NOT intended for production environments,
as the expected BMC credentials are not stored securely (yet, but I'm
actually working on it, even though the expected credentials are
generally temporary).

^ the above is from the commit message for the above code block. verify on your end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the commit message on the retain_credentials code bklock:

It should be noted retain_credentials is NOT intended for production environments,
as the expected BMC credentials are not stored securely (yet, but I'm
actually working on it, even though the expected credentials are
generally temporary).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Commit message for f64d4af439329c2ec97ff24558875c007b9128e9 =="
gh api repos/NVIDIA/infra-controller/commits/f64d4af439329c2ec97ff24558875c007b9128e9 --jq '.commit.message'

echo
echo "== retain_credentials usages with context =="
rg -n -C 3 '\bretain_credentials\b' .

echo
echo "== Files mentioning production/non-production near retain_credentials =="
rg -n -C 3 'retain_credentials|non-production|production' crates model

Length of output: 146498



@spydaNVIDIA, I verified it on my side.

For the matched comment commit (f64d4af439329c2ec97ff24558875c007b9128e9), the commit message is just:

fix: mr review

So I could not confirm that note from the current commit metadata.

I also checked the code paths around bmc_retain_credentials:

  • crates/site-explorer/src/bmc_endpoint_explorer.rs Line 268 explicitly skips BMC password rotation when retain_credentials is true.
  • crates/api-model/src/expected_machine.rs, crates/api-model/src/expected_switch.rs, and crates/api-model/src/expected_power_shelf.rs expose it as a normal field.
  • crates/rpc/proto/forge.proto and the admin CLI describe the behavior, but I did not find a production-only guard or other enforcement.

So, operationally, my original concern still stands: even if the intent is “non-production only,” the current code still marks the device converged on the retain path, and nothing I found prevents that path from being used in a production deployment.

If that non-production-only constraint is real, I’d treat it as missing enforcement/documentation rather than sufficient reason to withdraw the comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spydaNVIDIA, thanks — I verified the code path, and this clarification helps, but it does not fully invalidate the concern on its own.

The current implementation in crates/site-explorer/src/bmc_endpoint_explorer.rs still records BMC convergence from BmcEndpointExplorer::set_bmc_root_credentials(...) for any successful per-device Vault write. In BmcEndpointExplorer::set_sitewide_bmc_root_password(...), the retain_credentials branch explicitly skips rotating to the site-wide BMC password and then calls that helper anyway. So, as written, the bookkeeping can still mark the device as converged even when the hardware is intentionally left on retained credentials.

If the intended contract is “retain_credentials is non-production only, and its bookkeeping does not matter,” I would recommend making that invariant explicit in code rather than relying on commit-message context alone. The cleanest production-grade fix is still to gate convergence recording behind an explicit flag, or otherwise avoid recording convergence from the retained-credentials branch.

So: noted context, but I would keep the review comment open unless that invariant is enforced in code.


✏️ Learnings added
Learnt from: spydaNVIDIA
Repo: NVIDIA/infra-controller PR: 2846
File: crates/site-explorer/src/bmc_endpoint_explorer.rs:135-150
Timestamp: 2026-06-26T06:46:05.980Z
Learning: In `crates/site-explorer/src/bmc_endpoint_explorer.rs`, the `retain_credentials` path in `BmcEndpointExplorer::set_sitewide_bmc_root_password` is intended only for non-production environments. The expected BMC credentials used there are not yet stored securely, and are generally temporary.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@spydaNVIDIA spydaNVIDIA force-pushed the rotate branch 2 times, most recently from 92612d5 to ccf58cf Compare June 26, 2026 03:30
@github-actions

Copy link
Copy Markdown

@spydaNVIDIA

Copy link
Copy Markdown
Contributor Author

https://github.com/coderabbitai full review

@spydaNVIDIA

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/dpa-manager/src/card_handler/svpc.rs (1)

701-759: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use table-driven coverage for these convergence cases.

These two tests exercise the same operation with different setup and expectations. Please collapse them into a table-driven test using the repository test helpers so staged, fallback, and future edge cases remain one row per contract. As per coding guidelines, “Use table-driven test style when writing tests in Rust” and prefer carbide-test-support scenarios/check helpers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/dpa-manager/src/card_handler/svpc.rs` around lines 701 - 759, The two
`#[crate::sqlx_test]` cases for `record_lock_convergence` duplicate the same
flow with only setup and expected outcome changing, so collapse them into one
table-driven test. Use the repository test helpers from `carbide-test-support`
and structure rows around each contract (staged promotion, fallback to sitewide
target, and future edge cases) so the assertions for `lockdown_version_of` and
`lockdown_rotating_version_of` stay data-driven and easy to extend.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/dpa-manager/src/card_handler/svpc.rs`:
- Around line 701-759: The two `#[crate::sqlx_test]` cases for
`record_lock_convergence` duplicate the same flow with only setup and expected
outcome changing, so collapse them into one table-driven test. Use the
repository test helpers from `carbide-test-support` and structure rows around
each contract (staged promotion, fallback to sitewide target, and future edge
cases) so the assertions for `lockdown_version_of` and
`lockdown_rotating_version_of` stay data-driven and easy to extend.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4be73da0-3179-4959-a9b0-eab1dd1da89a

📥 Commits

Reviewing files that changed from the base of the PR and between b1f51aa and ccf58cf.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/errors.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/handlers/dpa.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/handlers/mlx_admin.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/api-db/src/credential_rotation/test_backfill.rs
  • crates/api-db/src/lib.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/dpa-manager/Cargo.toml
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/dpa-manager/src/lib.rs
  • crates/machine-controller/src/handler.rs
  • crates/secrets/src/credentials.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/switch-controller/src/configuring.rs
✅ Files skipped from review due to trivial changes (1)
  • crates/api-db/src/credential_rotation/test_backfill.rs
🚧 Files skipped from review as they are similar to previous changes (13)
  • crates/bmc-explorer-cli/src/main.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-core/src/test_support/builder.rs
  • crates/site-explorer/src/lib.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/switch-controller/src/configuring.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/machine-controller/src/handler.rs
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/secrets/src/credentials.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/switch-controller/src/configuring.rs (2)

145-149: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured tracing fields here.

These tracing::info! calls interpolate the key identifiers into the message body, which makes logfmt search and aggregation harder. Emit switch_id and bmc_mac_address as fields instead.

Proposed change
-                tracing::info!(
-                    "Switch {:?}: no NVOS credentials in vault or expected switch for BMC MAC {}, skipping",
-                    switch_id,
-                    bmc_mac_address
-                );
+                tracing::info!(
+                    switch_id = ?switch_id,
+                    %bmc_mac_address,
+                    "no NVOS credentials in vault or expected switch; skipping"
+                );
@@
-        tracing::info!(
-            "Switch {:?}: stored NVOS admin credentials from expected switch into vault for BMC MAC {}",
-            switch_id,
-            bmc_mac_address
-        );
+        tracing::info!(
+            switch_id = ?switch_id,
+            %bmc_mac_address,
+            "stored NVOS admin credentials from expected switch into vault"
+        );

As per coding guidelines, "All services should emit logs in 'logfmt' syntax" and "prefer placing common fields as attributes passed to tracing function, instead of using string interpolation."

Also applies to: 168-172

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 145 - 149, The
tracing logs in `configuring.rs` are interpolating `switch_id` and
`bmc_mac_address` into the message body instead of emitting them as structured
fields. Update the affected `tracing::info!` calls in the relevant branches of
`configuring.rs` to pass `switch_id` and `bmc_mac_address` as tracing attributes
alongside a stable message string, using the existing log sites around the NVOS
credential skip paths.

Source: Coding guidelines


119-174: 📐 Maintainability & Code Quality | 🔵 Trivial

Add focused tests for the NVOS credential backfill path.
A small table-driven test should cover the three outcomes here: missing expected-switch row, missing NVOS credentials on the expected switch, and successful Vault write. That will lock in the branch behavior and make future changes safer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/switch-controller/src/configuring.rs` around lines 119 - 174, The NVOS
credential backfill branch in the configuring state needs focused test coverage
for all outcomes. Add a table-driven test around the code path in configuring.rs
that exercises the expected-switch lookup and credential write flow in the same
branch that uses find_by_bmc_mac_address, Credentials::UsernamePassword, and
credential_manager.set_credentials. Cover three cases: no expected switch row
returns the Error transition, missing nvos_username or nvos_password logs and
returns the existing outcome, and a successful write stores credentials in Vault
and returns Ok(outcome). Use the existing state handler entrypoint and mock
db/services so each branch is asserted independently.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/switch-controller/src/configuring.rs`:
- Around line 145-149: The tracing logs in `configuring.rs` are interpolating
`switch_id` and `bmc_mac_address` into the message body instead of emitting them
as structured fields. Update the affected `tracing::info!` calls in the relevant
branches of `configuring.rs` to pass `switch_id` and `bmc_mac_address` as
tracing attributes alongside a stable message string, using the existing log
sites around the NVOS credential skip paths.
- Around line 119-174: The NVOS credential backfill branch in the configuring
state needs focused test coverage for all outcomes. Add a table-driven test
around the code path in configuring.rs that exercises the expected-switch lookup
and credential write flow in the same branch that uses find_by_bmc_mac_address,
Credentials::UsernamePassword, and credential_manager.set_credentials. Cover
three cases: no expected switch row returns the Error transition, missing
nvos_username or nvos_password logs and returns the existing outcome, and a
successful write stores credentials in Vault and returns Ok(outcome). Use the
existing state handler entrypoint and mock db/services so each branch is
asserted independently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd5f26cc-a30f-48ec-8bb6-927da029d497

📥 Commits

Reviewing files that changed from the base of the PR and between 9e904ee and ccf58cf.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/errors.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/handlers/dpa.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/handlers/mlx_admin.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/api-db/src/credential_rotation/test_backfill.rs
  • crates/api-db/src/lib.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/dpa-manager/Cargo.toml
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/dpa-manager/src/lib.rs
  • crates/machine-controller/src/handler.rs
  • crates/secrets/src/credentials.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/switch-controller/src/configuring.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/secrets/src/credentials.rs (2)

1120-1122: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the prefix completeness test assert the actual variants.

len() == 16 still passes if one prefix is duplicated and another is omitted. Compare CredentialPrefix::all() to the expected slice so this test proves the contract it documents.

Proposed stronger assertion
     #[test]
     fn prefix_all_is_complete() {
         let all = CredentialPrefix::all();
-        assert_eq!(all.len(), 16);
+        assert_eq!(
+            all,
+            &[
+                CredentialPrefix::DpuSsh,
+                CredentialPrefix::DpuHbn,
+                CredentialPrefix::DpuRedfish,
+                CredentialPrefix::Bgp,
+                CredentialPrefix::HostRedfish,
+                CredentialPrefix::UfmAuth,
+                CredentialPrefix::DpuUefi,
+                CredentialPrefix::HostUefi,
+                CredentialPrefix::BmcCredentials,
+                CredentialPrefix::NicLockdownIkm,
+                CredentialPrefix::ExtensionService,
+                CredentialPrefix::NmxM,
+                CredentialPrefix::SwitchNvosAdmin,
+                CredentialPrefix::MqttAuth,
+                CredentialPrefix::MachineIdentityEncryptionKey,
+                CredentialPrefix::RackMaintenanceAccessToken,
+            ]
+        );
     }

As per coding guidelines, prefer table-driven or explicit case coverage for Rust tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/credentials.rs` around lines 1120 - 1122, The prefix
completeness test in CredentialPrefix::all is too weak because it only checks
the count, so update prefix_all_is_complete to assert the actual returned
variants against the expected full set of CredentialPrefix values instead of
just len(). Use the CredentialPrefix::all() helper and compare it to an explicit
expected slice or table-driven case list so the test fails if any variant is
duplicated, missing, or reordered unexpectedly.

Source: Coding guidelines


663-677: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Pin the unversioned UEFI aliases this test claims are unchanged.

The comment says the unversioned alias remains unchanged, but this test only asserts the new versioned paths. Add exact checks for HostUefi { SiteDefault } and DpuUefi { SiteDefault } so future path drift is caught.

Proposed test hardening
         assert_eq!(host_uefi.prefix(), CredentialPrefix::HostUefi);
+
+        let host_uefi_alias = CredentialKey::HostUefi {
+            credential_type: CredentialType::SiteDefault,
+        };
+        assert_eq!(
+            host_uefi_alias.to_key_str(),
+            "machines/all_hosts/site_default/uefi-metadata-items/auth"
+        );
 
         let dpu_uefi = CredentialKey::DpuUefiSiteVersioned { version: 7 };
         assert_eq!(
             dpu_uefi.to_key_str(),
             "machines/all_dpus/site_default/uefi-metadata-items/auth/v7"
         );
         assert_eq!(dpu_uefi.prefix(), CredentialPrefix::DpuUefi);
+
+        let dpu_uefi_alias = CredentialKey::DpuUefi {
+            credential_type: CredentialType::SiteDefault,
+        };
+        assert_eq!(
+            dpu_uefi_alias.to_key_str(),
+            "machines/all_dpus/site_default/uefi-metadata-items/auth"
+        );

As per coding guidelines, “Verification should exercise the behavior that changed.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/secrets/src/credentials.rs` around lines 663 - 677, The test for
CredentialKey UEFI rotation only verifies the new versioned variants, so it does
not actually pin the unchanged unversioned aliases described in the comment. In
the credentials.rs test case, add explicit assertions for
CredentialKey::HostUefi { SiteDefault } and CredentialKey::DpuUefi { SiteDefault
} to check their exact to_key_str and prefix values, alongside the existing
HostUefiSiteVersioned and DpuUefiSiteVersioned checks, so future path drift is
caught.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/secrets/src/credentials.rs`:
- Around line 1120-1122: The prefix completeness test in CredentialPrefix::all
is too weak because it only checks the count, so update prefix_all_is_complete
to assert the actual returned variants against the expected full set of
CredentialPrefix values instead of just len(). Use the CredentialPrefix::all()
helper and compare it to an explicit expected slice or table-driven case list so
the test fails if any variant is duplicated, missing, or reordered unexpectedly.
- Around line 663-677: The test for CredentialKey UEFI rotation only verifies
the new versioned variants, so it does not actually pin the unchanged
unversioned aliases described in the comment. In the credentials.rs test case,
add explicit assertions for CredentialKey::HostUefi { SiteDefault } and
CredentialKey::DpuUefi { SiteDefault } to check their exact to_key_str and
prefix values, alongside the existing HostUefiSiteVersioned and
DpuUefiSiteVersioned checks, so future path drift is caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a02ee2a-9226-4b0f-bd1f-48ddd9702554

📥 Commits

Reviewing files that changed from the base of the PR and between ccf58cf and f64d4af.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/errors.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/handlers/dpa.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/handlers/mlx_admin.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/api-db/src/credential_rotation/test_backfill.rs
  • crates/api-db/src/lib.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/dpa-manager/Cargo.toml
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/dpa-manager/src/lib.rs
  • crates/machine-controller/src/handler.rs
  • crates/secrets/src/credentials.rs
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/switch-controller/src/configuring.rs
🚧 Files skipped from review as they are similar to previous changes (22)
  • crates/dpa-manager/src/lib.rs
  • crates/api-db/migrations/20260623120000_credential_rotation.sql
  • crates/api-core/src/handlers/mlx_admin.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/handlers/uefi.rs
  • crates/api-core/src/handlers/credential.rs
  • crates/api-core/src/test_support/builder.rs
  • crates/bmc-explorer-cli/src/main.rs
  • crates/api-db/src/lib.rs
  • crates/dpa-manager/Cargo.toml
  • crates/site-explorer/src/bmc_endpoint_explorer.rs
  • crates/site-explorer/src/lib.rs
  • crates/api-core/src/handlers/machine.rs
  • crates/api-core/src/errors.rs
  • crates/switch-controller/src/configuring.rs
  • crates/api-db/migrations/20260623130000_credential_rotation_backfill.sql
  • crates/api-db/src/credential_rotation.rs
  • crates/dpa-manager/src/card_handler/svpc.rs
  • crates/api-db/src/credential_rotation/test_backfill.rs
  • crates/machine-controller/src/handler.rs
  • crates/api-core/src/dpa/lockdown.rs
  • crates/api-core/src/handlers/dpa.rs

@spydaNVIDIA spydaNVIDIA enabled auto-merge (squash) June 26, 2026 06:19
@spydaNVIDIA spydaNVIDIA requested a review from wminckler June 26, 2026 06:20
@spydaNVIDIA spydaNVIDIA merged commit 52a0a54 into NVIDIA:main Jun 26, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants