Skip to content

fix(dhcp): temporarily block DHCP lease expiry handling.#2877

Open
abvarshney-nv wants to merge 2 commits into
NVIDIA:mainfrom
abvarshney-nv:stop_ip_cleanup
Open

fix(dhcp): temporarily block DHCP lease expiry handling.#2877
abvarshney-nv wants to merge 2 commits into
NVIDIA:mainfrom
abvarshney-nv:stop_ip_cleanup

Conversation

@abvarshney-nv

Copy link
Copy Markdown
Contributor

Expiring a BMC IP lease causes a mismatch between machine_interface and machine_topologies, and DPF does not yet support BMC IP changes. Block all lease expiry processing until we find a proper solution of cleaning up the IPs and DPF releases the fix.

  • Add EXPIRE_DHCP_LEASE_STATUS_NOT_HANDLED proto variant for leases that are found but intentionally not processed
  • Return NotHandled (or NotFound if the interface no longer exists) from the API handler, with an explicit txn.rollback() before the early return
  • Handle NotHandled in the Kea plugin's lease expiration match arm

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@abvarshney-nv abvarshney-nv requested a review from a team as a code owner June 25, 2026 08:40
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added a configuration toggle to control whether DHCP lease expiry performs IP cleanup.
    • Introduced a new “feature disabled” DHCP lease-expiration response status for accepted-but-not-processed requests.
    • Updated client-side handling to recognize the new status and log accordingly.
  • Bug Fixes
    • Ensured that when expiry handling is disabled, DHCP allocations and interface hostnames are preserved instead of being removed.
  • Tests
    • Updated DHCP lease-expiration test expectations and expanded override-based test setup to cover the disabled-handling path.

Walkthrough

The PR adds a feature-disabled DHCP expiry status, gates expire_dhcp_lease on a runtime config flag, and updates the DHCP caller and tests to cover the new response path.

Changes

DHCP lease expiration feature gate

Layer / File(s) Summary
Expiry status and config shape
crates/rpc/proto/forge.proto, rest-api/flow/internal/nicoapi/nicoproto/nico.proto, crates/api-core/src/cfg/file.rs, crates/api-core/src/test_support/default_config.rs
Adds FEATURE_DISABLED to ExpireDhcpLeaseStatus, and adds dhcp_lease_expiry_handling to CarbideConfig with a defaulted test value.
API-core expiry gate
crates/api-core/src/dhcp/expire.rs, crates/api-core/src/tests/common/api_fixtures/mod.rs
expire_dhcp_lease returns FeatureDisabled immediately when lease expiry handling is off, and the test fixture override path can enable that flag.
Caller handling and regression coverage
crates/dhcp/src/lease_expiration.rs, crates/api-core/src/tests/dhcp_lease_expiration.rs
expire_lease_at logs the feature-disabled status, and the DHCP lease-expiration tests update expectations, environment setup, and hostname assertions for the new behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the core change: lease expiry handling is being temporarily disabled.
Description check ✅ Passed The description is about blocking DHCP lease expiry processing, which aligns with the changeset despite some naming mismatches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@abvarshney-nv abvarshney-nv linked an issue Jun 25, 2026 that may be closed by this pull request
2 tasks

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/api-core/src/tests/dhcp_lease_expiration.rs (1)

160-169: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert that blocked expiry preserves the same IP.

After switching this path to NotHandled, the important invariant is that rediscover sees the existing allocation still attached. The current follow-up only checks for a non-empty address, so it would still pass if rediscover assigned a different IP. Please pin response2.address == original_ip (or verify the DB row directly) and rename the test to match the blocked behavior. As per coding guidelines, “Verification should exercise the behavior that changed” and “Add or update focused tests for ... API contracts”; as per path instructions, STYLE_GUIDE testing should cover observable input→output behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/dhcp_lease_expiration.rs` around lines 160 - 169,
The blocked lease-expiry test is too weak because it only checks that rediscover
returns an address, not that it preserves the original allocation. Update the
test around expire_dhcp_lease and rediscover to assert response2.address matches
original_ip, or verify the database row directly, so the changed NotHandled
behavior is exercised. Also rename the test to reflect the blocked-expiry
behavior and keep the assertion focused on the observable input→output contract.

Sources: Coding guidelines, Path instructions

🧹 Nitpick comments (2)
crates/api-core/src/dhcp/expire.rs (2)

58-58: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Log the blocked lease with structured fields.

This is now the steady-state outcome, so the bare string loses the IP/MAC needed to correlate skipped expirations with later DHCP/DNS state. Prefer something like tracing::info!(%ip_address, ?mac_address, "Expired DHCP lease handling blocked");. As per coding guidelines, “All services should emit logs in 'logfmt' syntax” and “prefer placing common fields as attributes passed to tracing functions”; as per path instructions, crates/**/*.rs should use structured tracing fields.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/dhcp/expire.rs` at line 58, The current `tracing::info!`
in `expire.rs` logs only a bare message, so the blocked DHCP lease event cannot
be correlated with IP/MAC state. Update the logging in the DHCP lease expiration
handling to use structured tracing fields on the relevant variables already in
scope, such as `ip_address` and `mac_address`, and keep the message descriptive
like the existing `tracing::info!` call.

Sources: Coding guidelines, Path instructions


42-69: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Move the blocked return ahead of txn_begin().

After ip_address.parse(), one of is_ipv4() / is_ipv6() is always true, so Lines 49-69 run for every request. That makes the delete/sync path dead for now and turns each expiry callback into a BEGIN + lookup + ROLLBACK cycle even though the blocked path is read-only. Make the block explicit before opening a write transaction, and use a read-only lookup only to choose NotFound vs NotHandled. As per path instructions, crates/api*/** changes should be reviewed for transaction safety.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/dhcp/expire.rs` around lines 42 - 69, The blocked DHCP
expiry path in expire_dhcp_lease currently runs after txn_begin() and always
returns because the IP is already validated, so move the early return logic
before opening a write transaction. In expire_dhcp_lease, keep only a read-only
lookup of db::machine_interface::find_by_ip to decide between NotFound and
NotHandled, and avoid starting/rolling back txn when the lease handling is
blocked.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/rpc/proto/forge.proto`:
- Line 3931: The flow proto mirror is out of sync with the source enum
definition, so add the missing EXPIRE_DHCP_LEASE_STATUS_NOT_HANDLED value to the
corresponding enum in nico.proto. Keep the enum ordering and numeric value
aligned with the matching enum in forge.proto so the generated flow clients
remain consistent with the source proto.

---

Outside diff comments:
In `@crates/api-core/src/tests/dhcp_lease_expiration.rs`:
- Around line 160-169: The blocked lease-expiry test is too weak because it only
checks that rediscover returns an address, not that it preserves the original
allocation. Update the test around expire_dhcp_lease and rediscover to assert
response2.address matches original_ip, or verify the database row directly, so
the changed NotHandled behavior is exercised. Also rename the test to reflect
the blocked-expiry behavior and keep the assertion focused on the observable
input→output contract.

---

Nitpick comments:
In `@crates/api-core/src/dhcp/expire.rs`:
- Line 58: The current `tracing::info!` in `expire.rs` logs only a bare message,
so the blocked DHCP lease event cannot be correlated with IP/MAC state. Update
the logging in the DHCP lease expiration handling to use structured tracing
fields on the relevant variables already in scope, such as `ip_address` and
`mac_address`, and keep the message descriptive like the existing
`tracing::info!` call.
- Around line 42-69: The blocked DHCP expiry path in expire_dhcp_lease currently
runs after txn_begin() and always returns because the IP is already validated,
so move the early return logic before opening a write transaction. In
expire_dhcp_lease, keep only a read-only lookup of
db::machine_interface::find_by_ip to decide between NotFound and NotHandled, and
avoid starting/rolling back txn when the lease handling is blocked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a4d71c84-707f-4775-afca-04064db703da

📥 Commits

Reviewing files that changed from the base of the PR and between 852c517 and 2196d81.

📒 Files selected for processing (4)
  • crates/api-core/src/dhcp/expire.rs
  • crates/api-core/src/tests/dhcp_lease_expiration.rs
  • crates/dhcp/src/lease_expiration.rs
  • crates/rpc/proto/forge.proto

Comment thread crates/rpc/proto/forge.proto Outdated
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 26 102 7 144
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3267 134 778 1176 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-25 10:48:59 UTC | Commit: d21c249

Comment thread crates/api-core/src/dhcp/expire.rs Outdated
// to still exist, so we must do this before the delete.
let interface = db::machine_interface::find_by_ip(&mut txn, ip_address).await?;

if ip_address.is_ipv4() || ip_address.is_ipv6() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of if ip_address.is_ipv4() || ip_address.is_ipv6() {, how about we add a config bool handle_lease_expiration_callback, defaults to false, and then this just becomes:

if !handle_lease_expiration_callback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@abvarshney-nv abvarshney-nv linked an issue Jun 26, 2026 that may be closed by this pull request
2 tasks

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/api-core/src/tests/dhcp_lease_expiration.rs (2)

436-445: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Align comment with the actual assertion.

The comment references a n-<mac> placeholder while the assertion verifies the hostname starts with noip. Update the comment to the noip prefix to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/dhcp_lease_expiration.rs` around lines 436 - 445,
The inline comment above the expiry assertion is inconsistent with the actual
check in dhcp_lease_expiration; update the comment to match the hostname prefix
being asserted in iface_after_expiry so it refers to the noip dormant format
rather than n-<mac>. Keep the comment aligned with the assertion logic in the
same test block to avoid misleading future readers.

34-64: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Rename this test to reflect its inverted assertion.

test_expire_releases_allocation now runs against the default (disabled) environment and asserts the allocation is explicitly not released (FeatureDisabled, with both rows preserved). The name describes the opposite outcome and will mislead the next maintainer — consider test_expire_is_blocked_when_handling_disabled or similar.
[optional_refactor]

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/dhcp_lease_expiration.rs` around lines 34 - 64, The
test name is misleading because the assertions in
test_expire_releases_allocation now verify the disabled-path behavior instead of
a successful lease expiration. Rename the test to reflect that expire_dhcp_lease
is blocked by the default environment and that the allocation remains preserved
with FeatureDisabled, using a name aligned with the current assertions so future
readers understand the intended outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/dhcp/src/lease_expiration.rs`:
- Around line 105-107: The new FeatureDisabled branch in
lease_expiration::expire_dhcp_lease is not reachable from the current tests
because MockAPIServer still returns a hard-coded
ExpireDhcpLeaseStatus::Released. Update the test setup so the mock status is
configurable, or add a focused unit test around the status-to-log/behavior
mapping in MockAPIServer and expire_dhcp_lease, so FeatureDisabled is explicitly
exercised.

---

Nitpick comments:
In `@crates/api-core/src/tests/dhcp_lease_expiration.rs`:
- Around line 436-445: The inline comment above the expiry assertion is
inconsistent with the actual check in dhcp_lease_expiration; update the comment
to match the hostname prefix being asserted in iface_after_expiry so it refers
to the noip dormant format rather than n-<mac>. Keep the comment aligned with
the assertion logic in the same test block to avoid misleading future readers.
- Around line 34-64: The test name is misleading because the assertions in
test_expire_releases_allocation now verify the disabled-path behavior instead of
a successful lease expiration. Rename the test to reflect that expire_dhcp_lease
is blocked by the default environment and that the allocation remains preserved
with FeatureDisabled, using a name aligned with the current assertions so future
readers understand the intended outcome.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8f4a93bc-6a51-4530-95f9-907d665a8395

📥 Commits

Reviewing files that changed from the base of the PR and between d21c249 and ad64861.

📒 Files selected for processing (8)
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/dhcp/expire.rs
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-core/src/tests/common/api_fixtures/mod.rs
  • crates/api-core/src/tests/dhcp_lease_expiration.rs
  • crates/dhcp/src/lease_expiration.rs
  • crates/rpc/proto/forge.proto
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto

Comment on lines +105 to +107
rpc::ExpireDhcpLeaseStatus::FeatureDisabled => {
log::info!("Feature is disabled at NICo");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the mock API server to see whether the expire-lease status is configurable.
fd -t f 'mock_api_server' crates/dhcp -x sed -n '1,200p' {}
rg -nP 'ExpireDhcpLeaseStatus|ENDPOINT_EXPIRE_DHCP_LEASE|expire_dhcp_lease' crates/dhcp -C3

Repository: NVIDIA/infra-controller

Length of output: 12786


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the lease expiration tests and any status configurability in the DHCP mock server.
sed -n '1,240p' crates/dhcp/src/lease_expiration.rs
printf '\n--- SEARCH ---\n'
rg -n 'FeatureDisabled|Released|NotFound|inject_failure|address_overrides|status:' crates/dhcp/src -C 2

Repository: NVIDIA/infra-controller

Length of output: 14971


Add a test hook for FeatureDisabled
MockAPIServer still hard-codes ExpireDhcpLeaseStatus::Released, so the new branch cannot be exercised by the existing lease-expiration tests. Make the mock status configurable or add a focused unit test for the status mapping; otherwise this branch remains uncovered.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/dhcp/src/lease_expiration.rs` around lines 105 - 107, The new
FeatureDisabled branch in lease_expiration::expire_dhcp_lease is not reachable
from the current tests because MockAPIServer still returns a hard-coded
ExpireDhcpLeaseStatus::Released. Update the test setup so the mock status is
configurable, or add a focused unit test around the status-to-log/behavior
mapping in MockAPIServer and expire_dhcp_lease, so FeatureDisabled is explicitly
exercised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: IP cleanup on DHCP lease expiry removes active interface IPs feat: keep DPU BMC ip in sync for DPF (DPF Integration)

2 participants