Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHCP lease maintenance should terminate when interface no longer exists. #1143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dougbtv
Copy link

@dougbtv dougbtv commented Jan 21, 2025

Due to oberservations that threads can grow and the dhcp daemon uses an increasing amount of memory.

This situation can happen organically when using say, bridge CNI, and the bridge has been removed outside of the bridge CNI lifecycle, and an interface no longer exists on a pod.

Also, for reproduction steps (admittedly, openshift biased), see my steps in https://gist.github.com/dougbtv/803316016c1e96a529725013402d5f0d

@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch from 20e6ba9 to 0834450 Compare January 21, 2025 18:10

ctx, cancel := context.WithTimeout(ctx, linkCheckTotalTimeout)
defer cancel()
_, err := backoffRetry(ctx, linkCheckRetryMax, checkFunc)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only downside here is that backoffRetry() sets a "base delay" using the resendDelay0 const which was from the DHCP RFC stuff, so, it's 4 seconds the first time. This could probably be quicker? Will happily rejig backoffRetry to take a base delay parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops missed a major part of this... fix incoming.

@dougbtv
Copy link
Author

dougbtv commented Jan 21, 2025

Soooo.... yeah what I have currently pushed has some things I totally whiffed on and I'm working on correcting it, but, in an ideal world, the netlink library release v1.3.0 doesn't include the commit to detect the ErrDumpInterrupted

I'm looking into working around it until that can get released, which I also requested in vishvananda/netlink#1019

@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch from 0834450 to 0249707 Compare January 21, 2025 22:06
return exists, err
}

func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, f func() (bool, error)) (bool, error) {
Copy link
Author

@dougbtv dougbtv Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wound up creating another backoff method because I needed to have the retry method f func() (bool, error) return a bool and an error, so, I additionally added linkCheckDelay0

if err != nil {
log.Printf("!bang: Error checking network link '%s': %v, type: %v", l.link.Attrs().Name, err, reflect.TypeOf(err))
var linkNotFoundErr *netlink.LinkNotFoundError
if errors.As(err, &linkNotFoundErr) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't evaluating to true even though I was seeing logs as:

2025/01/21 21:36:46 Error checking network link 'net1': Link not found, type: netlink.LinkNotFoundError

I'll remove the reflect package and also the lines marked with !bang once I get it sorted.

@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch 2 times, most recently from ea7c4f0 to 55262dc Compare January 21, 2025 22:09
@dougbtv
Copy link
Author

dougbtv commented Jan 22, 2025

I'm going to change to checking for the string Link Not Found which I'm not particularly stoked on, but, happy to take any input on why this didn't work:

var linkNotFoundErr *netlink.LinkNotFoundError
if errors.As(err, &linkNotFoundErr) {
	log.Printf("!bang Link not found error detected! this is what I want to see.")
	return false, nil
}

That would never evaluate as true, even when I could find the error as netlink.LinkNotFoundError when using a reflect output like this: log.Printf("!bang: Error checking network link '%s': %v, type: %v", l.link.Attrs().Name, err, reflect.TypeOf(err))

@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch 4 times, most recently from c0f73b9 to 78d100b Compare January 22, 2025 00:42
@zeeke
Copy link

zeeke commented Jan 22, 2025

I'm going to change to checking for the string Link Not Found which I'm not particularly stoked on, but, happy to take any input on why this didn't work:

var linkNotFoundErr *netlink.LinkNotFoundError
if errors.As(err, &linkNotFoundErr) {
	log.Printf("!bang Link not found error detected! this is what I want to see.")
	return false, nil
}

That would never evaluate as true, even when I could find the error as netlink.LinkNotFoundError when using a reflect output like this: log.Printf("!bang: Error checking network link '%s': %v, type: %v", l.link.Attrs().Name, err, reflect.TypeOf(err))

You're passing the address of an empty pointer (pointer of pointer) to errors.As(...).

This should do the job

var linkNotFoundErr *netlink.LinkNotFoundError = &netlink.LinkNotFoundError{}
if errors.As(err, linkNotFoundErr) {
	log.Printf("!bang Link not found error detected! this is what I want to see.")
	return false, nil
}

Try it in https://go.dev/play/p/GZJO1RdVR8g

@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch from 78d100b to 7758510 Compare January 22, 2025 14:58
_, err := netlink.LinkByName(l.link.Attrs().Name)
if err != nil {
var linkNotFoundErr *netlink.LinkNotFoundError = &netlink.LinkNotFoundError{}
if errors.As(err, linkNotFoundErr) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Owe you one @zeeke !!! Thanks. Appreciate that, it works a charm and much cleaner than the string check, thanks.

ctx, cancel := context.WithTimeout(ctx, linkCheckTotalTimeout)
defer cancel()

exists, err := backoffRetryLinkExists(ctx, linkCheckRetryMax, checkFunc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linkCheckRetryMax is constant variable, defined as global scope in this file. So we may remove it from function argument. Otherwise, remove constant variable, linkCheckRetryMax and define it in function variable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, simplified so that it isn't passed through the method param and just used directly there, good call

func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, f func() (bool, error)) (bool, error) {
baseDelay := linkCheckDelay0
for {
exists, err := f()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function is pretty for backoffRetryLinkExists, not for common retry, hence we may not required to have f func() (bool, error)) as argument, I guess. Instead of that, having linkName in function argument makes it simple, I feel.

func backoffRetryLinkExists(ctx context.Context, maxDelay time.Duration, linkName string) (bool, error) {
  checkFunc := func() (bool, error) {
        _, err := netlink.LinkByName(linkName)
        if err != nil {
            var linkNotFoundErr *netlink.LinkNotFoundError = &netlink.LinkNotFoundError{}
            if errors.As(err, linkNotFoundErr) {
                return false, nil
            }
            return false, err
        }
        return true, nil
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tomo, nice simplification. I did wind up breaking out a method checkLinkByName to use there, and then just check it in a for loop in a condensed version of the checkLinkExistsWithBackoff method.

@@ -344,6 +360,45 @@ func (l *DHCPLease) maintain() {
}
}

func (l *DHCPLease) checkLinkExistsWithBackoff(ctx context.Context) (bool, error) {
checkFunc := func() (bool, error) {
_, err := netlink.LinkByName(l.link.Attrs().Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that this link is already removed, then l.link.Attrs() may be invalid. We should store interface name ahead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I didn't really account for this, even though it seems it was pulling up the correct name. However, just in case this does happen in other situations, I instead opted to store linkName as a property of the DHCPLease object. Thanks!

Due to oberservations that threads can grow and the dhcp daemon uses an increasing amount of memory.

This situation can happen organically when using say, bridge CNI, and the bridge has been removed outside of the bridge CNI lifecycle, and an interface no longer exists on a pod.

Does so on a retry loop using the `backoffRetry()` method.

Signed-off-by: dougbtv <[email protected]>
@dougbtv dougbtv force-pushed the dhcp-daemon-terminate-iface-nonexist branch from 7758510 to ce0c680 Compare January 27, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants