Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsd verification processing hangs, activity stopped for 20-30 minutes #338

Open
ttyS4 opened this issue Jun 16, 2024 · 5 comments
Open

nsd verification processing hangs, activity stopped for 20-30 minutes #338

ttyS4 opened this issue Jun 16, 2024 · 5 comments
Assignees

Comments

@ttyS4
Copy link

ttyS4 commented Jun 16, 2024

hi nsd folks,

There is a place where nsd is used for verification.
(Because of ixfr related issues it is on 4.9.1-1 now running on debian 12, compiled a package in a debian12 chroot using official debian packages, basically a backport.)

A new zone is generated every 10 minutes and knot signs the zone then nsd does verification and distributes the zone (notify-out + xfr).

nsd[32438]: notify for xy. from ::1 serial 1718515802
nsd[22942]: xfrd: zone xy committed "received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52"
nsd[22943]: zone xy. received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52 of 7204 bytes in 7.9e-05 seconds
nsd[22943]: verify: started verifier for zone xy (pid 35409)
...
nsd[22943]: verify: verifier for zone xy (pid 35409) exited with 0
nsd[22942]: zone xy serial 1718515202 is updated to 1718515802
nsd[35663]: ixfr for xy. from IP1
nsd[35663]: ixfr for xy. from IP2
...
nsd[22942]: xfrd: zone xy: received notify response error .... from IP6

However today we saw no follow-up after the verifier exited with 0.
We see nsd[4819]: handle_child_command: read: Connection reset by peer like 20 minutes after the verification finished.
Then normal activity is resumed and:

nsd[22942]: zone xy serial 1718516403 is updated to 1718517002

message follows.

Notify messages were received (and logged) while in this state, but no progress.

Would you think that upgrade to 4.10 could help?
Is this a known issue or something that needs further investigation?

Regards,
Tamás

@wtoorop
Copy link
Member

wtoorop commented Jun 17, 2024

Hi Tamas,
I don't think upgrading to 4.10 would make a difference in this case, but perhaps the 20 minutes timeout (in which NSD stays in reload mode) could be reduced by setting verifier-timeout: value to something reasonable; like 200% the time it takes the script to verify the zone or so.

@wtoorop
Copy link
Member

wtoorop commented Jun 17, 2024

But I still want to look into the specific case (by manual code instpection) that the process already exited, but that NSD is still reading what the verifier is writing to stdout and stderr.

@wtoorop wtoorop self-assigned this Jun 17, 2024
@ttyS4
Copy link
Author

ttyS4 commented Jun 17, 2024

If you need any info from us, just let us know.
(I can also try to collect data for you as long as it is considered safe.)

@ttyS4
Copy link
Author

ttyS4 commented Oct 22, 2024

This issue happened again.

# grep -E 'handle_child_command|Broken' /var/log/daemon.log
Oct 22 04:10:31 myhost nsd[27260]: handle_child_command: read: Connection reset by peer
Oct 22 05:48:55 myhost nsd[16206]: svrmain: problems sending quit to child 8223 command: Broken pipe
Oct 22 05:48:55 myhost nsd[16206]: handle_child_command: read: Connection reset by peer
Oct 22 05:48:55 myhost nsd[16206]: svrmain: problems sending quit to child 8223 command: Broken pipe
Oct 22 06:11:01 myhost nsd[24647]: handle_child_command: read: Connection reset by peer

@wtoorop
Copy link
Member

wtoorop commented Dec 30, 2024

Hi @ttyS4, as you pointed out, this is indeed quite similar to issue #417.
I will report to both issues when working on the watchdog processes for the stages during reload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants