Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edgerouter crashing - watchdog reboots with service enabled for some time #516

Open
geekifier opened this issue Oct 31, 2024 · 4 comments
Open

Comments

@geekifier
Copy link

geekifier commented Oct 31, 2024

First of all, thank you for creating this, it's very useful to have those metrics available, and I appreciate you sharing your work with us.

I am running Edgerouter 4 with firmware v2.0.9-hotfix.7 with edgerouter-exporter 2.9.4 (prebuilt binary from github).

$ cat /config/user-data/edgerouter-exporter.env
LOG_LEVEL=info
PORT=9090

All the metrics look great coming from the service, however, with the service enabled, I am experiencing crashes where the router will become unresponsive, and the watchdog service initiates a reset.

I realize this is very generic information, so I would like to know what can I do in order to collect some useful logs/information on what might be happening? Due to the nature of the crash, syslog export gets interrupted, and there is not much in terms of useful info prior to the crash.

Syslog after the reboot just shows the kernel boot logs.

If you know of any sort of crash dump function I could enable to help with this, I would try to get those captured.

Thanks!

@chitoku-k
Copy link
Owner

Hi, I am glad to hear that edgerouter-exporter sounds beneficial to you!

It seems that one (or possibly more) of the commands that this exporter internally invokes to collect metrics might have caused such an issue on your router. I suspect that the crash stems from a memory leak, so I would like to know if a memory leak is observed by calling any of those commands repeatedly in order to find the suspicious one. The available section from the free command should remain mostly the same if they don’t leak at all.

Setting LOG_LEVEL to debug enables you to inspect what commands the exporter internally invokes:

$ sudo PORT=9090 LOG_LEVEL=debug /usr/bin/edgerouter-exporter
[2024-11-01T12:57:43Z DEBUG] executing /opt/vyatta/sbin/ubnt_vtysh with ["-c", "show ip bgp summary"]
[2024-11-01T12:57:43Z DEBUG] executing /opt/vyatta/sbin/ubnt_vtysh with ["-c", "show bgp ipv6 summary"]
[2024-11-01T12:57:43Z DEBUG] executing /opt/vyatta/bin/sudo-users/vyatta-op-dynamic-dns.pl with ["--show-status"]
[2024-11-01T12:57:43Z DEBUG] executing /opt/vyatta/bin/vyatta-op-cmd-wrapper with ["show", "load-balance", "status"]
[2024-11-01T12:57:43Z DEBUG] executing /bin/ip with ["--brief", "addr", "show"]
[2024-11-01T12:57:43Z DEBUG] executing /opt/vyatta/bin/vyatta-op-cmd-wrapper with ["show", "version"]
[2024-11-01T12:57:44Z DEBUG] executing /opt/vyatta/bin/vyatta-op-cmd-wrapper with ["show", "pppoe-client"]
[2024-11-01T12:57:44Z DEBUG] executing /opt/vyatta/bin/vyatta-op-cmd-wrapper with ["show", "load-balance", "watchdog"]

Best,

@geekifier
Copy link
Author

I will try to collect the info you requested.
I have not observed a memory leak, unless it happens very rapidly, the exporter service itself consumes very little RAM based on my monitoring.

I have not observed a crash yet since re-enabling the exporter. It could be some combination of factors triggering it, maybe I was doing more polling while setting it up initially etc.

I will update this ticket with any new info.

@geekifier
Copy link
Author

geekifier commented Dec 2, 2024

I have run many iterations of the commands via a for loop, and did not observe any increase in memory consumption.
The router actually ran fine since my last post, only crashing this morning (2024-12-02).
There was no spike in memory usage based on my monitoring.
image

If there is no way to retrieve some sort of a core dump from EdgeOS, then I am not sure what else I could check.
For now, I will disable the service and monitor the uptime.

@chitoku-k
Copy link
Owner

Thank you for taking your time to gather those information! That was quite helpful in terms of the amount of memory consumption not being related to this issue.

In case such an occasional crash stems from the kernel, it might be necessary to keep the journal log persistent across boots. It seems to me that it’s possible by configuring /etc/systemd/journald.conf and restarting systemd-journald (just temporarily) though I haven’t tested this before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants