-
Notifications
You must be signed in to change notification settings - Fork 598
Fix overdue state doesn't account for timeperiods in HA cluster #10562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, the DB will only be updated in these cases:
- In the non-forced External Command Pipe commands
- Unconditionally via the RescheduleCheck ApiAction
- When scheduling a check that is not forced, is active, and
enable_service_checksis true
In all other cases we just trigger internal state-changes.
I'm not sure the SetNextCheck() is the right place to make the decision about which event to persist to the DB, but I get that this is a low impact change that fixes the issue, and so I don't want to restart the bikeshedding from #10082.
lib/checker/checkercomponent.cpp
Outdated
| if (host && !service && (!checkable->GetEnableActiveChecks() || !icingaApp->GetEnableHostChecks())) { | ||
| activeChecksDisabled = !checkable->GetEnableActiveChecks() || !icingaApp->GetEnableServiceChecks(); | ||
| if (activeChecksDisabled && host && !service) { | ||
| Log(LogNotice, "CheckerComponent") | ||
| << "Skipping check for host '" << host->GetName() << "': active host checks are disabled"; | ||
| check = false; | ||
| } | ||
| if (host && service && (!checkable->GetEnableActiveChecks() || !icingaApp->GetEnableServiceChecks())) { | ||
| } else if (activeChecksDisabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this diff a bit more?
Why are we now checking icingaApp->GetEnableServiceChecks() in the host case? Or is that a mistake?
Is this change meaningful for what this PR does or just a small refactoring since we need the information for the activeChecksDisabled flag?
Also, since we're touching this anway, maybe it can be simplified to have less duplication, since all that's different is the word "host"/"service" in the log message.
d8c1d07 to
f8a4acb
Compare
This commit changes the ordering of CheckableScheduleInfo in the multi-index container to ensure that checkables with running checks are pushed to the end of the ordering. This prevents them from being prioritized for scheduling ahead of others, which could lead to unnecessary CPU load due to repeated scheduling attempts. By using a very large value for the index of checkables with running checks, they are effectively deprioritized until their current check is completed and they can be reinserted with their actual next check time.
Since the scheduler accounts for already running checks, we only need to update the `next_check` timestamp in `Checkable::ExecuteCheck()` only where it actually makes sense to do so, and as for local checks this doesn't make sense at all. There only two cases where we need to update the next check beforehand: 1) The execute command event is sent to a connected remote endpoint, so we need to set the next check to a time in the future until we actually receive the check result back from the remote endpoint. However, it must not be too far in the future to avoid that the check is not re-run for too long in case the remote endpoint never responds. 2) The check is a remote check, but either the endpoint is currently syncing replay logs or not connected at all, and we are within the magical 5min cold startup window. In these cases, the check is effectively skipped, and there will be no check result for it coming in, we manually update the next check normally as if the check was executed. In the other cases, either the check is executed locally, which means the `m_RunningCheck` flag already prevents the scheduler from re-running the check, or this is a remote check and the endpoint is not connected, but we are outside the cold startup window, in which case we also don't do anything as we've already called `Checkable::ProcessCheckResult()` with an appropriate error state, which in turn will call `Checkable::UpdateNextCheck()`.
f8a4acb to
c7775eb
Compare
|
I've changed the implementation according to our offline discussion with @jschmidt-icinga yesterday and with a bunch of comments added, the change list is a bit bigger than before but it's way better, I think. So please have a look! PS: I've also updated the PR description! |
Would it be possible to add test-cases for all permutations of state changes? The checker component doesn't have any dependencies on its own and you could make a simple mock object connecting to the same events the databases receive. Then reschedule checks on checkables and verify everything is updated and consistent. Also note that Unit-Tests act as a kind of documentation too. Well commented tests sometimes describe what that code does better than the code itself. (Speaking as someone who still doesn't grasp the entirety of check scheduling in its complexity) Edit: To clarify: I don't mean a full unit-test suite for the entire checker component, though that might be useful in the future, but for now, maybe just a few cases for the specific issue fixed in this PR. |
This commit introduces a new kinda special `OnRescheduleCheck` signal that is emitted whenever we want to inform the checker to reschedule the checkable at a specific timestamp without actually changing the next check time. Previously, we called `SetNextCheck` with some random timestamp just to enforce the checker to either pick it up immediately or at a specific time. Then at some point in time, subscribing to the `OnNextCheckChanged` signal became effectively unusable for any other purpose than to inform the checker about a new next check time. Thus, it resulted in introducing a new signal that is solely responsible for informing the Icigna DB and IDO about a new next check time in places where calling `SetNextCheck` did make sense. This commit does quite the opposite: it replaces all calls to `SetNextCheck` that were only used to inform the checker about a new next check time wit `OnRescheduleCheck` calls. Only places where we actually wanted to change the next check time still call `SetNextCheck` and thus inform the checker and all other listeners about the new next check time. And as a bonus point, we now got rid of the two object locks for child and parent at the same time.
It also removes the extra `SendNextUpdate()` call from the `NewCheckResultHandler` handler in Icinga DB, since it's subscribed to the `NextCheckChanged` event anyway and that event is always emitted before the `NewCheckResult` event gets triggered. This call became redundant.
c7775eb to
0bcee67
Compare
|
I've added basic unittests for the checker now as requested, so the PR size has drastically increased now :)! But I hope, the tests gives you now a little overview how the checker works or is actually supposed to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, amazing work adding these tests with all that infrastructure in what must only have been a few hours of time.
I'm not at all concerned with the increase in diff size due to the tests, as it is pretty clear that the production code actually gets a reduction in complexity from this PR.
For now I'm only going to comment on some of the test methods used, but I'm also continuing to look at what they do and the rest of the PR, which looks very promising.
If the logger is started with `Activate()` before `SetActive()`, it won't log anything, as the logger updates the "min severity" value of loggers only when starting them, and if they're not active at that point, they will just be ignored, so the min severity remains at info.
0bcee67 to
5f3b7d2
Compare
|
I think, I've addressed all your requested changes. Plus, our checker should now also have a roughly 70-90% code coverage. I've added some more unit tests to cover for remotely executed checks aka |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test: redis-cli MONITOR with one check.
c2c9b6b
1759312891.029505 [0 192.168.65.1:60842] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759312891027,\"latency\":7,\"next_check\":1759312901019,\"next_update\":1759312911035,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759312891.029533 [0 192.168.65.1:60842] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"acb9e73a6beec90b4ed53726d693d85f30085e77\"}"
1759312891.029545 [0 192.168.65.1:60842] "ZADD" "icinga:nextupdate:host" "1759312911.035350" "d7e93a933920b2a35baf2479691400a961d92311"
5f3b7d2
1759313097.299299 [0 192.168.65.1:32736] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759313097297,\"latency\":6,\"next_check\":1759313106638,\"next_update\":1759313116651,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759313097.299331 [0 192.168.65.1:32736] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"6a8714aa442b2d076193cfa3baf4603a35cf5fec\"}"
1759313097.299346 [0 192.168.65.1:32736] "ZADD" "icinga:nextupdate:host" "1759313116.651438" "d7e93a933920b2a35baf2479691400a961d92311"
1759313097.300295 [0 192.168.65.1:32736] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759313097297,\"latency\":6,\"next_check\":1759313106638,\"next_update\":1759313116651,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759313097.300320 [0 192.168.65.1:32736] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"6a8714aa442b2d076193cfa3baf4603a35cf5fec\"}"
Conclusion
Smells like #9501, I didn't even test state changes yet.
Consider a dead simple additional OnNextCheckUpdated cluster event instead.
| * the magical 5min cold startup window. In both cases, we just don't do anything and wait for | ||
| * the next check interval to re-try the check again. So, this check is effectively skipped. | ||
| */ | ||
| UpdateNextCheck(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad if my check runs once a day. (The else if case at least schedules the next check just one retry interval in the future.)
| if (child->GetProblem() && child->GetEnableActiveChecks()) { | ||
| auto nextCheck (now + Utility::Random() % 60); | ||
|
|
||
| ObjectLock oLock (child); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure the lock is no longer needed? Below you still have a get and kind of a set operation.
| continue; | ||
|
|
||
| if (parent->GetNextCheck() >= now + parent->GetRetryInterval()) { | ||
| ObjectLock olock(parent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. While on it, I guess it makes sense for the lock to cover both operations, get and set. Just as in the child case above.
| delta *= (double)std::rand() / RAND_MAX; | ||
| SetNextCheck(now + delta); | ||
| // We only want to jitter the next check a bit, and inform the scheduler about it, so not setting it directly. | ||
| SetNextCheck(now + delta, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To "inform the scheduler about it" you suppress the signal?
| }); | ||
|
|
||
| Checkable::OnNextCheckUpdated.connect([](const Checkable::Ptr& checkable) { NextCheckUpdatedHandler(checkable); }); | ||
| Checkable::OnNextCheckChanged.connect([](const Checkable::Ptr& checkable, const Value&) { NextCheckUpdatedHandler(checkable); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you missed to rename NextCheckUpdatedHandler.
|
|
||
| checkable->SetNextCheck(params->Get("next_check"), false, origin); | ||
| bool isOriginClientOld = params->Get("origin_client_old").ToBool(); | ||
| if (isOriginClientOld || endpoint->GetIcingaVersion() < 21600) { // TODO: Exact required version?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to worry about exact versions if you introduce a new capability.
| * queues and ensures that checks are not fired multiple times. ProcessCheckResult() | ||
| * is called too late. See #6421. | ||
| */ | ||
| UpdateNextCheck(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer necessary, especially in the local check case? Sure, now you return early if !!m_CheckRunning. But where is this checkable re-indexed inside the scheduler queue?
So, after the long discussions in #10082 that didn't lead to a final conclusion, here's a minimal implementation without too many refactorings and changes to existing logic. This PR changes the checker queueing behaviour a bit to never reschedule checkables with ongoing checks. Thus, we avoid piling up checkables with running checks in the queue that will never be executed anyway. Though, since this change affects a critical part of Icinga 2, we should be very careful with this not to introduce any regressions or performance degradations. Thus far, I've let it run some checks with 500ms intervals and I didn't notice any issues (see the attached graph), though the graph shows at "10:35" a small hiccup caused by some API requests triggered by me. Another, behaviour change is that previously the checker didn't trigger the now dropped
Checkable::OnNextCheckUpdatedsignal when active checks were disabled. However, since the Icinga DBSendNextUpdatefunction already has a logic to send aZREMRedis command when active checks are disabled, the checker now also triggers theCheckable::OnNextCheckChangedsignal unconditionally.Now, to the individual changes, each commit is self explaining what it does and why it's needed and contains some more details about the reasoning behind it. So, please have a look at the individual commits.
Tests
Just like before a simple
/v1/actions/reschedule-checkrequest triggers twoicinga:nextupdate:hostevents.... while a
/v1/actions/process-check-resultonly triggers a singleicinga:nextupdate:event.1758547833.559175 [12 172.18.0.1:57800] "HSET" "icinga:host:state" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "{\"affects_children\":false,\"check_attempt\":1,\"check_source\":\"mbp-yhabteab.fritz.box\",\"check_timeout\":60000,\"environment_id\":\"30e34a7a6c432baf29e051ed428aee3825f25e67\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"4c80e4ad47034480b3c3bb28885970b6bb50983e\",\"id\":\"4c80e4ad47034480b3c3bb28885970b6bb50983e\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":true,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1758547833522,\"last_update\":1758547833522,\"latency\":0,\"next_check\":1758548133522,\"next_update\":1758548193522,\"normalized_performance_data\":\"wrong=85;0;0;0 ok=1;0;0;0 max=215.600000;0;0;0\",\"output\":\" /dev/dm-13: 1.0MiB/s read1, 4.3KiB/s write1, 1.0MiB/s total, 215.6MiB/s max\",\"performance_data\":\"wrong=85;0;0;0; ok=1;0;0;0; max=215.6;0;0;0;\",\"previous_hard_state\":0,\"previous_soft_state\":0,\"scheduling_source\":\"mbp-yhabteab.fritz.box\",\"severity\":2080,\"soft_state\":1,\"state_type\":\"soft\"}" 1758547833.559289 [12 172.18.0.1:57800] "HSET" "icinga:checksum:host:state" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "{\"checksum\":\"0ed63c1311be890a7ad5ffae340a3489db7d57ca\"}" 1758547833.559305 [12 172.18.0.1:57800] "ZADD" "icinga:nextupdate:host" "1758548193.522986" "4c80e4ad47034480b3c3bb28885970b6bb50983e" 1758547833.559430 [12 172.18.0.1:57800] "XADD" "icinga:runtime:state" "MAXLEN" "~" "1000000" "*" "runtime_type" "upsert" "redis_key" "icinga:host:state" "checksum" "0ed63c1311be890a7ad5ffae340a3489db7d57ca" "affects_children" "0" "check_attempt" "1" "check_source" "mbp-yhabteab.fritz.box" "check_timeout" "60000" "environment_id" "30e34a7a6c432baf29e051ed428aee3825f25e67" "execution_time" "0" "hard_state" "0" "host_id" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "id" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "in_downtime" "0" "is_acknowledged" "0" "is_active" "1" "is_flapping" "0" "is_handled" "0" "is_problem" "1" "is_reachable" "1" "is_sticky_acknowledgement" "0" "last_state_change" "1758547833522" "last_update" "1758547833522" "latency" "0" "next_check" "1758548133522" "next_update" "1758548193522" "normalized_performance_data" "wrong=85;0;0;0 ok=1;0;0;0 max=215.600000;0;0;0" "output" " /dev/dm-13: 1.0MiB/s read1, 4.3KiB/s write1, 1.0MiB/s total, 215.6MiB/s max" "performance_data" "wrong=85;0;0;0; ok=1;0;0;0; max=215.6;0;0;0;" "previous_hard_state" "0" "previous_soft_state" "0" "scheduling_source" "mbp-yhabteab.fritz.box" "severity" "2080" "soft_state" "1" "state_type" "soft"fixes #10082