Skip to content

Conversation

@jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Oct 21, 2025

Bug Description 🐞

When multiple state changes happen outside a TimePeriod, NotificationComponent::FireSuppressedNotifications() will clear notifications that don't apply too soon, leading the state of the suppressed notifications to become inconsistent.

for (auto type : {NotificationProblem, NotificationRecovery, NotificationFlappingStart, NotificationFlappingEnd}) {
if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
}
}

Ultimately this will cause any suppressed notifications to be discarded when the TimePeriod resumes.

Fix Description 🔧

The trivial fix is to just move the check on NotificationReasonApplies() inside the tp->IsInside() condition, thereby removing them after the TimePeriod resumes and leaving the list of suppressed notifications intact until no further notifications will be added to it.

I'm not entirely certain what the purpose of that check is, but the only scenario I could think of is when a TimePeriod begins and a check result produces a notification before NotificationTimerHandler can run and send out the suppressed notifications. I've added a unit-test to ensure that behavior is the same as before.

@Al2Klimov, since you're the one who wrote the original code, maybe you can give other scenarios where this check might be affecting and that could potentially stop working with this PR. In that case we can add additional test-cases and look for a more complex fix.

Unit-Tests ✅

I've added tests for some common scenarios including the bugged behavior of the issue this PR fixes. This way the behavior of the NotificationComponent can be verified before and after this PR. I'm happy to add additional test-cases for behavior that might be affected by this issue and the fix.

Fixes #10575

@cla-bot cla-bot bot added the cla/signed label Oct 21, 2025
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from a6ba476 to 20560af Compare October 21, 2025 13:19
Al2Klimov
Al2Klimov previously approved these changes Oct 21, 2025

if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
subtract |= type;
suppressedTypes &= ~type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't read the issue, but the fix shouldn't hurt. After all, this is also the code branch where notifications are sent.

@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch 3 times, most recently from 1fd58ee to 22d7a0d Compare October 22, 2025 08:23
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch 2 times, most recently from c08038d to fea4554 Compare October 23, 2025 09:09
@jschmidt-icinga jschmidt-icinga marked this pull request as ready for review October 23, 2025 09:12
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from fea4554 to 9fa3acc Compare October 28, 2025 07:44
@jschmidt-icinga jschmidt-icinga modified the milestones: 2.16.0, 2.14.8 Oct 28, 2025
This includes a few common scenarios and a reproduction of the current behavior
affected by the underlying bug of issue #10575. This is done both to document
the change in behavior, as well as to ensure the behavior of the other scenarios
stays the same before and after the fix is applied.
Without this commit, every time the NotificationTimerHandler runs it
will discard notifications that don't apply to the reason of the latest
check result. This is probably intended to clear outdated suppressed
notifications immediately when the TimePeriod resumes, but it also clears
out important ones (see the test case).

This commit fixes that by clearing out inapplicable notifications when
the timer runs the first time after the TimePeriod resumes. By that time
we can expect that no new suppressed notifications will be added and all
notifications that don't conflict with the last check-result can still be
run.

Fixes #10575
@jschmidt-icinga jschmidt-icinga force-pushed the clear-suppr-notif-after-tp-resume branch from 9fa3acc to 41b5383 Compare October 28, 2025 14:32
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain what the purpose of that check is, but the only scenario I could think of is when a TimePeriod begins and a check result produces a notification before NotificationTimerHandler can run and send out the suppressed notifications.

That's pretty much the reasoning behind it. Its purpose is to avoid sending notifications that are no longer applicable, e.g. a problem notification when the checkable has already recovered. For instance, if a problem notification was suppressed due to being outside the time period, but recovers right after the time period starts and before the NotificationTimerHandler runs, it won't clear the suppressed problem notification, thus sending out the suppressed problem notification would be incorrect.

if (!(suppressedTypes & type) || checkable->NotificationReasonSuppressed(type))
continue;

if ((suppressedTypes & type) && !checkable->NotificationReasonApplies(type)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first check of this condition is superfluous and can be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and the whole if (suppressedTypes) ... surrounding this loop has also become redundant, thus you can either remove the whole if or the other if (!suppressedTypes) ... at the beginning of this function. Your choice.

Comment on lines 121 to 133
notification-notificationcomponent.cpp
remote-certificate-fixture.cpp
remote-filterutility.cpp
remote-configpackageutility.cpp
remote-httpserverconnection.cpp
remote-httpmessage.cpp
remote-url.cpp
${base_OBJS}
$<TARGET_OBJECTS:config>
$<TARGET_OBJECTS:remote>
$<TARGET_OBJECTS:icinga>
$<TARGET_OBJECTS:methods>
$<TARGET_OBJECTS:notification>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these dependencies conditionally based on the ICINGA2_WITH_NOTIFICATION variable.

Comment on lines +14 to +15
// black magic
template<auto prvPtr>
Copy link
Member

@yhabteab yhabteab Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed a black magic, but describing it a bit more what it does and it's used for would help readers. Also, does the abbreviated prv of the template param name stand for private or previous? Maybe use a bit of descriptive name instead.

template<auto prvPtr>
struct InvokeTimerHandlerImpl
{
friend void InvokeTimerHandler(const NotificationComponent::Ptr& nc) { return ((*nc).*prvPtr)(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
friend void InvokeTimerHandler(const NotificationComponent::Ptr& nc) { return ((*nc).*prvPtr)(); }
friend void InvokeTimerHandler(const NotificationComponent::Ptr& nc) { return (*nc.*prvPtr)(); }

// Store the old periods from the config snippets to reuse them later.
m_AllTimePeriod = m_TimePeriod->GetRanges();

Service::OnNotificationSentToUser.connect(NotificationSentToUserHandler);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using service to register the signal here looks odd! Why not just use Checkable instead?

Comment on lines +216 to +217
// Rerunning the timer before the next interval should not trigger a reminder notification.
NotificationTimerHandler();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the interval set to 0.15 at the beginning of this test case? How can you be sure that it's going to call this function before the 15ms interval has passed? Apart from that, the WaitForExpectedNotificationCount call above can block 5ms in the worst case, so I don't get how this all fit with what this comment says.

If I understand what this test is supposed to achieve, then I would just call NotificationTimerHandler() repeatedly before performing any assertions above instead, so you only get a very tight time window for race conditions and these assertions can be merged with the above ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recovery notification outside time period still lost

4 participants