Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affected hosts on Vulnerability does not match #24319

Closed
qa-wolf bot opened this issue Dec 3, 2024 · 18 comments
Closed

Affected hosts on Vulnerability does not match #24319

qa-wolf bot opened this issue Dec 3, 2024 · 18 comments
Assignees
Labels
bug Something isn't working as documented ~frontend Frontend-related issue. #g-software Software product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Milestone

Comments

@qa-wolf
Copy link

qa-wolf bot commented Dec 3, 2024

Steps to reproduce:

  1. Navigate to Software > Vulnerabilities
  2. Click a vulnerability
  3. See number of "affected hosts"
  4. Click "View Affected Hosts"
    Ran from 11AM - 11:15AM
    Expected: Expect number of hosts on page to match number of affected hosts in vulnerabilities page.
    Actual: Number of hosts on page do not match number of affected hosts in vulnerabilities page.
    Video:
    https://www.loom.com/share/30f1b0a0b2684dadaf8a46b5b054ee45 (https://www.loom.com/share/30f1b0a0b2684dadaf8a46b5b054ee45)
@sharon-fdm sharon-fdm added bug Something isn't working as documented :reproduce Involves documenting reproduction steps in the issue :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group labels Dec 4, 2024
@lukeheath lukeheath added the ~unreleased bug This bug was found in an unreleased version of Fleet. label Dec 6, 2024
@lukeheath lukeheath added this to the 4.62.0-tentative milestone Dec 11, 2024
@jmwatts
Copy link
Member

jmwatts commented Dec 12, 2024

Found this old bug: #21848

This appears to be the same issue.

This might happen because the count on the /software/vulnerabilities list view is populated from the vulnerability_host_counts table which is updated periodically based on the cron job that runs once per hour. I believe this is because vulnerabilities calculations across many hosts is an expensive process.

It seems the host count on the /hosts/manage page is calculated on page load, so the numbers could be slightly different, depending on if something has changed on a few of the hosts since the last time the cron job has run.

You can test this by finding vulnerable software currently installed on a host on the Software >> Vulnerabilities page, then on the host, deleting the software and refetching host vitals on the "My device" page. Once that has finished, navigate back to the Software >> Vulnerabilities page and filter by the CVE. The host count will have not updated on that page yet, but if you click "view affected hosts" the host for which you resolved the vulnerability will no longer be listed and the total count will be off by 1.

Tagging @noahtalerman to confirm this behavior is expected based on the way the UI retrieves and displays information, or if further investigation into alternative methods are warranted.

@jmwatts jmwatts removed :reproduce Involves documenting reproduction steps in the issue ~unreleased bug This bug was found in an unreleased version of Fleet. labels Dec 12, 2024
@noahtalerman
Copy link
Member

noahtalerman commented Dec 16, 2024

This might happen because the count on the /software/vulnerabilities list view is populated from the vulnerability_host_counts table which is updated periodically based on the cron job that runs once per hour. I believe this is because vulnerabilities calculations across many hosts is an expensive process.

It seems the host count on the /hosts/manage page is calculated on page load, so the numbers could be slightly different, depending on if something has changed on a few of the hosts since the last time the cron job has run.

@jmwatts that's right. It's likely this is confusing UX rather than a bug.

We can confirm that it's confusing UX (not a bug) by manually triggering the cron job and making sure the updated count on Vulnerabilities table matches the count on the Hosts page.

@mostlikelee mostlikelee added the #g-software Software product group label Dec 16, 2024
@sharon-fdm sharon-fdm added the ~frontend Frontend-related issue. label Dec 18, 2024
@sharon-fdm
Copy link
Collaborator

Hey team! Please add your planning poker estimate with Zenhub @jacobshandling @RachelElysia

@iansltx
Copy link
Member

iansltx commented Dec 19, 2024

Looks like running the vulnerabilities cron and confirming that it's complete should cause these numbers to match up. If the numbers don't match after the vulns run, we need to dig a bit more.

@iansltx iansltx self-assigned this Dec 19, 2024
@iansltx
Copy link
Member

iansltx commented Dec 19, 2024

@noahtalerman Maybe it makes sense to add a "host counts updated at" UI element to the individual vuln page...we have this in the API endpoint that already gets pulled for the vuln info...so it's more obvious when problems like this show up because the vulnerabilities cron hasn't run recently?

@iansltx
Copy link
Member

iansltx commented Dec 19, 2024

Took a look at the numbers myself and there's a significant discrepancy even with a recent vulns cron run. Awaiting a DB dump so I can troubleshoot further, as I figure this'll require query introspection etc.

@iansltx
Copy link
Member

iansltx commented Dec 19, 2024

For my own reference, the vuln being looked at is CVE-2024-44187, which is cross-platform and has a mix of OSes and software affected.

@iansltx
Copy link
Member

iansltx commented Dec 19, 2024

So, pulling down this environment does get me differing host counts between the filter and the calculation on the vulnerability page. Catch is, if I rerun the vulns cron using main (or the branch in #24914) I get 845 hosts matching the vulnerability, and host count matches between the vulnerability page and the host filter.

What's weird here is the vulnerabilities cron is showing as completed hourly as expected in the DB snapshot. My gut feeling is that we're dealing with an incomplete vulns run due to DB inconsistencies because this environment has at times run off of main, and that's resulted in DB migrations that later got tweaked or renumbered. Creating a ticket to clear that stuff up.

Once we have a clean snapshot in these environments...and once we're running off of a build including #24914, we should be able to properly analyze this, but until those two items are closed out I think anything we see here will be a red herring, so putting this ticket on the back burner until those two are merged.

@noahtalerman
Copy link
Member

Maybe it makes sense to add a "host counts updated at" UI element to the individual vuln page

@iansltx agreed this makes sense. I think not adding an "updated at" timestamp for the Vulnerability details page might have been a Product Design miss.

It looks like we're also missing the "updated at" timetamp on the OS version page:
Screenshot 2024-12-20 at 4 48 53 PM

And we're missing it on the Software version details page:

Screenshot 2024-12-20 at 4 50 44 PM

We only have it on the Software title details page:
Screenshot 2024-12-20 at 4 49 24 PM

@eugkuo I passed this bug to you. I think it's needs some Product Design help to determine where to put this timestamp. Maybe it shows up on hover over "Hosts"?

Screenshot 2024-12-20 at 4 52 07 PM

@noahtalerman noahtalerman added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group labels Dec 20, 2024
@noahtalerman noahtalerman assigned eugkuo and unassigned iansltx Dec 20, 2024
@iansltx
Copy link
Member

iansltx commented Dec 20, 2024

@noahtalerman So, there may be a genuine underlying issue (beyond #24933) in the counts themselves, rather than just lack of visibility on when they're updated. Mind if I split off the UX changes from your comment into a new issue (still assigned to @eugkuo), while keeping this one for troubleshooting vuln counts in the QAWolf environment? I think that split also helps QAWolf keep track of things, vs. switching the scope of this ticket.

@noahtalerman
Copy link
Member

Mind if I split off the UX changes from your comment into a new issue (still assigned to @eugkuo), while keeping this one for troubleshooting vuln counts in the QAWolf environment?

@iansltx sounds good to me 👍

@iansltx iansltx self-assigned this Dec 20, 2024
@iansltx iansltx removed the :product Product Design department (shows up on 🦢 Drafting board) label Dec 20, 2024
@iansltx iansltx added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group and removed #g-software Software product group labels Dec 20, 2024
@iansltx
Copy link
Member

iansltx commented Dec 20, 2024

UI improvements split :)

@iansltx
Copy link
Member

iansltx commented Dec 23, 2024

So, I triggered the vulnerabilities job on the QAWolf environment and I think there's an assumption baked into the total count where it shouldn't be.

Specifically, the total count seems to assume that for a given CVE a given host will either have an OS-level vulnerability or a software-level vulnerability, but not both. For this particular vuln both macOS and Safari have the vuln, so there are two hosts that are double-counted.

Question is whether we're willing to take the performance hit for correctly counting the intersection of hosts and vulns cross-OS vs. just summing host counts, but I'm guessing the answer is "probably yes because we want these numbers to be accurate."

Resolving that discrepancy gets us down to a difference of one host, which I narrowed down to a stale count issue I think (the off-by-one persists when clicking through on a Linux package, 662 hosts vs. 663, so I know what's causing it). Will validate this assumption shortly, at which point all discrepancies will have been accounted for.

@iansltx
Copy link
Member

iansltx commented Dec 23, 2024

Looks like the issue is with the vulnerability_host_counts calculation, as we have a single table per vuln (vs. trying to sum OS level and software level on-read). I'll figure out the fix there and stack on top of #24914.

@iansltx
Copy link
Member

iansltx commented Dec 30, 2024

My previous diagnosis was incorrect here. The host count mismatch being the same as I'd expect from SELECT * vs. SELECT DISTINCT was coincidental. The query for inserting host counts behaves correctly, as MySQL automatically filters duplicate rows when performing a UNION, and since we do a UNION across OS vulnerabilities and software vulnerabilities host IDs get deduplicated implicitly. I wrote some local tests to simplate the scenario where a CVE affected the OS and software on the same host and couldn't get it to fail, and the above is why.

The actual reason for the off-by-a-few is that we have some number of host OS and host software entries that didn't get cleaned up in the QA environment when those hosts were deleted. This was likely a transient issue, as we do have those "manual cascades" in the deletion method in the hosts part of the data store code, so the fix here is to delete the orphaned rows manually:

DELETE FROM host_software WHERE host_id NOT IN (SELECT id FROM hosts);
DELETE FROM host_operating_system WHERE host_id NOT IN (SELECT id FROM hosts);

@rfairburn I'm assigning this ticket to you to run the above queries. Assign back to me when done and I'll trigger the crons and confirm that things match up (see below for context there).

Another contributor to counts being off, at least locally, is the vulnerability host counts check using stale host OS count data. This is more obvious for host OS data because the mapping table there is also materialized by a cron (cleanups_then_aggregation), so if the vulns job runs with stale data and then the cleanups job runs immediately thereafter then the portion of host counts stemming from OS-level vulnerabilities will be incorrect, so the overall number will be off. This issue seems to be a little less prevalent in the QAWolf environment, but can bite us if the order of crons is wrong. Likely something we want to fix, even if it means re-aggregating OS counts in the vulns job right before pulling together vuln host counts.


With the following steps I got counts to match up across the board, on the vulnerability mentioned above (which includes hosts with both OS- and software-level vulns for the same CVE):

  1. Import the database and catch migrations up
  2. Manually edit app config to turn off host expiry (since I'm running on an older snapshot at this point, and cleanups-then-aggregations would nuke all my hosts otherwsie)
  3. Run the DELETE queries above to get rid of orphaned host data
  4. Run cleanups_then_aggregation cron
  5. Run vulnerabilities cron

At that point I got 1305 hosts with the vulnerability:

  • 655 synthetic macOS hosts with just the OS-level vulnerability
  • 2 real macOS hosts with both OS-level and software-level vulnerabilities
  • 648 synthetic Linux hosts with software vulnerabilities (4 vulnerable packages per host)

All affected host listings (entire CVE, per-OS, per-software) matched the counts on the vulnerability detail page. I also spot-checked no-team and normal-team counts/filters for the same CVE and they match as well, so I'm confident that the above is the extent of this issue.

@iansltx iansltx assigned rfairburn and unassigned iansltx Dec 30, 2024
@rfairburn rfairburn added the :help-customers Customer success issue. label Dec 30, 2024
@rfairburn
Copy link
Contributor

MySQL [fleet]> DELETE FROM host_software WHERE host_id NOT IN (SELECT id FROM hosts);
Query OK, 1699 rows affected (1.648 sec)

MySQL [fleet]> DELETE FROM host_operating_system WHERE host_id NOT IN (SELECT id FROM hosts);
Query OK, 5 rows affected (0.006 sec)

@rfairburn rfairburn assigned iansltx and unassigned rfairburn Jan 2, 2025
@rfairburn rfairburn removed the :help-customers Customer success issue. label Jan 2, 2025
@mostlikelee mostlikelee added #g-software Software product group and removed #g-endpoint-ops Endpoint ops product group labels Jan 2, 2025
@iansltx
Copy link
Member

iansltx commented Jan 2, 2025

Just reran cleanups_then_aggregation followed by vulnerabilities crons on the qawolf-premium env and got host counts matching filtered counts for the aforementioned CVE. Host counts also added up to the correct amount, accounting for hosts that had multiple vulnerabilities, including ones that had an OS-level vuln plus a software-level vuln for the same CVE.

Pinging QAWolf to verify on their end and hopefully close (since follow-on work for showing freshness indicators is tracked separately).

@qa-wolf qa-wolf bot closed this as completed Jan 2, 2025
@fleet-release
Copy link
Contributor

Vulnerabilities,
Count and match, like cloud drops,
Safety in numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as documented ~frontend Frontend-related issue. #g-software Software product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Projects
None yet
Development

No branches or pull requests

9 participants