Skip to content

SPDX Compare Time Out #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sivakinfomagnus opened this issue May 6, 2025 · 5 comments
Open

SPDX Compare Time Out #209

sivakinfomagnus opened this issue May 6, 2025 · 5 comments

Comments

@sivakinfomagnus
Copy link

Hello Team

We are working on comparing SPDX files using Java-Tool by generating one from zephyr and other one from previous release

Looks like compare never finishes and time out

Any one facing similar issue or following any other methods which succeeded the compare ?

@goneall
Copy link
Member

goneall commented May 6, 2025

There have been some performance issues in the past when deleting SPDX elements, but that shouldn't impact the compare.

Compare is very compute intensive, but it shouldn't be timing out. Since dependencies can have cycles, I wonder if there isn't an infinite loop that not being handled properly.

The unit tests run through a number of compare scenarios, so this issue may be related to the size of the SPDX files or something unique about the SPDX files.

If you can point me to two SPDX files that reproduce the problem, I'll take a look.

@sivakinfomagnus
Copy link
Author

Attached for the SPDX files we are trying to compare

spdx_compare_files.zip

@goneall
Copy link
Member

goneall commented May 7, 2025

I did some analysis and debugging. There are 44,595 elements which are in the documentDescribes list.

The document describes should only point to a few elements - the "root of the tree", typically less than 10 elements, so this is unusually large.

To make sure these match, the algorithm iterates through each element and makes sure it finds an equivalent element in the comparison documentDescribes list. On my machine, it is taking about 1 second for each element search, or 0.02 ms for each individual element comparison. The equivalent comparison can be quite involved, so the individual comparison cost seems reasonable to me. The issue is more the size of the lists being compared and possibly the algorithm for the list comparison.

We may be able to improve the algorithm, but I would not expect any of the collections to be larger than 5,000 or so in practice.

@goneall
Copy link
Member

goneall commented May 11, 2025

I did some more analysis and experimentation to see if I could dramatically improve the performance of comparing very large SPDX element collections.

TL;DR - no success yet, but some interesting findings.

Following are some of the constraints when doing the compares:

  1. The collections do not have any guaranteed order
  2. When comparing elements, we use the equivalent methods to state the 2 elements are the same only if all their properties are the same
  3. Access to the element properties are stateless - if another process or thread changes a property value, the next time that property is accessed, the new value will show up - no stale values

Comparing all the property values can take some time, especially if a property value is also a collection of elements. We can get a recursive stack of comparing very large number of elements.

One thing I noticed in my experimentation is we tend to compare the same elements over and over (e.g. the collections being compared may be a property of another element which is being compared).

If we relax the stateless access (3 above) for this use case, we can keep track of the comparisons in a map.

I created prototype using this approach (reference this draft pull request).

Unfortunately, we now have a memory issue - I ran out of heap space running the comparison above.

I'm not quite sure yet what is consuming the memory. The maximum size of the map will be total number of elements to the factor of the number of documents. Since it is a Boolean map, I wouldn't expect it to be large enough to run us out of heap space. More analysis needs to be done.

If anyone is interested in looking into this, let me know.

@bact
Copy link
Collaborator

bact commented May 13, 2025

If we can guarantee the collection (of differences) is sorted, will it help the comparison?

The sorting order is a desired enhancement from this spdx/Spdx-Java-Library#232 too. So looks like a good investment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants