You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately due to S3 having a large minimum chunk size (5 megabytes) we often have to deal with comparing large amounts of data when we do our tests against S3. What invariably ends up happening is we compare very large data structures (typically 10mb's or more) with eachother to see if they are equal which uses large amounts of CPU, enough so that it can actually create bottlenecks.
How could this be improved?
There are 2 cases we have to deal with
Doing a very fast comparison to see that large data structures are equal
If two data structures are not equal, finding a very fast way of figuring out "whats wrong"
In regards to 1, there isn't too much we can do apart from potentially dealing with raw numbers rather than String's on the suspicion that scalacheck's generated String's is causing too many hash collisions that is slowing down the equals method. The easiest solution here may be to use a custom hashcode that works better with the generated data. Alternately one can make sure that the data generated in io.aiven.guardian.kafka.Generators are incrementing numbers which will generate less collisions, however they you have to deal with serializing the strings into numbers when implementing a custom hashcode.
In regards to 2, we are currently using https://github.com/softwaremill/diffx to create nice diff's if the 2 data structures are not equal however diffx isn't designed to handle large data structures nicely. More specifically it can handle cases were a single value in some data structure is wrong or missing but what typically happens in our S3 tests is not a single value is missing but instead entire chunks of data are missing (i.e. a backup files for a single chunk). Using algorithms which are are very fast at quickly detecting these missing large chunks of data and handling overlapping data and then falling back to slower/more general methods can theoretically greatly reduce the amount of CPU time that is being used. In other words we want to have a fast path for the most common cause of test failures and then fallback to the slower/current algorithm (if possible)
Is this a feature you would work on yourself?
I plan to open a pull request for this feature
The text was updated successfully, but these errors were encountered:
What is currently missing?
Unfortunately due to S3 having a large minimum chunk size (5 megabytes) we often have to deal with comparing large amounts of data when we do our tests against S3. What invariably ends up happening is we compare very large data structures (typically 10mb's or more) with eachother to see if they are equal which uses large amounts of CPU, enough so that it can actually create bottlenecks.
How could this be improved?
There are 2 cases we have to deal with
In regards to 1, there isn't too much we can do apart from potentially dealing with raw numbers rather than
String
's on the suspicion that scalacheck's generatedString
's is causing too many hash collisions that is slowing down theequals
method. The easiest solution here may be to use a custom hashcode that works better with the generated data. Alternately one can make sure that the data generated inio.aiven.guardian.kafka.Generators
are incrementing numbers which will generate less collisions, however they you have to deal with serializing the strings into numbers when implementing a custom hashcode.In regards to 2, we are currently using https://github.com/softwaremill/diffx to create nice diff's if the 2 data structures are not equal however diffx isn't designed to handle large data structures nicely. More specifically it can handle cases were a single value in some data structure is wrong or missing but what typically happens in our S3 tests is not a single value is missing but instead entire chunks of data are missing (i.e. a backup files for a single chunk). Using algorithms which are are very fast at quickly detecting these missing large chunks of data and handling overlapping data and then falling back to slower/more general methods can theoretically greatly reduce the amount of CPU time that is being used. In other words we want to have a fast path for the most common cause of test failures and then fallback to the slower/current algorithm (if possible)
Is this a feature you would work on yourself?
The text was updated successfully, but these errors were encountered: