Multiple performance regressions in 3.3.1 compared to 3.2.1 #1915

peterurbanec · 2024-11-11T08:31:31Z

There appear to be multiple performance regressions in OpenEXR 3.3.1 when compared to OpenEXR 3.2.1. These regressions were first observed in real-world use in an application. To facilitate regression testing, I have backported exrmetrics to build against OpenEXR 3.2.1 and created a simple series of tests. Please see https://github.com/peterurbanec/exr_perf

To summarise, largest real-world performance losses are observed with DWA compression, where the slow down is up to 58% when writing half images. Results from exrmetrics also show slowdowns with DWA, although at only 15%.

More details, taken from https://github.com/peterurbanec/exr_perf/README.md

Change in performance, measured with backported version of exrmetrics from this repository
Quantity measured is time taken for one frame, therefore positive Δ % quantities indicate
a performance regression. These figures were collected on AMD Ryzen Threadripper PRO 3975WX
running Linux

exrmetrics          3.2.1           3.3.1                 3.2.1            3.3.1
              Encode time     Encode time    Δ %    Decode time      Decode time     Δ %
none-half       0.0506104       0.0538534     6%      0.0147626       0.00607923    -59%
none-float      0.1009070       0.1006210    -0%      0.0210363       0.00938430    -55%
rle-half        0.0742712       0.0735408    -1%      0.0315694       0.02286890    -28%
rle-float       0.1433720       0.1428730    -0%      0.0468555       0.03499330    -25%
zips-half       0.1403540       0.1435510     2%      0.0413822       0.03292240    -20%
zips-float      0.2781060       0.2841830     2%      0.0759776       0.06599230    -13%
zip-half        0.1490650       0.1478110    -1%      0.0362129       0.02737780    -24%
zip-float       0.3176970       0.3146880    -1%      0.0757593       0.06454440    -15%
piz-half        0.1664580       0.1329990   -20%      0.0470330       0.04469030     -5%
piz-float       0.5811040       0.5301250    -9%      0.1553520       0.11521600    -26%
pxr24-half      0.1577990       0.1601900     2%      0.0300827       0.02612540    -13%
pxr24-float     0.2518220       0.2486510    -1%      0.0582780       0.03983260    -32%
b44-half        0.0790941       0.0811910     3%      0.0132777       0.01222920     -8%
b44-float       0.1036020       0.1121230     8%      0.0210958       0.00707329    -67%
b44a-half       0.0723179       0.0742420     3%      0.0113973       0.01050320     -8%
b44a-float      0.1024170       0.1032410     0%      0.0214426       0.00713971    -67%
dwaa-half       0.2754550       0.2919960     6%      0.0386784       0.04521370     17%  *
dwaa-float      0.3004660       0.3176720     6%      0.0591789       0.06652650     12%  *
dwab-half       0.2452280       0.2807360    15%      0.0395200       0.04543520     15%  *
dwab-float      0.2713410       0.2937500     8%      0.0637925       0.06695870      5%  *

Change in performance as observed in a real-world application. Quantity measured is frames
per second, therefore a negative Δ % is a performance regression. These figures were collected
on i7-13800H running Windows.

                         3.2.1        3.3.1               3.2.1       3.3.1        
                    Encode fps   Encode fps    Δ %   Decode fps  Decode fps    Δ % 
 HD-half-DWA32(45)        28.2         11.8    -58         71.8        66.2     -8  *
 HD-float-DWA32(45)       16.7          9.0    -46         31.0        35.9     15  *
 HD-half-DWA256(45)       16.4         14.8    -10         43.1        38.7    -10  *
 HD-float-DWA256(45)       9.4          9.2     -2         21.7        24.5    -13  *
 HD-half-piz              31.7         29.7     -6         64.4        58.6     -9  *
 HD-half-Pxr24            19.2         18.9     -2         79.5        73.6     -7
 HD-float-Pxr24           20.4         24.2     19         39.6        43.2      9
 HD-half-B44A             19.3         19.8      3         70.8        68.2     -3 
 HD-float-B44A            17.7         18.1      2         32.4        40.7     26

Rows marked with * are of particular concern.

The text was updated successfully, but these errors were encountered:

kdt3rd · 2024-11-11T09:59:36Z

thanks for that testing - any way you could contribute the wrapper scripts you used to create the above? (i see the separate repo, just thinking it'd be good to integrate into the test metrics)

It would appear dwa (which is an encoding I don't ever use) has been hard hit, will take a look at that. As you point out, it is likely something that has inadvertently changed about the compression routines specifically for piz and dwa, as that would be the only reason for a regression on the encoding side.

peterurbanec · 2024-11-13T01:38:54Z

The results above were generated by running make test in the above-mentioned repository and the output was then copied and pasted into Libre Office Calc. A bit of cut-n-paste and a few formulas later I had the data, which I then further formatted in a text editor. I don't have a convenient automated process. 😞

I agree it would be good to have test metrics integrated. I'll look into adding a more comprehensive testing mode to exrmetrics in my spare time (if I can find some). It's unlikely to be ready for a merge request in the next week or two.

peterurbanec · 2024-11-19T23:31:42Z

I've a got some local code that adds benchmarking mode in exrmetrics. I plan on contributing that code at some stage this week, time permitting.

In the meantime, here are some results from testing on Linux Threadripper PRO 3975WX 32C 64T, nominally clocked at 2.2GHz. In general it seems that reads have improved and benefit from concurrency up to the number of cores, not threads. For writes, the results are all over the place. For example, piz-half sees a 22% improvement in single threaded, but slow down everywhere else. DWA is always slower. If one was to average the performance delta, multi-threaded write performance is down and single threaded write performance has seen a very small improvement, but only due to the piz-half anomalous gain.

kdt3rd · 2024-12-31T05:15:30Z

it took a while as I went into a rabbit hole to make dwa very much better, but the huf and dwa compression routines (encode / decode) were not as fast as they used to be. The huffman coding regression was because I removed some memory cast / copy undefined behavior which turned out to be a little faster. I've instead made other changes to make the routine faster and still left the original fix for undefined behavior.

The dwa one is a bit more mysterious, as when profiling, it was dominated by a memory bound lookup into the big giant table for quantizing values, so am not sure why that one regressed given the code didn't really change between 3.2 and 3.3 other than move from c++ to the c library version. However, I have just put in a pull request to remove that table lookup, so writing dwa should be significantly faster than before (0.23s -> 0.15s for a frame for me on an avx2 / f16c enabled compile).

The old version of the c++ library used to use threads (sort of) even when threads was set to 0 for reading, resulting in some misleading results on the read side. However, given how the encode and decode pipelines work right now, it doesn't surprise me that it caps out at core counts vs thread counts. This could be improved in the future with a slight rework of how the pipelines are evaluated by threads.

Thanks again for all the testing..

kdt3rd mentioned this issue Dec 31, 2024

convert dwa encoder to use algorithm quantize #1948

Merged

kdt3rd mentioned this issue Dec 31, 2024

add minor huf encode / decode performance optimizations #1947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple performance regressions in 3.3.1 compared to 3.2.1 #1915

Multiple performance regressions in 3.3.1 compared to 3.2.1 #1915

peterurbanec commented Nov 11, 2024

kdt3rd commented Nov 11, 2024 •

edited

Loading

peterurbanec commented Nov 13, 2024

peterurbanec commented Nov 19, 2024

kdt3rd commented Dec 31, 2024

Multiple performance regressions in 3.3.1 compared to 3.2.1 #1915

Multiple performance regressions in 3.3.1 compared to 3.2.1 #1915

Comments

peterurbanec commented Nov 11, 2024

kdt3rd commented Nov 11, 2024 • edited Loading

peterurbanec commented Nov 13, 2024

peterurbanec commented Nov 19, 2024

kdt3rd commented Dec 31, 2024

kdt3rd commented Nov 11, 2024 •

edited

Loading