-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime compressed refs interpreter performance #8878
Comments
Makefiles will also need to be updated - search for |
@dnakamura FYI for CMake. |
I'm working on this |
To start with, there's no need to actually be in mixed mode - if the individual interpreters are ifdeffed on the full/compressed flags, then they'll work in the normal builds, and won't bloat them. For an example, see: |
For the override check, I was thinking something like:
|
Some initial benchmark results... split: "mixed" build LibertyStartupDT7 - slight improvement in startup time
Throughput Benchmark - significant reduction in throughput
I'll be running more iterations of these benchmarks to see if results are consistent. |
@sharon-wang those numbers seem really off. The overhead from the compressed checks should mostly be in the GC at this point which shouldn't account for a 2-3x overhead. Can you collect some profiles from a combo build run? It would really help to see where time is going. |
The probable reason for the awful numbers is that the bench was being run in GC stress mode (which is something we will eventually need to address, but for now, the new numbers will be in normal mode to give us a general idea). |
hard to speculate without seeing heap sizing, machine config, GC logs, CPU profiles... |
Ran a few more iterations of LibertyStartupDT7, which yielded similar results as above. For GC, I've collected a few profiles and logs. What sort of data do we want from the profiles and GC logs? Are there specific measurements that we are interested in? |
For the profiles, we want to compare the profiles from the combo build to a non-combo build to see which symbols have become hotter. This helps to identify where we're spending more time. |
Please send collected GC verbose log files (both combo and non-combo) to me or @amicic |
GC times are about 10% longer. More or less expected at this stage. CPU profiles are too short to be meaningful. I asked @sharon-wang to generate them again |
Original runs were with -Xnocompressedrefs. Maybe there is something specific to it that led to so big difference. |
I find it odd that the lower heap sizes are showing better results. |
Run-to-run variation is easily 3%. We did not pick the right combo of heap sizes to emphasize the difference between low and high GC overhead scenarios, either. When I picked large heap (10G Nursery) I assumed small will be 4G (like in the original runs), so about 2.5x smaller GC overhead. But I should've also picked even larger than 10G. |
I can do some more runs: one set with compressed refs and one without. Are we interested in both COMPLIANT and PRESET runs? What is suggested for the heap settings? I can use |
Let's do compliant runs only for nonCR. If we indeed reproduce the big gap on GC perf machine, then will do PRESET runs and with CPU profiles. large: -Xmx24G -Xms24G -Xmn20G |
New Throughput Benchmark Results - COMPOSITE + COMPLIANT, There is not a big gap in performance like initially measured. Might have been a machine issue, since those first runs were done on a different machine than the subsequent runs. Heap settings:
Heap settings:
|
GC (Scavenge specifically) time slowdown for nonCR
Similar for CR:
Those are average GC times as obtained by:
|
@amicic Should more perf results be collected or is further profiling needed? |
I don't need more perf results. These slow downs are more or less expected, knowing the changes that came in... Next step is to try to reduce the gap. Generally, we need to reduce the number of runtime 'if CR' checks. If possible, extracting it from tight loops. Very often it won't be possible, so we'll have to resort to versioning the code. Something similar to what is done to the Interpreter (although to avoid maintaining 2 version of source code, I'd consider C++ templates). Either way we need to do it for minimum amount of code that is really impacted by the runtimes checks most. Specifically, for Scavenger that is most frequently executed GC in Gencon, we can version:
|
So where do we go from here? @rwy0717 suggested that an interpreter-style solution (the compressed override define) might work better than templates. Any thoughts on that? Another obvious option is to simply multiply-compile the majority of the GC and fill in the MM function table appropriately. This would obviously perform, but may be prohibitively large, and it would also need to address the ODR constraint (i.e. the classes would need to be multiply-named). I'm not familiar at all with templates, so I have nothing to offer in that regard. Can anyone provide an example of what a piece of code would look like with templates? One of the major places we probably want to optimize are the scanners/iterators, but I had heard they were being reworked (for VT?), so I've been loathe to put a lot of effort into changing those. One thing I had looked at was keeping an internal SlotObject instead of a scan pointer, and using the new instance-based APIs to increment the pointer. |
Sure thing, I'll do 2 runs of a Throughput Benchmark for mixedrefs and baseline for each of those policies. Expecting it'll a take a week or so to run everything and put the results together. I'll coordinate with @mpirvu to get him the builds - thanks for doing the DT7 runs! |
I compared the baseline (compressedrefs) to the mixedrefs build, both being run with ================= Detailed results =============
|
Some more Throughput Benchmark COMPOSITE Results baseline: Heap settings: Measurements below are an average of 3 runs for each. Run-to-run measurements varied slightly for both configurations and there is no major performance difference between baseline and mixedrefs.
|
max_jOPS | critical_jOPS | hbIR_max | hbIR_settled | |
---|---|---|---|---|
diff % | -2.13% | +0.41% | -- | -1.06% |
-Xgcpolicy:balanced
max_jOPS | critical_jOPS | hbIR_max | hbIR_settled | |
---|---|---|---|---|
diff % | -2.25% | +4.54% | -2.09% | +2.88% |
-Xgcpolicy:metronome
This GC policy appears to be incompatible with the Throughput Benchmark? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.
Crashes? It is interesting... Do you have any example around to get any idea where/how it crashed? |
@dmitripivkine Yes, I'll send you the segfault and stack trace |
@sharon-wang were you planning to investigate why the max-jOps was regressed by ~2% for both optthruput and balanced ? |
I'm assuming that's random variance - these GCs should be identical to the normal builds. Perhaps another run is in order. I've manually verified that none of the getters are missing the override check. |
I will do another set of runs to check if the same regression is seen. |
New set of Throughput Benchmark COMPOSITE runs: baseline: Heap settings: Measurements below are an average of 5 runs of each.
|
max_jOPS | critical_jOPS | hbIR_max | hbIR_settled | |
---|---|---|---|---|
diff % | +0.16% | -2.68% | -0.19% | +2.53% |
-Xgcpolicy:balanced
max_jOPS | critical_jOPS | hbIR_max | hbIR_settled | |
---|---|---|---|---|
diff % | +3.15% | +3.00% | +4.98% | -1.52% |
Seeing similar run-to-run variance as previous results. Seems like the two builds show the same performance.
For this initial set of changes to enable mixed builds with CMake, are we focused on JDK11 specifically, or do we want to enable this feature for all versions? I assume this also depends on which versions/platforms CMake is currently available on. |
The changes to openj9 to support mixed references should not be specific to any version of java, so I would expect it should work for all (with changes similar to ibmruntimes/openj9-openjdk-jdk11#359 made for the other extensions repositories). |
Just an update that all CMake mixed refs changes are now merged. The test story is in progress and can be followed here: #9231. |
In order to retain optimal performance in the interpreter in runtime compressed refs mode, we'll need to multiply-compile it (we already do this for debug mode).
The idea is to defined something like
J9_COMPRESSED_REFS_OVERRIDE
to eitherTRUE
orFALSE
, and use the override value if present in theJ9VMTHREAD_COMPRESS_OBJECT_REFERENCES
andJ9JAVAVM_COMPRESS_OBJECT_REFERENCES
macros.In keeping with the current naming conventions, I'd suggest replacing
BytecodeInterprer.cpp
withBytecodeInterpreterCompressed.cpp
andBytecodeInterpreterFull.cpp
, and rename the C entry points appropriately. (Almost) the entirety of these files should be ifdeffed onOMR_GC_COMPRESSED_POINTERS
orOMR_GC_FULL_POINTERS
(you'll need to includej9cfg.h
to get those ifdefs).The new entry points will need to be used in:
https://github.com/eclipse/openj9/blob/b542db86f655d22d1697c64d071bd483c3da3695/runtime/vm/jvminit.c#L2455
For now, let's not bother splitting the debug interpreter or MH interpreter.
The text was updated successfully, but these errors were encountered: