I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. I have recently added HPX to this benchmark suite in PR tzcnt/runtime-benchmarks#4.
I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.
You can read the PR description above for a complete description of all the things that I tried, but the summary is that I wasn't able to achieve reasonable performance despite spending many hours trying different approaches.
I also discovered that the abp-priority-lifo scheduler is broken and doesn't actually use the correct scheduling policy. I had to manually edit this line to reference hpx::threads::policies::lockfree_abp_lifo instead, after which I was able to I verify that the correct scheduler was being used.
I expected that this would solve the problem as the memory blowup that I observed is usually what happens when runtimes use strictly FIFO scheduling, which can result in e.g. for Skynet, 10^8 = 100,000,000 active tasks maximum, rather than 10*8 = 80 active tasks maximum when using LIFO scheduling for the local task queue. I thought that the lockfree_abp_lifo policy would prevent this problem, but it didn't make any impact.
I see this issue #3348 and this PR #4744 but I'm not sure if this would resolve this problem.
I'd also like to note that 3 of the benchmarks I'm running have nearly equivalent implementations in your examples, so I'd expect that it would be possible to make these run in a performant manner.
https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/fibonacci_one.cpp
https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/skynet.cpp
https://github.com/STEllAR-GROUP/hpx/tree/master/examples/nqueen
I'd like to give you an opportunity to present your library in the best light, so I'm open to any feedback on how I can optimize my implementation / build flags to make HPX perform better.
I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. I have recently added HPX to this benchmark suite in PR tzcnt/runtime-benchmarks#4.
I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.
You can read the PR description above for a complete description of all the things that I tried, but the summary is that I wasn't able to achieve reasonable performance despite spending many hours trying different approaches.
I also discovered that the
abp-priority-lifoscheduler is broken and doesn't actually use the correct scheduling policy. I had to manually edit this line to referencehpx::threads::policies::lockfree_abp_lifoinstead, after which I was able to I verify that the correct scheduler was being used.I expected that this would solve the problem as the memory blowup that I observed is usually what happens when runtimes use strictly FIFO scheduling, which can result in e.g. for Skynet, 10^8 = 100,000,000 active tasks maximum, rather than 10*8 = 80 active tasks maximum when using LIFO scheduling for the local task queue. I thought that the
lockfree_abp_lifopolicy would prevent this problem, but it didn't make any impact.I see this issue #3348 and this PR #4744 but I'm not sure if this would resolve this problem.
I'd also like to note that 3 of the benchmarks I'm running have nearly equivalent implementations in your examples, so I'd expect that it would be possible to make these run in a performant manner.
https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/fibonacci_one.cpp
https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/skynet.cpp
https://github.com/STEllAR-GROUP/hpx/tree/master/examples/nqueen
I'd like to give you an opportunity to present your library in the best light, so I'm open to any feedback on how I can optimize my implementation / build flags to make HPX perform better.