`NEDeconvolutionLayer` performance degradation in v24.11 #1150

alvoron · 2024-11-21T14:02:14Z

How ACL was built:

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=1 cppthreads=0 os=macos data_layout_support=all  build=native --jobs=16 os=macos build=native --silent fixed_format_kernels=True

Platform:
Apple M2 Pro

Operating System:
macOS 13.4

Problem description:
NEDeconvolutionLayer performance in 24.09 is better than in 24.11.

Reproducer

#include "arm_compute/core/Error.h"
#include "arm_compute/core/TensorShape.h"
#include "arm_compute/core/utils/misc/MMappedFile.h"
#include "arm_compute/runtime/Tensor.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include <iostream>
#include <vector>

using namespace arm_compute;

int main(int argc, char *argv[]) {
  TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 200, 200), 1, DataType::F16, DataLayout::NHWC);
  TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, DataType::F16, DataLayout::NHWC);
  TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 600, 600), 1, DataType::F16, DataLayout::NHWC);

  PadStrideInfo deconvInfo = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);
  bool fastMath = true;
  auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconvInfo, fastMath);
  if(status.error_code() != ErrorCode::OK) {
    std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
    exit(1);
  }
  std::cout << "PASSED VALIDATION" << std::endl;

  Tensor srcTensor;
  Tensor weiTensor;
  Tensor dstTensor;
  srcTensor.allocator()->init(srcTensorInfo);
  weiTensor.allocator()->init(weiTensorInfo);
  dstTensor.allocator()->init(dstTensorInfo);

  NEDeconvolutionLayer deconv;
  deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconvInfo, fastMath);
  std::cout << "PASSED CONFIGURATION" << std::endl;

  srcTensor.allocator()->allocate();
  weiTensor.allocator()->allocate();
  dstTensor.allocator()->allocate();

  //warm-up
  deconv.run();

  std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 100; i++) deconv.run();
  std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
  uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
  std::cout << "time: " << total_duration / 100 << std::endl;

  return 0;
}

How reproducer was built

g++ -O2 -g -I./ComputeLibrary -I./ComputeLibrary/include acl_deconv.cpp -L./ComputeLibrary/build/ -larm_compute -std=c++17

The reproducer gives 7038 on 24.09 and 10669 on 24.11.

Could you please review potential performance issues in NEDeconvolutionLayer?
I also observe degradation in Convolution, probably Deconvolution and Convolution issues have the same cause.

Also, it's worth to mention, I haven't observed these degradations on Ampere.

The text was updated successfully, but these errors were encountered:

morgolock · 2024-11-26T16:52:18Z

Hi @alvoron

For macos we don't support the option openmp=1 in ACL because libomp is not part of the OS and users need to install it as a thirdparty package. Can you double check on your side? The build command you shared above scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=1 cppthreads=0 os=macos data_layout_support=all build=native --jobs=16 os=macos build=native --silent fixed_format_kernels=True does not work for me. How is it that you can build acl with openmp=1 on macos?

alvoron · 2024-11-26T16:58:36Z

@morgolock can you install OpenMP by calling brew install libomp ?

morgolock · 2024-11-26T17:10:04Z

Hi @alvoron

Will do but this is not something we normally test or support on macos

alvoron · 2024-11-27T08:08:02Z

@morgolock I've got almost the same results with cppthreads=1 openmp=0:
24.11 - 10612
24.09 - 7144

morgolock · 2024-11-27T15:54:23Z

Hi @alvoron

Thanks for the additional information. I reproduced the issue. We are looking into it.

alvoron · 2024-12-03T13:37:43Z

Hi @morgolock
I've seen new release v24.11.1 has been released. Does it contain the fix?

morgolock · 2024-12-03T14:08:21Z

Hi @alvoron

No, the regression has not been fixed yet.

Hope this helps

morgolock · 2024-12-04T09:40:48Z

Hi @alvoron

This patch solves the problem and it will be included in the next release.

Hope this helps

alvoron · 2024-12-04T12:35:42Z

@morgolock
Thank you for the patch.
Let me check it on my side.

alvoron · 2024-12-17T08:29:31Z

@morgolock
I applied the patch on the top of v24.11 (f44f09d) and got the same results: 10400-10500

Could you please double check the fix? Or some additional patches are required?

morgolock · 2024-12-17T09:16:57Z

Hi @alvoron

I ran the test and I can confirm that it fixes the regression on Apple M2 Pro

I built the library with the following options:
scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=1 os=macos data_layout_support=all build=native --jobs=16 os=macos build=native validation_tests=0 examples=0 fixed_format_kernels=True logging=0 build_dir=./build/main -j8

See the results below

user@acl-mac-mini deconv % ./deconv_fix | grep time
time: 7067
user@acl-mac-mini deconv % ./deconv_24.09 | grep time
time: 7505

morgolock · 2024-12-17T09:20:24Z

Hi @alvoron

Make sure you explicitly set the memory manager when you create the DeconvLayer, as shown below.

1 #include "arm_compute/core/Error.h"
  2 #include "arm_compute/core/TensorShape.h"
  3 #include "arm_compute/core/utils/misc/MMappedFile.h"
  4 #include "arm_compute/runtime/Tensor.h"
  5 #include "arm_compute/runtime/NEON/NEFunctions.h"
  6 #include <iostream>
  7 #include <vector>
  8 #include "arm_compute/runtime/BlobLifetimeManager.h"
  9 #include "arm_compute/runtime/PoolManager.h"
 10 #include "arm_compute/runtime/Allocator.h"
 11 #include "arm_compute/runtime/MemoryManagerOnDemand.h"
 12 
 13 
 14 using namespace arm_compute;
 15 
 16 int main(int argc, char *argv[]) {
 17   Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
 18   auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
 19   auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
 20   auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
 21   MemoryGroup memory_group(mm);
 22 
 23   TensorInfo srcTensorInfo = TensorInfo(TensorShape(36, 200, 200), 1, DataType::F16, DataLayout::NHWC);
 24   TensorInfo weiTensorInfo = TensorInfo(TensorShape(36, 3, 3, 4), 1, DataType::F16, DataLayout::NHWC);
 25   TensorInfo dstTensorInfo = TensorInfo(TensorShape(4, 600, 600), 1, DataType::F16, DataLayout::NHWC);
 26   PadStrideInfo deconvInfo = PadStrideInfo(3, 3, 0, 0, 0, 0, DimensionRoundingType::FLOOR);
 27   bool fastMath = true;
 28   auto status = NEDeconvolutionLayer::validate(&srcTensorInfo, &weiTensorInfo, nullptr, &dstTensorInfo, deconvInfo, fastMath);
 29   if(status.error_code() != ErrorCode::OK) {
 30     std::cout << "ERROR: " << status.error_description().c_str() << std::endl;
 31     exit(1);
 32   }
 33   std::cout << "PASSED VALIDATION" << std::endl;
 34 
 35   Tensor srcTensor;
 36   Tensor weiTensor;
 37   Tensor dstTensor;
 38 
 39   memory_group.manage(&srcTensor);         // Start managing object tmp1 and start its lifetime
 40   memory_group.manage(&weiTensor);         // Start managing object tmp2 and start its lifetime
 41   memory_group.manage(&dstTensor);         // Start managing object tmp2 and start its lifetime
 42 
 43   srcTensor.allocator()->init(srcTensorInfo);
 44   weiTensor.allocator()->init(weiTensorInfo);
 45   dstTensor.allocator()->init(dstTensorInfo);
 46 
 47   NEDeconvolutionLayer deconv(mm);
 48   deconv.configure(&srcTensor, &weiTensor, nullptr, &dstTensor, deconvInfo, fastMath);
 49   std::cout << "PASSED CONFIGURATION" << std::endl;
 50   srcTensor.allocator()->allocate();
 51   weiTensor.allocator()->allocate();
 52   dstTensor.allocator()->allocate();
 53 
 54 mm->populate(allocator, 1);
 55 memory_group.acquire();
 56 
 57   deconv.run(); //warmup
 58  std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
 59   for (int i = 0; i < 100; i++) {
 60          deconv.run();
 61  }
 62 
 63   std::chrono::high_resolution_clock::time_point finish = std::chrono::high_resolution_clock::now();
 64   uint64_t total_duration = std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count();
 65   std::cout << "time: " << total_duration / 100 << std::endl;
 66 memory_group.release();
 67           mm->clear();
 68 
 69   return 0;
 70 }

If you apply the patch and initialize NEDeconvolutionLayer with the MemoryManager you will solve the performance issue.

Hope this helps

alvoron · 2024-12-18T13:38:01Z

@morgolock
I've observed performance degradation on f32 convolutions (gemm_acl_f32) on mobilenet-v2-1.0-224 as well.

We're using ACL Convolution kernel via oneDNN, so the fix needs to be done on oneDNN side to recover Convolution performance.

Could you please check GEMM as well?

morgolock · 2024-12-18T14:07:28Z

Hi @alvoron

I've observed performance degradation on f32 convolutions (gemm_acl_f32) on mobilenet-v2-1.0-224 as well.
We're using ACL Convolution kernel via oneDNN, so the fix needs to be done on oneDNN side to recover Convolution performance.
Could you please check GEMM as well?

In v24.11 we introduced this patch to improve memory management in ACL. This patch reduces considerably memory usage in some models. A side effect of this change is that it requires the user of the library to explicitly setup and configure the memory manager as shown in the reproducer above to get the best performance, otherwise you will see a performance regression.

If you setup the memory manager as in the reproducer you will see no performance regression in v24.12.

Hope this helps.

alvoron · 2024-12-18T15:54:33Z

@morgolock how to configure the memory manager if I'm using oneDNN to call ACL Convolution kernel? Does oneDNN provide an API to setup ACL memory manager?

theComputeKid · 2024-12-19T13:57:31Z

Does oneDNN provide an API to setup ACL memory manager?

@alvoron this sounds more like a oneDNN question. Are you asking as a oneDNN user, or a oneDNN contributor? The answer is different in both cases. We can take this discussion to the oneDNN repo if it gets too technical.

cc: @Sqvid

alvoron · 2024-12-28T17:54:38Z

@morgolock @theComputeKid
convolution issue I mentioned above is related to stateless feature, I've created an issue in oneDNN repo: oneapi-src/oneDNN#2324

theComputeKid · 2024-12-30T12:55:49Z

@alvoron Thanks, we will get back to you on the oneDNN side after our people get back from holidays.

@morgolock it might be likely that this is a bug in the way we implemented stateless conv (we have a workaround in oneDNN for winograd, which is probably slowing things down). We might need to apply some fixes to make it threadsafe, so that the oneDNN workaround is not required. Would you like to track this as part of this issue? In which case, we should rename the issue to "stateless conv performance worse than NEConv".

alvoron mentioned this issue Nov 21, 2024

[CPU][ARM] ACL upgrade to 24.11 openvinotoolkit/openvino#27647

Open

morgolock added this to the v25.02 milestone Dec 4, 2024

morgolock added the Performance label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`NEDeconvolutionLayer` performance degradation in v24.11 #1150

`NEDeconvolutionLayer` performance degradation in v24.11 #1150

alvoron commented Nov 21, 2024 •

edited

Loading

morgolock commented Nov 26, 2024

alvoron commented Nov 26, 2024

morgolock commented Nov 26, 2024

alvoron commented Nov 27, 2024

morgolock commented Nov 27, 2024

alvoron commented Dec 3, 2024

morgolock commented Dec 3, 2024

morgolock commented Dec 4, 2024

alvoron commented Dec 4, 2024

alvoron commented Dec 17, 2024

morgolock commented Dec 17, 2024

morgolock commented Dec 17, 2024

alvoron commented Dec 18, 2024 •

edited

Loading

morgolock commented Dec 18, 2024

alvoron commented Dec 18, 2024

theComputeKid commented Dec 19, 2024

alvoron commented Dec 28, 2024

theComputeKid commented Dec 30, 2024

NEDeconvolutionLayer performance degradation in v24.11 #1150

NEDeconvolutionLayer performance degradation in v24.11 #1150

Comments

alvoron commented Nov 21, 2024 • edited Loading

morgolock commented Nov 26, 2024

alvoron commented Nov 26, 2024

morgolock commented Nov 26, 2024

alvoron commented Nov 27, 2024

morgolock commented Nov 27, 2024

alvoron commented Dec 3, 2024

morgolock commented Dec 3, 2024

morgolock commented Dec 4, 2024

alvoron commented Dec 4, 2024

alvoron commented Dec 17, 2024

morgolock commented Dec 17, 2024

morgolock commented Dec 17, 2024

alvoron commented Dec 18, 2024 • edited Loading

morgolock commented Dec 18, 2024

alvoron commented Dec 18, 2024

theComputeKid commented Dec 19, 2024

alvoron commented Dec 28, 2024

theComputeKid commented Dec 30, 2024

`NEDeconvolutionLayer` performance degradation in v24.11 #1150

`NEDeconvolutionLayer` performance degradation in v24.11 #1150

alvoron commented Nov 21, 2024 •

edited

Loading

alvoron commented Dec 18, 2024 •

edited

Loading