route EthosU input/output memcpy through overridable hook (#19264)#19264
route EthosU input/output memcpy through overridable hook (#19264)#19264meta-codesync[bot] merged 1 commit intomainfrom
Conversation
|
@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766. |
This PR needs a
|
ddea8da to
ffc9927
Compare
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. Reviewed By: rascani Differential Revision: D103455766
ffc9927 to
8eeb57c
Compare
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-pytorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
8eeb57c to
3fe2220
Compare
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
3fe2220 to
845995e
Compare
|
|
| // unit so the compiler in the call-site TUs cannot inline this body and | ||
| // bypass the link-time override (same trick as bolt_arm_memcpy_external). | ||
| extern "C" __attribute__((weak)) void | ||
| io_memcpy(void* dst, const void* src, size_t size) { |
There was a problem hiding this comment.
regular memcpy should already be weak for embedded toolchain or we may be able to override through compiler flags but this is also OK.
There was a problem hiding this comment.
note that we do not want a wide override in the subsystems or modules - eg. we enable this override to DMA on specific zephyr overlay configs for specific app_versions only, ie we want 'this specific workload only, copying tensors back and forth for the NPU' to be offloaded to hardware DMA since it also had its tradeoffs.
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
845995e to
efec58a
Compare
|
Coretx-M testing: This was an interesting side effect, It seem we are building the backend here when we probably should/could avoid it. |
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
efec58a to
c4b0f13
Compare
Summary: Pull Request resolved: #19264 The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: rascani Differential Revision: D103455766
00b91bc to
b6d333d
Compare
Summary:
The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.
This change introduces a thin extern-C indirection —
arm_ethos_io_memcpy— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.
Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.
Implementation notes:
TUs cannot inline its body and bypass the link-time override. This is
the same pattern bolt_arm_memcpy_external uses.
layout-adjustment chunk loop in EthosUBackend.cpp, and the output
scratch copy in EthosUBackend_Cortex_M.cpp.
bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks
Reviewed By: rascani
Differential Revision: D103455766