Skip to content

Commit 5fb69ea

Browse files
[NPU] Model marshalling without weights copies (#31939)
### Details: * Extends the [OV StreamSerializer & XmlSerializer](#31639) in order to allow passing an `ov::Model` through the driver without copying its weights into a separate buffer. * The purpose of this PR is to reduce memory consumption by avoiding weights duplications. * This feature will be disabled by default for the CiD interface (at least for a while). The changes are first meant to be integrated in the upcoming CiP interface. * The implementation followed and adapted the sample provided in [this PR](#31969). * Two config options introduced: `intel_npu::use_base_model_serializer` - a switch between the old & new serialization algorithms, and `intel_npu::serialization_weights_size_threshold` - controls which weights are copied into a separate buffer and which ones have only metadata (memory location & size) stored as runtime information. More concretely, weights smaller than this value will be copied. * `vcl_serializer.hpp` is meant to contain all operations required to prepare an `ov::Model` as food for the VCL interface. This implies using the new/old model serializer, I/O & config serialization. `xml_serializer.hpp` is a more generic, weightless (no weights copies unless `serialization_weights_size_threshold` is used) implementation of the OV serializer. * Roughly how this works: * The plugin passes through all `ov::Constant` nodes and places weights metadata (`intel_npu::WeightsPointerAttribute`) as runtime information on the nodes that have buffers smaller than `serialization_weights_size_threshold`. * The new `intel_npu::StreamSerialize` is called which uses the `intel_npu::XmlSerializer` for serializing the model. Note that `StreamSerialize` uses a slightly different format within the buffer (metadata containing offsets & sizes, custom data, weights & the XML graph), see `ov::pass::StreamSerialize` for details. * `intel_npu::XmlSerializer` will not write weights into its dedicated buffer if the `WeightsPointerAttribute` is found within the current `ov::Constant` node. Instead, weights metadata will be written as runtime information by calling the visit method corresponding to the attribute. * The deserializer will be able to distinguish between the two cases (weights copied vs. weights stored as metadata) by looking for this attribute in the serialized buffer. * See the ticket for some performance reports. ## Related PRs * [Sample for extending the serialization algorithm](#31969) * [The PR that made the OV serializer extensible](#31639) ### Tickets: - *CVS-173711*
1 parent 9b3e405 commit 5fb69ea

File tree

23 files changed

+724
-290
lines changed

23 files changed

+724
-290
lines changed

src/core/xml_util/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ source_group("include" FILES ${PUBLIC_HEADERS})
1919
add_library(${TARGET_NAME} STATIC ${LIBRARY_SRC} ${PUBLIC_HEADERS})
2020

2121
add_library(openvino::xml_util ALIAS ${TARGET_NAME})
22-
set_target_properties(${TARGET_NAME} PROPERTIES EXPORT_NAME openvino_xml_util)
22+
set_target_properties(${TARGET_NAME} PROPERTIES EXPORT_NAME xml_util)
2323

2424
target_include_directories(${TARGET_NAME} PUBLIC
2525
$<BUILD_INTERFACE:${TARGET_INCLUDE_DIR}>

src/plugins/intel_npu/src/al/include/intel_npu/config/options.hpp

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1426,4 +1426,18 @@ struct USE_BASE_MODEL_SERIALIZER final : OptionBase<USE_BASE_MODEL_SERIALIZER, b
14261426
}
14271427
};
14281428

1429+
struct SERIALIZATION_WEIGHTS_SIZE_THRESHOLD final : OptionBase<SERIALIZATION_WEIGHTS_SIZE_THRESHOLD, size_t> {
1430+
static std::string_view key() {
1431+
return ov::intel_npu::serialization_weights_size_threshold.name();
1432+
}
1433+
1434+
static size_t defaultValue() {
1435+
return 0;
1436+
}
1437+
1438+
static OptionMode mode() {
1439+
return OptionMode::RunTime;
1440+
}
1441+
};
1442+
14291443
} // namespace intel_npu

src/plugins/intel_npu/src/al/include/intel_npu/npu_private_properties.hpp

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -357,12 +357,21 @@ static constexpr ov::Property<bool> weightless_blob{"NPU_WEIGHTLESS_BLOB"};
357357
*
358358
* The base serializer is the OV implementation of the "XmlSerializer" without any extensions. All weights are copied in
359359
* a separate buffer. By turning this off, the NPU extension of the serializer is enabled. This allows optimizing the
360-
* process by avoiding copies into a separate weights buffer. However, this solution may be less reliable.
361-
*
362-
* @note This option doesn't actually do anything right now, it has been registered in advance.
360+
* process by reducing the amount of weights that will be copied in a separate buffer. However, this solution may be
361+
* less reliable.
363362
*/
364363
static constexpr ov::Property<bool> use_base_model_serializer{"NPU_USE_BASE_MODEL_SERIALIZER"};
365364

365+
/**
366+
* @brief [Only for NPU Plugin]
367+
* Type: size_t. Default is 0.
368+
*
369+
* Effective only if "use_base_model_serializer" is set to false. All "ov::Constant" buffers smaller than this value
370+
* (byte size) will be copied in a separate buffer. The rest of the weights will be reconstructed at deserialization
371+
* time using buffer pointers.
372+
*/
373+
static constexpr ov::Property<size_t> serialization_weights_size_threshold{"NPU_SERIALIZATION_WEIGHTS_SIZE_THRESHOLD"};
374+
366375
/**
367376
* @brief [Experimental, only for NPU Plugin]
368377
* Type: integer.
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
// Copyright (C) 2025 Intel Corporation.
2+
// SPDX-License-Identifier: Apache-2.0
3+
//
4+
5+
#pragma once
6+
7+
#include <string_view>
8+
9+
#include "openvino/core/runtime_attribute.hpp"
10+
11+
namespace intel_npu {
12+
13+
/**
14+
* @brief Attribute containing the memory address of a weights buffer and the size of the buffer in bytes.
15+
* @details Used as part of the serialization/deserialization algorithms in order to allow processing models without
16+
* copying weights.
17+
*/
18+
class WeightsPointerAttribute : public ov::RuntimeAttribute {
19+
public:
20+
OPENVINO_RTTI("WeightsPointerAttribute", "0", RuntimeAttribute);
21+
22+
WeightsPointerAttribute() = delete;
23+
24+
WeightsPointerAttribute(const void* pointer, const size_t size)
25+
: memory_pointer(reinterpret_cast<size_t>(pointer)),
26+
byte_size(size) {}
27+
28+
/**
29+
* @note The names of the attributes have been kept short in order to save some memory (there may be a lot of
30+
* "ov::Constant" nodes in a model). While deserializing, the name of the attribute ("WeightsPointerAttribute") is
31+
* also used as part of identification in order to avoid collision.
32+
*/
33+
static constexpr const std::string_view POINTER_KEY = "mp";
34+
static constexpr const std::string_view BYTE_SIZE_KEY = "ms";
35+
36+
bool visit_attributes(ov::AttributeVisitor& visitor) override {
37+
visitor.on_attribute(POINTER_KEY.data(), memory_pointer);
38+
visitor.on_attribute(BYTE_SIZE_KEY.data(), byte_size);
39+
return true;
40+
}
41+
42+
size_t memory_pointer;
43+
size_t byte_size;
44+
};
45+
46+
} // namespace intel_npu

src/plugins/intel_npu/src/common/include/intel_npu/common/icompiler_adapter.hpp

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,15 @@
44

55
#pragma once
66

7+
#include "intel_npu/common/filtered_config.hpp"
78
#include "intel_npu/common/igraph.hpp"
89

910
namespace intel_npu {
1011

1112
class ICompilerAdapter {
1213
public:
1314
virtual std::shared_ptr<IGraph> compile(const std::shared_ptr<const ov::Model>& model,
14-
const Config& config) const = 0;
15+
const FilteredConfig& config) const = 0;
1516

1617
/**
1718
* @brief Compiles the model, weights separation enabled.
@@ -27,7 +28,8 @@ class ICompilerAdapter {
2728
* "icompiler.hpp".
2829
* @return A "WeightlessGraph" type of object.
2930
*/
30-
virtual std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model, const Config& config) const = 0;
31+
virtual std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model,
32+
const FilteredConfig& config) const = 0;
3133

3234
/**
3335
* @brief Parses the provided binary objects and returns a wrapper over the resulted L0 handles. The model may also
@@ -44,11 +46,12 @@ class ICompilerAdapter {
4446
*/
4547
virtual std::shared_ptr<IGraph> parse(
4648
ov::Tensor mainBlob,
47-
const Config& config,
49+
const FilteredConfig& config,
4850
std::optional<std::vector<ov::Tensor>> initBlobs = std::nullopt,
4951
const std::optional<std::shared_ptr<const ov::Model>>& model = std::nullopt) const = 0;
5052

51-
virtual ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model, const Config& config) const = 0;
53+
virtual ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model,
54+
const FilteredConfig& config) const = 0;
5255
virtual uint32_t get_version() const = 0;
5356
virtual std::vector<std::string> get_supported_options() const = 0;
5457
virtual bool is_option_supported(std::string optname) const = 0;

src/plugins/intel_npu/src/compiler_adapter/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ target_link_libraries(${TARGET_NAME}
2525
PRIVATE
2626
openvino::npu_al
2727
openvino::npu_common
28+
openvino::xml_util
2829
)
2930

3031
#

src/plugins/intel_npu/src/compiler_adapter/include/custom_stream_buffer.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,11 @@ class writer_streambuf final : public std::streambuf {
7575
}
7676
}
7777

78+
pos_type seekpos(pos_type pos, std::ios_base::openmode which) override {
79+
writeIt = startIt + pos;
80+
return pos;
81+
}
82+
7883
OutputIt startIt;
7984
OutputIt writeIt;
8085
};

src/plugins/intel_npu/src/compiler_adapter/include/driver_compiler_adapter.hpp

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include "intel_npu/config/config.hpp"
1111
#include "intel_npu/utils/logger/logger.hpp"
1212
#include "intel_npu/utils/zero/zero_init.hpp"
13+
#include "vcl_serializer.hpp"
1314
#include "ze_graph_ext_wrappers.hpp"
1415

1516
namespace intel_npu {
@@ -18,17 +19,20 @@ class DriverCompilerAdapter final : public ICompilerAdapter {
1819
public:
1920
DriverCompilerAdapter(const std::shared_ptr<ZeroInitStructsHolder>& zeroInitStruct);
2021

21-
std::shared_ptr<IGraph> compile(const std::shared_ptr<const ov::Model>& model, const Config& config) const override;
22+
std::shared_ptr<IGraph> compile(const std::shared_ptr<const ov::Model>& model,
23+
const FilteredConfig& config) const override;
2224

23-
std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model, const Config& config) const override;
25+
std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model,
26+
const FilteredConfig& config) const override;
2427

2528
std::shared_ptr<IGraph> parse(
2629
ov::Tensor mainBlob,
27-
const Config& config,
30+
const FilteredConfig& config,
2831
std::optional<std::vector<ov::Tensor>> initBlobs = std::nullopt,
2932
const std::optional<std::shared_ptr<const ov::Model>>& model = std::nullopt) const override;
3033

31-
ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model, const Config& config) const override;
34+
ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model,
35+
const FilteredConfig& config) const override;
3236

3337
std::vector<std::string> get_supported_options() const override;
3438

src/plugins/intel_npu/src/compiler_adapter/include/ir_serializer.hpp

Lines changed: 0 additions & 82 deletions
This file was deleted.

src/plugins/intel_npu/src/compiler_adapter/include/plugin_compiler_adapter.hpp

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,20 @@ class PluginCompilerAdapter final : public ICompilerAdapter {
1919
public:
2020
PluginCompilerAdapter(const std::shared_ptr<ZeroInitStructsHolder>& zeroInitStruct);
2121

22-
std::shared_ptr<IGraph> compile(const std::shared_ptr<const ov::Model>& model, const Config& config) const override;
22+
std::shared_ptr<IGraph> compile(const std::shared_ptr<const ov::Model>& model,
23+
const FilteredConfig& config) const override;
2324

24-
std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model, const Config& config) const override;
25+
std::shared_ptr<IGraph> compileWS(const std::shared_ptr<ov::Model>& model,
26+
const FilteredConfig& config) const override;
2527

2628
std::shared_ptr<IGraph> parse(
2729
ov::Tensor mainBlob,
28-
const Config& config,
30+
const FilteredConfig& config,
2931
std::optional<std::vector<ov::Tensor>> initBlobs = std::nullopt,
3032
const std::optional<std::shared_ptr<const ov::Model>>& model = std::nullopt) const override;
3133

32-
ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model, const Config& config) const override;
34+
ov::SupportedOpsMap query(const std::shared_ptr<const ov::Model>& model,
35+
const FilteredConfig& config) const override;
3336

3437
std::vector<std::string> get_supported_options() const override;
3538

0 commit comments

Comments
 (0)