Skip to content

[Feature] C++ Extension: Introduce DocNode, TextSplitterBase, SentenseSplitter#1022

Open
CompromisedKiwi wants to merge 79 commits intoLazyAGI:mainfrom
CompromisedKiwi:yzh/migrate_doc_node
Open

[Feature] C++ Extension: Introduce DocNode, TextSplitterBase, SentenseSplitter#1022
CompromisedKiwi wants to merge 79 commits intoLazyAGI:mainfrom
CompromisedKiwi:yzh/migrate_doc_node

Conversation

@CompromisedKiwi
Copy link
Copy Markdown
Collaborator

@CompromisedKiwi CompromisedKiwi commented Feb 7, 2026

📌 PR 内容 / PR Description

lazyllm_cpp 扩展更新,涉及四个RAG相关类:DocNode_TextSplitterBaseSentenceSplitter

架构设计(core / adaptor / binding)

  1. core 层(csrc/core/include, csrc/core/src

    • 职责:纯 C++ 数据结构与算法实现,不依赖 Python 对象语义。
    • 内容:DocNodeTextSplitterBaseSentenceSplitterTokenizer 接口、split与merge策略。
    • 目标:便于性能优化(string_view、并发)
  2. adaptor 层(csrc/adaptor

    • 职责:Python 对象回调桥接(std::any 参数编解码、GIL 获取、统一调用入口)。
    • 内容:AdaptorBaseWrapperDocumentStore(缓存 Python 对象,回调wrapper)。
    • 目标:把“跨语言调用机制”从算法和导出中抽离,避免 core/binding 出现重复桥接逻辑。
  3. binding 层(csrc/binding

    • 职责:pybind11 导出、trampoline、Python 语义兼容(命名、参数、返回类型、kwargs 容忍)。
    • 内容:export_doc_node.cppexport_text_splitter_base.cppexport_sentence_splitter.cpp
    • 目标:Python 强耦合行为全部集中在 binding,core 保持稳定、可预测、可优化。

三方依赖(CMake 声明)

依赖声明位于 csrc/cmake/third_party.cmake

  1. pybind11 :C++/Python 绑定层实现。
  2. Python3(Interpreter + Development)
  3. xxHash:高性能哈希能力(如内容哈希相关路径)。
  4. cpp_tiktoken:tokenizer 编解码能力(TiktokenTokenizer 后端)。
    • 备注:其内部会拉起 pcre2 等传递依赖。
  5. utf8proc :Unicode 文本处理支持。
  6. ThreadPool(header-only,本地引入)
    • 位置:csrc/core/include/thread_pool.hpp
    • 来源:progschj/ThreadPool(header-only)
    • 用途:NodeTransform::batch_forward 并行执行。

设计哲学:core 与 binding 分离

本 PR 统一遵循“core 负责算法,binding 负责 Python 语义”的原则:

  1. core 不关心 Python 的动态行为、kwargs、命名兼容与类型多态输入。
  2. binding 通过 trampoline + lambda 适配实现 Python 体验一致性。
  3. adaptor 处理跨语言回调机制,避免 core/binding 反复处理 GIL/std::any

string_view 使用与未使用点

  1. 已使用 string_view 的加速点

    • TextSplitterBase::split_text 输入视图:csrc/core/src/text_splitter_base.cpp:31
    • 递归切分与规则切分:split_recursivesplit_by_functionssplit_text_while_keeping_separator
    • 静态分隔切分工具返回 vector<string_view>,减少中间拷贝。
  2. 当前未完全 string_view 化的点

    • merge_chunks 阶段输出仍为 vector<string>(需要所有权与后续 decode/拼接安全性)。
    • SentenceSplitter 合并时维护 Chunk.text(因为 overlap 回填、拼接、trim 都需要稳定可拥有字符串)。
    • TiktokenTokenizer::encode 仍需先转 std::string(底层依赖接口限制),因此部分 merge 路径仍有字符串拷贝行为。

Python 侧接入方式(DocNode vs TextSplitterBase / SentenceSplitter)

本 PR 在 Python 侧采用了两种不同的 C++ 接入策略,分别服务于不同的稳定性与扩展性诉求:

  1. DocNode@cpp_class(整类替换)
    • DocNode 使用 @cpp_class,在 LAZYLLM_ENABLE_CPP_OVERRIDE 开启时,Python 类会被同名 C++ 导出类直接替换。
    • 对应实现是 C++ 侧完整导出 DocNode 类型,并在构造后补齐 Python 侧需要的 _lock 等属性,保证现有调用链兼容。
    • 适用原因:DocNode 是 RAG 高频核心数据结构,接口相对稳定,适合整类下沉以获得更直接的性能收益和一致行为。
  2. _TextSplitterBase / SentenceSplitter@cpp_proxy(实例代理)
    • _TextSplitterBase 与 SentenceSplitter 使用 @cpp_proxy,Python 类本体保留,仅将指定热点方法(如 split_text、_merge)代理到 C++ impl。
    • 其中 _merge 通过别名映射到 C++ merge_chunks,保证 Python 方法名与 C++ 导出接口解耦。
    • 适用原因:splitter 侧仍需要保留 Python 扩展能力(子类重写 _split/_merge、自定义 tokenizer/规则等),因此采用“热点 C++ 加速 + Python 语义保留”的折中方案。
  3. 两者共同点与兼容策略
    • 两种模式都由 LAZYLLM_ENABLE_CPP_OVERRIDE 统一开关控制。
    • 代理模式下对多态与语义兼容做了保护:当子类重写关键方法时自动回退 Python 路径;部分异常会做兼容转换,保持与原 Python 行为一致。

整体上,这一设计实现了“核心对象强下沉”与“可扩展流程渐进加速”的分层目标。

进度

  • DocNode
  • TextSplitterBase
  • SentenseSplitter

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

🧪 如何测试 / How Has This Been Tested?

  1. 开启环境变量LAZYLLM_ENABLE_CPP_OVERRIDE=1
  2. 单元测试: test_doc_node.py, test_transform.py
  3. 集成测试: rag_online.py

⚡ 更新后的用法示例 / Usage After Update

开启环境变量LAZYLLM_ENABLE_CPP_OVERRIDE=1后lazy会自动加载cpp扩展
(linux环境如果加载.so失败需要指定标准库路径export LD_PRELOAD=/lib/x86_64-linux-gnu/libstdc++.so.6

⚠️ 注意事项 / Additional Notes

@CompromisedKiwi CompromisedKiwi requested review from a team as code owners February 7, 2026 06:15
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @CompromisedKiwi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a substantial architectural shift by migrating several core RAG (Retrieval Augmented Generation) components, specifically DocNode, NodeTransform, and TextSplitterBase, from Python to C++. The primary motivation behind this migration is to significantly enhance the performance of compute-intensive operations within the RAG pipeline. The changes establish a robust hybrid Python-C++ framework, leveraging pybind11 for interoperability and a structured C++ project layout, ensuring that performance gains are achieved without sacrificing the flexibility of Python-based logic where appropriate.

Highlights

  • C++ Migration of Core RAG Components: Key components like DocNode, NodeTransform, and TextSplitterBase have been reimplemented in C++ to boost performance for compute-intensive tasks.
  • Hybrid Python-C++ Architecture: A new 'adaptor' layer (lazyllm_adaptor) facilitates seamless interaction, allowing C++ code to call back into Python for specific functionalities, maintaining flexibility.
  • Enhanced Build System: The CMake configuration has been significantly updated to support the new C++ codebase, manage external dependencies (pybind11, xxHash, sentencepiece) efficiently, and enforce modern C++ standards (C++17).
Changelog
  • .gitignore
    • Added '.cache/' to ignored files.
  • csrc/CMakeLists.txt
    • Updated C++ standard to C++17.
    • Added '-Werror' and '-Wshadow' compile options.
    • Refactored third-party dependency inclusion to 'cmake/third_party.cmake'.
    • Adjusted source file globbing for 'lazyllm_core' to 'core/src/*.cpp'.
    • Linked 'lazyllm_core' with 'xxhash' and 'sentencepiece'.
    • Introduced 'lazyllm_adaptor' static library for Python callback mechanisms.
    • Updated 'lazyllm_cpp' binding sources and linked it with 'lazyllm_adaptor'.
  • csrc/README.md
    • Renamed from 'csrc/include/README.md'.
  • csrc/adaptor/adaptor.cpp
    • New file, includes 'adaptor_base_wrapper.hpp' and 'document_store.hpp'.
  • csrc/adaptor/adaptor_base_wrapper.hpp
    • New file, defines 'AdaptorBaseWrapper' for Python object callbacks.
  • csrc/adaptor/document_store.hpp
    • New file, defines 'NodeGroup' and 'DocumentStore' for C++ interaction with Python document stores.
  • csrc/binding/export_add_doc_str.cpp
    • Renamed from 'csrc/binding/doc.cpp'.
    • Function 'exportDoc' renamed to 'exportAddDocStr'.
  • csrc/binding/export_doc_node.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::DocNode'.
  • csrc/binding/export_node_transform.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::NodeTransform'.
  • csrc/binding/export_text_splitter_base.cpp
    • New file, contains 'pybind11' bindings for 'lazyllm::TextSplitterBase' and 'lazyllm::_TokenTextSplitter'.
  • csrc/binding/lazyllm.cpp
    • Updated to include new binding export functions and remove old 'DocNode' bindings.
  • csrc/binding/lazyllm.hpp
    • Updated to declare new binding export functions.
  • csrc/cmake/tests.cmake
    • Updated 'googletest' URL to a specific commit hash.
  • csrc/cmake/third_party.cmake
    • New file, centralizes 'FetchContent' for 'pybind11', 'xxHash', and 'sentencepiece'.
  • csrc/core/include/adaptor_base.hpp
    • New file, defines 'AdaptorBase' for C++-Python callback interface.
  • csrc/core/include/doc_node.hpp
    • New file, provides the C++ implementation of 'DocNode'.
  • csrc/core/include/node_transform.hpp
    • New file, provides the C++ implementation of 'NodeTransform' with thread pool support.
  • csrc/core/include/text_splitter_base.hpp
    • New file, provides the C++ implementation of 'TextSplitterBase' for text splitting.
  • csrc/core/include/thread_pool.hpp
    • New file, implements a generic C++ thread pool.
  • csrc/core/include/tokenizer.hpp
    • New file, defines 'Tokenizer' interface and 'SentencePieceTokenizer'.
  • csrc/core/include/utils.hpp
    • New file, contains general utility functions and RAG metadata keys.
  • csrc/core/src/doc_node.cpp
    • New file, includes 'doc_node.hpp'.
  • csrc/core/src/text_splitter_base.cpp
    • New file, includes 'text_splitter_base.hpp'.
  • csrc/core/src/utils.cpp
    • New file, includes 'utils.hpp'.
  • csrc/include/doc_node.h
    • Removed old 'DocNode' header.
  • csrc/scripts/config_cmake.sh
    • New script for CMake configuration.
  • csrc/src/doc_node.cpp
    • Removed old 'DocNode' implementation file.
  • lazyllm/tools/rag/doc_node.py
    • Changed '_embedding_state' to 'embedding_state' (public attribute).
  • lazyllm/tools/rag/utils.py
    • Updated 'parallel_do_embedding' to directly access 'node.embedding_state'.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/main.yml
    • .github/workflows/publish_release.yml
Activity
  • The author 'CompromisedKiwi' initiated this feature branch to migrate Python RAG components to C++.
  • The pull request introduces a new C++ project structure under 'csrc/' with core logic, adaptor layer, and pybind11 bindings.
  • Dependencies like 'pybind11', 'xxHash', and 'sentencepiece' are now managed via CMake's 'FetchContent'.
  • The C++ standard has been upgraded to C++17.
  • The 'DocNode' and 'NodeTransform' components are fully migrated and bound, with 'TextSplitterBase' partially implemented.
  • Minor Python code adjustments were made to align with the new C++ 'DocNode' structure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 将 DocNode / NodeTransform / TextSplitterBase / SentenceSplitter 等核心能力迁移到 C++ 扩展,以提升性能,这是一个很好的方向。代码结构清晰,分为了 core、adaptor 和 binding 三层,并且使用了 pybind11、xxHash、sentencepiece 等现代 C++ 库。然而,我发现了一些严重的问题需要合并前解决:

  • 存在多处线程安全(竞态条件)和内存安全(悬垂指针)的隐患,可能导致程序崩溃或未定义行为。
  • 部分 pybind11 绑定代码引用了未实现的 C++ 方法,会导致编译失败。
  • 在 C++ 与 Python 交互的逻辑中存在一些参数和返回值处理的 bug。

我已经在代码中留下了具体的审查意见,请仔细查看。修复这些问题后,这将是一次非常有价值的性能优化贡献。

Copy link
Copy Markdown
Contributor

@wzh1994 wzh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary:
目的
本 PR 为 LazyLLM 的 C++ 扩展层引入三个核心类——DocNodeTextSplitterBaseSentenceSplitter,并将 CI 中原有的"仅编译 C++ 扩展"作业升级为"编译 + Python 回归测试"全流程作业。

变更文件与原因

  1. .github/workflows/main.yml——cpp_ext_test 作业大幅扩展:增加环境变量(各厂商 API Key)、安装测试依赖、下载测试数据集、构建 C++ 扩展后执行 cmake --install,再跑 basic_testsadvanced_tests,确保 C++ 扩展在 Linux/macOS/Windows 三平台上与 Python 侧功能回归一致。
  2. .github/workflows/publish_release.yml——少量清理(删除多余空行/步骤),与新构建流程对齐。
  3. csrc/ 下新增/修改的 C++ 源文件(diff 被截断,但从 PR 标题可推断)——实现 DocNode(文档节点数据结构)、TextSplitterBase(分割器基类)、SentenceSplitter(句子级分割器)的 pybind11 绑定,供 Python 侧通过 LAZYLLM_ENABLE_CPP_OVERRIDE=1 热替换原有纯 Python 实现。

关键设计决策

  • CI 作业加了 if: always(),即使前置 clone 作业失败也会执行,这可能是为了确保 C++ 构建状态始终可见,但也意味着在 clone 失败时会产生无意义的失败日志。
  • 使用 cmake --install ... --component lazyllm_cpp 将编译产物直接安装到工作区的 lazyllm/ 目录下,避免修改 sys.path,保持与 pip 安装后的目录结构一致。
  • 环境变量 LAZYLLM_ENABLE_CPP_OVERRIDE=1 作为运行时开关,允许 Python 回归测试在有/无 C++ 扩展两种模式下复用同一套用例。
  • 测试依赖按平台拆分(requirements_linux.txt / requirements_mac.txt),Windows 仅安装通用依赖。

潜在风险

  1. if: always() 可能导致 checkout 步骤失败后后续步骤连锁报错,建议改为 if: needs.clone.result == 'success' 或至少在 checkout 步骤加容错。
  2. git clone 下载测试数据集使用 PERSONAL_GITHUB_TOKEN,若 secret 未配置则回退到 github.token,对 fork 仓库的 PR 可能因权限不足而失败。
  3. CI 新增大量厂商 API Key 环境变量,外部贡献者的 PR 将无法获取这些 secret,需确认测试用例在 key 缺失时能优雅跳过。
  4. diff 被截断,无法审查 C++ 实现本身(内存管理、线程安全、与 Python GIL 的交互),建议重点审查 pybind11 绑定中的生命周期和异常传播。
  5. timeout-minutes: 120 对三平台 build + 全量回归来说偏长,可能掩盖卡死问题。

Findings:

  • total_issues: 33
  • logic: 14
  • safety: 6
  • style: 4
  • exception: 3
  • maintainability: 2
  • type: 2
  • design: 1
  • performance: 1

auto reviewed by BOT (claude-opus-4-6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants