-
Notifications
You must be signed in to change notification settings - Fork 376
[Feature] C++ Extension: Introduce DocNode, TextSplitterBase, SentenseSplitter
#1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CompromisedKiwi
wants to merge
79
commits into
LazyAGI:main
Choose a base branch
from
CompromisedKiwi:yzh/migrate_doc_node
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 20 commits
Commits
Show all changes
79 commits
Select commit
Hold shift + click to select a range
e011555
workflow fix
CompromisedKiwi bfb16fa
coarce migration
CompromisedKiwi 54e43c1
underline
CompromisedKiwi 894bb73
c++17
CompromisedKiwi b0a1ca0
rename
CompromisedKiwi 8e7e017
save
CompromisedKiwi fa1d7d2
Merge branch 'main' into yzh/migrate_doc_node
CompromisedKiwi 7ea108e
undo workflow fix
CompromisedKiwi f0f6657
refactor
CompromisedKiwi 1854448
adaptor
CompromisedKiwi 1484fc7
finish doc_node init
CompromisedKiwi a69f82f
children
CompromisedKiwi a6cfceb
doc_node hpp
CompromisedKiwi 0170a0e
DocNode done
CompromisedKiwi 459cfd4
pending review
CompromisedKiwi 5ea167c
NodeTransform done
CompromisedKiwi e4070f8
rename
CompromisedKiwi 6017ffa
save
CompromisedKiwi cc7ab7e
Merge branch 'main' into yzh/migrate_doc_node
CompromisedKiwi 615b7b0
Module
CompromisedKiwi 0b193c8
map_params
CompromisedKiwi 0d88ea6
save
CompromisedKiwi 02cbec4
Integrate utf8proc to split text to readable chars.
CompromisedKiwi af7e617
UnicodeProcessor
CompromisedKiwi 1c7ee82
text splitter base cpp finish
CompromisedKiwi 9ef9bd8
keys
CompromisedKiwi 068ca98
export
CompromisedKiwi 19e00dd
sentence_splitter
CompromisedKiwi e0c3acc
compile_options
CompromisedKiwi 06aa586
tests in cpp side
CompromisedKiwi a214e35
libstdc++.so.6
CompromisedKiwi e865ab6
DocNode manage itself.
CompromisedKiwi 2fd8583
finish cpp side tests
CompromisedKiwi ac9dad3
cpp env switch
CompromisedKiwi 4ab5a93
no need to test cpp override
CompromisedKiwi b38affc
cpp tests passed.
CompromisedKiwi 79218fb
merge
CompromisedKiwi ee3ecbc
install and third parties so.
CompromisedKiwi 42252a7
Reuse python side tests.
CompromisedKiwi 06eabd4
LD_PRELOAD
CompromisedKiwi fa73e50
feat: add cpp_class decorator for C++ class replacement
CompromisedKiwi 08f3333
docnode cpp ext repaired
CompromisedKiwi 2c893df
save
CompromisedKiwi f850d15
RegisterMap
CompromisedKiwi 9e709da
NodeTransform refactor
CompromisedKiwi 1024d0e
no node_transform
CompromisedKiwi 81c9aaa
simplify
CompromisedKiwi 25d0c83
new TextSplitterBaseCPPImpl
CompromisedKiwi 7680fff
cpp tests passed
CompromisedKiwi 980d0ad
python tests passed
CompromisedKiwi 0dec57a
change tiktoken cache dir outside
CompromisedKiwi b5c4ba3
Merge branch 'main' into yzh/migrate_doc_node
CompromisedKiwi 1e1087a
GIL
CompromisedKiwi 13a167d
linting
CompromisedKiwi 043fd1b
cpp_build_and_python_regression
CompromisedKiwi 5ccc9a7
fatal: could not read Username for
CompromisedKiwi a9417f3
no LAZYLLM_DATA
CompromisedKiwi c1d03cd
add lazyllm_data
CompromisedKiwi 2b5393e
no rerun
CompromisedKiwi 3fbe4b0
basic tests regression
CompromisedKiwi 787cddd
basic tests regression done
CompromisedKiwi ed4c5d5
purify cpp_proxy
CompromisedKiwi 8ae6609
no adaptor
CompromisedKiwi f25a1d4
docnode simplification
CompromisedKiwi 9e2cb42
UUID
CompromisedKiwi 6e32f84
fix
CompromisedKiwi 369eada
fix
CompromisedKiwi 326c068
cpp_proxy simplification
CompromisedKiwi 5a9d1b7
no view
CompromisedKiwi 1b74fa4
save
CompromisedKiwi 4600b72
save
CompromisedKiwi a105fc1
dynamic cpp member signature checking
CompromisedKiwi 62f2793
save
CompromisedKiwi d82edc8
cpp proxy class name could be specified.
CompromisedKiwi 20b3eb5
include(FetchContent)
CompromisedKiwi 7608a63
Forbiden R value
CompromisedKiwi 47a2680
else if
CompromisedKiwi 754388e
inline is implicitly specified
CompromisedKiwi 7e895b7
new tests
CompromisedKiwi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ test/ | |
| dist/ | ||
| tmp/ | ||
| build | ||
| .cache/ | ||
| *.lock | ||
| *.db | ||
| mkdocs.yml | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| #include "adaptor_base_wrapper.hpp" | ||
| #include "document_store.hpp" | ||
CompromisedKiwi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #pragma once | ||
|
|
||
| #include <memory> | ||
| #include <mutex> | ||
| #include <string> | ||
| #include <unordered_map> | ||
| #include <vector> | ||
|
|
||
| #include <pybind11/pybind11.h> | ||
|
|
||
| #include "adaptor_base.hpp" | ||
|
|
||
|
|
||
| namespace lazyllm { | ||
|
|
||
| class LAZYLLM_HIDDEN AdaptorBaseWrapper : public AdaptorBase { | ||
| pybind11::object _py_obj; | ||
| public: | ||
| AdaptorBaseWrapper(const pybind11::object &obj) : _py_obj(obj) {} | ||
| virtual ~AdaptorBaseWrapper() = default; | ||
|
|
||
| std::any call( | ||
| const std::string& func_name, | ||
| const std::unordered_map<std::string, std::any>& args) const override final | ||
| { | ||
| pybind11::gil_scoped_acquire gil; | ||
| pybind11::object func = pybind11::getattr(_py_obj, func_name.c_str(), pybind11::none()); | ||
| return call_impl(func_name, func, args); | ||
| } | ||
|
|
||
| virtual std::any call_impl( | ||
| const std::string& func_name, | ||
| const pybind11::object& func, | ||
| const std::unordered_map<std::string, std::any>& args) const = 0; | ||
| }; | ||
|
|
||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| #pragma once | ||
|
|
||
| #include <memory> | ||
| #include <string> | ||
| #include <unordered_map> | ||
| #include <vector> | ||
|
|
||
| #include <pybind11/pybind11.h> | ||
| #include <pybind11/stl.h> | ||
|
|
||
| #include "adaptor_base_wrapper.hpp" | ||
| #include "doc_node.hpp" | ||
|
|
||
| namespace lazyllm { | ||
|
|
||
| struct NodeGroup { | ||
| enum class Type { | ||
| ORIGINAL, CHUNK, SUMMARY, IMAGE_INFO, QUESTION_ANSWER, OTHER | ||
| }; | ||
| std::string _parent; | ||
| std::string _display_name; | ||
| Type _type; | ||
| NodeGroup( | ||
| const std::string& parent, | ||
| const std::string& display_name, | ||
| const Type& type = Type::ORIGINAL) : | ||
| _parent(parent), _display_name(display_name), _type(type) {} | ||
| }; | ||
|
|
||
| class LAZYLLM_HIDDEN DocumentStore : public AdaptorBaseWrapper { | ||
| public: | ||
| DocumentStore() = delete; | ||
| explicit DocumentStore( | ||
| const pybind11::object& store, | ||
| const std::unordered_map<std::string, NodeGroup> &map) : | ||
| AdaptorBaseWrapper(store), _node_groups_map(map) {} | ||
|
|
||
| // Cache-aware factory to avoid rebuilding adaptor for the same Python store. | ||
| static std::shared_ptr<DocumentStore> from_store( | ||
| const pybind11::object& store, const std::unordered_map<std::string, NodeGroup>& map) { | ||
| if (store.is_none()) return nullptr; | ||
|
|
||
| pybind11::gil_scoped_acquire gil; | ||
| PyObject *key = store.ptr(); | ||
| auto &cache = store_cache(); | ||
| auto it = cache.find(key); | ||
| if (it != cache.end()) { | ||
| if (auto existing = it->second.lock()) | ||
| return existing; | ||
| } | ||
| auto created = std::make_shared<DocumentStore>(store, map); | ||
| cache[key] = created; | ||
| return created; | ||
| } | ||
|
|
||
| DocNode::Children get_node_children(const DocNode* node) const { | ||
| DocNode::Children out; | ||
| auto& kb_id = std::any_cast<std::string&>(node->_p_global_metadata->at(std::string(RAG_KEY_KB_ID))); | ||
| auto& doc_id = std::any_cast<std::string&>(node->_p_global_metadata->at(std::string(RAG_KEY_DOC_ID))); | ||
| auto& group_name = node->get_group_name(); | ||
| for(auto& [current_group_name, group] : _node_groups_map) { | ||
| if (group._parent != group_name) continue; | ||
| if (!std::any_cast<bool>(call("is_group_active", {{"group", current_group_name}}))) continue; | ||
| auto nodes_in_group = std::any_cast<std::vector<DocNode*>>(call("get_nodes", { | ||
| {"group_name", current_group_name}, | ||
| {"kb_id", kb_id}, | ||
| {"doc_ids", std::vector<std::string>({doc_id})} | ||
| })); | ||
|
|
||
| std::vector<DocNode*> children; | ||
| children.reserve(nodes_in_group.size()); | ||
| for (auto* n : nodes_in_group) | ||
| if (n->get_parent_node() == node) children.push_back(n); | ||
| out[current_group_name] = children; | ||
| } | ||
| return out; | ||
| } | ||
|
|
||
| private: | ||
| std::unordered_map<std::string, NodeGroup> _node_groups_map; | ||
|
|
||
| std::any call_impl( | ||
| const std::string& func_name, | ||
| const pybind11::object& func, | ||
| const std::unordered_map<std::string, std::any>& args) const override | ||
| { | ||
| if (func_name == "is_group_active") { | ||
| return func(args.at("group")).cast<bool>(); | ||
| } | ||
| else if (func_name == "get_node") { | ||
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("uids") = std::vector<std::string>({std::any_cast<std::string>(args.at("uid"))}), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("display") = true | ||
| ).cast<pybind11::list>()[0].cast<DocNode*>(); | ||
| } | ||
CompromisedKiwi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| else if (func_name == "get_nodes") { | ||
| return func( | ||
| pybind11::arg("group_name") = std::any_cast<std::string>(args.at("group_name")), | ||
| pybind11::arg("kb_id") = std::any_cast<std::string>(args.at("kb_id")), | ||
| pybind11::arg("doc_ids") = std::vector<std::string>({std::any_cast<std::string>(args.at("doc_id"))}) | ||
CompromisedKiwi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ).cast<std::vector<DocNode*>>(); | ||
| } | ||
| else if (func_name == "get_node_children") { | ||
| return get_node_children(std::any_cast<DocNode*>(args.at("node"))); | ||
| } | ||
|
|
||
| throw std::runtime_error("Unknown DocumentStore function: " + func_name); | ||
| } | ||
|
|
||
| // Cache by Python object identity to ensure one wrapper per store instance. | ||
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> &store_cache() { | ||
| static std::unordered_map<PyObject *, std::weak_ptr<DocumentStore>> cache; | ||
| return cache; | ||
| } | ||
CompromisedKiwi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| }; | ||
|
|
||
| } // namespace lazyllm | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an AI-generated suggestion; please verify before applying.
[critical] [logic]
if: always()使得cpp_ext_test作业在needs: [clone]失败时仍会运行,可能导致在没有正确 clone 的情况下执行后续步骤并产生不可预测的失败。Suggestion: 如果意图是即使其他作业失败也运行,但前提是 clone 成功,应改为:
auto reviewed by BOT (claude-opus-4-6)