layout | title | short | permalink |
---|---|---|---|
page |
Google Summer of Code 2021 - Enhancing clang-bind |
GSoC 2021 clang-bind |
/gsoc-2021/clang-bind/ |
Student: Divyanshu Madan
Mentors: Kunal Tyagi, Lars Glud
Enhancing the Point Cloud Library’s clang-bind project to fix and extend its existing capabilities: the aim is to fix existing issues and increase the coverage of the existing scope to develop python bindings of PCL.
1.1. Introduce clang_utils.py, compilation_database.py, and related changes #6
- Introduce
clang_utils.py
- We use the inspect module, which provides several useful functions to help get information about live objects such as modules, classes, methods, functions, etc.
- The getmembers function retrieves the members of an object such as a class or module though it triggers execution instead of doing static analysis.
- This leads to errors, particularly on properties of classes in cindex.py, which causes segmentation errors or raises an Exception if a particular condition is not satisfied.
- To curb this, we fetch the members statically. We define a custom function
getmembers_static
based on the one in the inspect module.
- We then use the function to create utility functions for
is_
,get_
, and@property
methods for the following libclang objects:- CursorKind: describes the kind of entity that a cursor points to.
- Cursor: represents a reference to an element within the AST. It acts as a kind of iterator.
- Type: represents the type of an element in the abstract syntax tree.
- Introduce
compilation_database.py
: Loadscompile_commands.json
from a given directory. This is useful to get compilation arguments for a file from the compilation database. - Relevant changes in
parse.py
,generate.py
, and unit tests.
1.2. Introduce and utilise treelib #20
- Earlier, we were creating the tree structure using python dictionaries, in which a node had the following structure:
While this worked fine, we improved this by using a tree-based python package, instead of handling it via dictionaries. It also helps us reduce code during tree traversal by using pre-defined methods.
parsed_node = { "depth": node.get("depth"), "line": cursor.location.line, "column": cursor.location.column, "tokens": [x.spelling for x in cursor.get_tokens()], "cursor_kind": { **cursor_kind_utils.get_check_functions_dict(), **cursor_kind_utils.get_get_functions_dict(), **cursor_kind_utils.get_properties_dict(), }, "cursor": { **cursor_utils.get_check_functions_dict(), **cursor_utils.get_get_functions_dict(), **cursor_utils.get_properties_dict(), }, "type": { **cursor_type_utils.get_check_functions_dict(), **cursor_type_utils.get_get_functions_dict(), **cursor_type_utils.get_properties_dict(), }, "members": [], }
- We used treelib in
parse.py
to generate the tree from the traversal of the file's AST. This included:- Creating a
Node
class (as a data holder class). This contains parsed information about a cursor. - Utilise treelib in
Parse
class to create a similar structure as before.
- Creating a
- In unit tests, we use the path_to_leaves function to check the parsed tree's nodes: imitating the prior tests' structure. We also now use clang objects directly for assertions (earlier, we were using string comparisons).
1.3. Restructure and remove unnecessary files #23
- Earlier, since this project was meant to have resided in the pcl repo and used only for binding PCL, the
clang-bind
directory structure was designed accordingly: the code was present inbindings/python
directory. - Now, that we aim to create a general-purpose tool which is not specific to
pcl
, we updated the project to have the desired directory structure, for easier imports and better segregation:clang-bind ├── clang_bind │ ├── bind.py │ ├── clang_utils.py │ ├── cmake_frontend.py │ ├── generate.py │ ├── interface.py │ ├── parse.py │ ├── templates │ │ └── NAMESPACE │ └── utils.py ├── CMakeLists.txt ├── CMakeLists.txt.in ├── conanfile.txt ├── discussion.md ├── docs │ ├── make.bat │ ├── Makefile │ └── source │ ├── conf.py │ ├── index.rst │ ├── _static │ └── _templates ├── pytest.ini ├── README.md ├── requirements.txt ├── setup.py ├── tests │ ├── test_generate.py │ ├── test_parse.py │ └── test_project │ ├── CMakeLists.txt │ ├── include │ │ └── clang_bind_test │ │ └── function.hpp │ └── src │ └── simple.cpp └── third_party └── conan.cmake
- We performed clean-up: removing unnecessary old files, using
PYTHONPATH
instead ofcontext.py
, etc.
1.4. Develop CMake frontend #24
- CMake File API: CMake provides a file-based API that clients may use to get semantic information about the buildsystems CMake generates. Clients may use the API by writing query files to a specific location in a build tree to request zero or more Object Kinds. When CMake generates the buildsystem in that build tree it will read the query files and write reply files for the client to read.
-
API v1 directory structure:
.cmake └── api └── v1 ├── query │ └── codemodel-v2 └── reply
-
Step-by-step (via https://silizium.io/post/cmake_build_information/):
- Navigate to your cmake project top-level directory.
- Create a build directory
build/
. - Create the API directory
build/.cmake/api/v1/query
. - Create the file
build/.cmake/api/v1/query/codemodel-v2
and leave it empty. - Re-run cmake (the configuration step).
- CMake should have automatically created the directory
build/.cmake/api/v1/reply
containing the replies to our query.
-
We introduce
cmake_frontend.py
, in which we get the CMake build information from thereply/
directory via JSON manipulation.- We parse the Object Kind codemodel via
reply/codemodel-v2-<unspecified>.json
to get a list of build system targets. Such targets are created by calls toadd_executable()
,add_library()
, andadd_custom_target()
, excluding imported targets and interface libraries (which do not generate any build rules). - We then access the target's corresponding
reply/target-<name>-<unspecified>.json
to get various information like sources, dependencies, etc.
- We parse the Object Kind codemodel via
-
- We move the class
CompilationDatabase
tocmake_frontend.py
and removecompilation_database.py
. - We introduce Sphinx-compatible documentation (RTD format) to the project.
1.5. Changes in parse.py #27
- Change identifier type of the AST:
- For a tree node, we were using
clang_bind.parse.Node
(data holder class) objects as identifiers (treelib.Tree.identifier
). - This created a problem while attempting to create subtrees from the tree:
- Subtree creation (and other functions) uses copy functions (from
copy
module) internally, which lead to:ValueError: ctypes objects containing pointers cannot be pickled
- This is caused because the data holder class had clang pointers (cursors,
clang.cindex.Cursor
) which usesctypes
internally. Thus, their copies couldn't be created and subsequently, any tree operations couldn't be performed.
- Subtree creation (and other functions) uses copy functions (from
- This can be solved by a simple logic: instead of using the data holder objects (and thus
ctypes
) as identifiers, we use UUIDs as identifiers and map them to the data holder objects.- This won't create any difference in the tree representation as the tree printing operations rely on the
tag
, not the identifier and we are keeping the tag same (repr
of the data holder). - The UUID creation process is handled internally by
treelib
, so we just let it create UUIDs and we map them to the data holder in ourclang_bind.parse.Parse
class.
- This won't create any difference in the tree representation as the tree printing operations rely on the
- We also refactor
clang_bind.parse.Node
(data holder) class toclang_bind.parse.ParsedInfo
as it created confusion while usingtreelib.Node
objects in our code - Corresponding changes in
parse.py
:- The new tree uses uuids (internal to
treelib
) as node identifiers instead of the Node (now ParsedInfo) objects. - Refactor
Node
class toParsedInfo
. - Introduced
_parsed_info_map
. - Add corresponding new helper functions.
- The new tree uses uuids (internal to
- Corresponding changes in
test_parse.py
:- The new tree has UUIDs as identifiers (internal), as opposed to the
Node
(nowParsedInfo
) objects.
- The new tree has UUIDs as identifiers (internal), as opposed to the
- For a tree node, we were using
- Improve structure of tests in
test_parse.py
:- Use a class, remove the hard-coded filename, remove the dependency of each function on
tmp_path
: usetempfile
module instead, file writing method changes.
- Use a class, remove the hard-coded filename, remove the dependency of each function on
2.1. Binding flow #28
- Introduce
bind.py
- Using the CMake frontend, generate a binding database of all the files for a list of targets:
self.binding_db[target] = { "source_dir": source_dir, # source dir containing C++ targets "output_dir": output_dir, # output dir "inclusion_sources": inclusion_sources, # inclusions for the target "files": [ # list of files' information { "source": cpp_source, "compiler_arguments": CompilationDatabase(build_dir).get_compilation_arguments(cpp_source), } for cpp_source in cpp_sources ], }
- We can then use this database for functions such as parsing each file (for each target):
for file in target.get("files"): parsed_tree = Parse( Path(source_dir, file.get("source")), inclusion_sources, file.get("compiler_arguments"), ).get_tree() file.update({"parsed_tree": parsed_tree}) # update db
- Using the CMake frontend, generate a binding database of all the files for a list of targets:
- Changes in
parse.py
_is_valid_child
function checks if the child is valid:- Old condition:
- The child should be in the same file as the parent.
- New condition:
- Either child should be in the same file as the parent or,
- the child should be in the list of valid inclusion sources.
- Old condition:
- LOGIC
- Earlier model:
- We were not sure as to how different source files will be dealt with i.e. how they will interact with each other:
- Whether we will parse and bind each one separately, which doesn't make sense (ex. to bind definition-only
.h
files) or, - Handle inclusions gracefully:
- The ideal structure of a C++ project for binding is the standard one:
- Implementation in
.cpp
source files. - Definitions in
.h
sources files. .h
included in.cpp
.
- Implementation in
pcl
isn't an ideal project as it contains implementation in.hpp
files (ex. see point_types.hpp)- Thus, we cannot implement the model of creating bindings just for
.cpp
files and including the file'sINCLUSION_DIRECTIVES
directly in generated pybind code (See Open3D's bindings for reference)
- The ideal structure of a C++ project for binding is the standard one:
- Have a look at pcl's parsed output using the old model.
- New model:
- The solution that we found (that should work) is we let the cursor traverse to the target's (or even whole project's, customizable)
INCLUSION_DIRECTIVES
and get the code from there. - This allows us to only parse the
.cpp
source files by getting the validINCLUSION_DIRECTIVES
(because we don't want to include std or 3rd party inclusions' code in our parsed tree) through CMake frontend. - This significantly reduced the number of files we needed to handle during pybind code generation (only
.cpp
source files) while also enabling all the required context to be managed in a single file. - Have a look at pcl's parsed output using the new model.
- The solution that we found (that should work) is we let the cursor traverse to the target's (or even whole project's, customizable)
- Earlier model:
2.2. Use docker #29
- Introduce an ubuntu-based Dockerfile and use it in the
pytest.yml
Github Actions.
- The generation logic is tricky. While we can take away some for-loop complexity by using templates, it still is a difficult thing to generate pybind code generically.
- The issues are context generation and entity state management:
- Given a file, we want to generate correct contexts for each entity i.e. structs, functions, etc. from the entity's cursor's children.
- Ex. for a
STRUCT_DECL
: inheritance, namespace scope, members, functions etc. and in turn their children (recursively), all should be used to create a state for theSTRUCT_DECL
. This should then be plugged in the context of the tree rootTRANSLATION_UNIT
and updated to generate the pybind code for the file.
- Apart from that, we also need to maintain a database of namespace resolved C++ entities because we are now including all the parsed information in the
.cpp
source files. This leads to multiple instances of entities in a single file, and all of them need to be carefully examined and the database and the entity state updated accordingly. - From a practical perspective, have a look at the parsed tree for
point_types.cpp
: point_types.txt. - This couldn't have been possible in the time frame: we need to develop a comprehensive plan to handle the generation logic.
- To be followed after we sort out the generation logic. Due to my lack of expertise in
pcl
, this requires a detailed discussion regarding the test cases.