This project provides a tool for analyzing C/C++ code by extracting, serializing, and counting subtrees from its abstract syntax tree (AST). The tool uses the Clang library to parse the C/C++ code and generate the AST.
- Python 3.8+
- Clang 12+
- LLVM library
-
Install Clang and LLVM:
On macOS, you can install it using Homebrew:
brew install llvm
-
Set up the Python environment:
python3 -m venv .venv source .venv/bin/activate pip install clang -
Set up Clang library path:
Here is an example for macOS:
library_file = '/opt/homebrew/opt/llvm/lib/libclang.dylib'
-
Assign your target:
Assign
target_fileto the path of the desired C/C++ file to parse. If parsing C++, remember to modify parse_to_ast as guided by the comments. -
Run the script:
python parse.py
-
Output:
The script will generate a subtrees.csv file containing the following columns:
- Hash: SHA-256 hash of the serialized subtree.
- Count: Number of occurrences of this subtree.
- Human Readable Expression: A human-readable representation of the subtree.
- Serialized Subtree: The serialized subtree.
- Deserialized Tree: The deserialized tree structure in pretty-printed format.
parse.py
This script contains the following main functions:
- serialize_node(node, anon_map=None): Serializes an AST node, anonymizing variable names. Note: each subtree is re-anonymized.
- extract_subtrees(node, subtrees=None): Extracts all subtrees starting from a given node.
- hash_subtree(subtree): Returns the SHA-256 hash of a serialized subtree.
- count_subtrees(subtrees): Counts the occurrences of each subtree.
- deserialize_subtree(serialized_subtree): Deserializes a serialized subtree back into its tree structure.
- print_tree(node, depth=0): Pretty prints a deserialized tree.
- parse_to_ast(file_path): Parses a C/C++ file into an AST using Clang.
- tree_to_expression(node): Converts a deserialized subtree into a human-readable expression.
add.c
This is an example C file for parsing. subtrees.csv contains the output of parse.py with add.c as the target file.