A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.
The DIVE framework provides a powerful pipeline for blockchain dataset creation through six core components:
Fetch smart contract and account data from public blockchains.
- β
Currently supports
Ethereum. - π Uses
Etherscan.ioas a data source. - π Collects:
- Contract metadata
- Account-level information
- Opcodes
Retrieve and store verified contract source code as .sol files using Solidity.
Extract structured features from various smart contract attributes, including: ABI, Timestamp, Library, TransactionIndex, Code Metrics, Input / Bytecode, and Opcode
Merge extracted features with ground-truth vulnerability labels to build a structured dataset.
Clean, normalize, and transform the data to prepare it for downstream analysis or machine learning tasks.
Generate statistical summaries and visualizations to better understand the dataset's structure and characteristics.
-
Python= 3.12.2 -
solidity-code-metrics= 0.0.26Install using one of the following:
# Using Yarn yarn global add [email protected] # Or using npm npm install -g [email protected]
-
π Etherscan API Key
Create an account at
Etherscan.ioand follow theirAPI key guide.
β οΈ Do not share your API key publicly. -
Python dependencies are listed in
requirements.txt.
You can install them using:pip install -r requirements.txt
DIVE/
βββ Datasets/ # Generated datasets
β βββ InitialCombinedData/ # Merged raw features before preprocessing
β βββ PreprocessedData/ # Cleaned, transformed datasets for ML
β
βββDocs/
β βββ initial-setup.md # Step-by-step guide for project installation and configuration
β βββ usage.md # Detailed documentation for using framework functions and scripts
β
βββ Features/ # Extracted features
β βββ API-based/ # Features collected from Etherscan APIs
β β βββ AccountInfo/ # Account-level features
β β βββ BlockInfo/ # Block transaction counts
β β βββ ContractsInfo/ # Contract metadata from Etherscan
β β βββ Opcodes/ # Opcode data from Etherscan
β βββ FE-based/ # Feature engineering outputs
β β βββ ABI-based/ # Features extracted from ABI
β β βββ CodeMetrics/ # Code metric data
β β β βββ CodeMetrics/ # Parsed metric values
β β β βββ Reports/ # Raw/edited Markdown metric reports
β β β βββ EditedReports/
β β β βββ OriginalReports/
β β β βββ Raw_CodeMetrics/
β β βββ Input-based/ # Features derived from the Input attribute
β β βββ Library-based/ # Features derived from the Library attribute
β β βββ Opcode-based/ # Features derived from opcode-level analysis
β β βββ Timestamp-based/ # Features derived from the Timestamp attribute
β β βββ TransactionIndex/ # Features derived from the TransactionIndex attribute
β
βββ Labels/ # Ground-truth labels for contracts
β
βββ RawData/ # Data collected or downloaded
β βββ Samples/ # Extracted Solidity source code samples
β βββ SamplesSummary/ #
β βββ SC_Addresses/ # CSVs of smart contract addresses
β
βββ Scripts/ # Main processing and utility scripts
β
β βββ FeatureExtraction/ # Scripts for extracting low-level features
β β βββ EVM_Opcodes/ # Contains opcode-related resources
β β β βββ EVM_Opcodes_*.xlsx # Excel file(s) listing EVM opcodes and metadata
β β βββ ABI_FeatureExtraction.py # Extracts features from ABI (Application Binary Interface)
β β βββ Bytecode_FeatureExtraction.py# Extracts bytecode-level features
β β βββ get_CodeMetrics.py # Calls external tools (i.e., solidity-code-metrics) to compute code metrics
β β βββ get_OpcodesList.py # Generates the EVM opcode reference list (EVM_Opcodes_*.xlsx)
β β βββ Library_FeatureExtraction.py # Extracts library-based features
β β βββ Opcode_FeatureExtraction.py # Extracts features from opcodes (e.g., opcode metrics)
β β βββ Timestamp_FeatureExtraction.py # Extracts timestamp-based features
β β βββ transactionIndex_FeatureExtraction.py # Extracts transactionIndex-based features
β
β βββ FeatureSelection/ # Script for selecting relevant features for analysis/modeling
β β βββ get_FilteredFeatures.py # Applies feature selection (uses classification defined in Feature list.xlsx)
β
β βββ apply_DataPreprocessing.py # Cleans, normalizes, and transforms data
β βββ apply_FeatureExtraction.py # Coordinates the execution of multiple feature extraction steps
β βββ construct_FinalData.py # Merges feature sets and labels to construct the final dataset
β βββ extract_SourceCodes.py # Extracts Solidity source code (included in Etherscan API responses)
β βββ get_Addresses.py # Loads and filters smart contract addresses from input CSV files
β βββ get_BlockFeatures.py # Retrieves transaction counts for each block
β βββ get_ContractFeatures.py # Orchestrates retrieval of contract info from Etherscan
β βββ get_DataStatistics.py # Generates summary statistics and visualizations for the dataset
β
βββ Statistics/ # Analysis outputs and statistical summaries
β
βββ config.json # Configuration file for paths and API key
βββ DIVE_pipeline.yaml # YAML config defining the full data creation pipeline execution
βββ DIVE.ipynb # Interactive notebook for demonstrating the framework
βββ Feature list.xlsx # Documentation of features and their descriptions
βββ LICENSE.md # License: CC BY-NC 4.0
βββ README.md # Project overview and usage instructions
βββ requirements.txt # Python package dependencies
βββ run_DIVE_Pipeline.py # Entrypoint to run the entire pipeline as a script
- See full instructions in
Docs/initial-setup.md
- Each function is explained in detail in
Docs/usage.md
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
π« Patent Rights Reserved
- This project may be covered by pending or granted patents. The authors reserve all rights under applicable patent laws.
- The use of this software does not grant any rights to use patented inventions.
- For commercial licensing or patent-related inquiries, please contact the authors directly.
π‘οΈ Disclaimer
- DIVE is provided as a research tool and is under active development. While we strive for reliability, we do not provide warranties or guarantees. Please use it responsibly and at your own discretion.