Skip to content

DIVE4Data/DIVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

228 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DIVE Framework

A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.


πŸ” Key Features

The DIVE framework provides a powerful pipeline for blockchain dataset creation through six core components:


1. 🧾 Feature Collection

Fetch smart contract and account data from public blockchains.

  • βœ… Currently supports Ethereum.
  • πŸ”— Uses Etherscan.io as a data source.
  • πŸ“Š Collects:
    • Contract metadata
    • Account-level information
    • Opcodes

2. 🧠 Solidity Code Extraction

Retrieve and store verified contract source code as .sol files using Solidity.

3. πŸ§ͺ Feature Extraction

Extract structured features from various smart contract attributes, including: ABI, Timestamp, Library, TransactionIndex, Code Metrics, Input / Bytecode, and Opcode

4. 🏷️ Labeled Data Construction

Merge extracted features with ground-truth vulnerability labels to build a structured dataset.

5. 🧹 Data Preprocessing

Clean, normalize, and transform the data to prepare it for downstream analysis or machine learning tasks.

6. πŸ“Š Statistical Analysis & Visualization

Generate statistical summaries and visualizations to better understand the dataset's structure and characteristics.


πŸ“¦ Requirements


πŸ“ Repository Structure

DIVE/
β”œβ”€β”€ Datasets/                    # Generated datasets
β”‚   β”œβ”€β”€ InitialCombinedData/     # Merged raw features before preprocessing
β”‚   └── PreprocessedData/        # Cleaned, transformed datasets for ML
β”‚
β”œβ”€β”€Docs/
β”‚   β”œβ”€β”€ initial-setup.md        # Step-by-step guide for project installation and configuration
β”‚   └── usage.md                # Detailed documentation for using framework functions and scripts
β”‚
β”œβ”€β”€ Features/                    # Extracted features
β”‚   β”œβ”€β”€ API-based/               # Features collected from Etherscan APIs
β”‚   β”‚   β”œβ”€β”€ AccountInfo/         # Account-level features
β”‚   β”‚   β”œβ”€β”€ BlockInfo/           # Block transaction counts
β”‚   β”‚   β”œβ”€β”€ ContractsInfo/       # Contract metadata from Etherscan
β”‚   β”‚   └── Opcodes/             # Opcode data from Etherscan
β”‚   β”œβ”€β”€ FE-based/                # Feature engineering outputs
β”‚   β”‚   β”œβ”€β”€ ABI-based/           # Features extracted from ABI
β”‚   β”‚   β”œβ”€β”€ CodeMetrics/         # Code metric data
β”‚   β”‚   β”‚   β”œβ”€β”€ CodeMetrics/     # Parsed metric values
β”‚   β”‚   β”‚   └── Reports/         # Raw/edited Markdown metric reports
β”‚   β”‚   β”‚       β”œβ”€β”€ EditedReports/
β”‚   β”‚   β”‚       β”œβ”€β”€ OriginalReports/
β”‚   β”‚   β”‚       └── Raw_CodeMetrics/
β”‚   β”‚   β”œβ”€β”€ Input-based/         # Features derived from the Input attribute
β”‚   β”‚   β”œβ”€β”€ Library-based/       # Features derived from the Library attribute
β”‚   β”‚   β”œβ”€β”€ Opcode-based/        # Features derived from opcode-level analysis
β”‚   β”‚   β”œβ”€β”€ Timestamp-based/     # Features derived from the Timestamp attribute
β”‚   β”‚   └── TransactionIndex/    # Features derived from the TransactionIndex attribute
β”‚
β”œβ”€β”€ Labels/                      # Ground-truth labels for contracts
β”‚
β”œβ”€β”€ RawData/                     # Data collected or downloaded
β”‚   β”œβ”€β”€ Samples/                 # Extracted Solidity source code samples
β”‚   β”œβ”€β”€ SamplesSummary/          # 
β”‚   └── SC_Addresses/            # CSVs of smart contract addresses
β”‚
β”œβ”€β”€ Scripts/                             # Main processing and utility scripts
β”‚
β”‚   β”œβ”€β”€ FeatureExtraction/               # Scripts for extracting low-level features
β”‚   β”‚   β”œβ”€β”€ EVM_Opcodes/                 # Contains opcode-related resources
β”‚   β”‚   β”‚   β”œβ”€β”€ EVM_Opcodes_*.xlsx       # Excel file(s) listing EVM opcodes and metadata
β”‚   β”‚   β”œβ”€β”€ ABI_FeatureExtraction.py     # Extracts features from ABI (Application Binary Interface)
β”‚   β”‚   β”œβ”€β”€ Bytecode_FeatureExtraction.py# Extracts bytecode-level features
β”‚   β”‚   β”œβ”€β”€ get_CodeMetrics.py           # Calls external tools (i.e., solidity-code-metrics) to compute code metrics
β”‚   β”‚   β”œβ”€β”€ get_OpcodesList.py           # Generates the EVM opcode reference list (EVM_Opcodes_*.xlsx)
β”‚   β”‚   β”œβ”€β”€ Library_FeatureExtraction.py # Extracts library-based features
β”‚   β”‚   β”œβ”€β”€ Opcode_FeatureExtraction.py  # Extracts features from opcodes (e.g., opcode metrics)
β”‚   β”‚   β”œβ”€β”€ Timestamp_FeatureExtraction.py # Extracts timestamp-based features
β”‚   β”‚   └── transactionIndex_FeatureExtraction.py # Extracts transactionIndex-based features
β”‚
β”‚   β”œβ”€β”€ FeatureSelection/                # Script for selecting relevant features for analysis/modeling
β”‚   β”‚   └── get_FilteredFeatures.py      # Applies feature selection (uses classification defined in Feature list.xlsx)
β”‚
β”‚   β”œβ”€β”€ apply_DataPreprocessing.py       # Cleans, normalizes, and transforms data
β”‚   β”œβ”€β”€ apply_FeatureExtraction.py       # Coordinates the execution of multiple feature extraction steps
β”‚   β”œβ”€β”€ construct_FinalData.py           # Merges feature sets and labels to construct the final dataset
β”‚   β”œβ”€β”€ extract_SourceCodes.py           # Extracts Solidity source code (included in Etherscan API responses) 
β”‚   β”œβ”€β”€ get_Addresses.py                 # Loads and filters smart contract addresses from input CSV files
β”‚   β”œβ”€β”€ get_BlockFeatures.py             # Retrieves transaction counts for each block
β”‚   β”œβ”€β”€ get_ContractFeatures.py          # Orchestrates retrieval of contract info from Etherscan
β”‚   └── get_DataStatistics.py            # Generates summary statistics and visualizations for the dataset
β”‚
β”œβ”€β”€ Statistics/                  # Analysis outputs and statistical summaries
β”‚
β”œβ”€β”€ config.json                  # Configuration file for paths and API key
β”œβ”€β”€ DIVE_pipeline.yaml           # YAML config defining the full data creation pipeline execution
β”œβ”€β”€ DIVE.ipynb                   # Interactive notebook for demonstrating the framework
β”œβ”€β”€ Feature list.xlsx            # Documentation of features and their descriptions
β”œβ”€β”€ LICENSE.md                   # License: CC BY-NC 4.0
β”œβ”€β”€ README.md                    # Project overview and usage instructions
β”œβ”€β”€ requirements.txt             # Python package dependencies
└── run_DIVE_Pipeline.py         # Entrypoint to run the entire pipeline as a script

🧭 Getting Started

πŸ”§ Initial Setup

πŸ› οΈ Using Framework Functions


πŸ“¦ License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

🚫 Patent Rights Reserved

  • This project may be covered by pending or granted patents. The authors reserve all rights under applicable patent laws.
  • The use of this software does not grant any rights to use patented inventions.
  • For commercial licensing or patent-related inquiries, please contact the authors directly.

πŸ›‘οΈ Disclaimer

  • DIVE is provided as a research tool and is under active development. While we strive for reliability, we do not provide warranties or guarantees. Please use it responsibly and at your own discretion.

About

A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors