BeeChunker

Intelligent Chunk Size Optimization for BeeGFS using Machine Learning Models

BeeChunker is an intelligent system for optimizing chunk sizes in BeeGFS (Fraunhofer Parallel File System) storage systems by analyzing file access patterns and predicting optimal chunk sizes using various machine learning models including ~~Self-Organizing Maps (SOM)~~, Random Forest (RF), and XGBoost.

Overview

BeeGFS is a parallel file system that distributes file data across multiple storage servers using chunks. The "chunk size" determines how a file is divided and distributed, which significantly impacts I/O performance. However, determining the optimal chunk size for a file is challenging as it depends on many factors:

File size
Access patterns (read vs. write ratio)
I/O operation sizes
Workload characteristics
File type and extension

BeeChunker solves this problem by:

Continuously monitoring file access patterns
Training machine learning models on the collected data
Predicting optimal chunk sizes based on file characteristics
Automatically applying these optimizations to existing and new files

System Requirements

Python 3.8 or higher
BeeGFS installation (critical requirement)
Access to BeeGFS command-line tools (beegfs-ctl)
Root or sudo access (for some operations)
Sufficient disk space for the monitoring database and log files

Codebase Structure

The BeeChunker codebase is organized into several key components:

beechunker/
├── cli/                    # Command-line interfaces
│   ├── monitor_cli.py      # Monitor service CLI for tracking file access
│   ├── optimizer_cli.py    # Optimizer service CLI for applying chunk optimizations
│   └── trainer_cli.py      # Trainer service CLI for ML model training
├── common/                 # Common utilities
│   ├── beechunker_logging.py # Logging setup
│   ├── config.py           # Configuration management
│   ├── default_config.json # Default configuration
├── custom_types/events/
|   ├── file_access_event.py # File access event class
├── ml/                     # Machine learning components
│   ├── feature_engineering.py # Feature engineering
│   ├── feature_extraction.py  # Extract features from raw data
│   ├── random_forest.py    # Random Forest model
│   ├── som.py              # Self-Organizing Map implementation ** NOT USED IN THE FINAL IMPLEMENTATION **
│   ├── visualization.py    # Visualization tools
│   └── xgboost_model.py    # XGBoost model implementation
├── monitor/                # Monitoring components
│   ├── access_tracker.py   # File access tracking
│   └── db_manager.py       # Database management
└── optimizer/              # Optimization components
    ├── chunk_manager.py    # Chunk size management
    └── file_watcher.py     # New file detection

Architecture and Workflow

BeeChunker consists of three main services that work together:

1. Monitor Service

The monitor service (monitor_cli.py) continuously watches BeeGFS mount points to track file access operations:

Uses the watchdog library to detect file operations (read/write)
Captures file metadata including current chunk size using beegfs-ctl --getentryinfo
Records access patterns, read/write operations, and performance metrics in a SQLite database
Handles cleanup of old monitoring data to prevent database bloat

2. Trainer Service

The trainer service (trainer_cli.py) analyzes the collected data to train machine learning models:

Supports multiple ML models:
- Random Forest (RF): Ensemble learning method using decision trees for accurate chunk size prediction
- XGBoost: Gradient boosting implementation for high-performance predictions
- Self-Organizing Map (SOM): Unsupervised neural network creating a topological mapping of file access patterns
Processes raw access data through extensive feature engineering
Creates visualizations of model insights (U-Matrix, Component Planes, Cluster Analysis)
Saves trained models to disk for use by the optimizer

3. Optimizer Service

The optimizer service (optimizer_cli.py) applies ML predictions to optimize file chunk sizes:

Extracts features from files matching what the models were trained on
Uses the trained models to predict optimal chunk sizes
Creates new files with the optimal chunk size and swaps them in place
Tracks optimization history and performance improvements
Can be run continuously (service mode) or on-demand (CLI mode)

Data Flow

Data Collection: The monitor service records file access operations in the SQLite database
Feature Engineering: Raw data is processed to extract relevant features:
- File size and current chunk size
- Read/write operation counts and ratios
- Average and maximum read/write sizes
- File extension characteristics
- Access patterns and throughput metrics
Model Training: The trainer service processes this data to train models that correlate file characteristics with optimal chunk sizes
Prediction & Optimization: The optimizer applies these models to predict and set optimal chunk sizes for files

Installation

Extract the contents of the submitted zip file:

unzip beechunker.zip
cd BeeChunker

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the package and dependencies:

pip install -r requirements.txt
pip install -e .

Create necessary directories:

sudo mkdir -p /opt/beechunker/data/logs
sudo mkdir -p /opt/beechunker/data/models
sudo chown -R $USER:$USER /opt/beechunker

Configure BeeGFS mount points in a custom configuration file: (not a requirement as the default config will be applied in case the user doesnt manually define a config file)

mkdir -p /opt/beechunker/data
cat > /opt/beechunker/data/config.json << EOL
{
  "monitor": {
    "db_path": "/opt/beechunker/data/access_patterns.db",
    "log_path": "/opt/beechunker/data/logs/monitor.log",
    "polling_interval": 300
  },
  "optimizer": {
    "log_path": "/opt/beechunker/data/logs/optimizer.log",
    "min_chunk_size": 64,
    "max_chunk_size": 4096
  },
  "ml": {
    "models_dir": "/opt/beechunker/data/models",
    "log_path": "/opt/beechunker/data/logs/trainer.log",
    "training_interval": 86400,
    "min_training_samples": 100,
    "som_iterations": 5000,
    "n_estimators": 100,
    "hgb_iter": 1000,
    "ot_quantile": 0.65
  },
  "beegfs": {
    "mount_points": [
      "/mnt/beegfs"
    ]
  }
}
EOL

Running the Services

BeeChunker provides three main services that can be run independently or as systemd services.

Setting Up Systemd Services

The project includes example service files in the services/ directory. You'll need to adapt these to your environment.

1. Monitor Service

Copy and modify the example service file:

sudo cp services/beechunker-monitor.service.example /etc/systemd/system/beechunker-monitor.service
sudo nano /etc/systemd/system/beechunker-monitor.service

Update the service file with your specific paths and username:

[Unit]
Description=BeeChunker Monitor Service
After=network.target

[Service]
Type=simple
User=YOUR_USERNAME  # Replace with your username
WorkingDirectory=/path/to/BeeChunker  # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/monitor_cli.py run
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable beechunker-monitor.service
sudo systemctl start beechunker-monitor.service

Check the service status:

sudo systemctl status beechunker-monitor.service

2. Optimizer Service

Copy and modify the example service file:

sudo cp services/beechunker-optimizer.service.example /etc/systemd/system/beechunker-optimizer.service
sudo nano /etc/systemd/system/beechunker-optimizer.service

Update the service file with your specific paths and username:

[Unit]
Description=BeeChunker Optimizer Service
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=YOUR_USERNAME  # Replace with your username
WorkingDirectory=/path/to/BeeChunker  # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/optimizer_cli.py run
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable beechunker-optimizer.service
sudo systemctl start beechunker-optimizer.service

Check the service status:

sudo systemctl status beechunker-optimizer.service

3. Trainer Service

Copy and modify the example service file:

sudo cp services/beechunker-trainer.service.example /etc/systemd/system/beechunker-trainer.service
sudo nano /etc/systemd/system/beechunker-trainer.service

Update the service file with your specific paths and username:

[Unit]
Description=BeeChunker Trainer Service
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=YOUR_USERNAME  # Replace with your username
WorkingDirectory=/path/to/BeeChunker  # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/trainer_cli.py train
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable beechunker-trainer.service
sudo systemctl start beechunker-trainer.service

Check the service status: (might show not running or stopped which is normal)

sudo systemctl status beechunker-trainer.service

Creating Symbolic Links

To run BeeChunker commands easily from anywhere in your system, you can create symbolic links to the main CLI scripts. This allows you to use commands like beechunker-monitor instead of the full path.

# Create symbolic links in a directory that's in your PATH
sudo ln -s $(pwd)/beechunker/cli/monitor_cli.py /usr/local/bin/beechunker-monitor
sudo ln -s $(pwd)/beechunker/cli/optimizer_cli.py /usr/local/bin/beechunker-optimizer
sudo ln -s $(pwd)/beechunker/cli/trainer_cli.py /usr/local/bin/beechunker-trainer

# Make them executable
sudo chmod +x /usr/local/bin/beechunker-monitor
sudo chmod +x /usr/local/bin/beechunker-optimizer
sudo chmod +x /usr/local/bin/beechunker-trainer

After creating these symbolic links, you can run commands like:

beechunker-monitor run
beechunker-optimizer optimize-file /path/to/file

Using Pretrained Models

BeeChunker comes with several pretrained models that can be used immediately without having to collect data and train models from scratch. These models are located in the models/ directory of the repository and include:

rf_model.joblib: The main Random Forest model (recommended for production)
xgboost_model.json: The experimental XGBoost model
candidate_chunks.joblib: List of candidate chunk sizes considered by the models
feature_names.joblib: Feature names used by the models
And several supporting model files

Setting Up Pretrained Models

To use these pretrained models:

Create the models directory in your BeeChunker configuration path:

sudo mkdir -p /opt/beechunker/data/models
sudo chown -R $USER:$USER /opt/beechunker/data/models

Copy the pretrained models to this directory:

# Assuming you're in the BeeChunker root directory
cp -r models/* /opt/beechunker/data/models/

Verify the models are in place:

ls -la /opt/beechunker/data/models/

You should see all the model files copied to this location.

Using Models with the Optimizer

Once the models are in place, you can use them with the optimizer without running the trainer:

# Use the Random Forest model (default and recommended)
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type rf

# Or try the XGBoost model
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type xgb

The optimizer will automatically load the corresponding pretrained model based on the --model-type parameter.

Model Selection Performance Considerations

The pretrained models were developed based on extensive testing across various file sizes and access patterns:

Random Forest (rf_model.joblib):
- Best for general purpose use
- Performs well on files from 64KB to 2GB
- More consistent predictions
- Recommended for most use cases
XGBoost (xgboost_model.json):
- May provide better optimization for very large files (>1GB)
- More aggressive in optimization recommendations
- Can be used for specialized workloads

Testing Pretrained Models

You can test the performance of these pretrained models using the provided utility scripts:

# Run the demo with the Random Forest model
python demo.py --model rf

# Run the demo with the XGBoost model 
python demo.py --model xgb

# Compare both models
python model_comparison.py

These tests will create sample files, optimize them using the pretrained models, and report performance improvements.

Customizing or Updating Models

If you wish to customize these models or train new ones based on your specific workload:

Run the monitor service for some time to collect real-world access patterns:
```
python beechunker/cli/monitor_cli.py run
```

Export the collected data:

python beechunker/cli/monitor_cli.py export-data --output ~/beechunker_training_data.csv

Train a custom model:

python beechunker/cli/trainer_cli.py train --input-csv ~/beechunker_training_data.csv

Note that due to the current limitations in the trainer service, you may need to manually prepare the data for training.

Setting Up Cron Jobs

Although the trainer service currently has limitations with model compatibility, you can still set up a cron job to run the trainer periodically. This may be useful for future versions when these issues are resolved.

# Edit crontab file
crontab -e

Add one of the following lines to the crontab file:

# Run the trainer daily at 2:00 AM
0 2 * * * /path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/trainer_cli.py train >> /opt/beechunker/data/logs/cron_trainer.log 2>&1

# Or, if you created symbolic links:
0 2 * * * /usr/local/bin/beechunker-trainer train >> /opt/beechunker/data/logs/cron_trainer.log 2>&1

Save the file to set up the cron job. Remember that due to the current limitations in the trainer service, this cron job might not work correctly until the preprocessing compatibility issues are resolved.

Running Services from Command Line

If you prefer to run the services manually or for testing purposes, you can run them directly from the command line:

Monitor Service

# Start the monitoring service
python beechunker/cli/monitor_cli.py run

# Check monitoring statistics
python beechunker/cli/monitor_cli.py stats

# Clean up old data (keep last 30 days)
python beechunker/cli/monitor_cli.py cleanup --days 30

Trainer Service

# Train models using data from the database
python beechunker/cli/trainer_cli.py train

# Train using a specific CSV file
python beechunker/cli/trainer_cli.py train --input-csv /path/to/data.csv

# Make a prediction for a specific file
python beechunker/cli/trainer_cli.py predict --file-size 1073741824 --read-count 100 --write-count 20

Optimizer Service

# Run optimizer service continuously
python beechunker/cli/optimizer_cli.py run

# Optimize a single file
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file

# Choose a specific model type (rf, som, or xgb)
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type xgb

# Optimize all files in a directory
python beechunker/cli/optimizer_cli.py optimize-dir /path/to/directory --recursive

# Analyze without changing anything (dry run)
python beechunker/cli/optimizer_cli.py optimize-dir /path/to/directory --dry-run

# Analyze file to show predicted chunk size without changing
python beechunker/cli/optimizer_cli.py analyze /path/to/file

# Bulk optimize based on database query
python beechunker/cli/optimizer_cli.py bulk-optimize --min-access 10

Model Comparison and Demo

BeeChunker includes two utility scripts to demonstrate and evaluate the system:

Model Status and Limitations

Important Notes on Model Availability:

SOM Model (DEPRECATED): The Self-Organizing Map model was implemented as a proof of concept and is not intended for production use. It should be considered deprecated and unusable for actual optimization.
Trainer Service Limitations: The trainer service currently has compatibility issues between models due to different preprocessing requirements. This has not been fully resolved yet, so automatic training of models may not work as expected. Manual model comparisons and testing are recommended instead.
Recommended Model: The Random Forest (RF) model is currently the most stable and recommended model for production use. XGBoost is available for experimental comparisons.

Demo Script

The demo.py script demonstrates the system by creating test files, simulating access patterns, and showing performance improvements from chunk size optimization:

# Run the demo using the Random Forest model (default)
python demo.py --model rf

# Run the demo using the XGBoost model
python demo.py --model xgb

Model Comparison

The model_comparison.py script runs a comprehensive comparison between the Random Forest and XGBoost models:

# Run the model comparison with 3 trials per scenario (default)
python model_comparison.py

# Run with more trials for more robust results
python model_comparison.py --trials 5

This will generate detailed comparison plots in the comparison_plots/ directory and print a summary of model performance.

Monitoring and Troubleshooting

Checking Service Status

# Check service status
sudo systemctl status beechunker-monitor
sudo systemctl status beechunker-optimizer
sudo systemctl status beechunker-trainer

# View service logs
journalctl -u beechunker-monitor.service
journalctl -u beechunker-optimizer.service
journalctl -u beechunker-trainer.service

# View application logs
tail -f /opt/beechunker/data/logs/monitor.log
tail -f /opt/beechunker/data/logs/optimizer.log
tail -f /opt/beechunker/data/logs/trainer.log

Common Issues and Solutions

Missing BeeGFS Tools:
- Error: "Command 'beegfs-ctl' not found"
- Solution: Ensure BeeGFS is installed and tools are in PATH
Permission Issues:
- Error: "Permission denied"
- Solution: Run with sudo or adjust file permissions
Database Issues:
- Error: "Database not found" or "no such table"
- Solution: Ensure monitor service has run at least once to create database schema
Model Training Failures:
- Error: "Not enough samples for training"
- Solution: Gather more access data (at least 100 samples by default)
Service Won't Start:
- Check logs: journalctl -u beechunker-monitor.service -n 50
- Verify paths in service file

Advanced Configuration

The BeeChunker configuration file supports many customization options:

{
  "monitor": {
    "db_path": "/opt/beechunker/data/access_patterns.db",
    "log_path": "/opt/beechunker/data/logs/monitor.log",
    "polling_interval": 300  // Check interval in seconds
  },
  "optimizer": {
    "log_path": "/opt/beechunker/data/logs/optimizer.log",
    "min_chunk_size": 64,    // Minimum chunk size in KB
    "max_chunk_size": 4096   // Maximum chunk size in KB
  },
  "ml": {
    "models_dir": "/opt/beechunker/data/models",
    "log_path": "/opt/beechunker/data/logs/trainer.log",
    "training_interval": 86400,  // Training interval in seconds (daily)
    "min_training_samples": 100, // Minimum samples required for training
    "som_iterations": 5000,      // SOM training iterations
    "n_estimators": 100,         // RF tree count
    "hgb_iter": 1000,            // XGBoost iterations
    "ot_quantile": 0.65          // Optimal Throughput quantile threshold
  },
  "beegfs": {
    "mount_points": ["/mnt/beegfs"]  // BeeGFS mount points to monitor
  }
}

Performance Considerations

Database Size: The monitor database can grow large over time. Use the cleanup function regularly.
CPU Usage: Model training can be CPU intensive. Consider running training during off-peak hours.
Storage Overhead: Changing chunk sizes creates temporary files. Ensure sufficient free space.
Optimization Frequency: Frequent chunk size changes can cause overhead. Use appropriate thresholds.

Implementation Details

Monitor Service

The monitor service tracks file operations using the watchdog library and BeeGFS command-line tools:

Uses file system event handlers to detect read/write operations
Records access events in a SQLite database
Tracks file metadata including size, chunk size, and access patterns
Maintains separate tables for file metadata, access events, and throughput metrics

Machine Learning Models

BeeChunker implements different ML models for chunk size prediction:

Random Forest (RF) :
- Ensemble of decision trees for robust classification
- Stacks multiple RF models for higher accuracy
- Includes feature importance analysis
- Most stable and reliable model for production use
XGBoost :
- Gradient boosting implementation for high accuracy
- Fast prediction with low memory footprint
- Handles complex feature interactions well
- Currently in experimental stage
Self-Organizing Map (SOM) (DEPRECATED):
- Was implemented as a proof of concept only
- Not intended for production use
- Included in the codebase for academic purposes only
- Should NOT be used for actual chunk size optimization

Optimizer Implementation

The optimizer uses a sophisticated approach to change chunk sizes:

Extracts features from the file to predict optimal chunk size
Uses BeeGFS tools to create a new file with the optimal chunk size
Copies data from the original file to the new file
Performs an atomic swap to replace the original file
Records the optimization in the database for tracking

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License - see the LICENSE file for details.

Acknowledgments

BeeGFS team for their excellent parallel filesystem
MiniSom library for the Self-Organizing Map implementation
The scikit-learn and XGBoost teams for their machine learning libraries

Team Contributions (Each contribution has been added by the respective person)

This project was developed collaboratively by three team members, each contributing to specific components:

Jayesh Bhagyesh Gajbhar (jgajbha)

Core System Architecture & Implementation:

Complete CLI structure and services integration (beechunker/cli/)
System fundamentals and configuration (beechunker/common/)
Custom data types and events (beechunker/custom_types/)
Monitoring system for file access patterns (beechunker/monitor/)
Chunk size optimization implementation (beechunker/optimizer/)
System packaging and deployment (setup.py)

Utilities & Testing:

Performance demonstration tool (demo.py)
Model comparison framework (model_comparison.py)
Service deployment examples (services/)

Initial ML Research (deprecated):

Feature engineering pipeline (beechunker/ml/feature_engineering.py)
Self-Organizing Maps prototype (beechunker/ml/som.py, models/som_model.joblib)
Visualization utilities (beechunker/ml/visualization.py)

Aryan Gupta (agupta72)

Base BeeGFS System

Setting up the base BeeGFS System by installing and setting-up client kernal module
Script for setting up BeeGFS system (Aryan/Build.sh)
BeeGFS testing script (Aryan/combined_test.sh)
BeeGFS I/O test script (Aryan/fio_tests.sh)

Dataset:

Data Genration script for ML model training (Aryan/beegfs_test.sh)
Data pre-processing (beechunker/ml/feature_extraction.py)
Data Analysis

Production ML Model:

Random Forest implementation (beechunker/ml/random_forest.py)
Feature extraction framework (beechunker/ml/feature_extraction.py)
Test dataset creation (data/)
Production model training and optimization:
Base Random Forest model (models/rf_base.joblib)
Ensemble Random Forest model (models/rf_model.joblib)
Fine-tuning Random Forest using HGBoost (models/hgb_base.joblib)
Candidate Chunks to Map Chunk Size group with File Size (models/candidate_chunks.joblib)
Features Names to keep important features consitent throughout (models/feature_names.joblib)

Tanishq Virendrabhai Todkar (ttodkar)

Production ML Model:

XGBoost implementation (beechunker/ml/xgboost_model.py)
Custom feature engineering for XGBoost (beechunker/ml/xgboost_feature_engine.py)
Production model training and optimization:
Final XGBoost model (V2)(models/xgboost_model.joblib)
base XGBoost model (V1)(Branch:TVT_dev)
Data Visualizations

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
Aryan		Aryan
beechunker		beechunker
comparison_plots		comparison_plots
data		data
models		models
services		services
.DS_Store		.DS_Store
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
demo.py		demo.py
model_comparison.py		model_comparison.py
model_comparison_results.csv		model_comparison_results.csv
requirements.txt		requirements.txt
setup.py		setup.py

License

CSC724-Project/Chunker

Folders and files

Latest commit

History

Repository files navigation

BeeChunker

Overview

System Requirements

Codebase Structure

Architecture and Workflow

1. Monitor Service

2. Trainer Service

3. Optimizer Service

Data Flow

Installation

Running the Services

Setting Up Systemd Services

1. Monitor Service

2. Optimizer Service

3. Trainer Service

Creating Symbolic Links

Using Pretrained Models

Setting Up Pretrained Models

Using Models with the Optimizer

Model Selection Performance Considerations

Testing Pretrained Models

Customizing or Updating Models

Setting Up Cron Jobs

Running Services from Command Line

Monitor Service

Trainer Service

Optimizer Service

Model Comparison and Demo

Model Status and Limitations

Demo Script

Model Comparison

Monitoring and Troubleshooting

Checking Service Status

Common Issues and Solutions

Advanced Configuration

Performance Considerations

Implementation Details

Monitor Service

Machine Learning Models

Optimizer Implementation

Contributing

License

Acknowledgments

Team Contributions (Each contribution has been added by the respective person)

Jayesh Bhagyesh Gajbhar (jgajbha)

Core System Architecture & Implementation:

Utilities & Testing:

Initial ML Research (deprecated):

Aryan Gupta (agupta72)

Base BeeGFS System

Dataset:

Production ML Model:

Tanishq Virendrabhai Todkar (ttodkar)

Production ML Model:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages