Intelligent Chunk Size Optimization for BeeGFS using Machine Learning Models
BeeChunker is an intelligent system for optimizing chunk sizes in BeeGFS (Fraunhofer Parallel File System) storage systems by analyzing file access patterns and predicting optimal chunk sizes using various machine learning models including Self-Organizing Maps (SOM), Random Forest (RF), and XGBoost.
BeeGFS is a parallel file system that distributes file data across multiple storage servers using chunks. The "chunk size" determines how a file is divided and distributed, which significantly impacts I/O performance. However, determining the optimal chunk size for a file is challenging as it depends on many factors:
- File size
- Access patterns (read vs. write ratio)
- I/O operation sizes
- Workload characteristics
- File type and extension
BeeChunker solves this problem by:
- Continuously monitoring file access patterns
- Training machine learning models on the collected data
- Predicting optimal chunk sizes based on file characteristics
- Automatically applying these optimizations to existing and new files
- Python 3.8 or higher
- BeeGFS installation (critical requirement)
- Access to BeeGFS command-line tools (
beegfs-ctl) - Root or sudo access (for some operations)
- Sufficient disk space for the monitoring database and log files
The BeeChunker codebase is organized into several key components:
beechunker/
├── cli/ # Command-line interfaces
│ ├── monitor_cli.py # Monitor service CLI for tracking file access
│ ├── optimizer_cli.py # Optimizer service CLI for applying chunk optimizations
│ └── trainer_cli.py # Trainer service CLI for ML model training
├── common/ # Common utilities
│ ├── beechunker_logging.py # Logging setup
│ ├── config.py # Configuration management
│ ├── default_config.json # Default configuration
├── custom_types/events/
| ├── file_access_event.py # File access event class
├── ml/ # Machine learning components
│ ├── feature_engineering.py # Feature engineering
│ ├── feature_extraction.py # Extract features from raw data
│ ├── random_forest.py # Random Forest model
│ ├── som.py # Self-Organizing Map implementation ** NOT USED IN THE FINAL IMPLEMENTATION **
│ ├── visualization.py # Visualization tools
│ └── xgboost_model.py # XGBoost model implementation
├── monitor/ # Monitoring components
│ ├── access_tracker.py # File access tracking
│ └── db_manager.py # Database management
└── optimizer/ # Optimization components
├── chunk_manager.py # Chunk size management
└── file_watcher.py # New file detection
BeeChunker consists of three main services that work together:
The monitor service (monitor_cli.py) continuously watches BeeGFS mount points to track file access operations:
- Uses the
watchdoglibrary to detect file operations (read/write) - Captures file metadata including current chunk size using
beegfs-ctl --getentryinfo - Records access patterns, read/write operations, and performance metrics in a SQLite database
- Handles cleanup of old monitoring data to prevent database bloat
The trainer service (trainer_cli.py) analyzes the collected data to train machine learning models:
- Supports multiple ML models:
- Random Forest (RF): Ensemble learning method using decision trees for accurate chunk size prediction
- XGBoost: Gradient boosting implementation for high-performance predictions
Self-Organizing Map (SOM): Unsupervised neural network creating a topological mapping of file access patterns
- Processes raw access data through extensive feature engineering
- Creates visualizations of model insights (U-Matrix, Component Planes, Cluster Analysis)
- Saves trained models to disk for use by the optimizer
The optimizer service (optimizer_cli.py) applies ML predictions to optimize file chunk sizes:
- Extracts features from files matching what the models were trained on
- Uses the trained models to predict optimal chunk sizes
- Creates new files with the optimal chunk size and swaps them in place
- Tracks optimization history and performance improvements
- Can be run continuously (service mode) or on-demand (CLI mode)
- Data Collection: The monitor service records file access operations in the SQLite database
- Feature Engineering: Raw data is processed to extract relevant features:
- File size and current chunk size
- Read/write operation counts and ratios
- Average and maximum read/write sizes
- File extension characteristics
- Access patterns and throughput metrics
- Model Training: The trainer service processes this data to train models that correlate file characteristics with optimal chunk sizes
- Prediction & Optimization: The optimizer applies these models to predict and set optimal chunk sizes for files
- Extract the contents of the submitted zip file:
unzip beechunker.zip
cd BeeChunker- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install the package and dependencies:
pip install -r requirements.txt
pip install -e .- Create necessary directories:
sudo mkdir -p /opt/beechunker/data/logs
sudo mkdir -p /opt/beechunker/data/models
sudo chown -R $USER:$USER /opt/beechunker- Configure BeeGFS mount points in a custom configuration file: (not a requirement as the default config will be applied in case the user doesnt manually define a config file)
mkdir -p /opt/beechunker/data
cat > /opt/beechunker/data/config.json << EOL
{
"monitor": {
"db_path": "/opt/beechunker/data/access_patterns.db",
"log_path": "/opt/beechunker/data/logs/monitor.log",
"polling_interval": 300
},
"optimizer": {
"log_path": "/opt/beechunker/data/logs/optimizer.log",
"min_chunk_size": 64,
"max_chunk_size": 4096
},
"ml": {
"models_dir": "/opt/beechunker/data/models",
"log_path": "/opt/beechunker/data/logs/trainer.log",
"training_interval": 86400,
"min_training_samples": 100,
"som_iterations": 5000,
"n_estimators": 100,
"hgb_iter": 1000,
"ot_quantile": 0.65
},
"beegfs": {
"mount_points": [
"/mnt/beegfs"
]
}
}
EOLBeeChunker provides three main services that can be run independently or as systemd services.
The project includes example service files in the services/ directory. You'll need to adapt these to your environment.
- Copy and modify the example service file:
sudo cp services/beechunker-monitor.service.example /etc/systemd/system/beechunker-monitor.service
sudo nano /etc/systemd/system/beechunker-monitor.service- Update the service file with your specific paths and username:
[Unit]
Description=BeeChunker Monitor Service
After=network.target
[Service]
Type=simple
User=YOUR_USERNAME # Replace with your username
WorkingDirectory=/path/to/BeeChunker # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/monitor_cli.py run
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable beechunker-monitor.service
sudo systemctl start beechunker-monitor.service- Check the service status:
sudo systemctl status beechunker-monitor.service- Copy and modify the example service file:
sudo cp services/beechunker-optimizer.service.example /etc/systemd/system/beechunker-optimizer.service
sudo nano /etc/systemd/system/beechunker-optimizer.service- Update the service file with your specific paths and username:
[Unit]
Description=BeeChunker Optimizer Service
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=YOUR_USERNAME # Replace with your username
WorkingDirectory=/path/to/BeeChunker # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/optimizer_cli.py run
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable beechunker-optimizer.service
sudo systemctl start beechunker-optimizer.service- Check the service status:
sudo systemctl status beechunker-optimizer.service- Copy and modify the example service file:
sudo cp services/beechunker-trainer.service.example /etc/systemd/system/beechunker-trainer.service
sudo nano /etc/systemd/system/beechunker-trainer.service- Update the service file with your specific paths and username:
[Unit]
Description=BeeChunker Trainer Service
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=YOUR_USERNAME # Replace with your username
WorkingDirectory=/path/to/BeeChunker # Replace with your path
Environment="BEECHUNKER_CONFIG=/opt/beechunker/data/config.json"
ExecStart=/path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/trainer_cli.py train
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable beechunker-trainer.service
sudo systemctl start beechunker-trainer.service- Check the service status: (might show not running or stopped which is normal)
sudo systemctl status beechunker-trainer.serviceTo run BeeChunker commands easily from anywhere in your system, you can create symbolic links to the main CLI scripts. This allows you to use commands like beechunker-monitor instead of the full path.
# Create symbolic links in a directory that's in your PATH
sudo ln -s $(pwd)/beechunker/cli/monitor_cli.py /usr/local/bin/beechunker-monitor
sudo ln -s $(pwd)/beechunker/cli/optimizer_cli.py /usr/local/bin/beechunker-optimizer
sudo ln -s $(pwd)/beechunker/cli/trainer_cli.py /usr/local/bin/beechunker-trainer
# Make them executable
sudo chmod +x /usr/local/bin/beechunker-monitor
sudo chmod +x /usr/local/bin/beechunker-optimizer
sudo chmod +x /usr/local/bin/beechunker-trainerAfter creating these symbolic links, you can run commands like:
beechunker-monitor run
beechunker-optimizer optimize-file /path/to/fileBeeChunker comes with several pretrained models that can be used immediately without having to collect data and train models from scratch. These models are located in the models/ directory of the repository and include:
rf_model.joblib: The main Random Forest model (recommended for production)xgboost_model.json: The experimental XGBoost modelcandidate_chunks.joblib: List of candidate chunk sizes considered by the modelsfeature_names.joblib: Feature names used by the models- And several supporting model files
To use these pretrained models:
- Create the models directory in your BeeChunker configuration path:
sudo mkdir -p /opt/beechunker/data/models
sudo chown -R $USER:$USER /opt/beechunker/data/models- Copy the pretrained models to this directory:
# Assuming you're in the BeeChunker root directory
cp -r models/* /opt/beechunker/data/models/- Verify the models are in place:
ls -la /opt/beechunker/data/models/You should see all the model files copied to this location.
Once the models are in place, you can use them with the optimizer without running the trainer:
# Use the Random Forest model (default and recommended)
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type rf
# Or try the XGBoost model
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type xgbThe optimizer will automatically load the corresponding pretrained model based on the --model-type parameter.
The pretrained models were developed based on extensive testing across various file sizes and access patterns:
-
Random Forest (
rf_model.joblib):- Best for general purpose use
- Performs well on files from 64KB to 2GB
- More consistent predictions
- Recommended for most use cases
-
XGBoost (
xgboost_model.json):- May provide better optimization for very large files (>1GB)
- More aggressive in optimization recommendations
- Can be used for specialized workloads
You can test the performance of these pretrained models using the provided utility scripts:
# Run the demo with the Random Forest model
python demo.py --model rf
# Run the demo with the XGBoost model
python demo.py --model xgb
# Compare both models
python model_comparison.pyThese tests will create sample files, optimize them using the pretrained models, and report performance improvements.
If you wish to customize these models or train new ones based on your specific workload:
-
Run the monitor service for some time to collect real-world access patterns:
python beechunker/cli/monitor_cli.py run
-
Export the collected data:
python beechunker/cli/monitor_cli.py export-data --output ~/beechunker_training_data.csv -
Train a custom model:
python beechunker/cli/trainer_cli.py train --input-csv ~/beechunker_training_data.csv
Note that due to the current limitations in the trainer service, you may need to manually prepare the data for training.
Although the trainer service currently has limitations with model compatibility, you can still set up a cron job to run the trainer periodically. This may be useful for future versions when these issues are resolved.
# Edit crontab file
crontab -eAdd one of the following lines to the crontab file:
# Run the trainer daily at 2:00 AM
0 2 * * * /path/to/BeeChunker/venv/bin/python /path/to/BeeChunker/beechunker/cli/trainer_cli.py train >> /opt/beechunker/data/logs/cron_trainer.log 2>&1
# Or, if you created symbolic links:
0 2 * * * /usr/local/bin/beechunker-trainer train >> /opt/beechunker/data/logs/cron_trainer.log 2>&1
Save the file to set up the cron job. Remember that due to the current limitations in the trainer service, this cron job might not work correctly until the preprocessing compatibility issues are resolved.
If you prefer to run the services manually or for testing purposes, you can run them directly from the command line:
# Start the monitoring service
python beechunker/cli/monitor_cli.py run
# Check monitoring statistics
python beechunker/cli/monitor_cli.py stats
# Clean up old data (keep last 30 days)
python beechunker/cli/monitor_cli.py cleanup --days 30# Train models using data from the database
python beechunker/cli/trainer_cli.py train
# Train using a specific CSV file
python beechunker/cli/trainer_cli.py train --input-csv /path/to/data.csv
# Make a prediction for a specific file
python beechunker/cli/trainer_cli.py predict --file-size 1073741824 --read-count 100 --write-count 20# Run optimizer service continuously
python beechunker/cli/optimizer_cli.py run
# Optimize a single file
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file
# Choose a specific model type (rf, som, or xgb)
python beechunker/cli/optimizer_cli.py optimize-file /path/to/file --model-type xgb
# Optimize all files in a directory
python beechunker/cli/optimizer_cli.py optimize-dir /path/to/directory --recursive
# Analyze without changing anything (dry run)
python beechunker/cli/optimizer_cli.py optimize-dir /path/to/directory --dry-run
# Analyze file to show predicted chunk size without changing
python beechunker/cli/optimizer_cli.py analyze /path/to/file
# Bulk optimize based on database query
python beechunker/cli/optimizer_cli.py bulk-optimize --min-access 10BeeChunker includes two utility scripts to demonstrate and evaluate the system:
Important Notes on Model Availability:
-
SOM Model (DEPRECATED): The Self-Organizing Map model was implemented as a proof of concept and is not intended for production use. It should be considered deprecated and unusable for actual optimization.
-
Trainer Service Limitations: The trainer service currently has compatibility issues between models due to different preprocessing requirements. This has not been fully resolved yet, so automatic training of models may not work as expected. Manual model comparisons and testing are recommended instead.
-
Recommended Model: The Random Forest (RF) model is currently the most stable and recommended model for production use. XGBoost is available for experimental comparisons.
The demo.py script demonstrates the system by creating test files, simulating access patterns, and showing performance improvements from chunk size optimization:
# Run the demo using the Random Forest model (default)
python demo.py --model rf
# Run the demo using the XGBoost model
python demo.py --model xgbThe model_comparison.py script runs a comprehensive comparison between the Random Forest and XGBoost models:
# Run the model comparison with 3 trials per scenario (default)
python model_comparison.py
# Run with more trials for more robust results
python model_comparison.py --trials 5This will generate detailed comparison plots in the comparison_plots/ directory and print a summary of model performance.
# Check service status
sudo systemctl status beechunker-monitor
sudo systemctl status beechunker-optimizer
sudo systemctl status beechunker-trainer
# View service logs
journalctl -u beechunker-monitor.service
journalctl -u beechunker-optimizer.service
journalctl -u beechunker-trainer.service
# View application logs
tail -f /opt/beechunker/data/logs/monitor.log
tail -f /opt/beechunker/data/logs/optimizer.log
tail -f /opt/beechunker/data/logs/trainer.log-
Missing BeeGFS Tools:
- Error: "Command 'beegfs-ctl' not found"
- Solution: Ensure BeeGFS is installed and tools are in PATH
-
Permission Issues:
- Error: "Permission denied"
- Solution: Run with sudo or adjust file permissions
-
Database Issues:
- Error: "Database not found" or "no such table"
- Solution: Ensure monitor service has run at least once to create database schema
-
Model Training Failures:
- Error: "Not enough samples for training"
- Solution: Gather more access data (at least 100 samples by default)
-
Service Won't Start:
- Check logs:
journalctl -u beechunker-monitor.service -n 50 - Verify paths in service file
- Check logs:
The BeeChunker configuration file supports many customization options:
{
"monitor": {
"db_path": "/opt/beechunker/data/access_patterns.db",
"log_path": "/opt/beechunker/data/logs/monitor.log",
"polling_interval": 300 // Check interval in seconds
},
"optimizer": {
"log_path": "/opt/beechunker/data/logs/optimizer.log",
"min_chunk_size": 64, // Minimum chunk size in KB
"max_chunk_size": 4096 // Maximum chunk size in KB
},
"ml": {
"models_dir": "/opt/beechunker/data/models",
"log_path": "/opt/beechunker/data/logs/trainer.log",
"training_interval": 86400, // Training interval in seconds (daily)
"min_training_samples": 100, // Minimum samples required for training
"som_iterations": 5000, // SOM training iterations
"n_estimators": 100, // RF tree count
"hgb_iter": 1000, // XGBoost iterations
"ot_quantile": 0.65 // Optimal Throughput quantile threshold
},
"beegfs": {
"mount_points": ["/mnt/beegfs"] // BeeGFS mount points to monitor
}
}- Database Size: The monitor database can grow large over time. Use the cleanup function regularly.
- CPU Usage: Model training can be CPU intensive. Consider running training during off-peak hours.
- Storage Overhead: Changing chunk sizes creates temporary files. Ensure sufficient free space.
- Optimization Frequency: Frequent chunk size changes can cause overhead. Use appropriate thresholds.
The monitor service tracks file operations using the watchdog library and BeeGFS command-line tools:
- Uses file system event handlers to detect read/write operations
- Records access events in a SQLite database
- Tracks file metadata including size, chunk size, and access patterns
- Maintains separate tables for file metadata, access events, and throughput metrics
BeeChunker implements different ML models for chunk size prediction:
-
Random Forest (RF) :
- Ensemble of decision trees for robust classification
- Stacks multiple RF models for higher accuracy
- Includes feature importance analysis
- Most stable and reliable model for production use
-
XGBoost :
- Gradient boosting implementation for high accuracy
- Fast prediction with low memory footprint
- Handles complex feature interactions well
- Currently in experimental stage
-
Self-Organizing Map (SOM) (DEPRECATED):
- Was implemented as a proof of concept only
- Not intended for production use
- Included in the codebase for academic purposes only
- Should NOT be used for actual chunk size optimization
The optimizer uses a sophisticated approach to change chunk sizes:
- Extracts features from the file to predict optimal chunk size
- Uses BeeGFS tools to create a new file with the optimal chunk size
- Copies data from the original file to the new file
- Performs an atomic swap to replace the original file
- Records the optimization in the database for tracking
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License - see the LICENSE file for details.
- BeeGFS team for their excellent parallel filesystem
- MiniSom library for the Self-Organizing Map implementation
- The scikit-learn and XGBoost teams for their machine learning libraries
This project was developed collaboratively by three team members, each contributing to specific components:
- Complete CLI structure and services integration (beechunker/cli/)
- System fundamentals and configuration (beechunker/common/)
- Custom data types and events (beechunker/custom_types/)
- Monitoring system for file access patterns (beechunker/monitor/)
- Chunk size optimization implementation (beechunker/optimizer/)
- System packaging and deployment (setup.py)
- Performance demonstration tool (demo.py)
- Model comparison framework (model_comparison.py)
- Service deployment examples (services/)
- Feature engineering pipeline (beechunker/ml/feature_engineering.py)
- Self-Organizing Maps prototype (beechunker/ml/som.py, models/som_model.joblib)
- Visualization utilities (beechunker/ml/visualization.py)
- Setting up the base BeeGFS System by installing and setting-up client kernal module
- Script for setting up BeeGFS system (Aryan/Build.sh)
- BeeGFS testing script (Aryan/combined_test.sh)
- BeeGFS I/O test script (Aryan/fio_tests.sh)
- Data Genration script for ML model training (Aryan/beegfs_test.sh)
- Data pre-processing (beechunker/ml/feature_extraction.py)
- Data Analysis
- Random Forest implementation (beechunker/ml/random_forest.py)
- Feature extraction framework (beechunker/ml/feature_extraction.py)
- Test dataset creation (data/)
- Production model training and optimization:
- Base Random Forest model (models/rf_base.joblib)
- Ensemble Random Forest model (models/rf_model.joblib)
- Fine-tuning Random Forest using HGBoost (models/hgb_base.joblib)
- Candidate Chunks to Map Chunk Size group with File Size (models/candidate_chunks.joblib)
- Features Names to keep important features consitent throughout (models/feature_names.joblib)
- XGBoost implementation (beechunker/ml/xgboost_model.py)
- Custom feature engineering for XGBoost (beechunker/ml/xgboost_feature_engine.py)
- Production model training and optimization:
- Final XGBoost model (V2)(models/xgboost_model.joblib)
- base XGBoost model (V1)(Branch:TVT_dev)
- Data Visualizations
