Infernal is a powerful yet lightweight tool for running Large Language Models (LLMs) locally with blazing fast inference, comprehensive benchmarking, and model finetuning capabilities. Built on top of llama.cpp for maximum performance.
- Fast Local Inference: Lightning-fast model execution using llama.cpp
- Simple Downloads: Pull models directly from HuggingFace with zero configuration
- Interactive Chat: Engage in real-time conversations with your models
- Performance Benchmarking: Detailed performance metrics and analysis
- Model Finetuning: Train custom models using LoRA/PEFT with simple Modelfiles
- Clean Management: Simple model organization with automatic configuration
- Zero Setup: No complex configuration files - just download and run
Method 1: Using pip (Recommended)
git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install .
Method 2: Direct installation
git clone https://github.com/chandraprvkvsh/Infernal.git
cd infernal
pip install -r requirements.txt
After Method 1, you can use the infernal
command globally. For Method 2, use python infernal.py
.
Download a model:
Using pip installation:
infernal pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf
Or with direct URL:
infernal pull --url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
Using direct method:
python infernal.py pull --repo-id TheBloke/Llama-2-7B-Chat-GGUF --filename llama-2-7b-chat.Q4_K_M.gguf
List your models:
infernal list
or
python infernal.py list
Run inference:
Single prompt:
infernal run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing in simple terms"
Interactive chat:
infernal run llama-2-7b-chat.Q4_K_M.gguf --interactive
Using direct method:
python infernal.py run llama-2-7b-chat.Q4_K_M.gguf --prompt "Explain quantum computing"
Benchmark performance:
infernal benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a short story about AI" --repeat 5
Using direct method:
python infernal.py benchmark llama-2-7b-chat.Q4_K_M.gguf --prompt "Write a story" --repeat 3
Remove models:
infernal remove llama-2-7b-chat.Q4_K_M.gguf
or
python infernal.py remove llama-2-7b-chat.Q4_K_M.gguf
One of Infernal's most powerful features is the ability to finetune LLMs using a simple Modelfile approach, similar to Docker's approach but for AI models.
A Modelfile is a simple text file that describes how to finetune a model. It contains the base model, training parameters, and example conversations.
Here's a complete example (Modelfile
):
# Infernal Modelfile for Customer Support Bot
# Your HuggingFace token (required for private models)
HF_TOKEN your_huggingface_token_here
# Base model to finetune
FROM meta-llama/Llama-3.2-1B-Instruct
# Training configuration
PARAMETER device cuda
PARAMETER max_length 512
PARAMETER learning_rate 3e-5
PARAMETER epochs 3
PARAMETER batch_size 2
PARAMETER weight_decay 0.01
PARAMETER warmup_steps 100
# LoRA configuration for efficient training
PARAMETER lora true
PARAMETER lora_r 8
PARAMETER lora_alpha 32
PARAMETER lora_dropout 0.05
PARAMETER lora_target_modules q_proj,v_proj,k_proj,o_proj
# System prompt
SYSTEM You are a helpful customer support assistant for TechCorp. Be friendly, professional, and provide clear solutions.
# Training examples (as many as you need)
MESSAGE user How do I reset my password?
MESSAGE assistant To reset your password: 1) Go to the login page 2) Click "Forgot Password" 3) Enter your email 4) Check your email for reset instructions 5) Follow the link to create a new password. If you need further help, please contact our support team.
MESSAGE user My order hasn't arrived yet. What should I do?
MESSAGE assistant I understand your concern about your delayed order. Let me help you: 1) Check your email for tracking information 2) Use the tracking number on our website 3) If it's been more than 5 business days, please provide your order number and I'll investigate further. We apologize for any inconvenience.
MESSAGE user How do I return a defective product?
MESSAGE assistant For defective product returns: 1) Visit our Returns page 2) Select "Defective Item" as the reason 3) Upload photos of the defect 4) Print the prepaid return label 5) Package securely and ship back. You'll receive a replacement or refund within 5-7 business days after we receive the item.
MESSAGE user What's your refund policy?
MESSAGE assistant Our refund policy allows returns within 30 days of purchase. Items must be unused and in original packaging. Refunds typically process within 3-5 business days after we receive the returned item. Digital products and personalized items are non-refundable. Would you like help starting a return?
Model Configuration:
FROM
: Base model from HuggingFace (required)HF_TOKEN
: Your HuggingFace access tokenSYSTEM
: System prompt that defines the assistant's role
Training Parameters:
device
: Training device (cuda
for GPU,cpu
for CPU)epochs
: Number of training cycles (default: 3)batch_size
: Training batch size (default: 2)learning_rate
: Learning rate (default: 2e-5)max_length
: Maximum sequence length (default: 2048)weight_decay
: Weight decay for regularizationwarmup_steps
: Number of warmup steps
LoRA/PEFT Parameters (for efficient training):
lora
: Enable LoRA training (true
/false
)lora_r
: LoRA rank (default: 8)lora_alpha
: LoRA alpha parameter (default: 32)lora_dropout
: LoRA dropout rate (default: 0.05)lora_target_modules
: Target modules for LoRA (e.g.,q_proj,v_proj
)
Once you have your Modelfile ready:
Basic finetuning
infernal finetune --modelfile Modelfile --output my-custom-bot
Override parameters from command line
infernal finetune --modelfile Modelfile --output my-custom-bot --epochs 5 --batch-size 4
Specify a custom name for the model
infernal finetune --modelfile Modelfile --output my-custom-bot --name "customer-support-v1"
The finetuning process will:
- Download the base model from HuggingFace
- Prepare training data from your MESSAGE examples
- Apply LoRA adapters for efficient training (if enabled)
- Train the model using HuggingFace Trainer
- Merge adapters back into the base model
- Save the result in standard Hugging Face format
You can optionally convert the merged model to GGUF format using llama-cpp
, and push the converted model to Hugging Face if desired.
Then use your custom model:
infernal pull --url https://huggingface.co/my-username/my-custom-bot/resolve/main/my-custom-bot.gguf
infernal run my-custom-bot --interactive
For better results:
- Use 10-50 example conversations in your Modelfile
- Keep examples focused on your specific use case
- Use consistent formatting in your MESSAGE blocks
- Test with small models first (1B-3B parameters)
For efficient training:
- Enable LoRA (
PARAMETER lora true
) to reduce memory usage - Use smaller batch sizes if you run out of memory
- Start with fewer epochs (3-5) to avoid overfitting
Hardware recommendations:
- CPU training: Works but slow, use small models and batch_size=1
- GPU training: Much faster, can use larger models and batch sizes
- Memory: 8GB+ RAM minimum, 16GB+ recommended
Infernal provides comprehensive performance analysis with accurate timing measurements:
- Time to First Token (TTFT): Latency before first response token
- Throughput: Tokens generated per second during generation phase
- Total Token Rate: Combined input/output processing speed
- Memory Usage: Peak RAM consumption during inference
- Generation Time: Pure text generation time (excluding TTFT)
Infernal uses actual measured timings without artificial thresholds, providing authentic performance data that reflects real-world usage patterns.
Run 1/3 - Prompt 1
Time to first token: 0.245 seconds
Generation time (After-TTFT): 2.156 seconds
Throughput (generated tok/sec): 23.45
Total tokens/sec: 19.87
Peak memory usage: 8547.23 MB
Averages across all runs:
Total runs: 3
Total tokens generated: 150
Total time required: 6.78 seconds
Avg Time to first token: 0.251 seconds
Avg throughput (generated tok/sec): 22.89
Avg total tokens/sec: 19.34
Avg Peak memory usage: 8521.45 MB
Run benchmarks multiple times for statistical accuracy:
infernal benchmark model.gguf --prompt "Test prompt" --repeat 10
You can also benchmark using prompts from a Modelfile:
infernal benchmark model.gguf --promptfile Modelfile
This project uses modern Python packaging with pyproject.toml
instead of the traditional setup.py
. This provides:
- Cleaner dependency management: All project metadata in one file
- Modern build system: Uses
hatchling
as the build backend - Automatic script creation:
infernal
command is automatically available system-wide - Better development workflow: Use
pip install -e .
for editable installs
Benefits:
- Run
infernal
command from anywhere - Cleaner project structure
- Modern Python packaging standards
- Easy uninstallation with
pip uninstall infernal
Traditional approach using requirements.txt
for those who prefer direct control:
- Manual dependency installation:
pip install -r requirements.txt
- Direct script execution:
python infernal.py [command]
- No system-wide installation: Runs only in project directory
When to use each method:
- Use Method 1 for permanent installation and system-wide access
- Use Method 2 for development, testing, or temporary usage
By default, models are stored in a models/
directory with a config.json
file for management. Change the storage location with:
infernal --models-dir /path/to/models pull --url [URL]
Models are configured automatically, but you can modify settings in models/config.json
:
{
"models": {},
"default_model": null,
"settings": {
"max_tokens": 2048,
"temperature": 0.2,
"top_p": 0.7
}
}
- GGUF files: Primary format for llama.cpp compatibility
- Quantized models: Q4_K_M, Q5_K_M, Q8_0, etc.
- HuggingFace integration: Direct downloads from HF repositories
- Python 3.9+
- 4GB RAM (for small 1B-3B models)
- 8GB+ RAM (for 7B models)
- 16GB+ RAM (for 13B+ models)
- 2GB+ free disk space per model
- 16GB+ RAM for comfortable usage with 7B models
- SSD storage for faster model loading
- Multi-core CPU for better throughput
- NVIDIA GPU for finetuning (optional but recommended)
- Choose appropriate size: Start with 7B models for best speed/quality balance
- Use quantized models: Q4_K_M offers good performance with reduced memory
- Consider context length: Shorter contexts process faster
For NVIDIA GPU acceleration:
pip uninstall llama-cpp-python
pip install llama-cpp-python --force-reinstall
- Close unnecessary applications before running large models
- Use swap space if physical RAM is limited
- Monitor memory usage during benchmarking
"llama-cpp-python not installed"
pip install llama-cpp-python
If issues persist, try:
pip install llama-cpp-python --force-reinstall --no-cache-dir
Permission errors with pip install
pip install -e . --user
or use virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e .
Out of memory errors
- Try smaller models (7B instead of 13B)
- Use more aggressive quantization (Q4_0 instead of Q5_K_M)
- Reduce context size in the code
- Add swap space:
sudo swapon /swapfile
Slow performance
- Verify CPU utilization with
top
or Task Manager - Ensure model file is on fast storage (SSD)
- Check for background processes consuming resources
- Consider GPU acceleration for supported hardware
Model download failures
- Check internet connection and HuggingFace availability
- Verify model repository exists and is public
- Try downloading smaller test model first
- Use
--url
method if--repo-id
fails
CUDA out of memory during training Reduce batch size:
PARAMETER batch_size 1
Enable gradient accumulation:
PARAMETER gradient_accumulation_steps 4
Use LoRA for efficient training:
PARAMETER lora true
PARAMETER load_in_4bit true
Slow training on CPU Use smaller models for CPU training:
FROM microsoft/DialoGPT-small
Reduce sequence length:
PARAMETER max_length 256
Use minimal training:
PARAMETER epochs 1
Import errors after installation
which python
pip list | grep infernal
Reinstall in development mode:
pip uninstall infernal
pip install -e .
Command not found after pip install
pip show infernal
Add pip install directory to PATH if needed.
infernal/
├── infernal.py # Main application code
├── requirements.txt # Dependencies for direct installation
├── pyproject.toml # Modern Python project configuration
├── README.md # This documentation
├── tests/ # Test suite
│ ├── test_benchmark.py
│ ├── test_cli.py
│ ├── test_inference.py
│ ├── test_model_management.py
│ └── test_integration.py
├── Modelfile # Example Modelfile for finetuning
└── models/ # Model storage (created automatically)
└── config.json # Model configuration
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes following the existing code style
- Add tests for new functionality
- Update documentation if needed
- Submit a pull request
MIT License - see LICENSE file for details.
- llama.cpp - High-performance inference engine
- Hugging Face - Model hosting and distribution platform
- Rich - Beautiful terminal output library
- Click - Command-line interface framework
- PEFT - Parameter-Efficient Fine-tuning library
Unleash the power of local LLMs with Infernal - inference, benchmarking, and finetuning made simple.