xTuring
makes it simple, fast, and cost‑efficient to fine‑tune open‑source LLMs (e.g., GPT‑OSS, LLaMA/LLaMA 2, Falcon, GPT‑J, GPT‑2, OPT, Bloom, Cerebras, Galactica) on your own data — locally or in your private cloud.
Why xTuring:
- Simple API for data prep, training, and inference
- Private by default: run locally or in your VPC
- Efficient: LoRA and low‑precision (INT8/INT4) to cut costs
- Scales from CPU/laptop to multi‑GPU easily
- Evaluate models with built‑in metrics (e.g., perplexity)
pip install xturing
Run a small, CPU‑friendly example first:
from xturing.datasets import InstructionDataset
from xturing.models import BaseModel
# Load a toy instruction dataset (Alpaca format)
dataset = InstructionDataset("./examples/models/llama/alpaca_data")
# Start small for quick iterations (works on CPU)
model = BaseModel.create("distilgpt2_lora")
# Fine‑tune and then generate
model.finetune(dataset=dataset)
output = model.generate(texts=["Explain quantum computing for beginners."])
print(f"Model output: {output}")
Want bigger models and reasoning controls? Try GPT‑OSS variants (requires significant resources):
from xturing.models import BaseModel
# 120B or 20B variants; also support LoRA/INT8/INT4 configs
model = BaseModel.create("gpt_oss_20b_lora")
You can find the data folder here.
Highlights from recent updates:
- GPT‑OSS integration – Use and fine‑tune
gpt_oss_120b
andgpt_oss_20b
with off‑the‑shelf, INT8, LoRA, LoRA+INT8, and LoRA+INT4 options. Includes configurable reasoning levels and harmony response format support.
from xturing.models import BaseModel
# Use the production-ready 120B model
model = BaseModel.create('gpt_oss_120b_lora')
# Or use the efficient 20B model for faster inference
model = BaseModel.create('gpt_oss_20b_lora')
# Both models support reasoning levels via system prompts
- LLaMA 2 integration – Off‑the‑shelf, INT8, LoRA, LoRA+INT8, and LoRA+INT4 via
GenericModel
orLlama2
.
from xturing.models import Llama2
model = Llama2()
## or
from xturing.models import BaseModel
model = BaseModel.create('llama2')
- Evaluation – Evaluate any causal LM on any dataset. Currently supports
perplexity
.
# Make the necessary imports
from xturing.datasets import InstructionDataset
from xturing.models import BaseModel
# Load the desired dataset
dataset = InstructionDataset('../llama/alpaca_data')
# Load the desired model (try GPT-OSS for advanced reasoning)
model = BaseModel.create('gpt_oss_20b')
# Run the Evaluation of the model on the dataset
result = model.evaluate(dataset)
# Print the result
print(f"Perplexity of the evalution: {result}")
- INT4 precision – Fine‑tune many LLMs with INT4 using
GenericLoraKbitModel
.
# Make the necessary imports
from xturing.datasets import InstructionDataset
from xturing.models import GenericLoraKbitModel
# Load the desired dataset
dataset = InstructionDataset('../llama/alpaca_data')
# Load the desired model for INT4 bit fine-tuning
model = GenericLoraKbitModel('tiiuae/falcon-7b')
# Run the fine-tuning
model.finetune(dataset)
- CPU inference – Run inference on CPUs (including laptops) via Intel® Extension for Transformers, using weight‑only quantization and optimized kernels on Intel platforms.
# Make the necessary imports
from xturing.models import BaseModel
# Initializes the model: quantize the model with weight-only algorithms
# and replace the linear with Itrex's qbits_linear kernel
model = BaseModel.create("llama2_int8")
# Once the model has been quantized, do inferences directly
output = model.generate(texts=["Why LLM models are becoming so important?"])
print(output)
- Batching – Set
batch_size
in.generate()
and.evaluate()
to speed up processing.
# Make the necessary imports
from xturing.datasets import InstructionDataset
from xturing.models import GenericLoraKbitModel
# Load the desired dataset
dataset = InstructionDataset('../llama/alpaca_data')
# Load the desired model for INT4 bit fine-tuning
model = GenericLoraKbitModel('tiiuae/falcon-7b')
# Generate outputs on desired prompts
outputs = model.generate(dataset = dataset, batch_size=10)
An exploration of the Llama LoRA INT4 working example is recommended for an understanding of its application.
For an extended insight, consider examining the GenericModel working example available in the repository.
$ xturing chat -m "<path-to-model-folder>"
from xturing.datasets import InstructionDataset
from xturing.models import BaseModel
from xturing.ui import Playground
dataset = InstructionDataset("./alpaca_data")
model = BaseModel.create("<model_name>")
model.finetune(dataset=dataset)
model.save("llama_lora_finetuned")
Playground().launch() ## launches localhost UI
- Preparing your dataset
- Cerebras-GPT fine-tuning with LoRA and INT8
- Cerebras-GPT fine-tuning with LoRA
- LLaMA fine-tuning with LoRA and INT8
- LLaMA fine-tuning with LoRA
- LLaMA fine-tuning
- GPT-J fine-tuning with LoRA and INT8
- GPT-J fine-tuning with LoRA
- GPT-2 fine-tuning with LoRA
Here is a comparison for the performance of different fine-tuning techniques on the LLaMA 7B model. We use the Alpaca dataset for fine-tuning. The dataset contains 52K instructions.
Hardware:
4xA100 40GB GPU, 335GB CPU RAM
Fine-tuning parameters:
{
'maximum sequence length': 512,
'batch size': 1,
}
LLaMA-7B | DeepSpeed + CPU Offloading | LoRA + DeepSpeed | LoRA + DeepSpeed + CPU Offloading |
---|---|---|---|
GPU | 33.5 GB | 23.7 GB | 21.9 GB |
CPU | 190 GB | 10.2 GB | 14.9 GB |
Time/epoch | 21 hours | 20 mins | 20 mins |
Contribute to this by submitting your performance results on other GPUs by creating an issue with your hardware specifications, memory consumption and time per epoch.
We have already fine-tuned some models that you can use as your base or start playing with. Here is how you would load them:
from xturing.models import BaseModel
model = BaseModel.load("x/distilgpt2_lora_finetuned_alpaca")
model | dataset | Path |
---|---|---|
DistilGPT-2 LoRA | alpaca | x/distilgpt2_lora_finetuned_alpaca |
LLaMA LoRA | alpaca | x/llama_lora_finetuned_alpaca |
Below is a list of all the supported models via BaseModel
class of xTuring
and their corresponding keys to load them.
Model | Key |
---|---|
Bloom | bloom |
Cerebras | cerebras |
DistilGPT-2 | distilgpt2 |
Falcon-7B | falcon |
Galactica | galactica |
GPT-OSS (20B/120B) | gpt_oss_20b, gpt_oss_120b |
GPT-J | gptj |
GPT-2 | gpt2 |
LLaMA | llama |
LLaMA2 | llama2 |
OPT-1.3B | opt |
The above are the base variants. Use these templates for LoRA
, INT8
, and INT8 + LoRA
versions:
Version | Template |
---|---|
LoRA | <model_key>_lora |
INT8 | <model_key>_int8 |
INT8 + LoRA | <model_key>_lora_int8 |
To load a model’s INT4 + LoRA version, use the GenericLoraKbitModel
class:
model = GenericLoraKbitModel('<model_path>')
Replace <model_path>
with a local directory or a Hugging Face model like facebook/opt-1.3b
.
- Support for
LLaMA
,GPT-J
,GPT-2
,OPT
,Cerebras-GPT
,Galactica
andBloom
models - Dataset generation using self-instruction
- Low-precision LoRA fine-tuning and unsupervised fine-tuning
- INT8 low-precision fine-tuning support
- OpenAI, Cohere and AI21 Studio model APIs for dataset generation
- Added fine-tuned checkpoints for some models to the hub
- INT4 LLaMA LoRA fine-tuning demo
- INT4 LLaMA LoRA fine-tuning with INT4 generation
- Support for a
Generic model
wrapper - Support for
Falcon-7B
model - INT4 low-precision fine-tuning support
- Evaluation of LLM models
- INT3, INT2, INT1 low-precision fine-tuning support
- Support for Stable Diffusion
If you have any questions, you can create an issue on this repository.
You can also join our Discord server and start a discussion in the #xturing
channel.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
As an open source project in a rapidly evolving field, we welcome contributions of all kinds, including new features and better documentation. Please read our contributing guide to learn how you can get involved.