This repository contains the implementation of SSPO and baselines (DPO, ORPO, SimPO, KTO, SSRM, SPA).
- Create a virtual environment named 'sspo' with Python 3.10 or higher:
conda create -n sspo python==3.10.0
conda activate sspo- Install required packages:
cd SSPO
pip install -r requirements.txt- Preprocess the data:
python preprocessing_data/preprocessing_ultrachat.py --fb [feedback_ratio] --ch [chat_ratio]
python preprocessing_data/preprocessing_medical.py --fb [feedback_ratio] --ch [chat_ratio]
python preprocessing_data/preprocessing_business.py --fb [feedback_ratio] --ch [chat_ratio]- Generate YAML configuration and training command:
python examples/train/make_yaml.py
python examples/train/make_yaml_medical.py
python examples/train/make_yaml_business.py- Execute training:
# Copy the generated command from make_yaml.py output
# Paste it into examples/train/train.sh
bash examples/train/train.shFollow the same steps as SSPO, but modify the method in examples/train/train.sh
We follow the implementation from the SPA repository. Please refer to this repository for detailed instructions.
- Generate additional unlabeled responses:
python examples/SSRM/generate_responses.py- Perform pseudo-labeling using a pre-trained reward model:
python examples/SSRM/pseudo_label.py- Filter data based on confidence threshold:
python examples/SSRM/conf_threshold.py- Merge feedback data:
python examples/SSRM/merge_json.py- Execute the complete SSRM training pipeline:
# Configure the number of iterations in examples/SSRM/train-ssrm.sh
# The script will execute steps 1-4 for the specified number of iterations
bash examples/SSRM/train-ssrm.sh- Make sure to adjust hyperparameters in the YAML configuration file generated by
make_yaml.py - For SSRM, you can control the number of iterations by modifying the commands in
train-ssrm.sh