Skip to content

Latest commit

 

History

History
57 lines (48 loc) · 2.43 KB

README.md

File metadata and controls

57 lines (48 loc) · 2.43 KB

Installtion

You need to first update the submodule by running the following command:

git submodule init
git submodule update

Then, you can install the required packages for OpenRLHF with the following command:

conda create -n 
cd OpenRLHF
pip install -e .[vllm]
pip install evalplus # requried for rule-based reward for code generation

Data Preparation

  • To get the AceCode-87K-hard that only keeps 25% of the examples that makes the RL training faster, run the following command:
python scripts/get_hard_data.py --dataset_path "TIGER-Lab/AceCode-87K" --output_path "./data/acecode_87K/acecode_87K.json" --only_keep_hard_examples True

Reward model preparation

Since AceCodeRM-7B is trained with LlamaFactory, the format might be different from the OpenRLHF RM format, but it's generally the same. The only difference is that the Llamafactory enabled the bias=True for the final linear layer, while OpenRLHF uses bias=False.

Two ways to use RM for RL training:

  • Directly set reward_pretrain="TIGER-Lab/AceCodeRM-7B" in the RL training script and set value_head_prefix="summary" in the training script.
  • Convert the RM to OpenRLHF format weights with the following command:
python scripts/change_lf_rm_to_openrlhf_rm.py --lf_rm_model_path "TIGER-Lab/AceCodeRM-7B" --openrlhf_rm_model_path "./models/AceCodeRM-7B-openrlhf" --push_to_hub False

Then, set reward_pretrain="./models/AceCodeRM-7B-openrlhf" in the RL training script and set value_head_prefix="score" in the training script.

(Note: the reason why we use LlamaFactory for training RM is historical reason. We have tried using OpenRLHF to train RM, and the performance is similar.)

Training RL

please export WANDB_API_KEY=your_wandb_api_key before running the following scripts.

  • with reward model
bash scripts/train_reinforce_ray.sh # reinforcement++
# and change the following variables in the script
# policy_pretrain="Your initial policy model"
# reward_pretrain="TIGER-Lab/AceCodeRM-7B"
# dataset_path="./data/acecode_87K/acecode_87K.json"
# run_name="Your run name"
  • with rule-based reward (binary pass rate)
bash scripts/train_reinforce_ray_rule_rm.sh # reinforcement++
# and change the following variables in the script
# policy_pretrain="Your initial policy model"
# binary_reward=True 
# dataset_path="./data/acecode_87K/acecode_87K.json"
# run_name="Your run name"