Enforce LLMs to follow rules

With recent advancements in Large Language Models (LLMs), LLMs are optimized for general-purpose tasks. However, LLMs are not guaranteed to follow rules of the task. Prompt engineering, which is to curate the prompts for the tasks, is shown to improve rate of rule following only by a small margin. In this report, we demonstrate that our method of curriculum method can effectively increase the rate of rule following across different LLM models. In particular, two-step curriculum learning where each step involves proximal policy optimization and low-rank adaptation is the most effective. For simpler tasks, using one step of proximal policy optimization and low-rank adaptation is sufficient. We also demonstrate that our approach of combining proximal policy optimization and low-rank adaptation requires the right values of hyper-parameters to achieve high rates of rule following.

Prerequisites

Install the necessary Python libraries.

pip install -r requirements.txt

All training and testing scripts are run using SLURM workload manager. I ran the experiments on SoC Compute Clusters. The cluster must have the following GPUs:

h100-47: used to run tests of rule-following
h100-96: used to run fine-tuning experiments (PPO/prompt tuning/curriculum learning)

Directory structure

Script entry points are in scripts/ folder.

evaluate

This is the experiment on the quality of LLM outputs. The experiment is used to evaluate how far the quality of outputs changes after the LLM is fine-tuned to follow rules. Note that this experiment is only run on the game of ultimate tic-tac-toe.

FYI, the actual source code for the script to evaluate the LLM for this experiment is in nknguyenhc/ultimate-tictactoe/tree/fyp. The compiled JAR file has been committed to this directory.

To run one experiment,

Run a test script with scripts/test.slurm for the game of ultimate tic-tac-toe, before or after fine-tuning. Obtain the output log (as indicated in --output SBATCH parameter) and put the log in this directory, renaming it to remove the result. prefix and .txt suffix, e.g. rename result.LiquidAI.LFM2-350M.txt to LiquidAI.LFM2-350M.
In evaluate.slurm, Edit the last argument of the bash command to point to the output log you have just put in the current directory, e.g. LiquidAI.LFM2-350M.
Optionally, update the SBATCH parameters --output and --error. This will be the file containing stdout and stderr of this evaluation script.
Send the batch script.

sbatch evaluate.slurm

The result of evaluation is then stored in result.{model name}.txt, e.g. result.LiquidAI.LFM2-350M.txt.

fine_tuning

This folder contains critical components of fine-tuning processes (PPO/prompt tuning/curriculum learning)

c_dataset.py: dataset for connect-4 game, for PPO
c_prompt_tuning_dataset.py: dataset for connect-4 game, for prompt tuning
c_reward.py: reward model for connect-4, for both PPO and prompt tuning
cc_dataset.py: dataset for xiangqi game, for PPO and final step of curriculum learning
cc_piece_movement_dataset.py: dataset for xiangqi game, for piece movement step curriculum learning
cc_reward.py: reward model for xiangqi game, for PPO and final step and piece movement step of curriculum learning
cc_valid_start_dataset.py: dataset for xiangqi game, for valid starting position step of curriculum learning
cc_valid_start_reward.py: reward model for xiangqi game, for valid starting position step of curriculum learning
ult_ttt_dataset.py: dataset for ultimate tic-tac-toe game, for PPO
ult_ttt_prompt_tuning_dataset.py: dataset for ultimate tic-tac-toe game, for prompt tuning
ult_ttt_reward.py: reward model for ultimate tic-tac-toe game, for both PPO and prompt tuning

games

This folder contains code for game logic, and the accompanying test scripts (using unittest module).

Connect-4:
- connect_4.py: Main logic for the game
- connect_4_config.py: Configuration for this game (width, height, steal rules, how many pieces in a row to win). Note that the experiments have only been carried out for the current configuration indicated.
- connect_4_test.py: Test cases for this game
Tic-tac-toe:
- ttt.py: Main logic for this game
Ultimate tic-tac-toe:
- ult_ttt.py: Main logic for this game
- ult_ttt_test.py: Test cases for this game
Xiangqi:
- xiangqi.py: Main logic for this game
- xiangqi_test.py: Test cases for this game

misc

This script contains miscellaneous scripts to analyse training results and plot graphs found in my FYP report.

analysis.py: Run analysis and plots graphs from a run log of fine-tuning (e.g. result.ppo.meta-llama.Llama-3.1-8B-Instruct.out)
cc_stage_graph.py: Plot analysis graphs of the 3-stage curricula
overall.py: Plot performance graphs of different methods on each game

scripts

This folder contains various entry points for fine-tuning and rule following tests.

cl.slurm

This is our main experiment, which is on curriculum learning. The script is used to run one step within a curriculum. Hence to run a full curriculum, you need to run this script multiple times.

Update the output_dir parameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder.
Update the model_name_or_path parameter to the model that you want to fine-tune on. For the first step, this must be an available model on hugging face, e.g. LiquidAI/LFM2-350M points to https://huggingface.co/LiquidAI/LFM2-350M. For the subsequent step, this must point to a local directory containing the model from the previous training step.
Update the step parameter to indicate the training step.

Use vls to train on valid starting positions.
Use pm to train on piece movement.
Use final to train on generating full moves.

Optionally, update the SBATCH parameters --output and --error. This will be the file containing stdout and stderr of the training script.
Send the batch script.

sbatch scripts/cl.slurm

After each step, there are two tests being run:

Testing on valid starting positions and valid moves. The result is stored in result.valid_start.{output_dir}.txt.
Testing on piece movement. The result is stored in result.piece_movement.{output_dir}.txt.

ppo.slurm

This is the experiment of running PPO with LoRA on ultimate tic-tac-toe, connect-4 or xiangqi.

Update the game parameter to the game that you want to test on (ult-ttt, connect-4 or xiangqi).
Update the model_name_or_path parameter to the model that you want to fine-tune on. This model must be available on hugging face, e.g. meta-llama/Llama-3.1-8B-Instruct points to https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. You may use the current value indicated in the script to get started.
Update the trust_remote_code to indicate the boolean value when loading the LLM. Use True for all models, and use False for models from microsoft, e.g. microsoft/Phi-3-mini-4k-Instruct and microsoft/phi-4.
Optionally, update the output_dir parameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder.
Optionally, update the SBATCH parameters --output and --error. This will be the file containing stdout and stderr of the training script.
Send the batch script.

sbatch script.slurm

After the training script has run, the model will be saved to the folder indicated in output_dir. Run the test script with test.slurm on this fine-tuned model.

prompt_tuning.slurm

This is the experiment of running PPO with prefix tuning on ultimate tic-tac-toe and connect-4.

Update the game parameter to the game that you want to test on (ult-ttt or connect-4).
Update the model_name_or_path parameter to the model that you want to fine-tune on. This model must be available on hugging face, e.g. google/gemma-2-2b-it points to https://huggingface.co/google/gemma-2-2b-it. You may use the current value indicated in the script to get started.
Update the trust_remote_code to indicate the boolean value when loading the LLM. Use True for all models, and use False for models from microsoft, e.g. microsoft/Phi-3-mini-4k-Instruct and microsoft/phi-4.
Optionally, update the output_dir parameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder.
Optionally, update the SBATCH parameters --output and --error. This will be the file containing stdout and stderr of the training script.
Send the batch script.

sbatch script.slurm

After the training script has run, the model will be saved to the folder indicated in output_dir. Run the test script with test.slurm on this fine-tuned model.

test.slurm

This is the entry point for the various test scripts. Note that result of the test is stored in result.{normalized model name}.txt. Normalized model name is the model name with / replaced by ., removing the extra .'s where necessary, e.g. result of the test on ./google.gemma-2-2b-it is stored in result.google.gemma-2-2b-it.txt.

Tic-tac-toe

To run a test script on the game of tic-tac-toe, edit the entry point command to:

python scripts/test.py \
    --model_name_or_path <model name> \
    --trust_remote_code <True/False> \
    --game ttt

Replacing

<model name> with a model available on hugging face
<True/False> with the actual value, use True for all models, False for models from microsoft, e.g. microsoft/Phi-3-mini-4k-Instruct and microsoft/phi-4.

Ultimate tic-tac-toe and connect-4

To run a test script on the game of ultimate tic-tac-toe or connect-4, edit the entry point command to:

python scripts/test.py \
    --mode <ppo/prompt-tuning> \
    --model_name_or_path <model name> \
    --trust_remote_code <True/False> \
    --game <ult-ttt/connect-4>

Replacing

<ppo/prompt-tuning> with the actual mode. Use ppo if testing a model from hugging face or a model after PPO. Use prompt-tuning if testing a model after prompt tuning. Note that prompt tuning testing requires loading the model in a different way, hence the split in mode.
<model name> with a model available on hugging face, or a fine-tuned model available in local directory, e.g. ./google.gemma-2-2b-it.
<True/False> with the actual value, use True for all models, False for models from microsoft, e.g. microsoft/Phi-3-mini-4k-Instruct and microsoft/phi-4.
<ult-ttt/connect-4> with the actual game to test the model on. Use ult-ttt for ultimate tic-tac-toe, use connect-4 for connect-4.

Xiangqi

To run a test script on the game of xiangqi, edit the entry point command to:

python scripts/test.py \
    --step <vls/pm> \
    --model_name_or_path <model name> \
    --trust_remote_code <True/False> \
    --game xiangqi

Replacing

<vls/pm> with the metric to test the LLM on. Use vls to test for valid starting positions, this can also be used to the model before fine-tuning. Use pm to test for piece movement.
<model name> with a model available on hugging face, or a fine-tuned model available in local directory, e.g. ./google.gemma-2-2b-it.
<True/False> with the actual value, use True for all models, False for models from microsoft, e.g. microsoft/Phi-3-mini-4k-Instruct and microsoft/phi-4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enforce LLMs to follow rules

Prerequisites

Directory structure

evaluate

fine_tuning

games

misc

scripts

cl.slurm

ppo.slurm

prompt_tuning.slurm

test.slurm

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
evaluate		evaluate
fine_tuning		fine_tuning
games		games
misc		misc
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Enforce LLMs to follow rules

Prerequisites

Directory structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages