With recent advancements in Large Language Models (LLMs), LLMs are optimized for general-purpose tasks. However, LLMs are not guaranteed to follow rules of the task. Prompt engineering, which is to curate the prompts for the tasks, is shown to improve rate of rule following only by a small margin. In this report, we demonstrate that our method of curriculum method can effectively increase the rate of rule following across different LLM models. In particular, two-step curriculum learning where each step involves proximal policy optimization and low-rank adaptation is the most effective. For simpler tasks, using one step of proximal policy optimization and low-rank adaptation is sufficient. We also demonstrate that our approach of combining proximal policy optimization and low-rank adaptation requires the right values of hyper-parameters to achieve high rates of rule following.
- Install the necessary Python libraries.
pip install -r requirements.txt- All training and testing scripts are run using SLURM workload manager. I ran the experiments on SoC Compute Clusters. The cluster must have the following GPUs:
h100-47: used to run tests of rule-followingh100-96: used to run fine-tuning experiments (PPO/prompt tuning/curriculum learning)
Script entry points are in scripts/ folder.
This is the experiment on the quality of LLM outputs. The experiment is used to evaluate how far the quality of outputs changes after the LLM is fine-tuned to follow rules. Note that this experiment is only run on the game of ultimate tic-tac-toe.
FYI, the actual source code for the script to evaluate the LLM for this experiment is in nknguyenhc/ultimate-tictactoe/tree/fyp. The compiled JAR file has been committed to this directory.
To run one experiment,
- Run a test script with
scripts/test.slurmfor the game of ultimate tic-tac-toe, before or after fine-tuning. Obtain the output log (as indicated in--outputSBATCH parameter) and put the log in this directory, renaming it to remove theresult.prefix and.txtsuffix, e.g. renameresult.LiquidAI.LFM2-350M.txttoLiquidAI.LFM2-350M. - In
evaluate.slurm, Edit the last argument of the bash command to point to the output log you have just put in the current directory, e.g.LiquidAI.LFM2-350M. - Optionally, update the SBATCH parameters
--outputand--error. This will be the file containing stdout and stderr of this evaluation script. - Send the batch script.
sbatch evaluate.slurmThe result of evaluation is then stored in result.{model name}.txt, e.g. result.LiquidAI.LFM2-350M.txt.
This folder contains critical components of fine-tuning processes (PPO/prompt tuning/curriculum learning)
c_dataset.py: dataset for connect-4 game, for PPOc_prompt_tuning_dataset.py: dataset for connect-4 game, for prompt tuningc_reward.py: reward model for connect-4, for both PPO and prompt tuningcc_dataset.py: dataset for xiangqi game, for PPO and final step of curriculum learningcc_piece_movement_dataset.py: dataset for xiangqi game, for piece movement step curriculum learningcc_reward.py: reward model for xiangqi game, for PPO and final step and piece movement step of curriculum learningcc_valid_start_dataset.py: dataset for xiangqi game, for valid starting position step of curriculum learningcc_valid_start_reward.py: reward model for xiangqi game, for valid starting position step of curriculum learningult_ttt_dataset.py: dataset for ultimate tic-tac-toe game, for PPOult_ttt_prompt_tuning_dataset.py: dataset for ultimate tic-tac-toe game, for prompt tuningult_ttt_reward.py: reward model for ultimate tic-tac-toe game, for both PPO and prompt tuning
This folder contains code for game logic, and the accompanying test scripts (using unittest module).
- Connect-4:
connect_4.py: Main logic for the gameconnect_4_config.py: Configuration for this game (width, height, steal rules, how many pieces in a row to win). Note that the experiments have only been carried out for the current configuration indicated.connect_4_test.py: Test cases for this game
- Tic-tac-toe:
ttt.py: Main logic for this game
- Ultimate tic-tac-toe:
ult_ttt.py: Main logic for this gameult_ttt_test.py: Test cases for this game
- Xiangqi:
xiangqi.py: Main logic for this gamexiangqi_test.py: Test cases for this game
This script contains miscellaneous scripts to analyse training results and plot graphs found in my FYP report.
analysis.py: Run analysis and plots graphs from a run log of fine-tuning (e.g.result.ppo.meta-llama.Llama-3.1-8B-Instruct.out)cc_stage_graph.py: Plot analysis graphs of the 3-stage curriculaoverall.py: Plot performance graphs of different methods on each game
This folder contains various entry points for fine-tuning and rule following tests.
This is our main experiment, which is on curriculum learning. The script is used to run one step within a curriculum. Hence to run a full curriculum, you need to run this script multiple times.
- Update the
output_dirparameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder. - Update the
model_name_or_pathparameter to the model that you want to fine-tune on. For the first step, this must be an available model on hugging face, e.g.LiquidAI/LFM2-350Mpoints to https://huggingface.co/LiquidAI/LFM2-350M. For the subsequent step, this must point to a local directory containing the model from the previous training step. - Update the
stepparameter to indicate the training step.
- Use
vlsto train on valid starting positions. - Use
pmto train on piece movement. - Use
finalto train on generating full moves.
- Optionally, update the SBATCH parameters
--outputand--error. This will be the file containing stdout and stderr of the training script. - Send the batch script.
sbatch scripts/cl.slurmAfter each step, there are two tests being run:
- Testing on valid starting positions and valid moves. The result is stored in
result.valid_start.{output_dir}.txt. - Testing on piece movement. The result is stored in
result.piece_movement.{output_dir}.txt.
This is the experiment of running PPO with LoRA on ultimate tic-tac-toe, connect-4 or xiangqi.
- Update the
gameparameter to the game that you want to test on (ult-ttt,connect-4orxiangqi). - Update the
model_name_or_pathparameter to the model that you want to fine-tune on. This model must be available on hugging face, e.g.meta-llama/Llama-3.1-8B-Instructpoints to https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. You may use the current value indicated in the script to get started. - Update the
trust_remote_codeto indicate the boolean value when loading the LLM. UseTruefor all models, and useFalsefor models frommicrosoft, e.g.microsoft/Phi-3-mini-4k-Instructandmicrosoft/phi-4. - Optionally, update the
output_dirparameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder. - Optionally, update the SBATCH parameters
--outputand--error. This will be the file containing stdout and stderr of the training script. - Send the batch script.
sbatch script.slurmAfter the training script has run, the model will be saved to the folder indicated in output_dir. Run the test script with test.slurm on this fine-tuned model.
This is the experiment of running PPO with prefix tuning on ultimate tic-tac-toe and connect-4.
- Update the
gameparameter to the game that you want to test on (ult-tttorconnect-4). - Update the
model_name_or_pathparameter to the model that you want to fine-tune on. This model must be available on hugging face, e.g.google/gemma-2-2b-itpoints to https://huggingface.co/google/gemma-2-2b-it. You may use the current value indicated in the script to get started. - Update the
trust_remote_codeto indicate the boolean value when loading the LLM. UseTruefor all models, and useFalsefor models frommicrosoft, e.g.microsoft/Phi-3-mini-4k-Instructandmicrosoft/phi-4. - Optionally, update the
output_dirparameter. This will be the name of the folder containing the model. If the folder is not yet created, the script will create a new folder. - Optionally, update the SBATCH parameters
--outputand--error. This will be the file containing stdout and stderr of the training script. - Send the batch script.
sbatch script.slurmAfter the training script has run, the model will be saved to the folder indicated in output_dir. Run the test script with test.slurm on this fine-tuned model.
This is the entry point for the various test scripts. Note that result of the test is stored in result.{normalized model name}.txt. Normalized model name is the model name with / replaced by ., removing the extra .'s where necessary, e.g. result of the test on ./google.gemma-2-2b-it is stored in result.google.gemma-2-2b-it.txt.
- Tic-tac-toe
To run a test script on the game of tic-tac-toe, edit the entry point command to:
python scripts/test.py \
--model_name_or_path <model name> \
--trust_remote_code <True/False> \
--game tttReplacing
<model name>with a model available on hugging face<True/False>with the actual value, useTruefor all models,Falsefor models frommicrosoft, e.g.microsoft/Phi-3-mini-4k-Instructandmicrosoft/phi-4.
- Ultimate tic-tac-toe and connect-4
To run a test script on the game of ultimate tic-tac-toe or connect-4, edit the entry point command to:
python scripts/test.py \
--mode <ppo/prompt-tuning> \
--model_name_or_path <model name> \
--trust_remote_code <True/False> \
--game <ult-ttt/connect-4>Replacing
<ppo/prompt-tuning>with the actual mode. Useppoif testing a model from hugging face or a model after PPO. Useprompt-tuningif testing a model after prompt tuning. Note that prompt tuning testing requires loading the model in a different way, hence the split in mode.<model name>with a model available on hugging face, or a fine-tuned model available in local directory, e.g../google.gemma-2-2b-it.<True/False>with the actual value, useTruefor all models,Falsefor models frommicrosoft, e.g.microsoft/Phi-3-mini-4k-Instructandmicrosoft/phi-4.<ult-ttt/connect-4>with the actual game to test the model on. Useult-tttfor ultimate tic-tac-toe, useconnect-4for connect-4.
- Xiangqi
To run a test script on the game of xiangqi, edit the entry point command to:
python scripts/test.py \
--step <vls/pm> \
--model_name_or_path <model name> \
--trust_remote_code <True/False> \
--game xiangqiReplacing
<vls/pm>with the metric to test the LLM on. Usevlsto test for valid starting positions, this can also be used to the model before fine-tuning. Usepmto test for piece movement.<model name>with a model available on hugging face, or a fine-tuned model available in local directory, e.g../google.gemma-2-2b-it.<True/False>with the actual value, useTruefor all models,Falsefor models frommicrosoft, e.g.microsoft/Phi-3-mini-4k-Instructandmicrosoft/phi-4.