add extended abstract to README

zenglingqi647 · Dec 24, 2023 · 7bcf1df · 7bcf1df
1 parent 202b306
commit 7bcf1df
Show file tree

Hide file tree

Showing 5 changed files with 48 additions and 45 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,7 +3,8 @@
 .vscode
 setup.sh
 rl-starter-files/storage
-rl-starter-files/storag
+rl-starter-files/storag*
+rl-starter-files/submissions
 ctrl.sh
 rl-starter-files/evaluate/
 rl-starter-files/log/

diff --git a/README.md b/README.md
@@ -1,3 +1,18 @@
+# Final Project for COMPSCI 285
+## Extended Abstract
+
+In the dynamic field of Reinforcement Learning (RL) research, mastering long-horizon sparse-reward tasks remains a formidable challenge. Traditional RL agents, lacking prior knowledge, rely solely on sparse environmental rewards to discern effective actions. In contrast, humans leverage past knowledge to efficiently adapt and accomplish new tasks, a capability that current RL methodologies often lack.
+
+Recent advancements in Large Language Models (LLMs) like GPT and BERT have showcased their remarkable ability to encode vast world knowledge and perform contextual reasoning. However, LLMs are not inherently grounded in specific tasks or environments, and directly using them as primary agents can lead to uncertainty and instability. Moreover, the computational demands of these pre-trained models, with billions or trillions of parameters, make them impractical for local deployment. Concurrently, Hierarchical Reinforcement Learning (HRL) methods have shown promise in managing complex tasks by exploiting their hierarchical nature, central to which is an effective high-level planning policy that can reason about composing skills.
+
+Our work leverages the planning capabilities of LLMs to augment a Deep Q-Network (DQN) planner within an HRL framework. We first train a set of basic skill agents using curriculum learning in various BabyAI environments. These skills are then composed using a DQN planner, forming the basis of our hierarchical approach. The DQN planner is further augmented with an LLM-matching reward bonus, based on the similarity between its decisions and those suggested by the LLM during training. This integration offers three key advantages: 
+1. It eliminates the need for an LLM during evaluation, which is beneficial in environments with limited internet access, reduces LLM API costs and avoids rate-limiting issues.
+2. The DQN's learning process is accelerated by providing additional signals that would typically require manual specification.
+3. With sufficient experience, the system retains the potential to outperform pure LLM-based planners.
+
+We benchmark our method against various baseline models in multiple complex test environments. These include models trained using Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), a standalone DQN planner, and a pure LLM planner. Our approach demonstrates improved performance over these baselines in certain scenarios and can even surpass a pure LLM-based planner due to the DQN's additional optimization. Both the LLM and DQN are shown to contribute to the model's performance through ablation studies. However, our method's performance is not consistently optimal, highlighting future research directions such as further training and benchmarking against state-of-the-art models in the BabyAI environment.
+
+
 # Directory structures:
 experimental-code:
     Draft code, not really used
@@ -12,45 +27,4 @@ Setting up the repository:
 ```
 cd rl-starter-files
 pip install -r requirements.txt
-```
-
-Right now my training script:
-```
-cd rl-starter-files/
-
-python -m scripts.train --algo ppo --env BabyAI-GoToImpUnlock-v0 --model GoToImpUnlock0.0005Ask --text --save-interval 10 --frames 250000 --gpt
-```
-The problem is, an ask probability of 0.0005 is still very bad...It takes a really long time to train.
-
-# TODO
-### Baselines
-Basic: 
-> PPO, A2C only
-
-Exploration(?): 
-> RND: https://opendilab.github.io/DI-engine/12_policies/rnd.html
-> BeBold, NovelD: https://github.com/tianjunz/NovelD
-> Deir
-
-
-### **Update**
-- Bash script of experiments of different babyai and minigrid environments can be found as `babyai.sh` and `minigrid.sh`.
-
-- The reshaped reward with gpt predicting for a single action and for the next few actions (currently hardcoded as 10) are implemented and merged in the `train.py` and the `utils` folder.
-
-- Add `eval2excel.py` for evaluation and convert the results to excel files.
-
-
-To run:
-```
-/data1/lzengaf/cs285/proj/minigrid/experimental-code/llm-interface/llama2_interface.py
-```
-first run:
-```
-pip install langchain cmake
-export CMAKE_ARGS="-DLLAMA_METAL=on"
-FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
-
-curl https://ollama.ai/install.sh | sh
-
 ```
diff --git a/file.py b/file.py
@@ -0,0 +1,23 @@
+# import os
+# import shutil
+
+
+# root = '/data1/lzengaf/cs285/proj/goto/storage'
+# for dir in os.listdir(root):
+#     if os.path.exists(os.path.join(root, dir, 'status.pt')):
+#         os.remove(os.path.join(root, dir, 'status.pt'))
+#         print('removed status.pt in {}'.format(dir))
+#     else:
+#         print('------- no status.pt in {}'.format(dir))
+#         shutil.rmtree(os.path.join(root, dir))
+
+
+\paragraph{DQN Planner:} The DQN planner operates as a conventional reinforcement learning agent. It outputs discrete actions, each corresponding to a specific skill and goal combination. The DQN is trained to maximize a reward signal based on the performance of the selected skills in achieving the set goals.
+
+\paragraph{LLM Integration:} The LLM is utilized to provide strategic guidance to the DQN planner. Unlike the DQN, which is queried at fixed environment step intervals, the LLM is queried during the training phase of the DQN planner. The result of LLM is validated such that the words are from a pre-defined dictionary and makes it easier to parse and harder to deviate from the desired results.
+
+\paragraph{Query Mechanism:} The planner queries the DQN at fixed environment step intervals for immediate decision-making. In contrast, the LLM is queried after a predetermined number of DQN queries. 
+
+\paragraph{Reward Shaping:} A novel aspect of our approach is the reshaping of the planner reward. A reward bonus is given when the DQN's output matches the LLM's recommendation. The reward calculation is as follows: a match in chosen skills yields a reward of 1; similarly, matches in object color and type each contribute a reward of 1. Non-matches receive no reward. The final reward is the average of these individual rewards, promoting alignment with LLM guidance.
+
+\paragraph{Training and Updating:} The DQN is continuously trained and updated based on the reshaped reward signal. This training process incorporates feedback from both the environment and the LLM, ensuring that the DQN's policy evolves to effectively integrate the strategic guidance provided by the LLM.
diff --git a/scripts/baseline_ppo.sh b/scripts/baseline_ppo.sh
@@ -13,12 +13,12 @@ cd ../rl-starter-files
 # python -m scripts.train --algo ppo --env BabyAI-UnblockPickup-v0  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
 # python -m scripts.train --algo a2c --env BabyAI-UnblockPickup-v0  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
 
-for i in BabyAI-GoToObjMaze-v0 BabyAI-UnblockPickup-v0 BabyAI-GoToObjMazeOpen-v0 BabyAI-Pickup-v0 BabyAI-Pickup-v0
+for i in BabyAI-GoTo-v0
 do
 python -m scripts.train --algo a2c --env $i --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
 done
 
-for i in BabyAI-UnblockPickup-v0 BabyAI-GoToObjMazeOpen-v0 BabyAI-Pickup-v0 BabyAI-Pickup-v0
+for i in BabyAI-GoTo-v0
 do
 python -m scripts.train --algo ppo --env $i  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
 done
diff --git a/scripts/dqn_augmented_planner.sh b/scripts/dqn_augmented_planner.sh
@@ -1,13 +1,18 @@
 cd ../rl-starter-files
 
+tmux kill-session
+
 tmux attach -t 
 # python -m scripts.train --algo base --env BabyAI-GoToImpUnlock-v0  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 2 --dqn-batch-size 64
 
 # python -m scripts.train --algo base --env BabyAI-GoToSeq-v0  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 2 --dqn-batch-size 64
 BabyAI-GoTo-v0
 
-python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4
+python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoToObjMazeOpen-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-29
+
+python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoToObjMaze-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-39
 
+python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoTo-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-36
 # python -m scripts.train --algo base --env BabyAI-UnblockPickup-v0  --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 3 --dqn-batch-size 64
 
 # python -m scripts.train --algo base --env BabyAI-GoToObjMazeOpen-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 3 --dqn-batch-size 64