Skip to content

Commit

Permalink
add extended abstract to README
Browse files Browse the repository at this point in the history
  • Loading branch information
zenglingqi647 committed Dec 24, 2023
1 parent 202b306 commit 7bcf1df
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 45 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
.vscode
setup.sh
rl-starter-files/storage
rl-starter-files/storag
rl-starter-files/storag*
rl-starter-files/submissions
ctrl.sh
rl-starter-files/evaluate/
rl-starter-files/log/
Expand Down
56 changes: 15 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# Final Project for COMPSCI 285
## Extended Abstract

In the dynamic field of Reinforcement Learning (RL) research, mastering long-horizon sparse-reward tasks remains a formidable challenge. Traditional RL agents, lacking prior knowledge, rely solely on sparse environmental rewards to discern effective actions. In contrast, humans leverage past knowledge to efficiently adapt and accomplish new tasks, a capability that current RL methodologies often lack.

Recent advancements in Large Language Models (LLMs) like GPT and BERT have showcased their remarkable ability to encode vast world knowledge and perform contextual reasoning. However, LLMs are not inherently grounded in specific tasks or environments, and directly using them as primary agents can lead to uncertainty and instability. Moreover, the computational demands of these pre-trained models, with billions or trillions of parameters, make them impractical for local deployment. Concurrently, Hierarchical Reinforcement Learning (HRL) methods have shown promise in managing complex tasks by exploiting their hierarchical nature, central to which is an effective high-level planning policy that can reason about composing skills.

Our work leverages the planning capabilities of LLMs to augment a Deep Q-Network (DQN) planner within an HRL framework. We first train a set of basic skill agents using curriculum learning in various BabyAI environments. These skills are then composed using a DQN planner, forming the basis of our hierarchical approach. The DQN planner is further augmented with an LLM-matching reward bonus, based on the similarity between its decisions and those suggested by the LLM during training. This integration offers three key advantages:
1. It eliminates the need for an LLM during evaluation, which is beneficial in environments with limited internet access, reduces LLM API costs and avoids rate-limiting issues.
2. The DQN's learning process is accelerated by providing additional signals that would typically require manual specification.
3. With sufficient experience, the system retains the potential to outperform pure LLM-based planners.

We benchmark our method against various baseline models in multiple complex test environments. These include models trained using Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), a standalone DQN planner, and a pure LLM planner. Our approach demonstrates improved performance over these baselines in certain scenarios and can even surpass a pure LLM-based planner due to the DQN's additional optimization. Both the LLM and DQN are shown to contribute to the model's performance through ablation studies. However, our method's performance is not consistently optimal, highlighting future research directions such as further training and benchmarking against state-of-the-art models in the BabyAI environment.


# Directory structures:
experimental-code:
Draft code, not really used
Expand All @@ -12,45 +27,4 @@ Setting up the repository:
```
cd rl-starter-files
pip install -r requirements.txt
```

Right now my training script:
```
cd rl-starter-files/
python -m scripts.train --algo ppo --env BabyAI-GoToImpUnlock-v0 --model GoToImpUnlock0.0005Ask --text --save-interval 10 --frames 250000 --gpt
```
The problem is, an ask probability of 0.0005 is still very bad...It takes a really long time to train.

# TODO
### Baselines
Basic:
> PPO, A2C only
Exploration(?):
> RND: https://opendilab.github.io/DI-engine/12_policies/rnd.html
> BeBold, NovelD: https://github.com/tianjunz/NovelD
> Deir

### **Update**
- Bash script of experiments of different babyai and minigrid environments can be found as `babyai.sh` and `minigrid.sh`.

- The reshaped reward with gpt predicting for a single action and for the next few actions (currently hardcoded as 10) are implemented and merged in the `train.py` and the `utils` folder.

- Add `eval2excel.py` for evaluation and convert the results to excel files.


To run:
```
/data1/lzengaf/cs285/proj/minigrid/experimental-code/llm-interface/llama2_interface.py
```
first run:
```
pip install langchain cmake
export CMAKE_ARGS="-DLLAMA_METAL=on"
FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
curl https://ollama.ai/install.sh | sh
```
23 changes: 23 additions & 0 deletions file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# import os
# import shutil


# root = '/data1/lzengaf/cs285/proj/goto/storage'
# for dir in os.listdir(root):
# if os.path.exists(os.path.join(root, dir, 'status.pt')):
# os.remove(os.path.join(root, dir, 'status.pt'))
# print('removed status.pt in {}'.format(dir))
# else:
# print('------- no status.pt in {}'.format(dir))
# shutil.rmtree(os.path.join(root, dir))


\paragraph{DQN Planner:} The DQN planner operates as a conventional reinforcement learning agent. It outputs discrete actions, each corresponding to a specific skill and goal combination. The DQN is trained to maximize a reward signal based on the performance of the selected skills in achieving the set goals.

\paragraph{LLM Integration:} The LLM is utilized to provide strategic guidance to the DQN planner. Unlike the DQN, which is queried at fixed environment step intervals, the LLM is queried during the training phase of the DQN planner. The result of LLM is validated such that the words are from a pre-defined dictionary and makes it easier to parse and harder to deviate from the desired results.

\paragraph{Query Mechanism:} The planner queries the DQN at fixed environment step intervals for immediate decision-making. In contrast, the LLM is queried after a predetermined number of DQN queries.

\paragraph{Reward Shaping:} A novel aspect of our approach is the reshaping of the planner reward. A reward bonus is given when the DQN's output matches the LLM's recommendation. The reward calculation is as follows: a match in chosen skills yields a reward of 1; similarly, matches in object color and type each contribute a reward of 1. Non-matches receive no reward. The final reward is the average of these individual rewards, promoting alignment with LLM guidance.

\paragraph{Training and Updating:} The DQN is continuously trained and updated based on the reshaped reward signal. This training process incorporates feedback from both the environment and the LLM, ensuring that the DQN's policy evolves to effectively integrate the strategic guidance provided by the LLM.
4 changes: 2 additions & 2 deletions scripts/baseline_ppo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ cd ../rl-starter-files
# python -m scripts.train --algo ppo --env BabyAI-UnblockPickup-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
# python -m scripts.train --algo a2c --env BabyAI-UnblockPickup-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1

for i in BabyAI-GoToObjMaze-v0 BabyAI-UnblockPickup-v0 BabyAI-GoToObjMazeOpen-v0 BabyAI-Pickup-v0 BabyAI-Pickup-v0
for i in BabyAI-GoTo-v0
do
python -m scripts.train --algo a2c --env $i --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
done

for i in BabyAI-UnblockPickup-v0 BabyAI-GoToObjMazeOpen-v0 BabyAI-Pickup-v0 BabyAI-Pickup-v0
for i in BabyAI-GoTo-v0
do
python -m scripts.train --algo ppo --env $i --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size -1
done
7 changes: 6 additions & 1 deletion scripts/dqn_augmented_planner.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
cd ../rl-starter-files

tmux kill-session

tmux attach -t
# python -m scripts.train --algo base --env BabyAI-GoToImpUnlock-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 2 --dqn-batch-size 64

# python -m scripts.train --algo base --env BabyAI-GoToSeq-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 2 --dqn-batch-size 64
BabyAI-GoTo-v0

python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4
python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoToObjMazeOpen-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-29

python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoToObjMaze-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-39

python -m scripts.train --algo base --env BabyAI-GoTo-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 4 --model BabyAI-GoTo-v0_base_rec20_f1000000_fp40_seed1_llmplannergpt_askevery4.0_23-12-13-17-48-36
# python -m scripts.train --algo base --env BabyAI-UnblockPickup-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 3 --dqn-batch-size 64

# python -m scripts.train --algo base --env BabyAI-GoToObjMazeOpen-v0 --text --frames 1000000 --log-interval 10 --recurrence 20 --save-interval 15 --batch-size 1280 --procs 64 --frames-per-proc 40 --obs-size 0 --use-dqn --llm-augmented --llm-planner-variant gpt --ask-every 3 --dqn-batch-size 64
Expand Down

0 comments on commit 7bcf1df

Please sign in to comment.