GitHub - mauriciogtec/rule-bottleneck-reinforcement-learning: Code for paper: Rule-Bottleneck RL: Joint Decision and Explanation for Sequential Resource Allocation with LLM Agents

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

Run the code with the following command:

python sac_attention.py --llm="gpt-4o-mini"

Visualize the results with TensorBoard:

tensorboard --logdir=runs

Or by running the code with the flag --track and visualizing in Weights & Biases.

To see the full functionality of the code, run:

python sac_attention.py --help

usage: sac_attention.py [-h] [OPTIONS]

╭─ options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ -h, --help              show this help message and exit                                                                                            │
│ --exp-name STR          the name of this experiment (default: sac_attention)                                                                       │
│ --seed INT              seed of the experiment (default: 1)                                                                                        │
│ --torch-deterministic, --no-torch-deterministic                                                                                                    │
│                         if toggled, `torch.backends.cudnn.deterministic=False` (default: True)                                                     │
│ --cuda, --no-cuda       if toggled, cuda will be enabled by default (default: True)                                                                │
│ --track, --no-track     if toggled, this experiment will be tracked with Weights and Biases (default: False)                                       │
│ --wandb-project-name STR                                                                                                                           │
│                         the wandb's project name (default: rulebots)                                                                               │
│ --wandb-entity {None}|STR                                                                                                                          │
│                         the entity (team) of wandb's project (default: None)                                                                       │
│ --capture-video, --no-capture-video                                                                                                                │
│                         whether to capture videos of the agent performances (check out `videos` folder) (default: False)                           │
│ --log-frequency INT     the logging frequency of the algorithm (default: 1)                                                                        │
│ --log-examples-interval INT                                                                                                                        │
│                         the logging frequency of the examples (default: 20)                                                                        │
│ --resume, --no-resume   if toggled, tries to resume training from the latest checkpoint (default: False)                                           │
│ --ckpt-interval INT     the saving interval of the model (default: 1)                                                                              │
│ --overwrite-ckpt, --no-overwrite-ckpt                                                                                                              │
│                         if toggled and resuming is on, it will start fresh in resume mode, otherwise ignored (default: False)                      │
│ --env-id STR            the id of the environment (default: Uganda)                                                                                │
│ --num-envs INT          the number of parallel game environments (default: 4)                                                                      │
│ --agent {base_agent,llm_rules_agent,no_thoughts_agent}                                                                                             │
│                         the agent to use (default: llm_rules_agent)                                                                                │
│ --parallel-pipeline, --no-parallel-pipeline                                                                                                        │
│                         if toggled, the pipeline will be parallelized (default: True)                                                              │
│ --total-timesteps INT   total timesteps of the experiments (default: 1250)                                                                         │
│ --gamma FLOAT           the discount factor gamma (default: 0.95)                                                                                  │
│ --tau FLOAT             target smoothing coefficient (default: 1) (default: 1.0)                                                                   │
│ --batch-size INT        the batch size of sample from the reply memory (default: 32)                                                               │
│ --learning-starts INT   timestep to start learning (default: 64)                                                                                   │
│ --policy-lr FLOAT       the learning rate of the policy network optimizer (default: 0.0003)                                                        │
│ --q-lr FLOAT            the learning rate of the Q network network optimizer (default: 0.0003)                                                     │
│ --update-frequency FLOAT|INT                                                                                                                       │
│                         the frequency of training updates (default: 0.0625)                                                                        │
│ --warmup-updates INT    the number of warmup updates to the value function on the first iteration. (default: 64)                                   │
│ --target-network-frequency INT                                                                                                                     │
│                         the frequency of updates for the target networks (default: 64)                                                             │
│ --alpha FLOAT           Entropy regularization coefficient. (default: 0.01)                                                                        │
│ --autotune, --no-autotune                                                                                                                          │
│                         automatic tuning of the entropy coefficient (default: False)                                                               │
│ --target-entropy-scale FLOAT                                                                                                                       │
│                         coefficient for scaling the autotune entropy target (default: 0.89)                                                        │
│ --num-eval-steps INT    the number of steps to run in each eval environment per policy rollout (default: 64)                                       │
│ --eval-interval INT     the evaluation interval (default: 1)                                                                                       │
│ --eval-deterministic, --no-eval-deterministic                                                                                                      │
│                         if toggled, the evaluation will be deterministic (default: True)                                                           │
│ --rolling-rewards-window INT                                                                                                                       │
│                         the rolling rewards window (default: 64)                                                                                   │
│ --num-rules INT         The number of rules for rule-based LLM-only agent (default: 10)                                                            │
│ --llm                                                                                                                                              │
│ {google/gemma-2b-it,meta-llama/Llama-3.2-3B-Instruct-Turbo,meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo,meta-llama/Llama-3.3-70B-Instruct-Turbo,m… │
│                         the language model to use (default: gpt-4o-mini-huit)                                                                      │
│ --embedder-lm STR       the language model to use for embeddings (default: togethercomputer/m2-bert-80M-8k-retrieval)                              │
│ --embed-dim INT         the dimension of the embeddings (default: 768)                                                                             │
│ --hidden-dim INT        the hidden dimension of the networks (default: 16)                                                                         │
│ --buffer-collection-steps INT                                                                                                                      │
│                         the number of steps to collect data to the buffer (default: 64)                                                            │
│ --load-buffer, --no-load-buffer                                                                                                                    │
│                         if toggled, the agent will load the buffer from the pickle file if it exists (default: True)                               │
│ --buffer-size INT       the replay memory buffer size (default: 1024)                                                                              │
│ --compile-torch, --no-compile-torch                                                                                                                │
│                         Torch compile (default: False)

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
_deprecated		_deprecated
conf		conf
envs		envs
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
agents.py		agents.py
assess_diversity.py		assess_diversity.py
buffers.py		buffers.py
eval_compare.py		eval_compare.py
eval_compare_finetune.py		eval_compare_finetune.py
eval_compare_rulerew.py		eval_compare_rulerew.py
eval_llm.py		eval_llm.py
eval_random.py		eval_random.py
finetune_llm.py		finetune_llm.py
finetune_llm_rl.py		finetune_llm_rl.py
finetune_llm_simple.py		finetune_llm_simple.py
layers.py		layers.py
llm_apis.py		llm_apis.py
ppo_attention.py		ppo_attention.py
ppo_numeric.py		ppo_numeric.py
requirements.txt		requirements.txt
sac_attention.py		sac_attention.py
sac_attention_eval.py		sac_attention_eval.py
sac_attention_rag.py		sac_attention_rag.py
sac_numeric.py		sac_numeric.py
train_gmm.py		train_gmm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

mauriciogtec/rule-bottleneck-reinforcement-learning

Folders and files

Latest commit

History

Repository files navigation

Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages