Skip to content

A Predator-Prey-Grass multi-agent gridworld environment. Featuring dynamic spawning and deletion and partial observability of agents.

License

Notifications You must be signed in to change notification settings

doesburg11/PredPreyGrass

Repository files navigation

Python 3.11.11 PettingZoo version dependency Stable Baselines3 RLlib Open In Colab



Multi-Agent Reinforcement Learning (MARL)

Predator-Prey-Grass gridworld deploying a multi-agent environment with dynamic deletion and spawning of partially observant agents. The approach of the environment and algorithms is implemented in two separate solutions:

Framework Solution
PettingZoo
PettingZoo Environment
Single network for all agents (centralized learning) utilizing external Stable-Baselines3 PPO Algorithm applied to the PettingZoo multi-agent environment (AECEnv).
RLlib
RLlib (New API Stack) multi-agent environment
Dual network for Predator and Prey seperately (decentralized learning) utilizing native RLlib PPO Solution applied to the RLlib new API stack multi-agent environment (MultiAgentEnv).

Pred-Prey-Grass MARL with PettingZoo/SB3 PPO (centralized training)


Overview

The MARL environment predpregrass_base.py is implemented using PettingZoo, and the agents are trained using Stable Baselines3 (SB3) PPO. Essentially this solution demonstrates how SB3 can be adapted for MARL using parallel environments and centralized training.

Environment dynamics

Learning agents Predators (red) and Prey (blue) both sequentially expend energy moving around, and replenish it by eating. Prey eat Grass (green), and Predators eat Prey if they end up on the same grid cell. The agents obtain all the energy from the eaten resource. Predators die of starvation when their energy is run out, Prey die either of starvation or when being eaten by a Predator. Both learning agents asexually reproduce when energy levels exceed a certain threshold (by eating). In the base configuration, newly created agents are placed at random over the entire gridworld. Learning agents learn to move based on their partial observations (transparent red and blue squares) of the environment.

Configuration

Rewards (stepping, eating, dying and reproducing) are aggregated and can be adjusted in the environment configuration file.

Training

Basically, Stable Baseline3 is originally designed for single-agent training. This means in this solution, training utilizes only one unified network for Predators as well Prey.

How SB3 PPO is used in the Predator-Prey-Grass Multi-Agent Setting

1. PettingZoo AEC to Parallel Conversion

  • The environment is initially implemented as an Agent-Environment-Cycle (AEC) environment using PettingZoo (predpregrass_aec.py which inherits from predpregrass_base.py).
  • It is wrapped and converted into a Parallel Environment using aec_to_parallel() inside trainer.py.
  • This conversion enables multiple agents to take actions simultaneously rather than sequentially.

2. Treating Multi-Agent Learning as a Single-Agent Problem

  • SB3 PPO expects a single-agent Gymnasium-style environment.
  • The converted parallel environment stacks observations and actions for all agents, making it appear as a single large observation-action space.
  • PPO then treats the multi-agent problem as a centralized learning problem, where all agents share one policy.

3. Performance Optimization with Vectorized Environments

  • The environment is further wrapped using SuperSuit:
    env = ss.pettingzoo_env_to_vec_env_v1(env)
    env = ss.concat_vec_envs_v1(env, num_vec_envs, num_cpus=num_cores, base_class="stable_baselines3")
  • This enables running multiple instances of the environment in parallel, significantly improving training efficiency.
  • The training process treats the multi-agent setup as a single centralized policy, where PPO learns from the collective experiences of all agents.

Pred-Prey-Grass MARL with RLlib new API stack (decentralized training)

Overview

Obviously, using only one network has its limitations as Predators and Prey lack true specialization in their training. The RLlib new API stack framework is able to circumvent this limitation, albeit at the cost of considerable more compute time.

Environment dynamics

The environment dynamics of the RLlib environment (predpregrass_rllib_env.py) are largely the same as in the PettingZoo environment. However, newly spawned agents are placed in the vicinity of the parent, rather than randomly spawned in the entire gridworld. The implementation under-the-hood of the setup is somewhat different, utilizing more array lists to store agent data rather than implementing a seperate agent class. This is largely a result of experimentation with compute time of the step function.

Configuration

Similarly as in the PettingZoo environment, rewards can be adjusted in a seperate environment configuration file.

Training

Training is applied in accordance with the RLlib new API stack protocol. The training configuration is more out-of-the-box than the PettingZoo/SB3 solution, but nevertheless is much more applicable to MARL in general and especially decentralized training.

A principle difference of the second (RLlib) solution with the first (PettingZoo/SB3) solution is that the concurrent agents become part of the environment rather than being part of a combined "super" agent. Since, the environment of the first solution are (static) grass objects the environment dynamics of the second solution change dramatically. This is probably one of the reasons that training time of the RLlib solution is a multiple of the PettingZoo/SB3 solution. This is however a hypothesis and is subject to future investigation.

Emergent Behaviors from the PettingZoo/SB3 solution

Training the single objective environment predpregrass_base.py with the SB3 PPO algorithm is an example of how elaborate behaviors can emerge from simple rules in agent-based models. In the above displayed MARL example, rewards for learning agents are solely obtained by reproduction. So all other reward options are set to zero in the environment configuration. Despite this relativily sparse reward structure, maximizing these rewards results in elaborate emerging behaviors such as:

  • Predators hunting Prey
  • Prey finding and eating grass
  • Predators hovering around grass to catch Prey
  • Prey trying to escape Predators

Moreover, these learning behaviors lead to more complex emergent dynamics at the ecosystem level. The trained agents are displaying a classic Lotka–Volterra pattern over time:

More emergent behavior and findings are described on our website.

Installation

Editor used: Visual Studio Code 1.98.2 on Linux Mint 21.3 Cinnamon

  1. Clone the repository:
    git clone https://github.com/doesburg11/PredPreyGrass.git
  2. Open Visual Studio Code and execute:
    • Press ctrl+shift+p
    • Type and choose: "Python: Create Environment..."
    • Choose environment: Conda
    • Choose interpreter: Python 3.11.11 or higher
    • Open a new terminal
    • pip install -e .
  3. Install the following requirements:
    • pip install pettingzoo==1.24.3
      
    • pip install stable-baselines3[extra]==2.5.0
      
    • conda install -y -c conda-forge gcc=12.1.0
    • pip install supersuit==3.9.3 
    • pip install ray[rllib]==2.43.0
    • pip install tensorboard==2.18.0 

Getting started

Visualize a random policy with the PettingZoo/SB3 solution

In Visual Studio Code run: predpreygrass/pettingzoo/eval/evaluate_random_policy.py

Training and visualize trained model using PPO from stable baselines3

Adjust parameters accordingly in:

predpreygrass/pettingzoo/config/config_predpreygrass.py

In Visual Studio Code run:

predpreygrass/pettingzoo/train/train_sb3_ppo_parallel_wrapped_aec_env.py

To evaluate and visualize after training follow instructions in:

predpreygrass/pettingzoo/eval/evaluate_ppo_from_file_aec_env.py

Batch training and evaluating in one go:

predpreygrass/pettingzoo/eval/parameter_variation_train_wrapped_to_parallel_and_evaluate_aec.py

References