Skip to content

HATS-ICT/ml-agents-dodgeball-env-ICT

 
 

Repository files navigation

Large_Dense_Vid.mp4

ML-Agents DodgeBall Extended Battle Scenario

Overview

The ML-Agents DodgeBall environment is a third-person cooperative shooter where players try to pick up as many balls as they can, then throw them at their opponents. It comprises two game modes: Elimination and Capture the Flag. In Elimination, each group tries to eliminate all members of the other group by hitting them with balls. In Capture the Flag, players try to steal the other team’s flag and bring it back to their base. In both modes, players can hold up to four balls, and dash to dodge incoming balls and go through hedges. You can find more information about the environment at the corresponding blog post.

In this project, we used the Elimination game-mode to explore modifying the DodgeBall environment to serve as a proxy for high-fidelity military simulations. We modified both the dodgeball agents' functionality and the arenas they were tested on in order to better approximate a real battle scenario. This document will detail the most significant changes we made and discuss the method we developed for reducing the number of training steps needed to learn an intelligent cooperative policy.

Installation and Play

To open this repository, you will need to install the Unity editor version 2020.2.6.

Clone this repository by running:

git clone https://github.com/calebkoresh/ml-agents-dodgeball-env-ICT

Open the root folder in Unity. Then, navigate to Assets/Dodgeball/Scenes/TitleScreen.unity, open it, and hit the play button to play against pretrained agents. You can also build this scene (along with the Elimination.unity and CaptureTheFlag.unity scenes) into a game build and play from there.

Scenes

In Assets/Dodgeball/Scenes/ eight scenes are provided from this project. They are:

  • Large_Obs.unity

  • Large_Obs_Dense.unity

  • XL_Obs.unity

  • XL_Obs_Dense.unity

  • Large_WPM_Obs.unity

  • Large_WPM_Obs_Dense.unity

  • XL_WPM_Obs.unity

  • XL_WPM_Obs_Dense.unity

Obs differentiates between the scenarios which include the modified observation space and those that do not. WPM stands for waypoint manual, which was the final iteration of our waypoint movement system and will be discussed in the Waypoint Movement section of this document. Large and XL refers to the two different sizes of arena in which we tested our waypoint movement system against the original continuous implementation.

Elimination

In the elimination scenes, four players face off against another team of four. Balls are dropped throughout the stage, and players must pick up balls and throw them at opponents. If a player is hit twice by an opponent, they are "out", and sent to the penalty podium in the top-center of the stage.

EliminationVideo

The original dodgeball environment includes the option for capture the flag, but we did not use it during the course of this project. All results and scenes take place in the Elimination gamemode.

Training

ML-Agents DodgeBall was built using ML-Agents Release 18 (Unity package 2.1.0-exp.1). We recommend the matching version of the Python trainers (Version 0.27.0) though newer trainers should work. See the Releases Page on the ML-Agents Github for more version information.

To train DodgeBall, in addition to downloading and opening this environment, you will need to install the ML-Agents Python package. Follow the getting started guide for more information on how to use the ML-Agents trainers.

You will need to use either the official Unity scenes or the eight additional scenes provided for training. Since training takes a long time, we recommend building these scenes into a Unity build.

Two configuration YAML (DodgeBall.yaml and DodgeBall_seperate_policies.yaml) for ML-Agents is provided. The seperate policies YAML is used to train the two different types of agents discussed in this project; long and short-range. You can uncomment and increase the number of environments (num_envs) depending on your computer's capabilities.

After tens of millions of steps (this will take many, many hours!) your agents will start to improve. As with any self-play run, you should observe your ELO increase over time. Check out these videos (Elimination, Capture the Flag) for an example of what kind of behaviors to expect at different stages of training. In our experiments, we trained agents for 20M steps to get a good understanding of learning capabilities, but this is not nearly enough to reach convergence. Unity trained the original (simpler) models for 160M steps. These extreme training times are the inspiration for this project, as new methods are needed to reduce the computational requirements of reinforcement learning projects.

Environment Parameters

To produce the results in the blog post, we used the default environment as it is in this repo. However, we also provide environment parameters to adjust reward functions and control the environment from the trainer. You may find it useful, for instance, to experiment with curriculums.

Parameter Description
is_capture_the_flag Set this parameter to 1 to override the scene's game mode setting, and change it to Capture the Flag. Set to 0 for Elimination.
time_bonus_scale (default = 1.0 for Elimination, and 0.0 for CTF) Multiplier for negative reward given for taking too long to finish the game. Set to 1.0 for a -1.0 reward if it takes the maximum number of steps to finish the match.
elimination_hit_reward (default = 0.1) In Elimination, a reward given to an agent when it hits an opponent with a ball.

Extending DodgeBall to Emulate Military Training Scenarios

Infinite Ammunition

The first change that was needed to convert the original dodgeball scenario into a military-esque scenario was infinite ammunition. We implemented a system that destroys projectiles on impact and returns them into the possession of the agent. This removes the need to go and recover balls, which distracts from tactical movement and adds an unnecessary layer of complexity for the agents to learn.

)

3D Terrains

The next step was developing more realistic terrain. Battle scenarios will seldom occur on flat ground, so we imported data from the Razish Army Training Facility which allowed us to train our agents on a low-fidelity version of real-world training terrain. All the scenarios we tested included hills, which can be distinguished by the areas with different lighting and contour.

This new setup requires additional raycasts, so that the agents can detect opponents or walls that are not at the same altitude as them. This is crucial for developing intelligent policies when uneven terrain is introduced.

Screenshot (26)

Modified Observation and Action Spaces

Some modifications were made to the agents observation and action spaces were made to better fit our needs. The observation spaces are smaller due to the removal of unnecessary observations that only apply to the Capture the Flag gamemode. Additionally, the dash action was removed as it was a bit awkward in our scenario, especially when moving along waypoints.

Shooting Vertically

Another obvious additon to our scenario was the ability to shoot vertically. Opponents should be able to fire at angles other than parallel to the ground so that they can target opponents at various different altitudes.

Aim-assist

We attempted to train some models which were able to choose the angle of their shots, but this drastically increases the complexity of the environment. Our solution was to implement aim-assist, which targets the opponent closest to the shooter's forward direction and automatically fires directly at it. This removes the need for fine tuning aim and encourages learning intelligent positioning and movement over high-precision skills. This method achieved far better results, so it was used in most of our simulations and all the experiments in this repository.

Introducing Roles

We also investigated the ability to introduce different roles within the same team. We hoped to see whether the agents could learn a more complicated strategy to cooperate and utilize each individuals strengths. This was studied using short and long range units with different capabilities. The short-range units have half the aim-assist range but twice the fire-rate. We found that the agents did in fact learn their role. Short-range units learned more aggressive policies and the long-range units tended to remain in the rear. The long-range units can be distinguished by their darker color.

role_demo_video.mp4

Waypoint Movement

Due to the large computational requirements of reinforcement learning, we were not able to run our simulations for the same 160 million training steps that the original project did. This fact combined with the increased complexity of our environments led us to develop a method to reduce training time. We developed a waypoint movement system which aims to reduce the complexity of our environments and reduce the frequency of reinforcement learning steps while retaining the core positional strategy. This system limits agents to walking along the waypoints we generate onto the terrain, allowing us to automate shooting and only utilize reinforcement learning for the agents' movement. We only request a decision from the learned policy at each waypoint which is translated to the direction of travel to the next waypoint. This increases the time between decisions by 700% while maintaining and sometimes improving upon tactical performance. We also developed a system to automatically generate these waypoints so the system can be quickly implemented on any unity terrain. The code for waypoint generation is found under Assets/ScoutMission/WaypointGeneration/

wp_movement_video.mp4

Trained Policies

Below is video of the final trained policies for each of the scenarios we created, along with their corresponding ELO scores from self-play. We used ELO score as our metric for learning, but due to the differences between continuous and waypoint scenarios it is not a perfect metric. Our solution was to test the policies for each movement system directly against each other to see which performed better. This requires removing the waypoint restraints and thus creates a disadvantage for agents which were not trained in these conditions. Nevertheless, the waypoint-based agents outcompete the continuous movement agents consistently. They also score achieve higher ELO scores in most scenarios.

Note: Some of the videos had to be cropped to meet GitHub's file size limitations.

Small Continuous

large_wp_video.mp4

Small Waypoint

L_VIDEO.mp4

Small Continuous with Dense Obstacles

LD_Video.mp4

Small Waypoint with Dense Obstacles

Large_Dense_Vid.mp4

Large Continuous

XL_Video.mp4

Large Waypoint

XL_WP_Video.mp4

Large Continuous with Dense Obstacles

XL_Dense_Video.mp4

Large Waypoint with Dense Obstacles

XL_WP_Dense_Vid.mp4

Continuous VS. Waypoint ELO Scores

Small Arena

Small Arena with Dense Obstacles

Large Arena

Large Arena with Dense Obstacles

Verification

In addition to tracking ELO as an indicator of learning, we tested the waypoint-based agents directly against a team of agents that were trained using the original continuous movement. This was accomplished by removing the waypoints and retaining the longer time between decisions and discretized movement. In other words, the waypoint-based team picks one of 8 directions or to stand still and then continuous that course of action for 40 fixed updates. On the other hand, the continuous movement team retains its normal movement and makes decisions every 5 fixed updates. Despite the fact that the continuous movement team have home court advantage, the policies learned by our waypoint movement method were able to consistently outperform the continuous movement team.

Each scenario was ran 100 times with scores and video provided below.

Small Arena

L_test_video.mp4

Small Arena with Dense Obstacles

LDVS_Video.mp4

Large Arena

XL_vs.mp4

Large Arena with Dense Obstacles

XL_VS_Dense.mp4

Conclusion

Automatically generating waypoints is an efficient way to discretize the state space in military training scenarios so that reinforcement learning agents can learn intelligent policies more efficiently. Our results show that agents trained on a waypoint system can transfer their knowledge back into a continuous space and outperform agents who were not trained using waypoints. Furthermore, the advantage seems to grow as the complexity of the terrain increases.

In general, it seems most effective to use reinforcement learning to decide higher level behavior and strategy while hard-coding fine skills like aiming a projectile. Furthermore, state spaces should be discretized whenever possible to increase learning speed.

About

Showcase environment for ML-Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C# 95.7%
  • ShaderLab 3.8%
  • HLSL 0.5%