You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of [Stable Baselines](https://github.com/hill-a/stable-baselines).
12
12
13
13
You can read a detailed presentation of Stable Baselines3 in the [v1.0 blog post](https://araffin.github.io/post/sb3/) or our [JMLR paper](https://jmlr.org/papers/volume22/20-1364/20-1364.pdf).
@@ -85,7 +85,7 @@ Documentation is available online: [https://sb3-contrib.readthedocs.io/](https:/
85
85
86
86
## Stable-Baselines Jax (SBX)
87
87
88
-
[Stable Baselines Jax (SBX)](https://github.com/araffin/sbx) is a proof of concept version of Stable-Baselines3 in Jax.
88
+
[Stable Baselines Jax (SBX)](https://github.com/araffin/sbx) is a proof of concept version of Stable-Baselines3 in Jax, with recent algorithms like DroQ or CrossQ.
89
89
90
90
It provides a minimal number of features compared to SB3 but can be much faster (up to 20x times!): https://twitter.com/araffin2/status/1590714558628253698
91
91
@@ -192,7 +192,7 @@ All the following examples can be executed online using Google Colab notebooks:
192
192
<bid="f1">1</b>: Implemented in [SB3 Contrib](https://github.com/Stable-Baselines-Team/stable-baselines3-contrib) GitHub repository.
193
193
194
194
Actions `gym.spaces`:
195
-
*`Box`: A N-dimensional box that containes every point in the action space.
195
+
*`Box`: A N-dimensional box that contains every point in the action space.
196
196
*`Discrete`: A list of possible actions, where each timestep only one of the actions can be used.
197
197
*`MultiDiscrete`: A list of possible actions, where each timestep only one action of each discrete set can be used.
198
198
*`MultiBinary`: A list of possible actions, where each timestep any of the actions can be used in any combination.
Copy file name to clipboardExpand all lines: docs/guide/rl_tips.rst
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
Reinforcement Learning Tips and Tricks
5
5
======================================
6
6
7
-
The aim of this section is to help you do reinforcement learning experiments.
7
+
The aim of this section is to help you run reinforcement learning experiments.
8
8
It covers general advice about RL (where to start, which algorithm to choose, how to evaluate an algorithm, ...),
9
9
as well as tips and tricks when using a custom environment or implementing an RL algorithm.
10
10
@@ -14,6 +14,11 @@ as well as tips and tricks when using a custom environment or implementing an RL
14
14
this section in more details. You can also find the `slides here <https://araffin.github.io/slides/rlvs-tips-tricks/>`_.
15
15
16
16
17
+
.. note::
18
+
19
+
We also have a `video on Designing and Running Real-World RL Experiments <https://youtu.be/eZ6ZEpCi6D8>`_, slides `can be found online <https://araffin.github.io/slides/design-real-rl-experiments/>`_.
20
+
21
+
17
22
General advice when using Reinforcement Learning
18
23
================================================
19
24
@@ -103,19 +108,19 @@ and this `issue <https://github.com/hill-a/stable-baselines/issues/199>`_ by Cé
103
108
Which algorithm should I use?
104
109
=============================
105
110
106
-
There is no silver bullet in RL, depending on your needs and problem, you may choose one or the other.
111
+
There is no silver bullet in RL, you can choose one or the other depending on your needs and problems.
107
112
The first distinction comes from your action space, i.e., do you have discrete (e.g. LEFT, RIGHT, ...)
108
113
or continuous actions (ex: go to a certain speed)?
109
114
110
-
Some algorithms are only tailored for one or the other domain: ``DQN`` only supports discrete actions, where ``SAC`` is restricted to continuous actions.
115
+
Some algorithms are only tailored for one or the other domain: ``DQN`` supports only discrete actions, while ``SAC`` is restricted to continuous actions.
111
116
112
-
The second difference that will help you choose is whether you can parallelize your training or not.
117
+
The second difference that will help you decide is whether you can parallelize your training or not.
113
118
If what matters is the wall clock training time, then you should lean towards ``A2C`` and its derivatives (PPO, ...).
114
119
Take a look at the `Vectorized Environments <vec_envs.html>`_ to learn more about training with multiple workers.
115
120
116
-
To accelerate training, you can also take a look at `SBX`_, which is SB3 + Jax, it has fewer features than SB3 but can be up to 20x faster than SB3 PyTorch thanks to JIT compilation of the gradient update.
121
+
To accelerate training, you can also take a look at `SBX`_, which is SB3 + Jax, it has less features than SB3 but can be up to 20x faster than SB3 PyTorch thanks to JIT compilation of the gradient update.
117
122
118
-
In sparse reward settings, we either recommend to use dedicated methods like HER (see below) or population-based algorithms like ARS (available in our :ref:`contrib repo <sb3_contrib>`).
123
+
In sparse reward settings, we either recommend using either dedicated methods like HER (see below) or population-based algorithms like ARS (available in our :ref:`contrib repo <sb3_contrib>`).
119
124
120
125
To sum it up:
121
126
@@ -146,7 +151,7 @@ Continuous Actions
146
151
Continuous Actions - Single Process
147
152
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
148
153
149
-
Current State Of The Art (SOTA) algorithms are ``SAC``, ``TD3``and ``TQC`` (available in our :ref:`contrib repo <sb3_contrib>`).
154
+
Current State Of The Art (SOTA) algorithms are ``SAC``, ``TD3``, ``CrossQ`` and ``TQC`` (available in our :ref:`contrib repo <sb3_contrib>` and :ref:`SBX (SB3 + Jax) repo <sbx>`).
150
155
Please use the hyperparameters in the `RL zoo <https://github.com/DLR-RM/rl-baselines3-zoo>`_ for best results.
151
156
152
157
If you want an extremely sample-efficient algorithm, we recommend using the `DroQ configuration <https://twitter.com/araffin2/status/1575439865222660098>`_ in `SBX`_ (it does many gradient steps per step in the environment).
@@ -155,8 +160,7 @@ If you want an extremely sample-efficient algorithm, we recommend using the `Dro
155
160
Continuous Actions - Multiprocessed
156
161
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157
162
158
-
Take a look at ``PPO``, ``TRPO`` (available in our :ref:`contrib repo <sb3_contrib>`) or ``A2C``. Again, don't forget to take the hyperparameters from the `RL zoo <https://github.com/DLR-RM/rl-baselines3-zoo>`_
159
-
for continuous actions problems (cf *Bullet* envs).
163
+
Take a look at ``PPO``, ``TRPO`` (available in our :ref:`contrib repo <sb3_contrib>`) or ``A2C``. Again, don't forget to take the hyperparameters from the `RL zoo <https://github.com/DLR-RM/rl-baselines3-zoo>`_ for continuous actions problems (cf *Bullet* envs).
160
164
161
165
.. note::
162
166
@@ -181,26 +185,23 @@ Tips and Tricks when creating a custom environment
If you want to learn about how to create a custom environment, we recommend you read this `page <custom_env.html>`_.
184
-
We also provide a `colab notebook <https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/master/5_custom_gym_env.ipynb>`_ for
185
-
a concrete example of creating a custom gym environment.
188
+
We also provide a `colab notebook <https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/master/5_custom_gym_env.ipynb>`_ for a concrete example of creating a custom gym environment.
186
189
187
190
Some basic advice:
188
191
189
-
- always normalize your observation space when you can, i.e., when you know the boundaries
190
-
- normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment
191
-
- start with shaped reward (i.e. informative reward) and simplified version of your problem
192
-
- debug with random actions to check that your environment works and follows the gym interface:
192
+
- always normalize your observation space if you can, i.e. if you know the boundaries
193
+
- normalize your action space and make it symmetric if it is continuous (see potential problem below) A good practice is to rescale your actions so that they lie in [-1, 1]. This does not limit you, as you can easily rescale the action within the environment.
194
+
- start with a shaped reward (i.e. informative reward) and a simplified version of your problem
195
+
- debug with random actions to check if your environment works and follows the gym interface (with ``check_env``, see below)
193
196
194
-
Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption
197
+
Two important things to keep in mind when creating a custom environment are avoiding breaking the Markov assumption
195
198
and properly handle termination due to a timeout (maximum number of steps in an episode).
196
-
For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give a history of observations
197
-
as input.
199
+
For example, if there is a time delay between action and observation (e.g. due to wifi communication), you should provide a history of observations as input.
198
200
199
201
Termination due to timeout (max number of steps per episode) needs to be handled separately.
200
202
You should return ``truncated = True``.
201
203
If you are using the gym ``TimeLimit`` wrapper, this will be done automatically.
202
-
You can read `Time Limit in RL <https://arxiv.org/abs/1712.00378>`_ or take a look at the `RL Tips and Tricks video <https://www.youtube.com/watch?v=Ikngt0_DXJg>`_
203
-
for more details.
204
+
You can read `Time Limit in RL <https://arxiv.org/abs/1712.00378>`_, take a look at the `Designing and Running Real-World RL Experiments video <https://youtu.be/eZ6ZEpCi6D8>`_ or `RL Tips and Tricks video <https://www.youtube.com/watch?v=Ikngt0_DXJg>`_ for more details.
204
205
205
206
206
207
We provide a helper to check that your environment runs without error:
@@ -234,7 +235,7 @@ If you want to quickly try a random agent on your environment, you can also do:
234
235
235
236
Most reinforcement learning algorithms rely on a Gaussian distribution (initially centered at 0 with std 1) for continuous actions.
236
237
So, if you forget to normalize the action space when using a custom environment,
237
-
this can harm learning and be difficult to debug (cf attached image and `issue #473 <https://github.com/hill-a/stable-baselines/issues/473>`_).
238
+
this can harm learning and can be difficult to debug (cf attached image and `issue #473 <https://github.com/hill-a/stable-baselines/issues/473>`_).
0 commit comments