You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/misc/changelog.rst
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,10 @@
3
3
Changelog
4
4
==========
5
5
6
-
Release 2.4.0a10 (WIP)
6
+
Release 2.4.0a11 (WIP)
7
7
--------------------------
8
8
9
-
**New algorithm: CrossQ in SB3 Contrib**
9
+
**New algorithm: CrossQ in SB3 Contrib, Gymnasium v1.0 support**
10
10
11
11
.. note::
12
12
@@ -24,12 +24,14 @@ Release 2.4.0a10 (WIP)
24
24
25
25
Breaking Changes:
26
26
^^^^^^^^^^^^^^^^^
27
+
- Increase minimum required version of Gymnasium to 0.29.1
27
28
28
29
New Features:
29
30
^^^^^^^^^^^^^
30
31
- Added support for ``pre_linear_modules`` and ``post_linear_modules`` in ``create_mlp`` (useful for adding normalization layers, like in DroQ or CrossQ)
31
32
- Enabled np.ndarray logging for TensorBoardOutputFormat as histogram (see GH#1634) (@iwishwasaneagle)
32
33
- Updated env checker to warn users when using multi-dim array to define `MultiDiscrete` spaces
34
+
- Added support for Gymnasium v1.0
33
35
34
36
Bug Fixes:
35
37
^^^^^^^^^^
@@ -57,6 +59,7 @@ Bug Fixes:
57
59
`SBX`_ (SB3 + Jax)
58
60
^^^^^^^^^^^^^^^^^^
59
61
- Added CNN support for DQN
62
+
- Bug fix for SAC and related algorithms, optimize log of ent coeff to be consistent with SB3
60
63
61
64
Deprecations:
62
65
^^^^^^^^^^^^^
@@ -69,6 +72,7 @@ Others:
69
72
- Added a warning to recommend using CPU with on policy algorithms (A2C/PPO) and ``MlpPolicy``
70
73
- Switched to uv to download packages faster on GitHub CI
71
74
- Updated dependencies for read the doc
75
+
- Removed unnecessary ``copy_obs_dict`` method for ``SubprocVecEnv``, remove the use of ordered dict and rename ``flatten_obs`` to ``stack_obs``
72
76
73
77
Bug Fixes:
74
78
^^^^^^^^^^
@@ -77,6 +81,7 @@ Documentation:
77
81
^^^^^^^^^^^^^^
78
82
- Updated PPO doc to recommend using CPU with ``MlpPolicy``
79
83
- Clarified documentation about planned features and citing software
84
+
- Added a note about the fact we are optimizing log of ent coeff for SAC
Copy file name to clipboardExpand all lines: docs/modules/sac.rst
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,9 @@ Notes
35
35
which is the equivalent to the inverse of reward scale in the original SAC paper.
36
36
The main reason is that it avoids having too high errors when updating the Q functions.
37
37
38
+
.. note::
39
+
When automatically adjusting the temperature (alpha/entropy coefficient), we optimize the logarithm of the entropy coefficient instead of the entropy coefficient itself. This is consistent with the original implementation and has proven to be more stable
40
+
(see issues `GH#36 <https://github.com/DLR-RM/stable-baselines3/issues/36>`_, `#55 <https://github.com/araffin/sbx/issues/55>`_ and others).
0 commit comments