Skip to content

Commit e4f4f12

Browse files
authored
Add note about SAC ent coeff optimization (DLR-RM#2037)
* Allow new sphinx version * Add note about SAC ent coeff and add DQN tutorial link
1 parent 8f0b488 commit e4f4f12

File tree

5 files changed

+8
-2
lines changed

5 files changed

+8
-2
lines changed

docs/conda_env.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@ dependencies:
1414
- pandas
1515
- numpy>=1.20,<2.0
1616
- matplotlib
17-
- sphinx>=5,<8
17+
- sphinx>=5,<9
1818
- sphinx_rtd_theme>=1.3.0
1919
- sphinx_copybutton

docs/misc/changelog.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Bug Fixes:
5959
`SBX`_ (SB3 + Jax)
6060
^^^^^^^^^^^^^^^^^^
6161
- Added CNN support for DQN
62+
- Bug fix for SAC and related algorithms, optimize log of ent coeff to be consistent with SB3
6263

6364
Deprecations:
6465
^^^^^^^^^^^^^
@@ -80,6 +81,7 @@ Documentation:
8081
^^^^^^^^^^^^^^
8182
- Updated PPO doc to recommend using CPU with ``MlpPolicy``
8283
- Clarified documentation about planned features and citing software
84+
- Added a note about the fact we are optimizing log of ent coeff for SAC
8385

8486
Release 2.3.2 (2024-04-27)
8587
--------------------------

docs/modules/dqn.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Notes
2525

2626
- Original paper: https://arxiv.org/abs/1312.5602
2727
- Further reference: https://www.nature.com/articles/nature14236
28+
- Tutorial "From Tabular Q-Learning to DQN": https://github.com/araffin/rlss23-dqn-tutorial
2829

2930
.. note::
3031
This implementation provides only vanilla Deep Q-Learning and has no extensions such as Double-DQN, Dueling-DQN and Prioritized Experience Replay.

docs/modules/sac.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ Notes
3535
which is the equivalent to the inverse of reward scale in the original SAC paper.
3636
The main reason is that it avoids having too high errors when updating the Q functions.
3737

38+
.. note::
39+
When automatically adjusting the temperature (alpha/entropy coefficient), we optimize the logarithm of the entropy coefficient instead of the entropy coefficient itself. This is consistent with the original implementation and has proven to be more stable
40+
(see issues `GH#36 <https://github.com/DLR-RM/stable-baselines3/issues/36>`_, `#55 <https://github.com/araffin/sbx/issues/55>`_ and others).
3841

3942
.. note::
4043

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@
101101
"black>=24.2.0,<25",
102102
],
103103
"docs": [
104-
"sphinx>=5,<8",
104+
"sphinx>=5,<9",
105105
"sphinx-autobuild",
106106
"sphinx-rtd-theme>=1.3.0",
107107
# For spelling

0 commit comments

Comments
 (0)