deathcoder
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 12 additions & 8 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 12 additions & 8 deletions
diff --git a/‎docs/conda_env.yml‎
Lines changed: 2 additions & 2 deletions b/‎docs/conda_env.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/misc/changelog.rst‎
Lines changed: 7 additions & 2 deletions b/‎docs/misc/changelog.rst‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/modules/dqn.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/modules/dqn.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/modules/sac.rst‎
Lines changed: 3 additions & 0 deletions b/‎docs/modules/sac.rst‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 0 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎setup.py‎
Lines changed: 17 additions & 28 deletions b/‎setup.py‎
Lines changed: 17 additions & 28 deletions
diff --git a/‎stable_baselines3/common/vec_env/dummy_vec_env.py‎
Lines changed: 4 additions & 4 deletions b/‎stable_baselines3/common/vec_env/dummy_vec_env.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎stable_baselines3/common/vec_env/patch_gym.py‎
Lines changed: 1 addition & 1 deletion b/‎stable_baselines3/common/vec_env/patch_gym.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎stable_baselines3/common/vec_env/subproc_vec_env.py‎
Lines changed: 17 additions & 17 deletions b/‎stable_baselines3/common/vec_env/subproc_vec_env.py‎
Lines changed: 17 additions & 17 deletions
@@ -21,7 +21,12 @@ jobs:
     strategy:
       matrix:
         python-version: ["3.8", "3.9", "3.10", "3.11"]
-
+        include:
+          # Default version
+          - gymnasium-version: "1.0.0"
+          # Add a new config to test gym<1.0
+          - python-version: "3.10"
+            gymnasium-version: "0.29.1"
     steps:
       - uses: actions/checkout@v3
       - name: Set up Python ${{ matrix.python-version }}
@@ -37,15 +42,14 @@ jobs:
           # See https://github.com/astral-sh/uv/issues/1497
           uv pip install --system torch==2.3.1+cpu --index https://download.pytorch.org/whl/cpu
 
-          # Install Atari Roms
-          uv pip install --system autorom
-          wget https://gist.githubusercontent.com/jjshoots/61b22aefce4456920ba99f2c36906eda/raw/00046ac3403768bfe45857610a3d333b8e35e026/Roms.tar.gz.b64
-          base64 Roms.tar.gz.b64 --decode &> Roms.tar.gz
-          AutoROM --accept-license --source-file Roms.tar.gz
-
-          uv pip install --system .[extra_no_roms,tests,docs]
+          uv pip install --system .[extra,tests,docs]
           # Use headless version
           uv pip install --system opencv-python-headless
+      - name: Install specific version of gym
+        run: |
+          uv pip install --system gymnasium==${{ matrix.gymnasium-version }}
+        # Only run for python 3.10, downgrade gym to 0.29.1
+        if: matrix.gymnasium-version != '1.0.0'
       - name: Lint with ruff
         run: |
           make lint
 
@@ -8,12 +8,12 @@ dependencies:
   - python=3.11
   - pytorch=2.5.0=py3.11_cpu_0
   - pip:
-    - gymnasium>=0.28.1,<0.30
+    - gymnasium>=0.29.1,<1.1.0
     - cloudpickle
     - opencv-python-headless
     - pandas
     - numpy>=1.20,<2.0
     - matplotlib
-    - sphinx>=5,<8
+    - sphinx>=5,<9
     - sphinx_rtd_theme>=1.3.0
     - sphinx_copybutton
@@ -3,10 +3,10 @@
 Changelog
 ==========
 
-Release 2.4.0a10 (WIP)
+Release 2.4.0a11 (WIP)
 --------------------------
 
-**New algorithm: CrossQ in SB3 Contrib**
+**New algorithm: CrossQ in SB3 Contrib, Gymnasium v1.0 support**
 
 .. note::
 
@@ -24,12 +24,14 @@ Release 2.4.0a10 (WIP)
 
 Breaking Changes:
 ^^^^^^^^^^^^^^^^^
+- Increase minimum required version of Gymnasium to 0.29.1
 
 New Features:
 ^^^^^^^^^^^^^
 - Added support for ``pre_linear_modules`` and ``post_linear_modules`` in ``create_mlp`` (useful for adding normalization layers, like in DroQ or CrossQ)
 - Enabled np.ndarray logging for TensorBoardOutputFormat as histogram (see GH#1634) (@iwishwasaneagle)
 - Updated env checker to warn users when using multi-dim array to define `MultiDiscrete` spaces
+- Added support for Gymnasium v1.0
 
 Bug Fixes:
 ^^^^^^^^^^
@@ -57,6 +59,7 @@ Bug Fixes:
 `SBX`_ (SB3 + Jax)
 ^^^^^^^^^^^^^^^^^^
 - Added CNN support for DQN
+- Bug fix for SAC and related algorithms, optimize log of ent coeff to be consistent with SB3
 
 Deprecations:
 ^^^^^^^^^^^^^
@@ -69,6 +72,7 @@ Others:
 - Added a warning to recommend using CPU with on policy algorithms (A2C/PPO) and ``MlpPolicy``
 - Switched to uv to download packages faster on GitHub CI
 - Updated dependencies for read the doc
+- Removed unnecessary ``copy_obs_dict`` method for ``SubprocVecEnv``, remove the use of ordered dict and rename ``flatten_obs`` to ``stack_obs``
 
 Bug Fixes:
 ^^^^^^^^^^
@@ -77,6 +81,7 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Updated PPO doc to recommend using CPU with ``MlpPolicy``
 - Clarified documentation about planned features and citing software
+- Added a note about the fact we are optimizing log of ent coeff for SAC
 
 Release 2.3.2 (2024-04-27)
 --------------------------
 
@@ -25,6 +25,7 @@ Notes
 
 - Original paper: https://arxiv.org/abs/1312.5602
 - Further reference: https://www.nature.com/articles/nature14236
+- Tutorial "From Tabular Q-Learning to DQN": https://github.com/araffin/rlss23-dqn-tutorial
 
 .. note::
     This implementation provides only vanilla Deep Q-Learning and has no extensions such as Double-DQN, Dueling-DQN and Prioritized Experience Replay.
 
@@ -35,6 +35,9 @@ Notes
     which is the equivalent to the inverse of reward scale in the original SAC paper.
     The main reason is that it avoids having too high errors when updating the Q functions.
 
+.. note::
+    When automatically adjusting the temperature (alpha/entropy coefficient), we optimize the logarithm of the entropy coefficient instead of the entropy coefficient itself. This is consistent with the original implementation and has proven to be more stable
+    (see issues `GH#36 <https://github.com/DLR-RM/stable-baselines3/issues/36>`_, `#55 <https://github.com/araffin/sbx/issues/55>`_ and others).
 
 .. note::
 
 
@@ -18,7 +18,6 @@ ignore = ["B028", "RUF013"]
 # ClassVar, implicit optional check not needed for tests
 "./tests/*.py" = ["RUF012", "RUF013"]
 
-
 [tool.ruff.lint.mccabe]
 # Unlike Flake8, default to a complexity level of 10.
 max-complexity = 15
 
@@ -70,37 +70,13 @@
 
 """  # noqa:E501
 
-# Atari Games download is sometimes problematic:
-# https://github.com/Farama-Foundation/AutoROM/issues/39
-# That's why we define extra packages without it.
-extra_no_roms = [
-    # For render
-    "opencv-python",
-    "pygame",
-    # Tensorboard support
-    "tensorboard>=2.9.1",
-    # Checking memory taken by replay buffer
-    "psutil",
-    # For progress bar callback
-    "tqdm",
-    "rich",
-    # For atari games,
-    "shimmy[atari]~=1.3.0",
-    "pillow",
-]
-
-extra_packages = extra_no_roms + [  # noqa: RUF005
-    # For atari roms,
-    "autorom[accept-rom-license]~=0.6.1",
-]
-
 
 setup(
     name="stable_baselines3",
     packages=[package for package in find_packages() if package.startswith("stable_baselines3")],
     package_data={"stable_baselines3": ["py.typed", "version.txt"]},
     install_requires=[
-        "gymnasium>=0.28.1,<0.30",
+        "gymnasium>=0.29.1,<1.1.0",
         "numpy>=1.20,<2.0",  # PyTorch not compatible https://github.com/pytorch/pytorch/issues/107302
         "torch>=1.13",
         # For saving models
@@ -125,16 +101,29 @@
             "black>=24.2.0,<25",
         ],
         "docs": [
-            "sphinx>=5,<8",
+            "sphinx>=5,<9",
             "sphinx-autobuild",
             "sphinx-rtd-theme>=1.3.0",
             # For spelling
             "sphinxcontrib.spelling",
             # Copy button for code snippets
             "sphinx_copybutton",
         ],
-        "extra": extra_packages,
-        "extra_no_roms": extra_no_roms,
+        "extra": [
+            # For render
+            "opencv-python",
+            "pygame",
+            # Tensorboard support
+            "tensorboard>=2.9.1",
+            # Checking memory taken by replay buffer
+            "psutil",
+            # For progress bar callback
+            "tqdm",
+            "rich",
+            # For atari games,
+            "ale-py>=0.9.0",
+            "pillow",
+        ],
     },
     description="Pytorch version of Stable Baselines, implementations of reinforcement learning algorithms.",
     author="Antonin Raffin",
 
@@ -8,7 +8,7 @@
 
 from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvIndices, VecEnvObs, VecEnvStepReturn
 from stable_baselines3.common.vec_env.patch_gym import _patch_env
-from stable_baselines3.common.vec_env.util import copy_obs_dict, dict_to_obs, obs_space_info
+from stable_baselines3.common.vec_env.util import dict_to_obs, obs_space_info
 
 
 class DummyVecEnv(VecEnv):
@@ -110,12 +110,12 @@ def _save_obs(self, env_idx: int, obs: VecEnvObs) -> None:
                 self.buf_obs[key][env_idx] = obs[key]  # type: ignore[call-overload]
 
     def _obs_from_buf(self) -> VecEnvObs:
-        return dict_to_obs(self.observation_space, copy_obs_dict(self.buf_obs))
+        return dict_to_obs(self.observation_space, deepcopy(self.buf_obs))
 
     def get_attr(self, attr_name: str, indices: VecEnvIndices = None) -> List[Any]:
         """Return attribute from vectorized environment (see base class)."""
         target_envs = self._get_target_envs(indices)
-        return [getattr(env_i, attr_name) for env_i in target_envs]
+        return [env_i.get_wrapper_attr(attr_name) for env_i in target_envs]
 
     def set_attr(self, attr_name: str, value: Any, indices: VecEnvIndices = None) -> None:
         """Set attribute inside vectorized environments (see base class)."""
@@ -126,7 +126,7 @@ def set_attr(self, attr_name: str, value: Any, indices: VecEnvIndices = None) ->
     def env_method(self, method_name: str, *method_args, indices: VecEnvIndices = None, **method_kwargs) -> List[Any]:
         """Call instance methods of vectorized environments."""
         target_envs = self._get_target_envs(indices)
-        return [getattr(env_i, method_name)(*method_args, **method_kwargs) for env_i in target_envs]
+        return [env_i.get_wrapper_attr(method_name)(*method_args, **method_kwargs) for env_i in target_envs]
 
     def env_is_wrapped(self, wrapper_class: Type[gym.Wrapper], indices: VecEnvIndices = None) -> List[bool]:
         """Check if worker environments are wrapped with a given wrapper"""
 
@@ -43,7 +43,7 @@ def _patch_env(env: Union["gym.Env", gymnasium.Env]) -> gymnasium.Env:  # pragma
             "Missing shimmy installation. You provided an OpenAI Gym environment. "
             "Stable-Baselines3 (SB3) has transitioned to using Gymnasium internally. "
             "In order to use OpenAI Gym environments with SB3, you need to "
-            "install shimmy (`pip install 'shimmy>=0.2.1'`)."
+            "install shimmy (`pip install 'shimmy>=2.0'`)."
         ) from e
 
     warnings.warn(
 
@@ -1,6 +1,5 @@
 import multiprocessing as mp
 import warnings
-from collections import OrderedDict
 from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Type, Union
 
 import gymnasium as gym
@@ -54,10 +53,10 @@ def _worker(
             elif cmd == "get_spaces":
                 remote.send((env.observation_space, env.action_space))
             elif cmd == "env_method":
-                method = getattr(env, data[0])
+                method = env.get_wrapper_attr(data[0])
                 remote.send(method(*data[1], **data[2]))
             elif cmd == "get_attr":
-                remote.send(getattr(env, data))
+                remote.send(env.get_wrapper_attr(data))
             elif cmd == "set_attr":
                 remote.send(setattr(env, data[0], data[1]))  # type: ignore[func-returns-value]
             elif cmd == "is_wrapped":
@@ -129,7 +128,7 @@ def step_wait(self) -> VecEnvStepReturn:
         results = [remote.recv() for remote in self.remotes]
         self.waiting = False
         obs, rews, dones, infos, self.reset_infos = zip(*results)  # type: ignore[assignment]
-        return _flatten_obs(obs, self.observation_space), np.stack(rews), np.stack(dones), infos  # type: ignore[return-value]
+        return _stack_obs(obs, self.observation_space), np.stack(rews), np.stack(dones), infos  # type: ignore[return-value]
 
     def reset(self) -> VecEnvObs:
         for env_idx, remote in enumerate(self.remotes):
@@ -139,7 +138,7 @@ def reset(self) -> VecEnvObs:
         # Seeds and options are only used once
         self._reset_seeds()
         self._reset_options()
-        return _flatten_obs(obs, self.observation_space)
+        return _stack_obs(obs, self.observation_space)
 
     def close(self) -> None:
         if self.closed:
@@ -206,27 +205,28 @@ def _get_target_remotes(self, indices: VecEnvIndices) -> List[Any]:
         return [self.remotes[i] for i in indices]
 
 
-def _flatten_obs(obs: Union[List[VecEnvObs], Tuple[VecEnvObs]], space: spaces.Space) -> VecEnvObs:
+def _stack_obs(obs_list: Union[List[VecEnvObs], Tuple[VecEnvObs]], space: spaces.Space) -> VecEnvObs:
     """
-    Flatten observations, depending on the observation space.
+    Stack observations (convert from a list of single env obs to a stack of obs),
+    depending on the observation space.
 
     :param obs: observations.
                 A list or tuple of observations, one per environment.
                 Each environment observation may be a NumPy array, or a dict or tuple of NumPy arrays.
-    :return: flattened observations.
-            A flattened NumPy array or an OrderedDict or tuple of flattened numpy arrays.
+    :return: Concatenated observations.
+            A NumPy array or a dict or tuple of stacked numpy arrays.
             Each NumPy array has the environment index as its first axis.
     """
-    assert isinstance(obs, (list, tuple)), "expected list or tuple of observations per environment"
-    assert len(obs) > 0, "need observations from at least one environment"
+    assert isinstance(obs_list, (list, tuple)), "expected list or tuple of observations per environment"
+    assert len(obs_list) > 0, "need observations from at least one environment"
 
     if isinstance(space, spaces.Dict):
-        assert isinstance(space.spaces, OrderedDict), "Dict space must have ordered subspaces"
-        assert isinstance(obs[0], dict), "non-dict observation for environment with Dict observation space"
-        return OrderedDict([(k, np.stack([o[k] for o in obs])) for k in space.spaces.keys()])
+        assert isinstance(space.spaces, dict), "Dict space must have ordered subspaces"
+        assert isinstance(obs_list[0], dict), "non-dict observation for environment with Dict observation space"
+        return {key: np.stack([single_obs[key] for single_obs in obs_list]) for key in space.spaces.keys()}  # type: ignore[call-overload]
     elif isinstance(space, spaces.Tuple):
-        assert isinstance(obs[0], tuple), "non-tuple observation for environment with Tuple observation space"
+        assert isinstance(obs_list[0], tuple), "non-tuple observation for environment with Tuple observation space"
         obs_len = len(space.spaces)
-        return tuple(np.stack([o[i] for o in obs]) for i in range(obs_len))  # type: ignore[index]
+        return tuple(np.stack([single_obs[i] for single_obs in obs_list]) for i in range(obs_len))  # type: ignore[index]
     else:
-        return np.stack(obs)  # type: ignore[arg-type]
+        return np.stack(obs_list)  # type: ignore[arg-type]