Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

Conversation

Aladoro
Copy link
Contributor

@Aladoro Aladoro commented Jan 15, 2022

The default replay buffer requires very high RAM and results in frequent crashes due to Pytorch's data-loader memory leak issue. Thus, I have efficiently re-implemented DrQv2's replay buffer entirely in NumPy, taking only about 20gb of RAM for storing all 1000000 transitions. Moreover, with this implementation, there is no need to wait for a trajectory to be completed before adding a new transition to the memory used for sampling.

FPS of this NumPy implementation appears to be identical (perhaps, very slightly higher) on all machines I have tested this on. Potentially, this could also lead to (very minimal) performance gains since the agent can now sample replay transitions from its latest trajectory.

I have kept the original dataloader replay buffer as default. The new replay_buffer can be used by running train.py with the replay_buffer=numpy option.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 15, 2022
@denisyarats
Copy link
Contributor

Hm, this is very similar to the replay buffer that I have in original DrQ (https://github.com/denisyarats/drq/blob/master/replay_buffer.py).
The reason I decided to switch to pytorch dataloaders is to take advantage of pin_memory and offload cpu->gpu data copy into a separate thread. In my experimentation this showed significant training time gains.

@Aladoro
Copy link
Contributor Author

Aladoro commented Jan 15, 2022

The replay buffer in the original DrQ is actually quite different and should use around x6 times the amount RAM (since each observation is 9 x 84 x 84 and is saved both in the observation and next_observation np arrays, while my implementation saves each observation 3 x 84 x 84 only once).

I have tested my implementation extensively on both my home and lab machines and I have not experienced any slow down whatsoever ^^

@dvstter
Copy link

dvstter commented May 12, 2023

I've experienced multiple times crush. The only clue is about the dataloader, but I cannot locate the error. Thanks for your contribution, I'd like to try your implementation!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants