-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot for preemption #155
base: master
Are you sure you want to change the base?
Conversation
Thanks for your PR! It is really good to have better resumability. General comments on resumabilityFirst, let me summarize what I think need to be done to achieve resumability. Please comment if I miss something. I checked the points supported by this PR. Things that need to be snapshotted for resumability except randomness:
RNG-related things that need to be snapshotted for complete resumability:
This is a large list, and it would be a tough task to support all of it. I think it is ok to start supporting only part of it if
Specific comments on this PR
|
Thank you for the detailed comments!! What I skip in this PR
What I implement
|
/test |
Successfully created a job for commit dde7ebf: |
Sorry, I fixed the linter problem |
(I forgot to write this) |
Hi! Is there any action required for this PR to be merged? |
Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes.
This PR support saving and loading snapshot including replay buffer.
Done
python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000
Not Done
Could you check the current implementation strategy and give some ideas on how to implement the above points?