Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging after trying to saving weights and not completing execution of script #6

Open
twfroelich opened this issue Dec 2, 2021 · 3 comments

Comments

@twfroelich
Copy link

Hello,

I've been running into an issue where run_distributed() will not complete execution. I've encountered this both with and without trying to save the weights during a run. Running it with verbose turned on, it seems like one worker is completing its task prior to other which cause the program to lock up. I've included a screen shot of the REPL for 2 workers, with verbose = true and 4 cycles. After several hours, I end up killing the process as there is not other way to continue.

image

@DhairyaLGandhi
Copy link
Owner

That is odd, do you ever see the remaining processes finishing their work? It could be that an error is getting consumed by the tasks causing the syncgrads to never reach the state where it closes all the channels and returns. This could do with adding some more context to how to reproduce this.

@twfroelich
Copy link
Author

After waiting several hours, the remaining processes never finish saving. I face a similar issue when turning saveweights=false and number of cycles=1.

I agree with your thoughts on the problem. Is there a better way than using the verbose output to get a better sense of what each task is doing?

@DhairyaLGandhi
Copy link
Owner

I have updated the project significantly to make it so that we can use tasks instead to communicate with the GPUs, but the process based methods are still in there to facilitate multi node training. If you're interested, it would be cool to check out the newest changes which should be significantly more efficient and faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants