You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running into an issue where run_distributed() will not complete execution. I've encountered this both with and without trying to save the weights during a run. Running it with verbose turned on, it seems like one worker is completing its task prior to other which cause the program to lock up. I've included a screen shot of the REPL for 2 workers, with verbose = true and 4 cycles. After several hours, I end up killing the process as there is not other way to continue.
The text was updated successfully, but these errors were encountered:
That is odd, do you ever see the remaining processes finishing their work? It could be that an error is getting consumed by the tasks causing the syncgrads to never reach the state where it closes all the channels and returns. This could do with adding some more context to how to reproduce this.
After waiting several hours, the remaining processes never finish saving. I face a similar issue when turning saveweights=false and number of cycles=1.
I agree with your thoughts on the problem. Is there a better way than using the verbose output to get a better sense of what each task is doing?
I have updated the project significantly to make it so that we can use tasks instead to communicate with the GPUs, but the process based methods are still in there to facilitate multi node training. If you're interested, it would be cool to check out the newest changes which should be significantly more efficient and faster.
Hello,
I've been running into an issue where run_distributed() will not complete execution. I've encountered this both with and without trying to save the weights during a run. Running it with verbose turned on, it seems like one worker is completing its task prior to other which cause the program to lock up. I've included a screen shot of the REPL for 2 workers, with verbose = true and 4 cycles. After several hours, I end up killing the process as there is not other way to continue.
The text was updated successfully, but these errors were encountered: