-
-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nslink.NetNS is possibly not safe to use in multithreaded applications #1213
Comments
Any update when this issue will be fixed? |
It might be fixed now in the master branch after merging the async core, the whole NetNS functionality was rewritten. The documentation update is on its way. |
Please check if it works for you now, see the master branch (now released yet, so install/use from git). Notice that you can use The docs are being updated, will be merged/published asap. |
We are now using the current master branch in our test servers and will monitor over the next few days if any deadlocks occur (I do not expect so, the changes look like they should fix the issue). For other people experiencing the same problem, note that until the default is changed in Python 3.14, I assume you additionally need to set the default start method for processes to "spawn" for this to work: https://docs.python.org/3/library/multiprocessing.html#multiprocessing-start-methods |
Unfortunately, we weren't able to properly test it after all, because we were using the |
Oops, fixing! Thanks |
@philipp-karcher , |
Thanks, it seems to work now! Unfortunately, both spawn and forkserver seem to perform much worse than fork for us, so we will probably still need to contain our usage of NetNS to a single-threaded process anyway :/ |
@philipp-karcher unlike pre-0.9.x, the current master does not need any running child processes after fork/spawn — the child only need to set the netns, open the socket, and send the fd back to the parent — and exit after that. So there is still some room for optimization, we don't need any correctly running Python interpreter in the child. If you prepare some testcase that I can use to collect the metrics and optimize this particular routine, it would be great. |
Just executing this (with the namespaces already created) takes less than 1 second with fork, about 7 seconds with forkserver and about 10 seconds with spawn on my local machine (which seems to match the performance difference of about x7-10 on our test servers). |
Thanks. The issue looks quite serious to me, I'll try to find a solution asap. |
@philipp-karcher I tested several ways of dealing with netns, and so far I'm here: #1261 It's still fork-based, but:
Subprocess and multiprocessing-based solutions are much slower, so I believe I will continue with this "hacky" one. Until the weekend I plan to introduce some thread-enabled stress test to prove that the solution is reliable under such conditions. Another experiment is planned with the garbage collection, again to improve the stability. |
So if you have any particular examples of threaded architecture — let me know, I'll include them in the testing |
I think it's reasonable to prioritize the performance here and continue to use fork, as long as this potential risks will be documented somewhere.
I'm not sure that the deadlocks can really be prevented by changing the code (but tbh the details of this go a bit over my head): https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2
Unfortunately, deadlocks like this are very unpredictable (we didn't have any problems for years before this, and our code ran 24/7 on multiple servers) and reliability is important for us. So even if intensive testing would show no problems, we would still opt to just rework our architecture and only use NetNS from a single-threaded process (or rather we already have an ugly workaround that does this, it just needs to be properly implemented at some point). |
Okay, understand. The fork-based approach in this branch (#1261) is faster than before; and is much safer as it presumes that the child process is in broken state by default. And more safety measures will be introduced :) While totally understanding that issues in the child of a multithreaded process can not be fixed from within python code, they still can be mitigated on both sides (e.g. in the worst case I can safely kill a child after a timeout and try again if it fails, and so on) |
Thanks for the use case, btw. I have to focus more on reliability in the project. |
We have been dealing with very mysterious deadlocks in a multithreaded application, and I believe that nslink.NetNS is the culprit. It calls os.fork in its constructor, which is unsafe when mixed with threads (and for this reason deprecated in Python 3.12):
https://docs.python.org/3/library/os.html#os.fork
https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555
The text was updated successfully, but these errors were encountered: