Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nslink.NetNS is possibly not safe to use in multithreaded applications #1213

Open
philipp-karcher opened this issue Oct 1, 2024 · 16 comments

Comments

@philipp-karcher
Copy link

We have been dealing with very mysterious deadlocks in a multithreaded application, and I believe that nslink.NetNS is the culprit. It calls os.fork in its constructor, which is unsafe when mixed with threads (and for this reason deprecated in Python 3.12):

https://docs.python.org/3/library/os.html#os.fork
https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555

@m4123
Copy link

m4123 commented Dec 11, 2024

Any update when this issue will be fixed?

@svinota
Copy link
Owner

svinota commented Dec 20, 2024

It might be fixed now in the master branch after merging the async core, the whole NetNS functionality was rewritten.

The documentation update is on its way.

@svinota
Copy link
Owner

svinota commented Jan 15, 2025

Please check if it works for you now, see the master branch (now released yet, so install/use from git).

Notice that you can use IPRoute(netns='...') instead of NetNS('...'); also AsyncIPRoute() is available, so you can use the library now in the asyncio-based projects.

The docs are being updated, will be merged/published asap.

@philipp-karcher
Copy link
Author

We are now using the current master branch in our test servers and will monitor over the next few days if any deadlocks occur (I do not expect so, the changes look like they should fix the issue).

For other people experiencing the same problem, note that until the default is changed in Python 3.14, I assume you additionally need to set the default start method for processes to "spawn" for this to work: https://docs.python.org/3/library/multiprocessing.html#multiprocessing-start-methods

@philipp-karcher
Copy link
Author

Unfortunately, we weren't able to properly test it after all, because we were using the remove method of NetNS, which appears to have been removed.

@svinota
Copy link
Owner

svinota commented Jan 27, 2025

Oops, fixing! Thanks

svinota added a commit that referenced this issue Jan 27, 2025
svinota added a commit that referenced this issue Jan 27, 2025
iproute: bring NetNS.remove() back

Bug-Url: #1260
Bug-Url: #1213
@svinota
Copy link
Owner

svinota commented Jan 27, 2025

@philipp-karcher , NetNS.remove() should be fixed now

@philipp-karcher
Copy link
Author

Thanks, it seems to work now!

Unfortunately, both spawn and forkserver seem to perform much worse than fork for us, so we will probably still need to contain our usage of NetNS to a single-threaded process anyway :/

@svinota
Copy link
Owner

svinota commented Jan 27, 2025

@philipp-karcher unlike pre-0.9.x, the current master does not need any running child processes after fork/spawn — the child only need to set the netns, open the socket, and send the fd back to the parent — and exit after that.

So there is still some room for optimization, we don't need any correctly running Python interpreter in the child.

If you prepare some testcase that I can use to collect the metrics and optimize this particular routine, it would be great.

@philipp-karcher
Copy link
Author

method = "fork"
count = 100
multiprocessing.set_start_method(method)
start = time.time()
for i in range(count):
    NetNS(f"T{i}")
print(f"Creating {count} NetNS objects took {time.time() - start} seconds with start method {method}".)

Just executing this (with the namespaces already created) takes less than 1 second with fork, about 7 seconds with forkserver and about 10 seconds with spawn on my local machine (which seems to match the performance difference of about x7-10 on our test servers).

@svinota
Copy link
Owner

svinota commented Jan 28, 2025

Thanks. The issue looks quite serious to me, I'll try to find a solution asap.

@svinota
Copy link
Owner

svinota commented Jan 31, 2025

@philipp-karcher I tested several ways of dealing with netns, and so far I'm here: #1261

It's still fork-based, but:

  1. the code in the child after fork() doesn't rely on variables that could lead to deadlocks; actually, the child is as reduced as possible
  2. and the child process is discarded as soon as it sends back the socket FD (one of the key differences from pre-0.9.1)
  3. one extra fork() in netns.create() is eliminated (another key difference)

Subprocess and multiprocessing-based solutions are much slower, so I believe I will continue with this "hacky" one.

Until the weekend I plan to introduce some thread-enabled stress test to prove that the solution is reliable under such conditions. Another experiment is planned with the garbage collection, again to improve the stability.

@svinota
Copy link
Owner

svinota commented Jan 31, 2025

So if you have any particular examples of threaded architecture — let me know, I'll include them in the testing

@philipp-karcher
Copy link
Author

I think it's reasonable to prioritize the performance here and continue to use fork, as long as this potential risks will be documented somewhere.

the code in the child after fork() doesn't rely on variables that could lead to deadlocks; actually, the child is as reduced as possible

I'm not sure that the deadlocks can really be prevented by changing the code (but tbh the details of this go a bit over my head): https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2

While that is a great thing for you to do, unfortunately on POSIX doing enough isn’t actually possible. It is impossible to safely code threads that can be used in an application that calls fork() in Python because the CPython runtime itself is by definition unsafe for use after fork(). No C API other than those listed as “async-signal-safe” in signal-safety(7) - Linux manual page and the related POSIX standard(s) can be presumed compatible with use in a forked child process. The CPython runtime requires way more.

Unfortunately, deadlocks like this are very unpredictable (we didn't have any problems for years before this, and our code ran 24/7 on multiple servers) and reliability is important for us. So even if intensive testing would show no problems, we would still opt to just rework our architecture and only use NetNS from a single-threaded process (or rather we already have an ugly workaround that does this, it just needs to be properly implemented at some point).

@svinota
Copy link
Owner

svinota commented Jan 31, 2025

Okay, understand.

The fork-based approach in this branch (#1261) is faster than before; and is much safer as it presumes that the child process is in broken state by default. And more safety measures will be introduced :)

While totally understanding that issues in the child of a multithreaded process can not be fixed from within python code, they still can be mitigated on both sides (e.g. in the worst case I can safely kill a child after a timeout and try again if it fails, and so on)

@svinota
Copy link
Owner

svinota commented Jan 31, 2025

Thanks for the use case, btw. I have to focus more on reliability in the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants