Skip to content

Conversation

blester125
Copy link
Collaborator

This PR adds an asyncify function to try to turn sync code into async code. Still testing if this speeds things up.

This PR adds an `asyncify` function to try to turn sync code into async
code. Still testing if this speeds things up.
@blester125
Copy link
Collaborator Author

Currently there seems to be issues with the threading library numba uses, on my linux box the omp module is used but this isn't available on osx so the workgroup thing has issues. There is also an intel tbb library but there are issues with getting that installed (I had to install it system wide in linux but it seems to need to be installed in osx and the suggested pip install tbb doesn't work)

@blester125
Copy link
Collaborator Author

blester125 commented Apr 21, 2023

Some clarifications:

  • The default threading library numba uses can't handle a threaded numba function (the prange, etc) being called from different threads. This is what happens when we asyncify the hash using the thread executor.
  • TBB was just the first threading library I tried as it was supposed to be pip installable. Using the OpenMP library is another option, it is just harder to install
    • Having to install pre-reqs beyond pip can induce a lot of friction, which we already have some of from needing git-lfs
  • We want to async the hash because it is currently serial. The whole runtime from an last await -> hashing -> next await, no other async task can run (i.e. we can't start shoving a tensor through a pipe to git-lfs or async interactions with tensorstore).
  • We talked about using multiprocessing, the main issue with that is we will be working in m = # of process chunks and long io tasks like piping tensors to git-lfs will mot have concurrency beyond the m.
    • Scaling m > mp.cpu_count() could help on the IO bound parts but would probably cause contention in the compute bound parts.
    • Additionally we would need to be careful that the parameter tree is not loaded before the mp.pool in created, otherwise each process have a copy of the parameters.
    • This would also add an extra IPC cost as parameters need to be passed from the main process to each worker or we would need to use the shared memory array and have some sort of numpy translation layer (and ensure it doesn't result in copies).
      • I did some research on shared memory (we could use the RawArray class as we don't need to lock it, embarrassingly parallel). It seems like we could use this in a smudge (as we know the shape of the paramters based on the metadata) but during clean we would need to us IPC (if we load the model to get the shapes, the model parameters would copied in the fork and each process would have a full copy of the model :()
  • We want to test if the git-lfs-filter process is able to handle concurrent requests.

@blester125 blester125 linked an issue Apr 21, 2023 that may be closed by this pull request
@blester125 blester125 marked this pull request as draft April 21, 2023 20:06
@craffel
Copy link
Contributor

craffel commented Apr 25, 2023

The default threading library numba uses can't handle a threaded numba function (the prange, etc) being called from different threads. This is what happens when we asyncify the hash using the thread executor.

...

We want to async the hash because it is currently serial. The whole runtime from an last await -> hashing -> next await, no other async task can run (i.e. we can't start shoving a tensor through a pipe to git-lfs or async interactions with tensorstore).

The latter is caused by the former, right? I wonder if we can use a different library than numba for acceleration that might better support the threading we want to do. If we are only using numba for accelerating a slow for loop or something, we could use some other library (maybe Cython or something, no idea which ones support what we want). Or is the issue that numba is already doing multithreading, so we can't do additional threading during the numba call?

@blester125
Copy link
Collaborator Author

Currently our LSH hash is implemented numba it is already using multithreading (to parallelize a for loop, also it avoids materializing a large intermediate array) but that threading is all encapsulated within numba.

From the prospective of the caller, this hash call is blocking, it uses threads under the hood but the caller needs to sit there until it is done. Normally in async, the concurrency comes from waiting (on things like I/O or this hash to finish) but the hash never says "ok, I'm waiting, you can run other things."

The default way to "asyncify" something like that is to run it in another thread. This gives us back a wrapper that will say "ok, I'm waiting" until the computation is done in that thread, so we can await on that in async code and everyone is happy.

The issue is that this means we are now launching multithreaded numba code from multiple threads which their default threading library can't deal with. Different libraries can enable different safety settings docs. For example, the OpenMP lib on linux can hand if we call multithreaded numba from multiple threads but it couldn't if we called multithreaded numba code from multiple processes.

@craffel
Copy link
Contributor

craffel commented Apr 26, 2023

Hm, I see. While there may be a solution, this seems like something that we can punt on for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to Async blocking computations
2 participants