-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance questions #96
Comments
Not much attention has been paid to performance in the bindings in this repository, so this isn't altogether unsurprising. I have never personally written high-performance Python and consequently probably made a ton of mistakes when writing these bindings. If you're looking for performance though I would recommend the Rust bindings rather than the Python bindings, as there we have indeed focused on performance and the overhead is significantly smaller. |
Thanks, I was actually wondering if using wasm was a possibility for Python libraries that rely on C/C++/Rust, as you would only need to build the one file, then read it using wasmtime-py, instead of generating many platform wheels, which can be a headache. So I wanted to compare performance as a first step. Thanks a lot for your response, I will close the issue. |
That's definitely an intended use case for bindings like this, and I'll emphasize again that the cost here isn't intrinsic to these bindings or Python, it's probably just that I don't know how to write high-performance Python. PRs are of course always welcome for improvements as well. |
I've been doing some experiments and identifying bottleneck the next obvious bottleneck is that there seems to be a fixed context switch cost between python and WASM or vice versa I can demonstrate this bottleneck by considering the following cdb_djp_hash.c and estimated the cost of find hashes of strings of different length, 13, 26, 130, 1300 and the shocking finding was all this took the same time ~40ms regardless of the length of the string this indicates that most of the 40ms is a constant cost of switching regardless of the actual work done inside the loop I've some commented profiling to identify exactly where is this waste of time happening https://github.com/muayyad-alsadi/wasm-demos/blob/main/cdb_djp_hash/cdb_djp_hash.py#L88 |
doing 10k calls pr=Profile()
pr.enable()
for i in range(10000):
cdb_djp_hash_wasm(large_a)
pr.disable()
#pr.print_stats()
#pr.print_stats('cumulative')
pr.print_stats('calls') so what we do out of the 800ms only 169ms was taken by the actual WASM call
and when doing it with np_mem = np.frombuffer(fast_mem.get_buffer_ptr(), dtype=np.uint8)
we are doing a 10k call, I should see ~10k
but what does not make since is having 210k operations
TL;DR: actionable items
|
I think I fixed that in #137 @alexprengere would you please test my approach and report how much performance gain did it achieve |
I've was able to eliminate the 40ms wasted time here is the code # gcd_alt.py
import ctypes
from wasmtime import Store, Module, Instance, WasmtimeError
from functools import partial
from wasmtime import _ffi as ffi
from wasmtime._func import enter_wasm
from wasmtime._bindings import wasmtime_val_raw_t
store = Store()
module = Module.from_file(store.engine, './gcd.wat')
instance = Instance(store, module, [])
def func_init(func, store):
ty = func.type(store)
ty_params = ty.params
ty_results = ty.results
params_str = (str(i) for i in ty_params)
params_n = len(ty_params)
results_n = len(ty_results)
n = max(params_n, results_n)
raw_type = wasmtime_val_raw_t*n
func.raw_type = raw_type
def _create_raw(*params):
raw = raw_type()
for i, param_str in enumerate(params_str):
setattr(raw[i], param_str, params[i])
return raw
func._create_raw = _create_raw
_gcd_in = instance.exports(store)["gcd"]
func_init(_gcd_in, store)
def gcd(a, b):
raw = _gcd_in._create_raw(a, b)
raw_ptr_casted = ctypes.cast(raw, ctypes.POINTER(wasmtime_val_raw_t))
with enter_wasm(store) as trap:
error = ffi.wasmtime_func_call_unchecked(
store._context,
ctypes.byref(_gcd_in._func),
raw_ptr_casted,
trap)
if error:
raise WasmtimeError._from_ptr(error)
return raw[0].i32
print("gcd(6, 27) = %d" % gcd(6, 27)) and here is the benchmark import wasmtime.loader
import time
from math import gcd as math_gcd
from gcd import gcd as wasm_gcd
from gcd_alt import gcd as wasm_gcd_alt
def python_gcd(x, y):
while y:
x, y = y, x % y
return abs(x)
N = 1_000
for gcdf in math_gcd, python_gcd, wasm_gcd, wasm_gcd_alt:
start_time = time.perf_counter()
for _ in range(N):
g = gcdf(16516842, 154654684)
total_time = time.perf_counter() - start_time
print(total_time) and here is the result
|
we have 3 types of performance bottlenecks
I've addressed the first two, and I'll create a ticket for the last one with details |
Incredible work! I was trying to see what kind of overhead there is to call a wasm function from Python.
I am using WSL2 on Windows, with a recent Fedora, CPython3.10.
Re-using the exact gcd.wat from the examples, it looks like any call has a "cost" of about 30μs.
The code I used is a simple timer to compare performance:
This returns about:
Note that I tested this with an empty "hello world", and the 30μs are still there.
I am wondering about 2 things:
The text was updated successfully, but these errors were encountered: