Skip to content

Commit 68b61d9

Browse files
committed
fix: correct output buffer sizing — use per-thread cap, not total
output_cap was n*64 (total entries) but the kernel indexes as tid*output_cap, making the buffer n*n*64*5 uint32s (~128GB at n=10000). Fixed to 64 per-thread entries = 12.8MB at n=10000. Co-Authored-By: Claude Opus 4.6 <[email protected]>
1 parent 24ea722 commit 68b61d9

File tree

1 file changed

+3
-5
lines changed

1 file changed

+3
-5
lines changed

emojiasm/gpu.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -431,13 +431,11 @@ def gpu_run(
431431
# Conservative estimate: allow up to 64 output entries per thread.
432432
is_tier2 = tier == 2
433433
if is_tier2:
434-
output_cap = n * 64
434+
max_out_per_thread = 64 # max output entries per thread
435435
else:
436-
output_cap = 0
436+
max_out_per_thread = 0
437437

438-
# Max output entries per thread for Tier 2
439-
max_out_per_thread = output_cap
440-
output_cap_array = mx.array([output_cap], dtype=mx.uint32)
438+
output_cap_array = mx.array([max_out_per_thread], dtype=mx.uint32)
441439

442440
# Get (cached) kernel
443441
kernel = _get_kernel()

0 commit comments

Comments
 (0)