-
Notifications
You must be signed in to change notification settings - Fork 593
feat: Cold L2 Cache Benchmarking with Rotating Buffers #2213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughRefactors benchmark/test wrappers to accept explicit per-run inputs and introduces rotating-buffer utilities for cold‑L2 benchmarking in flashinfer/testing/utils.py. Threads rotate_buffers and input_args/input_kwargs through bench_gpu_time, CUDA-graph, CUPTI, and CUDA-event timing paths, and updates attention, gemm, and moe benchmark routines to pass richer argument sets into their run_backend/run_backend_wrapper functions. Changes
Sequence Diagram(s)sequenceDiagram
participant Bench as Bench Runner
participant BenchAPI as bench_gpu_time
participant Rotator as RotatingBufferMgr
participant CUGraph as CUDA Graph / Capture
participant CUPTI as CUPTI timing (optional)
participant GPU as GPU / Backend
Bench->>BenchAPI: request timing(fn, rotate_buffers=True, input_args)
BenchAPI->>Rotator: detect GPU tensors & compute rotations
Rotator-->>BenchAPI: rotated input copies (N)
BenchAPI->>CUGraph: capture graph with rotated inputs (iter 1..N)
CUGraph->>GPU: replay captures per rotation
alt CUPTI available
BenchAPI->>CUPTI: measure with input_args (rotated if enabled)
CUPTI->>GPU: timed runs
end
BenchAPI->>Bench: aggregated timing results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the accuracy of GPU performance benchmarking by introducing a "cold L2 cache" measurement capability. By implementing rotating buffers, the system now cycles through distinct memory regions for each kernel execution within a benchmark, preventing misleadingly high performance figures that can arise from a "hot" L2 cache. This is particularly vital for memory-bound kernels where cache effects can heavily influence observed speeds, ensuring that benchmark results more faithfully represent real-world cold-start scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new 'rotating buffer' mechanism for cold-L2 cache benchmarking across various attention, GEMM, and MoE routines. The flashinfer/testing/utils.py file was updated to include utilities for calculating L2 cache size, extracting GPU tensors, cloning tensor structures while preserving strides, and creating multiple copies of input arguments for buffer rotation. The core benchmarking functions (bench_gpu_time_with_cuda_event, bench_gpu_time_with_cudagraph, bench_gpu_time_with_cupti, and the main bench_gpu_time wrapper) were refactored to accept explicit input_args and input_kwargs, and a new rotate_buffers flag. When rotate_buffers is enabled with CUDA graphs, the system automatically determines the necessary number of buffer copies based on L2 cache size and input tensor sizes, then rotates through these copies during graph capture to ensure cold-L2 cache conditions for each kernel invocation. Correspondingly, the run_backend_wrapper and run_backend functions in benchmarks/routines/attention.py, benchmarks/routines/gemm.py, and benchmarks/routines/moe.py were updated to accept all relevant input tensors and parameters as explicit arguments, facilitating their use with the new rotating buffer mechanism in the benchmarking utilities. Additionally, error returns for unsupported backends were changed from res to None in attention benchmarks.
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
benchmarks/routines/gemm.py (1)
841-982: Fixmm_fp4:mat2_fp4_trtllm/mat2_inv_s_trtllmcan be undefined whentrtllmisn’t selected.
Because these are now included ininput_args, they must exist regardless of backend choice.Minimal fix: initialize defaults before the conditional.
@@ - if "trtllm" in backends: + # Defaults so we can always thread these through input_args + mat2_fp4_trtllm = mat2_fp4 + mat2_inv_s_trtllm = mat2_inv_s + if "trtllm" in backends: mat2_fp4_trtllm, mat2_inv_s_trtllm = flashinfer.nvfp4_quantize( mat2, global_sf_mat2, sfLayout=flashinfer.SfLayout.layout_128x4, do_shuffle=True, )flashinfer/testing/utils.py (1)
880-1012: Critical: CUPTI-missing fallback dropsinput_args/input_kwargscausing incorrect behavior.
When CUPTI isn't installed or is < 13, anduse_cuda_graph=False, the fallback tobench_gpu_time_with_cuda_eventis called without passing the providedinput_argsandinput_kwargs, while theuse_cuda_graph=Truefallback correctly passes them tobench_gpu_time_with_cudagraph. This causes the function to ignore user-provided arguments.Fix:
else: return bench_gpu_time_with_cuda_event( fn=fn, dry_run_iters=dry_run_iters, repeat_iters=repeat_iters, dry_run_time_ms=dry_run_time_ms, repeat_time_ms=repeat_time_ms, l2_flush=l2_flush, l2_flush_size_mb=l2_flush_size_mb, l2_flush_device=l2_flush_device, sleep_after_run=sleep_after_run, + input_args=input_args, + input_kwargs=input_kwargs, )benchmarks/routines/attention.py (1)
499-590: Guard unsupported backends and fixtrtllm-genKV stride selection (decode).The decode routine supports only
["fa2", "fa2_tc", "trtllm-gen", "cudnn", "trtllm-native"], but the argument parser allows["fa2", "fa2_tc", "fa3", "cudnn", "cutlass", "trtllm-gen", "trtllm-native", "trtllm-gen-native"]. Three unsupported backends slip through:
- In the refcheck path (lines 542–552), calling
.detach()onNonewill crash.kv_cache_for_trtis prepared and FP8-converted (lines 358–497) but never passed torun_backend_wrapper; trtllm-gen receives the incorrectly-stridedkv_cacheinstead.bench_gpu_timeis called unconditionally (line 566), benchmarking unsupported backends as no-ops.The proposed patch correctly:
- Conditionally passes
kv_cache_for_trtto trtllm-gen- Guards refcheck against
Nonewith an earlycontinue- Skips benchmarking of unsupported backends
🧹 Nitpick comments (3)
benchmarks/routines/attention.py (2)
22-43: Preferwarnings.warn(and consider de-duping) for deprecated backend renames.
print()is fine for a CLI, butwarnings.warn(..., DeprecationWarning/UserWarning)is easier to filter and won’t get lost in perf logs. Also consider de-duping if--backendscontains repeated deprecated entries.Also applies to: 180-185
992-1099:rotate_buffers=True+ “fat”input_argsmay defeat cold-L2 intent or blow up memory.
Right nowinput_argsincludes large scratch buffers (e.g.,workspace_buffer) plus index tables; the rotation heuristic counts all CUDA tensors, which can (a) disable rotation because “inputs exceed 5×L2” even if the kernel’s real working-set doesn’t, or (b) clone huge scratch tensors many times if rotation triggers.Suggestion: pass only the tensors that materially contribute to the memory-working-set into
input_args(or add an exclusion mechanism inbench_gpu_time_with_cudagraph).Also applies to: 1485-1607, 1942-2039
benchmarks/routines/moe.py (1)
928-952: Avoidselected_experts.to(torch.int)inside the timed callable.
Move the cast out ofrun_cutlassand pass the already-cast tensor ininput_args_for_benchto avoid measuring conversion overhead.Example:
@@ - routing_weights, selected_experts = _compute_routing(router_logits, top_k) + routing_weights, selected_experts = _compute_routing(router_logits, top_k) + selected_experts_i32 = selected_experts.to(torch.int) @@ - def run_cutlass(x, selected_experts, routing_weights, w31_local, w2_local, out): + def run_cutlass(x, selected_experts_i32, routing_weights, w31_local, w2_local, out): return cutlass_fused_moe( x, - selected_experts.to(torch.int), + selected_experts_i32, routing_weights, @@ - input_args_for_bench = (x, selected_experts, routing_weights, w31_local, w2_local, out) + input_args_for_bench = (x, selected_experts_i32, routing_weights, w31_local, w2_local, out)Also applies to: 988-1019, 1078-1104, 1123-1135
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
benchmarks/routines/attention.py(12 hunks)benchmarks/routines/gemm.py(12 hunks)benchmarks/routines/moe.py(15 hunks)flashinfer/testing/utils.py(15 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
benchmarks/routines/gemm.py (1)
flashinfer/testing/utils.py (1)
bench_gpu_time(1399-1537)
benchmarks/routines/attention.py (4)
flashinfer/testing/utils.py (1)
bench_gpu_time(1399-1537)benchmarks/bench_append_paged_kv_cache.py (1)
fn(116-127)benchmarks/bench_append_paged_mla_kv_cache.py (1)
fn(100-111)benchmarks/bench_fused_add_rmsnorm.py (1)
fn(41-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (3)
benchmarks/routines/gemm.py (1)
265-305: Signature expansion +bench_gpu_time(input_args=...)wiring looks consistent.
The explicit argument passing should play nicely with the newbench_gpu_timeentry point and avoids closure capture surprises.Also applies to: 445-491, 648-700
benchmarks/routines/moe.py (1)
674-770: FP4 MoE runner signature threading intobench_gpu_time(input_args=...)is consistent.
Good alignment with the new benchmarking interface.flashinfer/testing/utils.py (1)
731-877: Plumbing ofinput_args/input_kwargsthrough event/graph/unified entrypoint looks good.
Thecall_fn()abstraction makes behavior consistent across backends and aligns with the repo-wide signature expansions.Also applies to: 1170-1397, 1399-1537
| def get_l2_cache_size(device=None) -> int: | ||
| """ | ||
| Get L2 cache size in bytes for the given CUDA device. | ||
| Args: | ||
| device: CUDA device (int, torch.device, or None for current device). | ||
| Returns: | ||
| L2 cache size in bytes. | ||
| """ | ||
| if device is None: | ||
| device = torch.cuda.current_device() | ||
| props = torch.cuda.get_device_properties(device) | ||
| return props.L2_cache_size | ||
|
|
||
|
|
||
| def _calculate_tensor_bytes(tensors: List[torch.Tensor]) -> int: | ||
| """ | ||
| Calculate total bytes of tensors residing on GPU. | ||
| Assumes all tensors are on the same device. | ||
| Args: | ||
| tensors: List of torch.Tensor objects. | ||
| Returns: | ||
| Total bytes occupied by GPU tensors (CPU tensors are ignored). | ||
| """ | ||
| total = 0 | ||
| for t in tensors: | ||
| if isinstance(t, torch.Tensor) and t.is_cuda: | ||
| total += t.numel() * t.element_size() | ||
| return total | ||
|
|
||
|
|
||
| def _extract_gpu_tensors(obj) -> List[torch.Tensor]: | ||
| """ | ||
| Recursively extract all GPU-resident tensors from a nested structure | ||
| of lists, tuples, and dicts. | ||
| Args: | ||
| obj: Object to extract tensors from (can be tensor, list, tuple, dict, or other). | ||
| Returns: | ||
| Flat list of tensors on GPU found in the structure. | ||
| """ | ||
| tensors = [] | ||
| if isinstance(obj, torch.Tensor) and obj.is_cuda: | ||
| tensors.append(obj) | ||
| elif isinstance(obj, (list, tuple)): | ||
| for item in obj: | ||
| tensors.extend(_extract_gpu_tensors(item)) | ||
| elif isinstance(obj, dict): | ||
| for v in obj.values(): | ||
| tensors.extend(_extract_gpu_tensors(v)) | ||
| return tensors | ||
|
|
||
|
|
||
| def calculate_rotation_count( | ||
| tensors: List[torch.Tensor], device=None, min_rotations: int = 2 | ||
| ) -> int: | ||
| """ | ||
| Calculate the number of buffer copies needed to ensure cold L2 cache. | ||
| The function uses conservative thresholds to account for: | ||
| - LRU eviction being gradual (not all data evicted when capacity exceeded) | ||
| - Cache associativity effects (some data may persist in non-conflicting sets) | ||
| - Hardware prefetching behavior | ||
| Returns 1 (no rotation needed) only when tensor size substantially exceeds | ||
| L2 cache (>= 5x), ensuring cache effects are truly negligible. | ||
| Args: | ||
| tensors: List of tensors to consider for rotation (must be on GPU). | ||
| device: Device for L2 cache query (None for current device). | ||
| min_rotations: Minimum number of rotations when rotation is needed. | ||
| Returns: | ||
| Number of buffer copies needed (1 means no rotation needed). | ||
| """ | ||
| l2_size = get_l2_cache_size(device) | ||
| total_bytes = _calculate_tensor_bytes(tensors) | ||
|
|
||
| if total_bytes == 0: | ||
| return 1 # No tensors to rotate | ||
|
|
||
| # Use aggressive threshold: only skip rotation if tensors far exceed L2 (5x) | ||
| # This ensures cache effects are truly negligible even with prefetching | ||
| safe_cache_threshold = l2_size * 5 | ||
| if total_bytes >= safe_cache_threshold: | ||
| return 1 # Tensors far exceed L2, no rotation needed | ||
|
|
||
| # Conservative formula: ensure between any two uses of the same buffer, | ||
| # we've accessed enough data to fully flush L2 with margin | ||
| # Using safe_cache_threshold ensures we account for all cache effects | ||
| num_rotations = math.ceil(safe_cache_threshold / total_bytes) + 1 | ||
|
|
||
| return max(min_rotations, num_rotations) | ||
|
|
||
|
|
||
| def _clone_structure(obj): | ||
| """ | ||
| Deep clone a nested structure, cloning GPU tensors with detach().clone() | ||
| while preserving scalars, booleans, and other non-tensor values. | ||
| For non-contiguous tensors (e.g., created with as_strided), this function | ||
| preserves the stride pattern using torch.empty_strided() + copy_(). This is | ||
| important for backends like cuDNN that expect specific memory layouts. | ||
| Args: | ||
| obj: Object to clone (tensor, list, tuple, dict, or other). | ||
| Returns: | ||
| Cloned structure with GPU tensors cloned, other values preserved. | ||
| """ | ||
| if isinstance(obj, torch.Tensor): | ||
| if obj.is_cuda: | ||
| if obj.is_contiguous(): | ||
| return obj.detach().clone() | ||
| else: | ||
| # Preserve stride pattern for non-contiguous tensors | ||
| # (e.g., as_strided views used by cuDNN paged attention) | ||
| result = torch.empty_strided( | ||
| obj.size(), | ||
| obj.stride(), | ||
| dtype=obj.dtype, | ||
| device=obj.device, | ||
| ) | ||
| result.copy_(obj.detach()) | ||
| return result | ||
| else: | ||
| return obj # CPU tensors returned as-is | ||
| elif isinstance(obj, list): | ||
| return [_clone_structure(item) for item in obj] | ||
| elif isinstance(obj, tuple): | ||
| return tuple(_clone_structure(item) for item in obj) | ||
| elif isinstance(obj, dict): | ||
| return {k: _clone_structure(v) for k, v in obj.items()} | ||
| else: | ||
| # Non-tensor, non-container: return as-is (e.g., int, float, str, bool, None) | ||
| return obj | ||
|
|
||
|
|
||
| def _create_rotated_buffer_copies( | ||
| input_args: Tuple, input_kwargs: dict, num_rotations: int | ||
| ) -> List[Tuple[Tuple, dict]]: | ||
| """ | ||
| Create multiple copies of input_args and input_kwargs for buffer rotation. | ||
| The first copy (index 0) uses the original args/kwargs. | ||
| Subsequent copies clone all GPU tensors while preserving other values. | ||
| Args: | ||
| input_args: Positional arguments tuple. | ||
| input_kwargs: Keyword arguments dict. | ||
| num_rotations: Number of buffer copies to create. | ||
| Returns: | ||
| List of (args, kwargs) tuples, one for each rotation index. | ||
| """ | ||
| if num_rotations <= 1: | ||
| return [(input_args, input_kwargs)] | ||
|
|
||
| copies = [] | ||
| # First copy uses original args/kwargs | ||
| copies.append((input_args, input_kwargs)) | ||
|
|
||
| # Create cloned copies for remaining rotations | ||
| for _ in range(num_rotations - 1): | ||
| cloned_args = _clone_structure(input_args) | ||
| cloned_kwargs = _clone_structure(input_kwargs) | ||
| copies.append((cloned_args, cloned_kwargs)) | ||
|
|
||
| return copies | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cap rotation count to avoid pathological OOM (tiny inputs).
For very small total_bytes, ceil((5×L2)/total_bytes) can explode (thousands+ copies). Add a max_rotations cap and warn/clamp.
Example patch:
@@
-def calculate_rotation_count(
- tensors: List[torch.Tensor], device=None, min_rotations: int = 2
-) -> int:
+def calculate_rotation_count(
+ tensors: List[torch.Tensor],
+ device=None,
+ min_rotations: int = 2,
+ max_rotations: int = 64,
+) -> int:
@@
- num_rotations = math.ceil(safe_cache_threshold / total_bytes) + 1
-
- return max(min_rotations, num_rotations)
+ num_rotations = math.ceil(safe_cache_threshold / total_bytes) + 1
+ num_rotations = max(min_rotations, num_rotations)
+ if num_rotations > max_rotations:
+ warnings.warn(
+ f"Requested {num_rotations} rotating buffer copies (inputs={total_bytes} bytes, "
+ f"L2={l2_size} bytes); clamping to max_rotations={max_rotations}.",
+ category=UserWarning,
+ stacklevel=2,
+ )
+ num_rotations = max_rotations
+ return num_rotations🤖 Prompt for AI Agents
flashinfer/testing/utils.py around lines 38 to 211: the current
calculate_rotation_count can produce an extremely large num_rotations when
total_bytes is tiny (ceil((5*L2)/total_bytes) explodes), risking OOM; clamp the
computed rotations with a reasonable max_rotations (either hard-coded, e.g. 16
or 32, or passed as an optional parameter) and log/warn when you clamp so
callers know rotation was limited; implement by adding a max_rotations argument
(default 16), compute num_rotations as now, then set num_rotations =
min(num_rotations, max_rotations) and emit a warning via warnings.warn or logger
if clamping occurred, finally return max(min_rotations, num_rotations).
| num_iters_within_graph (int): Kernel calls per graph (CUDA graph mode only, | ||
| default: 10). | ||
| rotate_buffers (bool): If True, use rotating buffers for cold-L2 benchmarking | ||
| (CUDA graph mode only, default: False). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this constraint? From my understanding rotate_buffer should also work without CUDAGraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no technical constraint, but I did not implement it because when running outside of CUDA graph, we can simply flush L2 between runs with a buffer.zero_() (here)
torch.cuda.synchronize()
for iter_idx in range(repeat_iters):
if l2_flush:
buffer.zero_()
start_events[iter_idx].record()
call_fn()
end_events[iter_idx].record()
It is the CUDA graph + CUDA events measurements that primarily requires rotating buffers because we launch the kernel num_iters_within_graph times within the CUDA graph and cannot flush the cache in between.
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
if rotate_buffers and num_rotations > 1:
# Capture with rotating buffers: use buffer[iter % num_rotations]
for iter_idx in range(num_iters_within_graph):
buf_idx = iter_idx % num_rotations
call_fn_with_rotation(buf_idx)
else:
# Non-rotating capture (uses original args if provided)
for _ in range(num_iters_within_graph):
call_fn()
torch.cuda.synchronize()
...
...
torch.cuda.synchronize()
for iter_idx in range(repeat_iters):
if l2_flush:
buffer.zero_()
start_events[iter_idx].record()
g.replay()
end_events[iter_idx].record()
However, I can see the untidiness of users needing to think about L2 flush vs. rotating buffers.
Do you have any thoughts that come to mind whether or how we might want to clean this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/testing/utils.py (1)
881-1011: Bug: CUPTI-unavailable fallback dropsinput_args/input_kwargs.This makes
bench_gpu_time(enable_cupti=True, ...)fail forfnthat needs explicit args when CUPTI isn’t present.else: return bench_gpu_time_with_cuda_event( fn=fn, dry_run_iters=dry_run_iters, repeat_iters=repeat_iters, dry_run_time_ms=dry_run_time_ms, repeat_time_ms=repeat_time_ms, l2_flush=l2_flush, l2_flush_size_mb=l2_flush_size_mb, l2_flush_device=l2_flush_device, sleep_after_run=sleep_after_run, + input_args=input_args, + input_kwargs=input_kwargs, )
♻️ Duplicate comments (1)
flashinfer/testing/utils.py (1)
95-135: Capcalculate_rotation_countto avoid pathological OOM for tiny inputs.This still has the unbounded
ceil((5×L2)/total_bytes)blow-up risk for smalltotal_bytes(thousands+ copies). Please clamp and warn (as previously suggested).def calculate_rotation_count( - tensors: List[torch.Tensor], device=None, min_rotations: int = 2 + tensors: List[torch.Tensor], + device=None, + min_rotations: int = 2, + max_rotations: int = 64, ) -> int: @@ - num_rotations = math.ceil(safe_cache_threshold / total_bytes) + 1 - - return max(min_rotations, num_rotations) + num_rotations = math.ceil(safe_cache_threshold / total_bytes) + 1 + num_rotations = max(min_rotations, num_rotations) + if num_rotations > max_rotations: + warnings.warn( + f"Requested {num_rotations} rotating buffer copies (inputs={total_bytes} bytes, " + f"L2={l2_size} bytes); clamping to max_rotations={max_rotations}.", + category=UserWarning, + stacklevel=2, + ) + num_rotations = max_rotations + return num_rotations
🧹 Nitpick comments (2)
flashinfer/testing/utils.py (2)
54-92: Dedup extracted GPU tensors to avoid double-counting and inconsistent rotation sizing.If the same tensor object is referenced multiple times in
input_args/input_kwargs, you’ll currently count it multiple times.def _extract_gpu_tensors(obj) -> List[torch.Tensor]: @@ - tensors = [] - if isinstance(obj, torch.Tensor) and obj.is_cuda: - tensors.append(obj) + tensors: List[torch.Tensor] = [] + seen: set[int] = set() + def visit(x): + if isinstance(x, torch.Tensor) and x.is_cuda: + k = id(x) + if k not in seen: + seen.add(k) + tensors.append(x) + elif isinstance(x, (list, tuple)): + for item in x: + visit(item) + elif isinstance(x, dict): + for v in x.values(): + visit(v) + + visit(obj) - elif isinstance(obj, (list, tuple)): - for item in obj: - tensors.extend(_extract_gpu_tensors(item)) - elif isinstance(obj, dict): - for v in obj.values(): - tensors.extend(_extract_gpu_tensors(v)) return tensors
1275-1544: Use the inputs’ CUDA device for L2 sizing (and optionally guard misconfiguration ofrotate_buffers).Right now rotation sizing is keyed off
l2_flush_device, which may differ from the tensors’ device in multi-GPU situations.- num_rotations = calculate_rotation_count(gpu_tensors, l2_flush_device) + num_rotations = calculate_rotation_count(gpu_tensors, gpu_tensors[0].device)Optional (to reduce user confusion): warn if
rotate_buffers=Truewhileuse_cuda_graph=Falseinbench_gpu_time, since it won’t do anything in the CUDA-event path.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
flashinfer/testing/utils.py(15 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (1)
flashinfer/testing/utils.py (1)
731-879: CUDA-event path arg plumbing looks consistent.
| def get_l2_cache_size(device=None) -> int: | ||
| """ | ||
| Get L2 cache size in bytes for the given CUDA device. | ||
| Args: | ||
| device: CUDA device (int, torch.device, or None for current device). | ||
| Returns: | ||
| L2 cache size in bytes. | ||
| """ | ||
| if device is None: | ||
| device = torch.cuda.current_device() | ||
| props = torch.cuda.get_device_properties(device) | ||
| return props.L2_cache_size | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check the exact content of the file and confirm the code snippet
head -60 flashinfer/testing/utils.py | tail -30Repository: flashinfer-ai/flashinfer
Length of output: 922
🏁 Script executed:
# Search for get_l2_cache_size usage in the codebase
rg "get_l2_cache_size" --type py -B 2 -A 2Repository: flashinfer-ai/flashinfer
Length of output: 630
🏁 Script executed:
# Check for other device handling patterns in the file
rg "torch\.cuda\." flashinfer/testing/utils.py -B 1 -A 1Repository: flashinfer-ai/flashinfer
Length of output: 3540
🏁 Script executed:
# Check if there are CUDA availability checks elsewhere in the codebase
rg "torch\.cuda\.is_available" --type py | head -20Repository: flashinfer-ai/flashinfer
Length of output: 1583
🏁 Script executed:
# Check the context of get_l2_cache_size calls - especially in calculate_rotation_count
rg "calculate_rotation_count" --type py -B 5 -A 10Repository: flashinfer-ai/flashinfer
Length of output: 3124
🏁 Script executed:
# Look at the full function that calls get_l2_cache_size
sed -n '60,100p' flashinfer/testing/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 1264
🏁 Script executed:
# Check if there's any CUDA requirement or guard at module level
head -30 flashinfer/testing/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 877
🌐 Web query:
torch.cuda.get_device_properties PyTorch device type handling str torch.device
💡 Result:
Short answer
- torch.cuda.get_device_properties accepts a torch.device, an int (legacy ordinal), or a device string (e.g. "cuda", "cuda:0"). If device is None it uses the current CUDA device. The function is intended for CUDA devices (not CPU/mps), so pass a CUDA device (or you'll get an error). [1][2]
Example
- torch.cuda.get_device_properties(torch.device("cuda:0"))
- torch.cuda.get_device_properties("cuda:0")
- torch.cuda.get_device_properties(0) # legacy ordinal
- torch.cuda.get_device_properties(None) # uses current device
Refs
- torch.cuda.get_device_properties docs (accepts torch.device / int / str; None → current device). [1]
- torch.device description (device string formats like "cuda", "cuda:0", and semantics). [2]
🏁 Script executed:
# Verify the exact signature and behavior by checking PyTorch usage
# Let's see if there are any existing device handling patterns in flashinfer
rg "torch\.device\(" flashinfer/testing/utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check if get_device_properties is called elsewhere with different device types
rg "get_device_properties" --type py -B 2 -A 2Repository: flashinfer-ai/flashinfer
Length of output: 5383
🏁 Script executed:
# Let's verify the actual torch.cuda behavior with a focused test
python3 << 'EOF'
import torch
# Check if torch.cuda.get_device_properties accepts various types
# (Note: we're just checking the signature, not actually running on GPU)
try:
# Check the function signature
import inspect
sig = inspect.signature(torch.cuda.get_device_properties)
print("Signature:", sig)
print("\nDocstring snippet:")
print(torch.cuda.get_device_properties.__doc__[:500])
except Exception as e:
print(f"Error: {e}")
EOFRepository: flashinfer-ai/flashinfer
Length of output: 154
Add CUDA availability check to get_l2_cache_size (optional defensive programming).
The function will fail in CPU-only environments. While get_l2_cache_size is currently only called from benchmarking code that already requires CUDA, a defensive check makes the function more robust if reused elsewhere:
def get_l2_cache_size(device=None) -> int:
"""
Get L2 cache size in bytes for the given CUDA device.
Args:
device: CUDA device (int, torch.device, or None for current device).
Returns:
L2 cache size in bytes.
"""
+ if not torch.cuda.is_available():
+ raise RuntimeError("CUDA is not available; cannot query L2 cache size.")
if device is None:
device = torch.cuda.current_device()
props = torch.cuda.get_device_properties(device)
return props.L2_cache_sizeNote: torch.cuda.get_device_properties already accepts int, torch.device, and str device specifiers directly, so no device type normalization is needed.
🤖 Prompt for AI Agents
In flashinfer/testing/utils.py around lines 38 to 52, the function
get_l2_cache_size calls torch.cuda APIs and will raise on CPU-only systems; add
a defensive CUDA availability check at the top of the function (e.g., if not
torch.cuda.is_available(): raise RuntimeError("CUDA is not available; cannot get
L2 cache size")), keep the existing device handling (no normalization needed),
and return the L2 cache size as before.
| def _clone_structure(obj): | ||
| """ | ||
| Deep clone a nested structure, cloning GPU tensors with detach().clone() | ||
| while preserving scalars, booleans, and other non-tensor values. | ||
| For non-contiguous tensors (e.g., created with as_strided), this function | ||
| preserves the stride pattern using torch.empty_strided() + copy_(). This is | ||
| important for backends like cuDNN that expect specific memory layouts. | ||
| Args: | ||
| obj: Object to clone (tensor, list, tuple, dict, or other). | ||
| Returns: | ||
| Cloned structure with GPU tensors cloned, other values preserved. | ||
| """ | ||
| if isinstance(obj, torch.Tensor): | ||
| if obj.is_cuda: | ||
| if obj.is_contiguous(): | ||
| return obj.detach().clone() | ||
| else: | ||
| # Preserve stride pattern for non-contiguous tensors | ||
| # (e.g., as_strided views used by cuDNN paged attention) | ||
| result = torch.empty_strided( | ||
| obj.size(), | ||
| obj.stride(), | ||
| dtype=obj.dtype, | ||
| device=obj.device, | ||
| ) | ||
| result.copy_(obj.detach()) | ||
| return result | ||
| else: | ||
| return obj # CPU tensors returned as-is | ||
| elif isinstance(obj, list): | ||
| return [_clone_structure(item) for item in obj] | ||
| elif isinstance(obj, tuple): | ||
| return tuple(_clone_structure(item) for item in obj) | ||
| elif isinstance(obj, dict): | ||
| return {k: _clone_structure(v) for k, v in obj.items()} | ||
| else: | ||
| # Non-tensor, non-container: return as-is (e.g., int, float, str, bool, None) | ||
| return obj | ||
|
|
||
|
|
||
| def _create_rotated_buffer_copies( | ||
| input_args: Tuple, input_kwargs: dict, num_rotations: int | ||
| ) -> List[Tuple[Tuple, dict]]: | ||
| """ | ||
| Create multiple copies of input_args and input_kwargs for buffer rotation. | ||
| The first copy (index 0) uses the original args/kwargs. | ||
| Subsequent copies clone all GPU tensors while preserving other values. | ||
| Args: | ||
| input_args: Positional arguments tuple. | ||
| input_kwargs: Keyword arguments dict. | ||
| num_rotations: Number of buffer copies to create. | ||
| Returns: | ||
| List of (args, kwargs) tuples, one for each rotation index. | ||
| """ | ||
| if num_rotations <= 1: | ||
| return [(input_args, input_kwargs)] | ||
|
|
||
| copies = [] | ||
| # First copy uses original args/kwargs | ||
| copies.append((input_args, input_kwargs)) | ||
|
|
||
| # Create cloned copies for remaining rotations | ||
| for _ in range(num_rotations - 1): | ||
| cloned_args = _clone_structure(input_args) | ||
| cloned_kwargs = _clone_structure(input_kwargs) | ||
| copies.append((cloned_args, cloned_kwargs)) | ||
|
|
||
| return copies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve tensor aliasing when cloning inputs (memoize clones; share memo across args+kwargs).
Right now, if a tensor is referenced multiple times (or shared between input_args and input_kwargs), each occurrence becomes a distinct clone in rotated copies. That can change semantics for alias-sensitive kernels and also inflates memory.
-def _clone_structure(obj):
+def _clone_structure(obj, _memo: Optional[dict[int, Any]] = None):
@@
- if isinstance(obj, torch.Tensor):
+ if _memo is None:
+ _memo = {}
+ if isinstance(obj, torch.Tensor):
+ k = id(obj)
+ if k in _memo:
+ return _memo[k]
if obj.is_cuda:
if obj.is_contiguous():
- return obj.detach().clone()
+ cloned = obj.detach().clone()
+ _memo[k] = cloned
+ return cloned
else:
@@
- result.copy_(obj.detach())
- return result
+ result.copy_(obj.detach())
+ _memo[k] = result
+ return result
@@
- elif isinstance(obj, list):
- return [_clone_structure(item) for item in obj]
+ elif isinstance(obj, list):
+ return [_clone_structure(item, _memo) for item in obj]
elif isinstance(obj, tuple):
- return tuple(_clone_structure(item) for item in obj)
+ return tuple(_clone_structure(item, _memo) for item in obj)
elif isinstance(obj, dict):
- return {k: _clone_structure(v) for k, v in obj.items()}
+ return {k: _clone_structure(v, _memo) for k, v in obj.items()}
@@
def _create_rotated_buffer_copies(
@@
for _ in range(num_rotations - 1):
- cloned_args = _clone_structure(input_args)
- cloned_kwargs = _clone_structure(input_kwargs)
+ memo: dict[int, Any] = {}
+ cloned_args = _clone_structure(input_args, memo)
+ cloned_kwargs = _clone_structure(input_kwargs, memo)
copies.append((cloned_args, cloned_kwargs))
📌 Description
bench_gpu_time_with_cudagraphuses CUDA events with multiple kernel iterations within the graph to amortize timing overhead. However, the L2 cache becomes hot after the first iteration, leading to misleadingly better performance for memory-bound kernels. (Note: this is not an issue for cupti-based measurements withbench_gpu_time_with_cuptiwhere the measurement overhead is small and thus L2 cache can be flushed between kernel launches)This PR implements rotating buffers that cycle through different memory regions across iterations, ensuring cold L2 cache for each kernel invocation.
🔍 Related Issues
#2187
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.