refactor: update fa3 codebase [part 2] #2192

yzh119 · 2025-12-09T11:20:47Z

📌 Description

Follow up of #2111 , adding more optimizations in fa3 (mostly around fp8).

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

cc @Edenzzzz

Summary by CodeRabbit

Release Notes

New Features
- Enhanced FP8 quantization support with optimized data transpose operations.
- Added L2 cache-aware scheduling for improved performance.
- Introduced Longest-Processing-Time scheduling optimization.
- New FlashInfer vs FA3 benchmarking suite.
Performance Improvements
- Optimized kernel pipelines for Cutlass 3.6+ compatibility.
- Improved paged KV cache handling and memory efficiency.
- Enhanced attention computation across variable sequence lengths.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-09T11:20:59Z

Walkthrough

PR introduces Hopper kernel optimizations for attention mechanisms including FP8 transposed V loading infrastructure, L2-aware scheduling with longest-processing-time (LPT) support, and a new barrier-optimized TMA pipeline class. Adds benchmarking script for FlashInfer/FA3 performance comparison. Modifies mainloop masking patterns, epilogue operations, and tile scheduling to support improved causal attention and FP8 quantization paths.

Changes

Cohort / File(s)	Summary
Benchmarking `benchmarks/bench_fa3_comparison.py`	New standalone script with TFLOPS computation utilities (calc_tflops, calc_tflops_varlen, bench_fn) and multi-configuration benchmarks for prefill, batch, variable-length, FP8, and paged KV cache tests.
Hopper Pipeline & Infrastructure `include/flashinfer/attention/hopper/sm90_pipeline_no_cluster.cuh`	New class `PipelineTmaAsyncNoCluster` customizing barrier initialization and signaling to reduce barrier activity in newer Cutlass versions; initializes barriers only via initializing warp and performs consumer_release only at warp-group boundaries.
Hopper Core Configuration `include/flashinfer/attention/hopper/kernel_traits.cuh`	Replaces TMA async pipeline type from `cutlass::PipelineTmaAsync` to `PipelineTmaAsyncNoCluster` when TMA load is enabled, with fallback to async for non-TMA paths.
Hopper Mainloop Core `include/flashinfer/attention/hopper/mainloop.cuh`	Adds `EVICT_LAST` cache hints to four TMA copy operations (K-load and V-load paths) via additional template arguments.
Hopper Mainloop Masking & MMA `include/flashinfer/attention/hopper/mainloop_mma.cuh`	Major refactor introducing identity-tensor initialization, new masking abstractions (mask_multi_item_scoring, causal_mask_with_seqlen_fn, swa_left_mask_fn), forward-step function pattern, and separate causal/non-causal/sliding-window masking loops with template-based attention updater selection.
Hopper Tile Scheduler `include/flashinfer/attention/hopper/tile_scheduler.cuh`	Adds `LPT` template parameter (default false) to `SingleTileScheduler` for longest-processing-time-first scheduling; introduces `num_qo_tiles` parameter and reversal logic for q_tile_idx when LPT enabled.
Hopper Utilities `include/flashinfer/attention/hopper/utils.cuh`	Adds `permute_output_fp8` device function to undo FP8 MMA register permutation via element swaps before shared-memory writes.
Hopper Prefill Dispatch `include/flashinfer/attention/hopper/prefill_sm90.cuh`	Updates `Scheduler` alias in kernel traits from `SingleTileScheduler` to `SingleTileScheduler</LPT=/CAUSAL>` to enable LPT when causal mode is active.
Hopper FP8 Epilogue `include/flashinfer/attention/hopper/quantization/epilogue.cuh`	Replaces static SM90 copy atom with dynamically computed `CopyOpR2S` via `sm90_get_smem_store_op_for_accumulator` for FP8 accumulators; adds type aliases for stride and tile shape.
Hopper FP8 Kernel Traits `include/flashinfer/attention/hopper/quantization/kernel_traits.cuh`	Introduces `TranposeTraits_64x64` struct encapsulating MN-major (TMA) and K-major (MMA) FP8 transpose layouts, LDSM/STSM tiled copies, and sparse-path variants; adds `SmemTransposeFP8_64x64` and sparse variant; extends FP8AttentionKernelTraits with sparse layout aliases and updated SharedStorage.
Hopper FP8 Mainloop Load `include/flashinfer/attention/hopper/quantization/mainloop_load.cuh`	Introduces transposed V loading via `StrideVTransposed` (D, N, H) layout; updates TMA_V to load transposed gmem strides into MN-major SmemLayoutV; adjusts tile shape selection and in-kernel transpose sourcing.
Hopper FP8 Mainloop MMA `include/flashinfer/attention/hopper/quantization/mainloop_mma.cuh`	Adds `SmemLayoutVt_` template parameter to `mma_fp8` function; expands function signature with intermediate tensor, attention updater, and batch/sequence index parameters for improved FP8 mainloop control flow.
Hopper FP8 Sparse Load `include/flashinfer/attention/hopper/quantization/mainloop_sparse_load.cuh`	Replaces FP8 V layouts with sparse variants (`SmemLayoutVSparse`, `SmemLayoutVtSparse`, etc.) and updates transpose to use `SmemTransposeFP8_64x64_Sparse` for sparse-path FP8 operations.
Hopper FP8 Prefill `include/flashinfer/attention/hopper/quantization/prefill_sm90.cuh`	Adds `CollectiveMainloop::SmemLayoutVt` as explicit template parameter to `mma_fp8` call; updates scheduler from `SingleTileScheduler` to `SingleTileScheduler</LPT=/CAUSAL>`.
Attention Scheduler `include/flashinfer/attention/scheduler.cuh`	Reduces qo_len weight in cost function (2x to 0.05x); introduces L2-aware scheduling parameters (max_kv_blocks_in_l2, nheads_in_l2, swizzle-based grouping) and section-based loop structure for head grouping; adds cta_tile_kv parameter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

mainloop_mma.cuh: Extensive masking refactor with new template abstractions, multiple new helper functions, and significant control-flow restructuring; requires careful validation of causal/non-causal/sliding-window path correctness and synchronization logic.
quantization/kernel_traits.cuh: New FP8 transpose infrastructure with multiple struct types, sparse-path variants, and SMEM layout abstractions; requires understanding of MN-major vs. K-major swizzle patterns and their interaction with TMA/MMA/sparse paths.
scheduler.cuh: L2-aware scheduling with new capacity estimation, stepped nheads logic, and swizzle-based traversal order; complex cost model changes and section-based iteration require validation against expected scheduling behavior.
sm90_pipeline_no_cluster.cuh: New barrier signaling strategy; verify barrier initialization and consumer_release logic matches Cutlass 3.6+ semantics and avoids deadlock in producer-consumer scenarios.
quantization/mainloop_load.cuh & quantization/mainloop_mma.cuh: Transposed V layout and expanded function signatures; trace V data flow from TMA load through transpose to MMA to verify correctness across FP8/sparse paths.

Possibly related PRs

refactor: update fa3 codebase and fix hopper unittest [part 1] #2111: Both PRs modify FA3/sparse attention codepaths including FP8 transpose, mainloop, and paged KV handling with overlapping kernel infrastructure.
feat: add xqa fp8 mha and fp8 kv cache #1769: Both modify FP8 attention and KV cache paths (transposed V loading, FP8 mainloop traits, paged KV handling) with direct code-level overlap.
feat: TRTLLM FMHAv2 backend for ctx attention #2142: Main PR's SM90/FP8/TMA kernel optimizations and scheduler changes relate to retrieved PR's FMHA v2/Hopper infrastructure additions in the same kernel and data-path areas.

Suggested reviewers

joker-eph
cyx-6
wenscarl
djmmoss
aleozlx
nvmbreughe
bkryu

Poem

🐰 Hops with glee, through SM90's design
Transpose and cache, now perfectly aligned
L2 aware schedules in L-P-T,
Barriers optimized, as fast as can be!
FlashInfer hops forward with kernels so fine ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'refactor: update fa3 codebase [part 2]' is vague and uses generic terms like 'update' that don't convey the specific nature of the changes; it indicates continuation of a multi-part refactor but lacks clarity about the primary focus.	Revise the title to specifically highlight the main change (e.g., 'refactor: add FA3-style FP8 optimizations and pipeline improvements' or similar) to better describe the primary focus for readers scanning commit history.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is largely complete with related issue, checklist completion, and reviewer notes, although the technical details are minimal and the test suite passing status is not yet confirmed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-12-09T11:21:15Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request continues the refactoring of the FlashAttention-3 (FA3) codebase within FlashInfer, with a primary focus on enhancing the efficiency and performance of FP8 operations. It integrates advanced memory access patterns and scheduling strategies to better leverage GPU architecture, aiming to close the performance gap with native FA3 implementations. The changes are validated through a new, detailed benchmarking suite.

Highlights

FP8 Optimizations: Significant refactoring and optimization of FP8 attention kernels, including a new FA3-style V transpose mechanism for both TMA and sparse loading paths, and a utility function to handle FP8 output register permutations.
Performance Benchmarking: A new comprehensive benchmark script (bench_fa3_comparison.py) has been added to compare FlashInfer's performance against FA3 across various configurations, including FP16, FP8, ragged KV cache, and paged KV cache.
Cutlass Pipeline Fix: Introduced a custom PipelineTmaAsyncNoCluster to address a performance regression in Cutlass v3.6.0+ by optimizing barrier signaling during TMA loads, ensuring only one thread signals the barrier instead of all.
Enhanced Scheduling: Implemented Longest-Processing-Time-First (LPT) scheduling for causal attention and L2-aware swizzling in the prefill plan, inspired by FA3, to improve load balancing and cache utilization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the FA3 codebase, particularly for FP8 data types. The changes include a new comprehensive benchmark script, L2-aware scheduling, a workaround for a Cutlass performance regression, Longest-Processing-Time-First (LPT) scheduling, and FA3-style same-shape transpose for FP8. The changes are well-commented and demonstrate a deep understanding of CUDA and Cutlass for performance tuning. My review focuses on the new benchmark script, where I've identified a couple of inefficiencies in the test setup code. The CUDA-level changes appear to be solid optimizations.

gemini-code-assist · 2025-12-09T11:23:49Z

benchmarks/bench_fa3_comparison.py

+    kv_indptr = torch.tensor(
+        [0]
+        + [
+            sum((kv_lens[i] + page_size - 1) // page_size for i in range(j + 1))
+            for j in range(batch_size)
+        ],
+        dtype=torch.int32,
+        device=device,
+    )


The creation of kv_indptr has a quadratic time complexity (O(N^2)) with respect to batch_size due to the nested loop structure (sum inside a list comprehension). For a large batch_size, this can be very slow. This can be computed in linear time. This pattern is repeated on lines 571-579.

num_pages_per_seq = torch.tensor([(kv_len + page_size - 1) // page_size for kv_len in kv_lens], dtype=torch.int32, device=device) kv_indptr = torch.nn.functional.pad(torch.cumsum(num_pages_per_seq, 0), (1, 0))

gemini-code-assist · 2025-12-09T11:23:50Z

benchmarks/bench_fa3_comparison.py

+    cu_seqlens = torch.tensor(
+        [0] + list(torch.cumsum(torch.tensor(qo_lens), 0).numpy()),
+        dtype=torch.int32,
+        device=device,
+    )


The creation of cu_seqlens is inefficient. It involves creating a tensor on the CPU, performing a cumulative sum, converting it to a NumPy array, then to a list, and finally creating a new tensor on the GPU. This can be performed more efficiently and directly on the GPU to avoid unnecessary CPU-GPU data transfers and conversions. This pattern is repeated in other parts of the script (e.g., lines 199-203, 291-295, etc.).

cu_seqlens = torch.nn.functional.pad(torch.cumsum(torch.tensor(qo_lens, device=device, dtype=torch.int32), dim=0), (1, 0))

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

include/flashinfer/attention/hopper/sm90_pipeline_no_cluster.cuh (2)
18-24: Well-documented workaround for Cutlass regression.

The comment clearly explains the rationale for this custom pipeline class. However, consider adding the Cutlass version check or a link to the relevant Cutlass issue/commit for future maintainability.

Consider adding a reference to the specific Cutlass commit or issue that introduced this regression:
 // As of Cutlass v3.6.0, if size(ClusterShape) == 1, PipelineTmaAsync has all threads
 // signaling the barrier during consumer_release. This causes a perf regression in FA3
 // forward pass (especially hdim 128 causal). We instead reimplement the version of
 // PipelineTmaAsync before v3.6.0 where only 1 out of 128 threads signals the barrier.
+// TODO: Remove this workaround once upstream Cutlass provides a fix.
 //
 // Assumption: params.num_consumers % NumThreadsPerWarpGroup == 0
36-53: Consider adding static_assert or runtime check for num_consumers alignment.

The documented assumption (params.num_consumers % NumThreadsPerWarpGroup == 0) is critical for correctness but not enforced. The calculation at line 43-44 uses ceiling division which could mask misalignment issues.
   static CUTLASS_DEVICE void init_barriers(SharedStorage& storage, Params params) {
+    // Enforce documented assumption
+    assert(params.num_consumers % cutlass::NumThreadsPerWarpGroup == 0 &&
+           "num_consumers must be a multiple of NumThreadsPerWarpGroup");
     int warp_idx = cutlass::canonical_warp_idx_sync();
     bool is_initializing_warp = (warp_idx == 0);
benchmarks/bench_fa3_comparison.py (3)
78-83: Consider capturing loop variables explicitly in lambdas.

While the current code works correctly (variables are used immediately within the same iteration), Ruff flags B023 warnings because lambdas reference loop variables. This pattern repeats throughout the file and violates linting best practices.

Apply this pattern to explicitly capture variables (example for this section, apply similarly to lines 147, 218, 319, 333, 456, 497, 635, 687):
-        fi_time = bench_fn(lambda: wrapper.run(q, k, v))
-        fa3_time = bench_fn(
-            lambda: fa3_varlen_func(
-                q, k, v, cu_seqlens, cu_seqlens, seq_len, seq_len, causal=True
-            )
-        )
+        fi_time = bench_fn(lambda w=wrapper, q_=q, k_=k, v_=v: w.run(q_, k_, v_))
+        fa3_time = bench_fn(
+            lambda q_=q, k_=k, v_=v, cu=cu_seqlens, sl=seq_len: fa3_varlen_func(
+                q_, k_, v_, cu, cu, sl, sl, causal=True
+            )
+        )
This makes the variable capture explicit and silences the linter warnings.

128-132: Consider using iterable unpacking for cleaner code.

Line 129 (and similar patterns at lines 200, 292, 428, 587) uses list concatenation which can be simplified with iterable unpacking.

Apply this pattern:
-    cu_seqlens = torch.tensor(
-        [0] + list(torch.cumsum(torch.tensor(qo_lens), 0).numpy()),
-        dtype=torch.int32,
-        device=device,
-    )
+    cu_seqlens = torch.tensor(
+        [0, *torch.cumsum(torch.tensor(qo_lens), 0).tolist()],
+        dtype=torch.int32,
+        device=device,
+    )
This is more idiomatic Python and avoids the intermediate list conversion.

302-349: Consider logging exceptions for debugging.

The broad exception handling is appropriate for a benchmarking script to ensure graceful degradation. However, silently swallowing exceptions (lines 323, 346) makes debugging difficult when implementations fail.

Add minimal logging:
     except Exception:
+        # Silently skip if FP8 not supported
         fi_time = None
         fi_tflops = None
Or for better debugging:
-    except Exception:
+    except Exception as e:
+        print(f"  (FlashInfer FP8 skipped: {type(e).__name__})")
         fi_time = None
         fi_tflops = None
This pattern applies to similar exception handlers at lines 346, 460, 508, 639, 701.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6bb01d1 and 53ae0b5.

📒 Files selected for processing (15)

benchmarks/bench_fa3_comparison.py (1 hunks)
include/flashinfer/attention/hopper/kernel_traits.cuh (2 hunks)
include/flashinfer/attention/hopper/mainloop.cuh (2 hunks)
include/flashinfer/attention/hopper/mainloop_mma.cuh (4 hunks)
include/flashinfer/attention/hopper/prefill_sm90.cuh (1 hunks)
include/flashinfer/attention/hopper/quantization/epilogue.cuh (2 hunks)
include/flashinfer/attention/hopper/quantization/kernel_traits.cuh (4 hunks)
include/flashinfer/attention/hopper/quantization/mainloop_load.cuh (8 hunks)
include/flashinfer/attention/hopper/quantization/mainloop_mma.cuh (2 hunks)
include/flashinfer/attention/hopper/quantization/mainloop_sparse_load.cuh (2 hunks)
include/flashinfer/attention/hopper/quantization/prefill_sm90.cuh (2 hunks)
include/flashinfer/attention/hopper/sm90_pipeline_no_cluster.cuh (1 hunks)
include/flashinfer/attention/hopper/tile_scheduler.cuh (3 hunks)
include/flashinfer/attention/hopper/utils.cuh (1 hunks)
include/flashinfer/attention/scheduler.cuh (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

include/flashinfer/attention/hopper/kernel_traits.cuh
include/flashinfer/attention/hopper/quantization/mainloop_mma.cuh
include/flashinfer/attention/hopper/quantization/epilogue.cuh

🧬 Code graph analysis (1)

benchmarks/bench_fa3_comparison.py (2)

flashinfer/testing/utils.py (1)

bench_gpu_time_with_cupti (646-878)

flashinfer/prefill.py (2)

BatchPrefillWithRaggedKVCacheWrapper (2316-3093)

BatchPrefillWithPagedKVCacheWrapper (1305-2299)

🪛 Ruff (0.14.8)

benchmarks/bench_fa3_comparison.py

78-78: Function definition does not bind loop variable wrapper

(B023)

78-78: Function definition does not bind loop variable q

(B023)

78-78: Function definition does not bind loop variable k

(B023)

78-78: Function definition does not bind loop variable v

(B023)

81-81: Function definition does not bind loop variable q

(B023)

81-81: Function definition does not bind loop variable k

(B023)

81-81: Function definition does not bind loop variable v

(B023)

81-81: Function definition does not bind loop variable cu_seqlens

(B023)

81-81: Function definition does not bind loop variable cu_seqlens

(B023)

81-81: Function definition does not bind loop variable seq_len

(B023)

81-81: Function definition does not bind loop variable seq_len

(B023)

129-129: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

147-147: Function definition does not bind loop variable wrapper

(B023)

147-147: Function definition does not bind loop variable q

(B023)

147-147: Function definition does not bind loop variable k

(B023)

147-147: Function definition does not bind loop variable v

(B023)

150-150: Function definition does not bind loop variable q

(B023)

150-150: Function definition does not bind loop variable k

(B023)

150-150: Function definition does not bind loop variable v

(B023)

150-150: Function definition does not bind loop variable cu_seqlens

(B023)

150-150: Function definition does not bind loop variable cu_seqlens

(B023)

150-150: Function definition does not bind loop variable seq_len

(B023)

150-150: Function definition does not bind loop variable seq_len

(B023)

192-192: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

200-200: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

218-218: Function definition does not bind loop variable wrapper

(B023)

218-218: Function definition does not bind loop variable q

(B023)

218-218: Function definition does not bind loop variable k

(B023)

218-218: Function definition does not bind loop variable v

(B023)

221-221: Function definition does not bind loop variable q

(B023)

221-221: Function definition does not bind loop variable k

(B023)

221-221: Function definition does not bind loop variable v

(B023)

221-221: Function definition does not bind loop variable cu_seqlens

(B023)

221-221: Function definition does not bind loop variable cu_seqlens

(B023)

221-221: Function definition does not bind loop variable max_seqlen

(B023)

221-221: Function definition does not bind loop variable max_seqlen

(B023)

292-292: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

319-319: Function definition does not bind loop variable wrapper

(B023)

319-319: Function definition does not bind loop variable q_fp8

(B023)

319-319: Function definition does not bind loop variable k_fp8

(B023)

319-319: Function definition does not bind loop variable v_fp8

(B023)

319-319: Function definition does not bind loop variable s_q

(B023)

319-319: Function definition does not bind loop variable s_k

(B023)

319-319: Function definition does not bind loop variable s_v

(B023)

323-323: Do not catch blind exception: Exception

(BLE001)

333-333: Function definition does not bind loop variable q_fp8

(B023)

334-334: Function definition does not bind loop variable k_fp8

(B023)

335-335: Function definition does not bind loop variable v_fp8

(B023)

336-336: Function definition does not bind loop variable cu_seqlens

(B023)

337-337: Function definition does not bind loop variable cu_seqlens

(B023)

338-338: Function definition does not bind loop variable seq_len

(B023)

339-339: Function definition does not bind loop variable seq_len

(B023)

346-346: Do not catch blind exception: Exception

(BLE001)

428-428: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

456-456: Function definition does not bind loop variable wrapper

(B023)

456-456: Function definition does not bind loop variable q_fp16

(B023)

456-456: Function definition does not bind loop variable kv_data_fp16

(B023)

460-460: Do not catch blind exception: Exception

(BLE001)

497-497: Function definition does not bind loop variable q_fa3

(B023)

498-498: Function definition does not bind loop variable k_cache_fa3

(B023)

499-499: Function definition does not bind loop variable v_cache_fa3

(B023)

500-500: Function definition does not bind loop variable cache_seqlens

(B023)

501-501: Function definition does not bind loop variable page_table

(B023)

508-508: Do not catch blind exception: Exception

(BLE001)

587-587: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

635-635: Function definition does not bind loop variable wrapper

(B023)

635-635: Function definition does not bind loop variable q_fp8

(B023)

635-635: Function definition does not bind loop variable kv_data_fp8

(B023)

635-635: Function definition does not bind loop variable s_q

(B023)

635-635: Function definition does not bind loop variable s_k

(B023)

635-635: Function definition does not bind loop variable s_v

(B023)

639-639: Do not catch blind exception: Exception

(BLE001)

687-687: Function definition does not bind loop variable q_fa3

(B023)

688-688: Function definition does not bind loop variable k_cache_fa3

(B023)

689-689: Function definition does not bind loop variable v_cache_fa3

(B023)

690-690: Function definition does not bind loop variable cache_seqlens

(B023)

691-691: Function definition does not bind loop variable page_table

(B023)

692-692: Function definition does not bind loop variable q_descale_fa3

(B023)

693-693: Function definition does not bind loop variable k_descale_fa3

(B023)

694-694: Function definition does not bind loop variable v_descale_fa3

(B023)

701-701: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (47)

include/flashinfer/attention/hopper/utils.cuh (1)

144-166: LGTM - FP8 output permutation utility is well-structured.

The function correctly:

Validates fragment layout with comprehensive static_asserts

Uses group_modes to flatten the tensor for easier iteration

Swaps adjacent elements to reverse the FP8 MMA register permutation before smem writes

The swap pattern (1, j, 2*i) <-> (0, j, 2*i+1) correctly undoes the byte permutation applied during FP8 MMA.

include/flashinfer/attention/hopper/mainloop.cuh (1)

204-206: LGTM - TMA cache hints for L2 optimization.

The EVICT_LAST cache hints are correctly applied to all K and V TMA loads. This hint marks loaded data as least-recently-used in L2, which is appropriate for attention K/V tiles that are typically consumed once per Q tile iteration. This complements the L2-aware scheduling changes in scheduler.cuh.

Also applies to: 233-242, 249-251

include/flashinfer/attention/scheduler.cuh (3)

799-799: Verify the cost function coefficient change.

The cost function changed from 2 * qo_len + kv_len to 0.05 * qo_len + kv_len. This dramatically reduces the weight of qo_len relative to kv_len (40x reduction).

Please confirm this is intentional and aligned with FA3's cost model. The change makes the scheduler heavily prioritize KV-length in load balancing decisions.

932-997: L2-aware scheduling implementation looks correct.

The implementation follows FA3's approach:

Uses a conservative 8MB L2 estimate (appropriate for cache sharing scenarios)

Stepped nheads_in_l2 values (16, 8, 4, 2, 1) match FA3's heuristics

Swizzle-based head grouping enables K/V reuse within L2 cache

Inner loop order (q_tiles in LPT order, then heads) maximizes L2 hit rate

The size_l2 constant at line 935 is hardcoded. Consider making this configurable for different GPU architectures if needed in the future.

904-912: LGTM - KV tile size heuristics for L2 estimation.

The cta_tile_kv values are appropriately chosen based on head dimensions:

192 for non-causal HD128 (larger tiles possible without causal masking overhead)

64 for HD > 128 (smaller tiles to fit larger head dimensions)

These are used for L2 cache capacity estimation, aligning with the actual kernel tile sizes.

include/flashinfer/attention/hopper/mainloop_mma.cuh (4)

59-61: LGTM - Identity tensor hoisted outside loops.

Creating the identity tensor and its partition once outside the loops is a good optimization, avoiding redundant tensor allocations in each iteration.

150-237: LGTM - FA3-style compile-time specialized mask functions.

The refactoring introduces clean, specialized mask functions:

causal_mask_with_seqlen_fn: First iteration with boundary checks

causal_mask_fn: Masking iterations without redundant seqlen checks

no_mask_fn: Main loop (no causal masking)

swa_left_mask_fn: Sliding window left boundary

This compile-time specialization enables the compiler to generate optimized code for each masking scenario without runtime branching overhead.

250-310: LGTM - Pipelined forward step with compile-time mask dispatch.

The fwd_step lambda correctly implements the pipelined attention computation:

Wait for K and start QK gemm

Rescale O (if not first iteration)

Wait for V and start PV gemm

Apply compile-time specialized mask

Update attention state

Release pipelines and advance

The loop passes kv_tile_idx_decrement(kv_tile_idx) to fwd_step because the function loads K for the next iteration while processing V from the current iteration - this correctly implements the software pipeline.

282-283: LGTM - Masking steps calculation.

The n_masking_steps correctly accounts for the causal diagonal:

For causal: ceil_div(CTA_Q, CTA_KV) iterations cover the diagonal region

MULTIITEMSCORING adds +1 for additional boundary handling

include/flashinfer/attention/hopper/quantization/prefill_sm90.cuh (2)

236-242: LGTM - Explicit SmemLayoutVt template parameter.

Passing CollectiveMainloop::SmemLayoutVt explicitly to mma_fp8 ensures the correct V transpose layout is used, aligning with the FA3-style layout infrastructure introduced across the kernel traits and mainloop components.

263-264: LGTM - LPT scheduling enabled for causal attention.

Using SingleTileScheduler</*LPT=*/CAUSAL> enables Longest Processing Time (LPT) scheduling when causal masking is active. This improves load balancing because causal tiles near the diagonal have significantly less work (fewer valid KV positions), and LPT scheduling helps distribute this uneven workload more evenly across SMs.

include/flashinfer/attention/hopper/prefill_sm90.cuh (1)

301-302: LGTM! LPT scheduling for causal attention is a sensible optimization.

The change enables Longest-Processing-Time-First scheduling when CAUSAL=true. This improves load balancing since causal attention tiles have triangular workloads—later tiles process more KV tokens, so processing them first helps even out GPU utilization.

include/flashinfer/attention/hopper/kernel_traits.cuh (2)

20-20: LGTM! Include for no-cluster pipeline variant.

114-118: Verify PipelineTmaAsyncNoCluster class exists and is properly implemented.

PipelineTmaAsyncNoCluster is not part of official CUTLASS and must be a custom FlashInfer class. Confirm:

The class definition exists in the codebase with correct NUM_STAGES template parameter

The barrier signaling optimization (1 thread vs. all threads) is actually implemented

The class properly integrates with Cutlass 3.6+ without regressions

include/flashinfer/attention/hopper/quantization/epilogue.cuh (2)

17-17: LGTM! Required include for epilogue detail utilities.

43-49: LGTM! Dynamic copy operation selection for FP8 accumulators.

Using sm90_get_smem_store_op_for_accumulator is the correct approach—it automatically selects the appropriate SMEM store operation based on the output stride and data type, rather than hardcoding a specific copy operation. This improves correctness for FP8 paths where the optimal copy operation may differ from F16/BF16.

include/flashinfer/attention/hopper/tile_scheduler.cuh (4)

16-19: LGTM! Clear documentation for the LPT scheduling feature.

The comments effectively explain the purpose: reversing block indices so that tiles with more KV tokens (higher indices in causal attention) are processed first for better load balancing.

30-30: LGTM! Added num_qo_tiles to Params for LPT reversal.

36-36: LGTM! Propagating num_qo_tiles from Arguments to Params.

66-71: LGTM! LPT reversal implementation is correct.

The reversal formula params.num_qo_tiles - 1 - q_tile_idx correctly maps:

blockIdx.x = 0 → highest tile index (most work in causal attention)

blockIdx.x = num_qo_tiles - 1 → tile index 0 (least work)

This ensures tiles with more KV tokens are processed first, improving GPU utilization through better load balancing.
include/flashinfer/attention/hopper/quantization/mainloop_load.cuh (8)
51-54: LGTM! Transposed stride definition for FA3-style V loading.

The stride (D, N, H) with (_1, int64_t, int64_t) correctly represents loading V with dimension D as the innermost (contiguous) dimension, enabling direct loading into MN-major shared memory.

71-79: LGTM! TMA_V updated for transposed V loading path.

The tile shape change from select<1,2> to select<2,1> correctly reflects the transposed (D, N) tile shape for V, aligning with the MN-major SmemLayoutV.

188-199: LGTM! In-kernel V tensor and transpose source updated for transposed layout.

The changes correctly:

Create mV with transposed shape (D, N, H) for TMA tensor

Use sVt_src (SmemLayoutV, MN-major) as the TMA destination and transpose source

Keep sVt_tgt (SmemLayoutVt, K-major) as the transpose target for MMA consumption

210-212: LGTM! gV tile uses transposed shape select<2, 1>.

Consistent with the TMA_V definition—tile shape is now (HEAD_DIM, CTA_KV) instead of (CTA_KV, HEAD_DIM).

272-272: LGTM! Transpose source updated to sVt_src.

297-297: LGTM! Consistent transpose source usage in pipeline loop.

319-319: LGTM! Consistent transpose source in final loop iteration.

132-143: Verify stride construction correctness for transposed V tensor.

The transposed layout construction looks correct conceptually: swapping shape and stride elements to create a (D, N, H) layout from the original (N, D, H). However, this relies on layout_V having the expected stride pattern where stride_D is 1 (contiguous head dimension).

Verify that the input layout_V always has stride_D = 1 (row-major within each head) as assumed by this transpose:
#!/bin/bash
# Search for how layout_V is constructed/passed to verify stride assumptions
rg -n "layout_V" --type cpp -B2 -A2 | head -60

# Check get_gmem_layout to see what strides are typically used
ast-grep --pattern $'get_gmem_layout($_, $_, $_, $_, $_)'
include/flashinfer/attention/hopper/quantization/mainloop_mma.cuh (3)

17-22: LGTM! Clean template parameterization for layout flexibility.

The introduction of SmemLayoutVt_ as a template parameter with clear documentation enables the mainloop to support both TMA (FA3-style) and sparse paths with their respective layouts.

133-136: Verify causal masking step count calculation.

The n_masking_steps calculation assumes ceil_div(CTA_Q, CTA_KV) is correct for causal masking. With different tile sizes, ensure this formula correctly determines the number of tiles requiring masking.

23-40: Verify all call sites are updated for the new template parameter.

The function signature includes a new template parameter SmemLayoutVt_. Ensure all callers of mma_fp8 are updated to provide the correct template argument.

include/flashinfer/attention/hopper/quantization/mainloop_sparse_load.cuh (4)

64-67: LGTM! Consistent sparse-path layout naming.

The layout aliases are correctly updated to use SmemLayoutVSparse and SmemLayoutVtSparse, maintaining consistency with the new naming conventions in kernel_traits.cuh.

194-202: LGTM! Correct sparse transpose implementation.

The transpose source/target layouts and transposer are correctly updated to use sparse-specific variants that handle the (CTA_KV, HEAD_DIM, STAGES) layout from cp.async loading.

408-437: Verify parity toggle logic in the main loop.

The parity double-buffering pattern is used to overlap prefetch and load operations. The toggle sequence appears correct (parity ^= 1 before prefetch, parity ^= 1 before load from previous prefetch, parity ^= 1 before next load), but verify the invariant is maintained across iterations.

280-298: Guard against out-of-bounds page table access at line 292.

Validating kv_idx_read < kv_len does not guarantee that page_iter will be within bounds of kv_indices_ptr. When kv_idx_read is valid but page_iter exceeds the page table size, accessing kv_indices_ptr[page_iter] reads garbage data. Add bounds checking on page_iter or ensure kv_indices_ptr size is guaranteed to accommodate all possible page_iter values derived from the page size divmod.

include/flashinfer/attention/hopper/sm90_pipeline_no_cluster.cuh (1)

95-100: LGTM! Efficient single-thread barrier signaling.

The consumer_release implementation correctly selects only threads at warp-group boundaries (threadIdx.x % NumThreadsPerWarpGroup == 0) to signal the barrier, avoiding the all-threads signaling that caused the regression in Cutlass v3.6.0.

include/flashinfer/attention/hopper/quantization/kernel_traits.cuh (6)

50-60: Excellent documentation of FA3-style transpose approach.

The comment clearly explains the key insight about TMA loading V with transposed gmem strides and the difference between TMA and sparse paths. This will be valuable for future maintainers.

97-99: Verify the assertion covers all valid configurations.

The static_assert requires either kHeadDim or kBlockN to be a multiple of 64, which aligns with the 64x64 transpose atom. Confirm this constraint is documented for users configuring custom tile shapes.

189-236: LGTM! Clean FA3-style transpose implementation.

The do_transpose method correctly:

Uses LDSM.T for loading from MN-major smem

Applies byte permutation for FP8 element reordering

Uses STSM for storing to K-major smem

Includes ILP optimization for better instruction-level parallelism

397-403: Good documentation explaining SharedStorage layout choice.

The comment clearly explains why SmemLayoutVSparse is used for SharedStorage (to support the sparse/paged KV path) while noting that both layouts have the same cosize for memory allocation.

305-308: Verify loop bounds match layout dimensions.

The nested loops iterate over shape<2> and shape<1> of SmemLayoutVSparseTransposeSrc. Confirm these dimensions correspond to the correct tile subdivisions and that the loop ordering (outer loop over shape<2>, inner loop over shape<1>) is appropriate for the transpose operation.

390-403: Verify if PipelineTmaAsyncNoCluster exists as a custom implementation in flashinfer.

The recommended replacement PipelineTmaAsyncNoCluster<NUM_STAGES> does not appear in standard CUTLASS sources or documentation. If this is a custom wrapper or variant specific to flashinfer (e.g., defined in sm90_pipeline_no_cluster.cuh), verify that both the custom class and the header file exist in the codebase before applying this change. Additionally, confirm that the Cutlass v3.6.0 regression affecting PipelineTmaAsync is accurately characterized.

benchmarks/bench_fa3_comparison.py (5)

19-44: LGTM!

The helper functions are well-structured:

calc_tflops correctly computes TFLOPS using the standard attention formula (4 × batch × seq² × heads × dim), with appropriate scaling for causal masks.

calc_tflops_varlen properly handles variable-length sequences by summing per-sequence FLOPs.

bench_fn returns the median time, which is a robust choice for benchmarking.

250-261: LGTM!

The per-head symmetric quantization implementation is correct:

Properly handles both FP8 formats (e4m3fn and e5m2) with appropriate ranges

Uses per-head scaling (reducing over dims 0 and 2)

Includes safeguards against division by zero with min=1e-6 clamping

Correctly broadcasts scale factors for element-wise quantization

1-718: Well-structured benchmarking script.

This is a comprehensive benchmarking suite that effectively compares FlashInfer and FA3 across multiple scenarios:

Various batch sizes and sequence lengths

Variable-length sequences

FP8 quantization paths

Paged KV cache with different page sizes

Graceful handling of implementation differences

The progressive test configurations and clear output formatting make this a valuable benchmarking tool.

13-13: [rewritten comment]
[classification tag]

673-683: The q_descale computation for GQA is correct and matches FA3's API expectations.

FA3's flash_attn_with_kvcache expects descale parameters with shape (num_sequences, num_kv_heads). For GQA, this means one descale value per KV head group. The code correctly reshapes s_q from [num_qo_heads] to [num_kv_heads, num_qo_heads // num_kv_heads], takes the mean across query heads within each KV group, and expands to (batch_size, num_kv_heads). This aligns with FA3's requirement that descales be per-KV-head for GQA models, where the CUDA kernel internally expands them to match per-query entries.

coderabbitai · 2025-12-09T11:25:17Z

include/flashinfer/attention/hopper/quantization/kernel_traits.cuh

 template <typename TileShape_QKD, typename Element, int NUM_STAGES>
 struct TranposeTraits_64x64 {
  using TransposeShapeAtom_ = Shape<_64, _64>;
  using TransElement = Element;
  static_assert(cutlass::sizeof_bits_v<TransElement> == 8);

-  using SmemShapeLDSM = Shape<Shape<_8, _8>, Shape<_16, _4>>;
-  using SmemShapeSTSM = Shape<Shape<_16, _4>, Shape<_16, _4>>;
+  static constexpr int kHeadDim = get<2>(TileShape_QKD{});
+  static constexpr int kBlockN = get<1>(TileShape_QKD{});
+
+  // MN-major for TMA loading (V is loaded with transposed gmem strides)
+  static constexpr cute::GMMA::Major TmaMajorV = GMMA::Major::MN;
+  // K-major for MMA consumption (required for FP8)
+  static constexpr cute::GMMA::Major MmaMajorV = GMMA::Major::K;
+
+  // ==================== TMA Path Layouts (FA3-style same-shape) ====================
+  // SmemLayoutVtTma: MN-major layout for TMA load, shape (HEAD_DIM, CTA_KV, STAGES)
+  using SmemLayoutAtomVtTma =
+      decltype(cutlass::gemm::collective::detail::ss_smem_selector<TmaMajorV, Element,
+                                                                   Int<kHeadDim>, Int<kBlockN>>());
+  using SmemLayoutVtTma = decltype(tile_to_shape(
+      SmemLayoutAtomVtTma{}, make_shape(Int<kHeadDim>{}, Int<kBlockN>{}, Int<NUM_STAGES>{}),
+      cute::Step<_2, _1, _3>{}));  // MN-major ordering
+
+  // SmemLayoutVtMma: K-major layout for MMA, same shape (HEAD_DIM, CTA_KV, STAGES)
+  using SmemLayoutAtomVtMma =
+      decltype(cutlass::gemm::collective::detail::ss_smem_selector<MmaMajorV, Element,
+                                                                   Int<kHeadDim>, Int<kBlockN>>());
+  using SmemLayoutVtMma = decltype(tile_to_shape(
+      SmemLayoutAtomVtMma{}, make_shape(Int<kHeadDim>{}, Int<kBlockN>{}, Int<NUM_STAGES>{}),
+      cute::Step<_1, _2, _3>{}));  // K-major ordering
+
+  // For TMA path: SmemLayoutV = SmemLayoutVtTma (MN-major, for TMA load)
+  using SmemLayoutV = SmemLayoutVtTma;
+  using SmemLayoutVt = SmemLayoutVtMma;
+
+  // FA3-style LDSM/STSM tiled copies for TMA path transpose
+  static constexpr bool kHeadDimMultiple64 = kHeadDim % 64 == 0;
+  static_assert(kHeadDimMultiple64 || kBlockN % 64 == 0,
+                "Either kHeadDim or kBlockN must be multiple of 64");


⚠️ Potential issue | 🟡 Minor

Typo in struct name: TranposeTraits should be TransposeTraits.

The struct name has a typo (missing 's' in "Transpose").

-template <typename TileShape_QKD, typename Element, int NUM_STAGES> -struct TranposeTraits_64x64 { +template <typename TileShape_QKD, typename Element, int NUM_STAGES> +struct TransposeTraits_64x64 {

Also update the alias at line 373:

- using VTranposeTraits = TranposeTraits_64x64<TileShape_QKD, DTypeKV, NUM_STAGES>; + using VTransposeTraits = TransposeTraits_64x64<TileShape_QKD, DTypeKV, NUM_STAGES>;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In include/flashinfer/attention/hopper/quantization/kernel_traits.cuh around lines 61 to 99, rename the struct from TranposeTraits_64x64 to TransposeTraits_64x64 (fix the missing 's') and update any corresponding internal uses/typedefs to match the corrected name; also update the alias referenced at line 373 to use TransposeTraits_64x64 instead of the old misspelled name so all references remain consistent.

Edenzzzz · 2025-12-09T17:29:11Z

include/flashinfer/attention/scheduler.cuh

 }

-inline float cost_function(int qo_len, int kv_len) { return 2 * float(qo_len) + kv_len; }
+inline float cost_function(int qo_len, int kv_len) { return 0.05 * float(qo_len) + kv_len; }


Is this based on profiling? This will also change the cost for persistent FA2?
(I tried fitting a quadratic cost function over a bunch of seq lens for persistent FA2 + overlap, didn't work well)
I think FA4 just strides by num CTAs without any load balancing? I plan to try opportunistically fetching from the two work queues next in the persistent FA2, like in POD

Yes it's based on profiling.

But as you mention, none of them work well (the cost model can not model the interference of L2, etc).

FA3's dynamic LPT scheduler (it might not be necessary to still stride by num_ctas)
https://github.com/Dao-AILab/flash-attention/blob/c783ab2f7e05ba1cd79ecfe0e6e109a4e3f6e542/hopper/tile_scheduler.hpp#L338

yzh119 · 2025-12-09T22:08:44Z

/bot run

flashinfer-bot · 2025-12-09T22:09:22Z

GitLab MR !186 has been created, and the CI pipeline #39915642 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-12-10T06:52:57Z

[SUCCESS] Pipeline #39915642: 11/20 passed

yzh119 added 4 commits December 9, 2025 00:56

wip

28a27cc

Merge remote-tracking branch 'origin/main' into align-fa3

d609356

upd

dd30a24

upd

53ae0b5

yzh119 requested review from bkryu, cyx-6, jiahanc, joker-eph and kahyunnam as code owners December 9, 2025 11:20

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

Edenzzzz reviewed Dec 9, 2025

View reviewed changes

refactor: update fa3 codebase [part 2] #2192

Are you sure you want to change the base?

refactor: update fa3 codebase [part 2] #2192

Conversation

yzh119 commented Dec 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzh119 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Dec 9, 2025

Uh oh!

flashinfer-bot commented Dec 9, 2025

Uh oh!

flashinfer-bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzh119 commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading

Edenzzzz Dec 9, 2025 •

edited

Loading