-
Notifications
You must be signed in to change notification settings - Fork 594
Port TRT-LLM communication kernels to flashinfer #2102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a throughput-optimized MOE All-to-All (A2A) backend: new CUDA dispatch/combine/sanitize kernels and headers, C++ TVM/FFI bridge exposing workspace/init/dispatch/combine/sanitize/metainfo APIs, Python JIT/AOT integration and high-level MoeAlltoAll manager, environment helpers, metadata utilities, comprehensive tests, and CI script updates. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant PyHost as Python Host (MoeAlltoAll)
participant JIT as JIT Module
participant CppFFI as C++ FFI
participant CUDA as CUDA Kernels
participant Net as Network / P2P
Note over App,CUDA: Dispatch Phase
App->>PyHost: moe_a2a_dispatch(...)
PyHost->>JIT: call registered dispatch op (workspace, metainfo)
JIT->>CppFFI: FFI bridge -> moe_a2a_dispatch
CppFFI->>CUDA: launch prepare_dispatch kernel
CUDA->>CUDA: init counters/flags
CppFFI->>CUDA: launch dispatch kernel
CUDA->>Net: write/send per-rank payloads (send buffers / send_counters)
Net->>CUDA: peers receive payloads (recv buffers)
Note over App,CUDA: Combine Phase
App->>PyHost: moe_a2a_combine(...)
PyHost->>JIT: call registered combine op
JIT->>CppFFI: FFI bridge -> moe_a2a_combine
CppFFI->>CUDA: launch prepare_combine (copy recv->workspace)
CppFFI->>CUDA: launch combine kernel
CUDA->>CUDA: per-token aggregation and top-k combine
CUDA->>CppFFI: set completion flags / return outputs
Note over App,CUDA: Sanitization (optional)
App->>PyHost: moe_a2a_sanitize_expert_ids(...)
PyHost->>JIT: sanitize op
JIT->>CppFFI: bridge to sanitize entry
CppFFI->>CUDA: launch sanitize kernel (mask invalid expert ids)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
710a388 to
bd82a2b
Compare
| #define check_timeout(s) false | ||
| #else | ||
| // 300 * 2000 MHz - should be high enough on any GPU but will prevent a hang | ||
| #define check_timeout(s) ((clock64() - (s)) > (300ll * 2000ll * 1000ll * 1000ll)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have manually added this, can I get someone to sanity check my logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
f766bfe to
8cdf8d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (18)
csrc/nv_internal/cpp/common/envUtils.cpp (1)
357-357: Consider cachinggetEnvEplbForceGdrcopylike other bool env helpers
getEnvEplbForceGdrcopycallsgetBoolEnv(and thusstd::getenv) on every invocation, while most other helpers in this file cache the value in astaticlocal. Functionally this is fine, but for consistency and to avoid repeated env lookups in hot paths you might want to align it:-bool getEnvEplbForceGdrcopy() { return getBoolEnv("TRTLLM_EPLB_FORCE_GDRCOPY"); } +bool getEnvEplbForceGdrcopy() { + static bool const forceGdrcopy = getBoolEnv("TRTLLM_EPLB_FORCE_GDRCOPY"); + return forceGdrcopy; +}Not critical, but it would match the rest of the env-utils style.
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (2)
428-431: Consider documenting why the acquire fence is commented out.The
fence.acquire.sysat line 430 is commented out after the dispatch wait loop. While the combine kernel (line 735) does have an acquire fence, having it commented here without explanation could cause confusion for future maintainers. If this is intentional (relying on the combine kernel's fence), a brief comment explaining the design decision would help.} - // asm volatile("fence.acquire.sys;"); + // NOTE: Acquire fence intentionally omitted here; combine kernel provides + // the acquire semantics before reading peer data. #endif
596-609: Generic fallback is unreachable code.The generic fallback reduction loop (lines 599-608) can never be reached because the
SWITCH_TOP_Kmacro (lines 53-78) only allowsTOP_Kvalues of 1, 2, 4, or 8, and all these cases have explicit handling above. Consider removing this dead code or adding astatic_assertto document the constraint.} else if constexpr (TOP_K == 1) { // nothing to do - } else { - // Generic fallback: accumulate all into acc[0] - T* a0 = reinterpret_cast<T*>(&acc[0]); -#pragma unroll - for (int k = 1; k < TOP_K; ++k) { - T* ak = reinterpret_cast<T*>(&acc[k]); -#pragma unroll - for (int j = 0; j < elems_per_vec; ++j) { - a0[j] += ak[j]; - } - } + } else { + static_assert(TOP_K == 1 || TOP_K == 2 || TOP_K == 4 || TOP_K == 8, + "Only TOP_K values 1, 2, 4, 8 are supported"); }scripts/task_test_multi_node_comm_kernels.sh (1)
9-13: Disabling cache cleanup may cause stale import issues.The cache cleanup commands are commented out. If module refactoring occurs between test runs, stale
.pycfiles could cause import errors or unexpected behavior. Consider re-enabling these commands or documenting why they're disabled.tests/comm/test_trtllm_moe_alltoall.py (4)
1-2: Copyright year should be updated to 2025.The license header shows 2024 but this is a new file created in 2025.
-Copyright (c) 2024 by FlashInfer team. +Copyright (c) 2025 by FlashInfer team.
112-112: Potential issue with payload size calculation.
x[0].numel()gets the number of elements in the first row, but ifinput_tensorsis a list of 2D tensors, this calculates size per token correctly. However, the variable namepayload_size_per_tokenand the indexingx[0]could be clearer.- payload_size_per_token = sum([x[0].numel() * x.itemsize for x in input_tensors]) + payload_size_per_token = sum([x.shape[-1] * x.element_size() for x in input_tensors])
207-236: CUDA streams created but not explicitly cleaned up.The
cuda_streams_all_rankslist creates CUDA streams that are not explicitly destroyed. While Python's garbage collector will eventually clean them up, for test reliability consider using a context manager or explicit cleanup.
411-411: Minor typo in comment.Extra slash at end of comment.
- # For each expert selected for this token/ + # For each expert selected for this tokentests/comm/test_mnnvl_moe_alltoall.py (4)
37-46: Consider usingraisewithout exception name per Python best practices.The explicit
raise eis redundant; bareraisepreserves the traceback better.def safe_run(func, *args, **kwargs): comm = MPI.COMM_WORLD try: func(*args, **kwargs) except MPIExit as e: - raise e + raise except Exception as e: traceback.print_exc() comm.allgather(True) - raise e + raise
49-51: Test fixture should yield for proper cleanup semantics.Even though no cleanup is needed, the fixture pattern should include
yieldfor consistency.@pytest.fixture(autouse=True) def setup_test(): torch.manual_seed(0x1234) + yield
571-576: Blind exception catch may mask real initialization errors.Catching bare
Exceptionwhen checking MNNVL support could hide legitimate configuration issues. Consider catching specific exception types or at least logging the exception.try: MnnvlMemory.initialize() if not MnnvlMemory.supports_mnnvl(): pytest.skip("MNNVL not supported on this system") - except Exception: + except (RuntimeError, pynvml.NVMLError) as e: + # Log exception for debugging if needed pytest.skip("MNNVL not supported on this system")
709-712: Unused variableexpert_id_payload_indexas flagged by static analysis.The unpacked variable is never used. Either prefix with underscore or remove from unpacking.
- payloads, expert_id_payload_index = make_bfloat16_payloads( + payloads, _expert_id_payload_index = make_bfloat16_payloads( local_num_tokens, hidden_size, top_k, rank, token_selected_experts )flashinfer/comm/trtllm_moe_alltoall.py (5)
8-8: TODO comment should be addressed or tracked.The
# TODO Reviewcomment at the top suggests this module needs review. Consider removing after review or converting to a tracked issue.Would you like me to open an issue to track any remaining review items?
351-351: Mutable class attribute should useClassVarannotation.Per static analysis and Python best practices, mutable class attributes should be annotated with
typing.ClassVar.+from typing import ClassVar + class MoeAlltoAll: ... # Single shared workspace across the process - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
456-463: Assertions for validation could use proper exceptions in production.Using
assertfor validation is acceptable for debug builds but these checks may be skipped in optimized Python (python -O). Consider using explicitif/raisefor critical invariants.
610-610: Inefficient way to get element size.Creating an empty tensor just to get element size is wasteful. Use
torch.finfoortorch.iinfoor a lookup table instead.- element_size = torch.tensor([], dtype=dtype).element_size() + # More efficient: use dtype itemsize directly + element_size = torch.empty(0, dtype=dtype).element_size()Or better, consider caching element sizes or using:
element_size = torch.finfo(dtype).bits // 8 if dtype.is_floating_point else torch.iinfo(dtype).bits // 8
621-628:__all__is not sorted as noted by static analysis.Consider sorting for consistency, though this is a minor issue.
__all__ = [ "MoeAlltoAll", "moe_a2a_initialize", + "moe_a2a_combine", "moe_a2a_dispatch", - "moe_a2a_combine", + "moe_a2a_get_workspace_size_per_rank", "moe_a2a_sanitize_expert_ids", - "moe_a2a_get_workspace_size_per_rank", ]csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
78-120: Well-documented struct with clear field descriptions.The
MoeA2ADispatchParamsstruct has excellent inline documentation explaining each field's purpose and dimensions. The TODO on line 90-91 about renamingmax_tokens_per_ranktoruntime_max_tokens_per_rankshould be tracked.Would you like me to open an issue to track the TODO about renaming
max_tokens_per_rank?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
csrc/nv_internal/cpp/common/envUtils.cpp(2 hunks)csrc/nv_internal/tensorrt_llm/common/envUtils.h(2 hunks)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu(1 hunks)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h(1 hunks)csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h(1 hunks)csrc/trtllm_moe_a2a.cu(1 hunks)docs/api/comm.rst(1 hunks)flashinfer/aot.py(1 hunks)flashinfer/comm/__init__.py(1 hunks)flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)flashinfer/jit/__init__.py(1 hunks)flashinfer/jit/comm.py(1 hunks)scripts/task_test_multi_node_comm_kernels.sh(1 hunks)tests/comm/test_mnnvl_memory.py(1 hunks)tests/comm/test_mnnvl_moe_alltoall.py(1 hunks)tests/comm/test_trtllm_moe_alltoall.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (11)
flashinfer/jit/__init__.py (1)
flashinfer/jit/comm.py (1)
gen_mnnvl_a2a_module(83-109)
csrc/nv_internal/tensorrt_llm/common/envUtils.h (1)
csrc/nv_internal/cpp/common/envUtils.cpp (8)
getEnvKVCacheTimeOutputPath(275-278)getEnvKVCacheTimeOutputPath(275-275)getEnvMoeA2AOneBlockPerToken(326-333)getEnvMoeA2AOneBlockPerToken(326-326)getEnvMoeA2ADispatchBlockSize(347-350)getEnvMoeA2ADispatchBlockSize(347-347)getEnvMoeA2ACombineBlockSize(352-355)getEnvMoeA2ACombineBlockSize(352-352)
tests/comm/test_mnnvl_memory.py (1)
flashinfer/comm/mapping.py (1)
local_rank(391-392)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
csrc/trtllm_moe_a2a.cu (2)
getMoeA2AMetaInfoIndexPairs(395-407)getMoeA2AMetaInfoIndexPairs(395-395)
csrc/trtllm_moe_a2a.cu (1)
csrc/nv_internal/cpp/common/envUtils.cpp (2)
getEnvMoeA2AOneBlockPerToken(326-333)getEnvMoeA2AOneBlockPerToken(326-326)
flashinfer/aot.py (1)
flashinfer/jit/comm.py (1)
gen_mnnvl_a2a_module(83-109)
flashinfer/jit/comm.py (1)
flashinfer/jit/core.py (2)
JitSpec(213-312)gen_jit_spec(315-381)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
csrc/nv_internal/cpp/common/envUtils.cpp (4)
getEnvMoeA2ADispatchBlockSize(347-350)getEnvMoeA2ADispatchBlockSize(347-347)getEnvMoeA2ACombineBlockSize(352-355)getEnvMoeA2ACombineBlockSize(352-352)
flashinfer/comm/trtllm_moe_alltoall.py (4)
flashinfer/comm/mnnvl.py (5)
MnnvlMemory(232-551)MnnvlConfig(224-229)as_torch_strided_tensor(264-273)initialize(276-285)set_comm_from_config(288-293)flashinfer/comm/mapping.py (2)
Mapping(21-475)moe_ep_rank(349-350)flashinfer/jit/comm.py (1)
gen_mnnvl_a2a_module(83-109)include/flashinfer/trtllm/fused_moe/runner.h (1)
num_experts(263-263)
tests/comm/test_mnnvl_moe_alltoall.py (2)
flashinfer/comm/trtllm_moe_alltoall.py (4)
MoeAlltoAll(336-618)dispatch(484-541)get_combine_payload_tensor_in_workspace(585-618)combine(543-583)flashinfer/comm/mnnvl.py (3)
MnnvlMemory(232-551)initialize(276-285)supports_mnnvl(545-551)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (3)
csrc/nv_internal/tensorrt_llm/common/envUtils.h (1)
tensorrt_llm(23-104)csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
mnnvl_throughput(25-58)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (8)
moe_a2a_dispatch_launch(445-506)moe_a2a_dispatch_launch(445-445)moe_a2a_prepare_dispatch_launch(436-439)moe_a2a_prepare_dispatch_launch(436-436)moe_a2a_combine_launch(792-842)moe_a2a_combine_launch(792-792)moe_a2a_sanitize_expert_ids_launch(864-872)moe_a2a_sanitize_expert_ids_launch(864-866)
🪛 Clang (14.0.6)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h
[error] 19-19: 'array' file not found
(clang-diagnostic-error)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
[error] 18-18: 'cuda_bf16.h' file not found
(clang-diagnostic-error)
🪛 Ruff (0.14.6)
flashinfer/comm/trtllm_moe_alltoall.py
351-351: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
443-443: Avoid specifying long messages outside the exception class
(TRY003)
445-445: Avoid specifying long messages outside the exception class
(TRY003)
606-608: Avoid specifying long messages outside the exception class
(TRY003)
621-628: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tests/comm/test_mnnvl_moe_alltoall.py
34-34: Avoid specifying long messages outside the exception class
(TRY003)
42-42: Use raise without specifying exception name
Remove exception name
(TRY201)
46-46: Use raise without specifying exception name
Remove exception name
(TRY201)
575-575: Do not catch blind exception: Exception
(BLE001)
676-676: Do not catch blind exception: Exception
(BLE001)
710-710: Unpacked variable expert_id_payload_index is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (38)
csrc/nv_internal/cpp/common/envUtils.cpp (2)
275-277: KV cache time output path now cached by reference – behavior looks consistentReturning
std::string const&to a function-local static is safe here and matches the pattern used by other helpers in this file (single evaluation of the env var, cheap repeated access). The only behavior change is that changes toTRTLLM_KVCACHE_TIME_OUTPUT_PATHafter the first call are no longer observed, which is usually fine for env-based config.If you don’t rely on mutating the env mid-process (including in tests), this change looks good to me.
326-355: Based on my analysis with the available information, I can now rewrite the review comment:Verify MOE A2A environment variable prefixes (
TLLM_vsTRTLLM_) against project documentation and call sites; fix misleading comment insanitizeBlockSizeto reflect "round up" instead of "nearest"The new helpers follow the file's consistent patterns (single-read statics, sane defaults), but need attention in two areas:
Env var prefixes for MoE A2A knobs
- These functions read
TLLM_MOE_A2A_ONE_BLOCK_PER_TOKEN,TLLM_MOE_A2A_DISPATCH_BLOCK_SIZE, andTLLM_MOE_A2A_COMBINE_BLOCK_SIZE- Review the project documentation, call sites, and surrounding code to confirm whether the
TLLM_prefix matches intended usage; most other envs in this file appear to useTRTLLM_prefix and silently using the wrong names would be problematic.
sanitizeBlockSizecomment accuracy
- The implementation rounds up to a multiple of 32:
block = (block + 31) / 32 * 32;(e.g., 33 → 64, not 32)- The comment currently states "Round to nearest multiple of 32 (warp size)", which is misleading.
Fix the comment:
- // Round to nearest multiple of 32 (warp size)
- // Round up to the next multiple of 32 (warp size)
The extra `if (block == 0) block = 256;` check after clamping is redundant but harmless.csrc/nv_internal/tensorrt_llm/common/envUtils.h (1)
95-102: LGTM! New MoE A2A environment variable accessors are properly declared.The three new accessor functions (
getEnvMoeA2AOneBlockPerToken,getEnvMoeA2ADispatchBlockSize,getEnvMoeA2ACombineBlockSize) are well-documented with default behaviors and align with their implementations inenvUtils.cpp.csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (3)
19-22: Static analysis false positive - standard headers are valid.The Clang error about
<array>not being found is a false positive. This is a standard C++11 header that should be available in any modern C++ environment. The includes are correct.
28-43: LGTM! Well-structured metadata index enum.The
MoeA2AMetaInfoIndexenum provides clear, sequential indexing for metadata fields withNUM_METAINFO_FIELDS = 9correctly representing the count of actual data fields (0-8). TheMoeA2ADataOffsetstype alias correctly uses this count for the array size.
45-58: LGTM! Useful name-to-index mapping function.The inline
getMoeA2AMetaInfoIndexPairs()function provides a clean way to expose metadata field names and their corresponding indices, which is consumed by the TVM FFI interface incsrc/trtllm_moe_a2a.cu.csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (4)
114-116: Timeout calculation looks reasonable.The timeout of
300ll * 2000ll * 1000ll * 1000llcycles (~600 billion) translates to approximately 300 seconds at 2 GHz, which provides a generous upper bound to prevent infinite hangs while allowing ample time for legitimate synchronization delays. The calculation avoids overflow by usinglong longliterals.
272-276: LGTM! Single-threaded flag increment is safe.The
flag_valincrement occurs only whenidx == 0, ensuring single-threaded access. Since this kernel runs sequentially in the stream before the dispatch kernel, there's no race condition.
844-872: LGTM! Sanitize kernel implementation is correct.The kernel correctly identifies invalid tokens (where
token_idx >= recv_counters[source_rank]) and sets their expert IDs toinvalid_id. Each thread operates on disjoint memory locations, avoiding any race conditions.
315-344: Based on my investigation, I cannot access the repository or find public documentation for thekMaxRanksconstant definition. The repository clone failed, and web searches returned no results for this internal NVIDIA TensorRT-LLM code.However, the core concern raised in the review comment remains valid and cannot be conclusively verified without access to:
- The header file containing
kMaxRanksdefinition- The actual value of
kMaxRanks- Runtime validation constraints on
ep_size(ensemble parallel size)The potential undefined behavior is legitimate: if
target_rankcan be 64 or greater, the bit shift1ULL << target_rankon auint64_twould indeed cause undefined behavior in C++.
Verify
kMaxRanksdoes not exceed 64 to avoid undefined behavior.The
already_copiedbitmask usesuint64_twith bit operations1ULL << target_rank. Iftarget_rankcan be 64 or greater, this causes undefined behavior (shifting by >= width of type). The code validatesparams.ep_size <= kMaxRanksat line 448, so ensurekMaxRanksis defined as ≤ 64 in the header.flashinfer/aot.py (1)
515-522: LGTM! MNNVL A2A module integration follows existing patterns.The new
gen_mnnvl_a2a_moduleis correctly imported within theadd_commblock and added under thehas_sm100condition, consistent with the existinggen_trtllm_comm_moduleandgen_trtllm_mnnvl_comm_moduleplacement.tests/comm/test_mnnvl_memory.py (1)
125-125: LGTM! Correct device selection for multi-node scenarios.Using
self.local_rankinstead ofself.rankis the correct fix for multi-node setups where the global rank may exceed the local GPU count. This aligns with the setup fixture (line 51) and theMapping.local_rankproperty shown in the relevant snippet.scripts/task_test_multi_node_comm_kernels.sh (1)
17-19: LGTM - new MoE A2A test coverage added.The addition of
test_mnnvl_moe_alltoall.pyaligns with the new MoE A2A functionality introduced in this PR.flashinfer/jit/__init__.py (1)
80-80: LGTM!The re-export follows the established pattern for module generators in this file.
flashinfer/comm/__init__.py (1)
42-52: LGTM!The new MoE A2A exports follow the established import patterns and properly expose the public API surface for the throughput backend.
csrc/trtllm_moe_a2a.cu (6)
53-99: LGTM - offset calculations and workspace sizing are well-structured.The alignment logic using cache-line boundaries (128 bytes) is appropriate for GPU memory access patterns. The offset calculations correctly account for metadata fields and payload regions.
101-123: LGTM - initialization logic is correct.Input validation is comprehensive. The
cudaStreamSynchronizeat line 119 ensures the memset completes before returning, which is necessary for correctness.
132-262: LGTM - dispatch implementation is solid.The function has comprehensive input validation, proper payload descriptor setup, workspace size verification, and appropriate CUDA error checking after kernel launch.
264-277: LGTM!The dtype conversion correctly handles the common floating-point types used in MoE computations. The unreachable return after the throw silences compiler warnings.
279-361: LGTM - combine implementation is well-validated.The function properly validates all inputs including the optional
payloadInWorkspaceoptimization for zero-copy scenarios. Error handling is comprehensive.
393-416: LGTM!The metainfo accessor correctly wraps the C++ pairs for Python accessibility, and all FFI exports are properly registered.
flashinfer/jit/comm.py (1)
83-109: I'll verify whether architecture-specific NVCC flags are required for thegen_mnnvl_a2a_modulefunction. Let me search the codebase for relevant information.
<function_calls>
#!/bin/bashFind the moeAlltoAllKernels.cu file
find . -name "moeAlltoAllKernels.cu" -type f 2>/dev/null
</function_calls>
<function_calls>
#!/bin/bashCheck if the file exists and search for architecture-specific patterns
if [ -f "csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu" ]; then
echo "=== Checking for SM guards and architecture macros ==="
rg -n 'sm|SM[0-9]|CUDA_ARCH|CUDA_SM' "csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu" || echo "No explicit SM guards found"echo ""
echo "=== Checking for tensor core/MMA operations ==="
rg -n 'mma|wmma|tensorop|__mma|__hmma' "csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu" || echo "No tensor core intrinsics found"
else
echo "File not found at expected path"
fi
</function_calls><function_calls>
#!/bin/bashLook at the comm.py file to understand nvcc_flags definition and usage
echo "=== Current comm.py around nvcc_flags definition ==="
rg -n 'nvcc_flags' flashinfer/jit/comm.py -B3 -A3echo ""
echo "=== All functions in comm.py to see pattern ==="
rg -n '^def gen_' flashinfer/jit/comm.py
</function_calls><function_calls>
#!/bin/bashLet's see the full context of comm.py to understand nvcc_flags and module patterns
wc -l flashinfer/jit/comm.py
</function_calls><function_calls>
#!/bin/bashRead the entire comm.py file to understand structure and nvcc_flags
cat -n flashinfer/jit/comm.py
</function_calls><function_calls>
#!/bin/bashCheck what nvcc_flags is and where it comes from
rg -n 'nvcc_flags\s*=' flashinfer/jit/comm.py -B2 -A2
echo ""
echo "=== Check for imports in comm.py ==="
head -30 flashinfer/jit/comm.py
</function_calls><function_calls>
#!/bin/bashCheck the jit module structure
ls -la flashinfer/jit/
</function_calls><function_calls>
#!/bin/bashCheck jit_env for nvcc_flags
rg -n 'nvcc_flags' flashinfer/jit/ -A2 -B2
</function_calls>tests/comm/test_trtllm_moe_alltoall.py (6)
25-29: Session-scoped fixture withtorch.manual_seedmay not reset seed between tests.The fixture sets the seed once at session start but individual tests may want reproducible seeds. Consider if this is intentional or if test-level seeding is needed.
62-69: Good defensive check for SM resources.This helper appropriately skips tests when insufficient SMs are available for parallel kernel execution. The check prevents hangs on systems with limited GPU resources.
72-84: LGTM!The
make_payloadhelper correctly distinguishes between integer and floating-point types for random tensor generation.
138-141: Sorting approach for validation is reasonable but fragile.Sorting both input and output tensors to compare them works for this test case but relies on unique values. If there are duplicate values, the sort order could differ. Consider documenting this assumption or using a more robust comparison.
388-429: Reference implementation for fake_moe looks correct.The fake_moe function properly handles expert parallelism filtering and accumulation. The tree reduction comment on line 423 correctly explains why results are summed after collection.
530-536: Relatively loose tolerance for numerical comparison.Using
atol=1.5e-2andrtol=1.5e-2is quite loose for bf16/fp16. This may mask precision issues. Verify this tolerance is intentional given the accumulation order differences mentioned elsewhere.tests/comm/test_mnnvl_moe_alltoall.py (3)
293-293: Direct modification of class variable_WORKSPACEis concerning.Setting
MoeAlltoAll._WORKSPACE = Nonedirectly before instantiation suggests test isolation concerns. This should be documented or handled via a proper reset method.Consider whether
_reset_workspace()method fromMoeAlltoAllshould be used instead, or if this pattern is intentional for test setup.
800-813: Good documentation of tolerance rationale.The comment on line 809 clearly explains why a 99% match threshold is used instead of exact comparison due to bf16 accumulation order differences. This is helpful for future maintainers.
836-838: Helpful run instructions in docstring.The comment showing how to run with
mpirunis useful for developers unfamiliar with MPI testing.flashinfer/comm/trtllm_moe_alltoall.py (3)
353-383: Workspace caching strategy looks correct.The caching by
(workspace_size_per_rank, ep_rank, ep_size, max_num_tokens)tuple allows reusing workspaces across instances with compatible configurations. This addresses the past review comment about supporting different shaped communicators.
470-482:_reset_workspacemethod deletes from class cache without thread safety.If multiple threads could access this class simultaneously, the
deloperation on_WORKSPACE_CACHEcould cause issues. Document that this method is not thread-safe.def _reset_workspace(self): - """Reset the workspace to free up its state. This is mainly used for testing. Use this with caution. This object is no longer usable after this.""" + """Reset the workspace to free up its state. + + Warning: This method is not thread-safe and is mainly used for testing. + This object is no longer usable after calling this method. + """
505-508: Good use of state machine pattern for dispatch/combine sequencing.The phase checking prevents calling dispatch twice without combine and ensures proper operation ordering. This is a clean design.
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (4)
17-19: Static analysis reports missingcuda_bf16.h- this is a false positive.The
cuda_bf16.handcuda_fp16.hheaders are provided by the CUDA toolkit and will be available during compilation with nvcc. Static analysis tools without CUDA environment cannot find these headers.
23-27: Configuration constants are well-documented and reasonable.The limits (256 experts, 8 top-k, 8 payloads, 64 ranks) provide good flexibility while keeping fixed-size arrays manageable. Consider whether these should be configurable at runtime if larger deployments are anticipated.
173-179: Function declarations are clean and match the implementation.The kernel launch function declarations align with the implementations shown in the relevant code snippets from
moeAlltoAllKernels.cu.
148-148: Unable to verify include configuration due to repository access failure.The repository clone failed, preventing me from examining the file's include structure, verifying whether
nvinfer1::DataTypeis actually used, or confirming if the necessary headers are already present. Manual verification is required to confirm:
- Whether
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.hcurrently includes NvInfer headers- Whether
nvinfer1::DataTypeis actually declared in the file or included transitively- Whether the code compiles successfully without the suggested include
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
flashinfer/comm/trtllm_moe_alltoall.py (3)
379-379: Annotate mutable class attribute withClassVar.Per Python best practices, mutable class attributes should be annotated with
ClassVarto make clear they are shared across instances.+from typing import ClassVar + class MoeAlltoAll: ... - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
638-638: Consider usingtorch.finfoortorch.iinfofor element size.Creating an empty tensor just to get element size has minor overhead. Consider using dtype introspection directly.
- element_size = torch.tensor([], dtype=dtype).element_size() + element_size = torch._utils._element_size(dtype)Alternatively, keep the current approach if you prefer avoiding private APIs.
649-656: Consider addingmoe_a2a_wrap_payload_tensor_in_workspaceto__all__.This function is used in tests and appears to be part of the public API. Also consider sorting
__all__for consistency.__all__ = [ "MoeAlltoAll", + "moe_a2a_combine", "moe_a2a_initialize", "moe_a2a_dispatch", - "moe_a2a_combine", + "moe_a2a_get_workspace_size_per_rank", "moe_a2a_sanitize_expert_ids", - "moe_a2a_get_workspace_size_per_rank", + "moe_a2a_wrap_payload_tensor_in_workspace", ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
csrc/trtllm_moe_alltoall.cu(1 hunks)flashinfer/aot.py(1 hunks)flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)flashinfer/jit/__init__.py(1 hunks)flashinfer/jit/comm.py(1 hunks)tests/comm/test_trtllm_moe_alltoall.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- flashinfer/aot.py
- flashinfer/jit/comm.py
🧰 Additional context used
🧬 Code graph analysis (2)
csrc/trtllm_moe_alltoall.cu (2)
csrc/tvm_ffi_utils.h (3)
Tensor(282-284)get_current_stream(266-270)encode_dlpack_dtype(29-31)flashinfer/comm/trtllm_moe_alltoall.py (6)
moe_a2a_get_workspace_size_per_rank(175-198)moe_a2a_get_workspace_size_per_rank(350-361)moe_a2a_initialize(41-47)moe_a2a_initialize(210-218)moe_a2a_dispatch(53-93)moe_a2a_dispatch(251-309)
tests/comm/test_trtllm_moe_alltoall.py (7)
flashinfer/comm/mapping.py (1)
Mapping(21-475)tests/test_helpers/test_helpers.py (1)
get_device_properties(10-11)include/flashinfer/trtllm/fused_moe/runner.h (1)
num_experts(263-263)flashinfer/comm/trtllm_moe_alltoall.py (13)
moe_a2a_get_workspace_size_per_rank(175-198)moe_a2a_get_workspace_size_per_rank(350-361)MoeAlltoAll(364-646)dispatch(512-569)get_combine_payload_tensor_in_workspace(613-646)combine(571-611)moe_a2a_initialize(41-47)moe_a2a_initialize(210-218)moe_a2a_dispatch(53-93)moe_a2a_dispatch(251-309)moe_a2a_sanitize_expert_ids(146-155)moe_a2a_sanitize_expert_ids(338-347)moe_a2a_wrap_payload_tensor_in_workspace(221-248)flashinfer/fused_moe/utils.py (1)
_(157-163)csrc/xqa/mha.cu (1)
any(157-157)tests/comm/test_mnnvl_moe_alltoall.py (1)
fake_moe(121-181)
🪛 Ruff (0.14.6)
flashinfer/comm/trtllm_moe_alltoall.py
379-379: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
471-471: Avoid specifying long messages outside the exception class
(TRY003)
473-473: Avoid specifying long messages outside the exception class
(TRY003)
634-636: Avoid specifying long messages outside the exception class
(TRY003)
649-656: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🔇 Additional comments (16)
flashinfer/jit/__init__.py (1)
80-80: LGTM!The new import follows the established pattern for re-exporting JIT module generators from the comm submodule.
csrc/trtllm_moe_alltoall.cu (5)
53-88: LGTM!The offset calculation logic correctly aligns data structures to cache-line boundaries where needed, following a clear sequential layout pattern.
101-123: LGTM!The initialization correctly validates inputs, zeros the workspace region, and returns metadata offsets. The stream synchronization ensures the workspace is properly initialized before returning.
278-360: LGTM!The combine operation has thorough input validation, properly handles the workspace-backed payload case, and includes appropriate error checking after kernel launch.
362-394: LGTM!The sanitization operation correctly validates inputs and launches the kernel with proper error checking.
396-419: LGTM!The metainfo index pairs helper provides a clean mechanism to expose C++ constants to Python, and all required functions are properly exported.
tests/comm/test_trtllm_moe_alltoall.py (6)
74-86: LGTM!The payload generator correctly handles both integer and floating-point dtypes for test data generation.
93-161: LGTM!Comprehensive single-GPU test covering multiple payload dtypes, dispatch/combine workflow, and workspace-backed tensor operations.
164-240: LGTM!The helper correctly simulates multi-rank dispatch on a single GPU using separate CUDA streams, with proper synchronization.
302-344: LGTM!The multi-rank test correctly validates token routing across simulated ranks with proper verification of payload delivery.
390-431: LGTM!The reference MoE implementation provides a deterministic baseline for verifying combine correctness, with appropriate handling of expert-parallel scenarios.
434-551: LGTM!Comprehensive combine test covering multiple dtypes, workspace configurations, and ranks with appropriate numerical tolerances for reduced-precision arithmetic.
flashinfer/comm/trtllm_moe_alltoall.py (4)
32-207: LGTM!The JIT module getter follows the established pattern with proper caching and custom op registration.
221-248: LGTM!The function correctly creates a workspace-backed tensor view with properly documented parameters.
470-473: LGTM!The validation logic is appropriate and the exception messages are concise.
498-510: LGTM!The reset method appropriately handles workspace cleanup for testing scenarios, with clear documentation about post-call state.
6e9bed5 to
a51b1ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
csrc/trtllm_moe_alltoall.cu (1)
263-276: Consider extending dtype support for future flexibility.The
toNvDataTypefunction currently supports half, bfloat16, and float32. Consider documenting supported types or adding int8/fp8 support if those are common in MoE workloads.tests/comm/test_mnnvl_moe_alltoall.py (3)
711-712: Unused variable is intentional; consider underscore prefix per Ruff hint.The
expert_id_payload_indexis returned by the helper but not used in this test. Consider renaming to_expert_id_payload_indexto signal intentional discard.- payloads, expert_id_payload_index = make_bfloat16_payloads( + payloads, _expert_id_payload_index = make_bfloat16_payloads(
293-294: Setting class attribute_WORKSPACE = Nonemay conflict with class-level cache.Assigning
MoeAlltoAll._WORKSPACE = Noneresets a non-existent instance attribute. The class uses_WORKSPACE_CACHEfor caching. This assignment has no effect but is misleading.Consider removing this line or using
MoeAlltoAll._WORKSPACE_CACHE.clear()if the intent is to reset the cache:- MoeAlltoAll._WORKSPACE = None + MoeAlltoAll._WORKSPACE_CACHE.clear()
742-742: Same issue:_WORKSPACE = Noneassignment is ineffective.This line also sets a non-existent attribute. Consider removing or using
_WORKSPACE_CACHE.clear().- MoeAlltoAll._WORKSPACE = Noneflashinfer/comm/trtllm_moe_alltoall.py (2)
375-377: Annotate mutable class attribute withClassVarper Ruff hint.The
_WORKSPACE_CACHEis a mutable class-level attribute that should be annotated withClassVarto make the intent clear.+from typing import ClassVar + class MoeAlltoAll: ... - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
496-508: Document that_reset_workspaceinvalidates the instance.The docstring mentions this but it's critical: after calling
_reset_workspace, the object is unusable. Consider adding a stronger warning or raising an exception on subsequent method calls.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
csrc/trtllm_moe_alltoall.cu(1 hunks)flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)tests/comm/test_mnnvl_moe_alltoall.py(1 hunks)tests/comm/test_trtllm_moe_alltoall.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
csrc/trtllm_moe_alltoall.cu (2)
csrc/tvm_ffi_utils.h (3)
Tensor(282-284)get_current_stream(266-270)encode_dlpack_dtype(29-31)flashinfer/comm/trtllm_moe_alltoall.py (11)
moe_a2a_get_workspace_size_per_rank(173-196)moe_a2a_get_workspace_size_per_rank(348-359)moe_a2a_initialize(39-45)moe_a2a_initialize(208-216)moe_a2a_dispatch(51-91)moe_a2a_dispatch(249-307)moe_a2a_combine(97-138)moe_a2a_combine(310-333)moe_a2a_sanitize_expert_ids(144-153)moe_a2a_sanitize_expert_ids(336-345)moe_a2a_get_metainfo_index_pairs(159-167)
flashinfer/comm/trtllm_moe_alltoall.py (3)
flashinfer/comm/mnnvl.py (5)
MnnvlMemory(232-551)MnnvlConfig(224-229)as_torch_strided_tensor(264-273)initialize(276-285)set_comm_from_config(288-293)flashinfer/comm/mapping.py (2)
Mapping(21-475)moe_ep_rank(349-350)flashinfer/jit/comm.py (1)
gen_mnnvl_moe_alltoall_module(83-109)
tests/comm/test_mnnvl_moe_alltoall.py (3)
flashinfer/comm/trtllm_moe_alltoall.py (4)
MoeAlltoAll(362-644)dispatch(510-567)get_combine_payload_tensor_in_workspace(611-644)combine(569-609)flashinfer/comm/mapping.py (1)
Mapping(21-475)flashinfer/comm/mnnvl.py (3)
MnnvlMemory(232-551)initialize(276-285)supports_mnnvl(545-551)
🪛 Ruff (0.14.6)
flashinfer/comm/trtllm_moe_alltoall.py
377-377: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
469-469: Avoid specifying long messages outside the exception class
(TRY003)
471-471: Avoid specifying long messages outside the exception class
(TRY003)
632-634: Avoid specifying long messages outside the exception class
(TRY003)
tests/comm/test_mnnvl_moe_alltoall.py
34-34: Avoid specifying long messages outside the exception class
(TRY003)
576-576: Do not catch blind exception: Exception
(BLE001)
677-677: Do not catch blind exception: Exception
(BLE001)
711-711: Unpacked variable expert_id_payload_index is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (44)
csrc/trtllm_moe_alltoall.cu (8)
1-51: LGTM! Well-structured header and utility functions.The license header, includes, and utility functions are appropriately organized. The
alignOffsetfunction correctly implements cache-line alignment using bitwise operations.
53-88: LGTM! Offset calculation is well-structured.The
calculateOffsetsfunction properly calculates memory offsets for various MoE A2A data structures with appropriate alignment for cache-line boundaries.
90-99: LGTM! Workspace size calculation.The workspace size calculation correctly accounts for metadata, payload, and combine regions with proper alignment.
101-123: LGTM! Initialize operation with proper validation.Good input validation for workspace dimensions, rank bounds, and proper error checking for CUDA operations. The synchronization before returning metainfo is appropriate.
125-261: LGTM! Dispatch operation is well-implemented.The dispatch function has comprehensive input validation, proper payload descriptor handling, and correct workspace pointer arithmetic. Error checking after kernel launch is appropriate.
278-360: LGTM! Combine operation with proper validation.The combine function correctly validates payload dimensions, workspace pointer alignment, and handles the
payloadInWorkspaceflag appropriately. Error checking after kernel launch is proper.
362-394: LGTM! Sanitize operation is correctly implemented.Proper input validation and error checking for the sanitize expert IDs kernel.
396-419: LGTM! Metainfo export and FFI registration.The metainfo index pairs function and TVM FFI exports are correctly implemented, providing clean Python interoperability.
tests/comm/test_trtllm_moe_alltoall.py (12)
25-29: LGTM! Docstring has been corrected.The fixture docstring now accurately describes that it sets the torch seed for deterministic tests.
32-60: Good test parameter coverage.The test parameters cover a good range of configurations (small, medium, large) for both single-GPU and multi-rank scenarios, with various dtypes and payload configurations.
63-72: Good resource-aware skip logic.The SM count check appropriately skips tests when hardware resources are insufficient, preventing false failures on less capable GPUs.
74-86: LGTM! Payload generation helper.The
make_payloadfunction correctly handles both integer and floating-point dtypes with appropriate random value generation.
89-162: Comprehensive single-GPU test with proper verification.The test covers dispatch and combine flows with multiple dtypes, validates output via sorting and exact comparison, and tests the workspace-backed combine path.
164-240: LGTM! Multi-rank dispatch helper is well-structured.The helper properly manages workspaces, initializes per-rank metadata, uses separate CUDA streams for parallel execution, and synchronizes appropriately.
243-259: LGTM! Sanitize helper function.Simple and correct delegation to the underlying sanitize function for each rank.
262-299: LGTM! Combine helper with parallel execution.The combine helper correctly uses separate streams per rank and synchronizes before returning results.
302-345: LGTM! Multi-rank test with proper verification.Good verification logic that filters non-zero tensors and compares sorted outputs against the reference filtered by expert assignment.
347-388: LGTM! Sanitize test with comprehensive verification.The test properly clones tensors before sanitization to enable before/after comparison and correctly verifies the sanitization logic.
390-431: LGTM! Reference MoE implementation for verification.The
fake_moefunction provides a clear reference implementation for verifying the distributed MoE behavior, with proper EP-rank filtering logic.
434-555: Good end-to-end combine test with tolerance handling.The test covers the full dispatch-process-combine cycle with both in-workspace and external payload paths. The tolerance values for bf16 are reasonable.
tests/comm/test_mnnvl_moe_alltoall.py (11)
27-46: MPI error handling utilities are well-designed.The
MPIExitexception,check_any_rank_failed, andsafe_runpattern provide robust MPI coordination for test failures across ranks, ensuring clean error propagation.
49-52: LGTM! Test fixture for deterministic seeding.
55-88: LGTM! Helper functions for expert routing and token generation.
compute_target_rank_idcorrectly implements contiguous expert partitioning, andgenerate_token_selected_expertsproperly generates random expert assignments.
91-119: LGTM! Expert weight creation with reproducible seeding.Using
ep_rank * 1000 + ias a seed ensures reproducibility across runs while differentiating experts per rank.
122-182: LGTM! Comprehensive fake MoE reference implementation.The function correctly handles both EP-rank and global modes with proper local expert ID conversion.
185-258: LGTM! Payload creation helpers.Both NV FP4 and BFloat16 payload creators are well-structured with appropriate rank-specific patterns for verification.
261-383: LGTM! Single-rank dispatch worker function.Comprehensive workspace setup, dispatch execution, and metadata extraction for MPI-based testing.
386-556: LGTM! Thorough dispatch verification.The
verify_dispatchfunction provides exhaustive validation of shapes, dtypes, counters, routing, and payload content. This is excellent for catching regressions.
572-577: Bare Exception catch is intentional for MNNVL availability check.The broad exception catch here is acceptable since it's used to detect MNNVL support availability across various failure modes (driver issues, missing hardware, etc.).
673-678: Bare Exception catch is acceptable for capability detection.Same as above - this is intentional for gracefully skipping tests on systems without MNNVL support.
657-818: LGTM! Full dispatch+combine cycle test.The test properly verifies the complete MoE A2A workflow with appropriate tolerance for bf16 accumulation order differences. The 99% match threshold is reasonable given the expected numerical variations.
flashinfer/comm/trtllm_moe_alltoall.py (13)
1-19: LGTM! Module header and imports are well-organized.Clean module docstring and appropriate imports for the MoE A2A functionality.
21-28: LGTM! State dataclass is appropriate.The
_A2AStatedataclass cleanly tracks the dispatch/combine lifecycle with appropriate fields.
30-205: LGTM! JIT module initialization with custom op registration.The lazy module loading with
@functools.cacheand custom op registration is well-structured. The returnedSimpleNamespaceprovides a clean API surface.
208-246: LGTM! Public wrapper functions are clean delegations.The top-level
moe_a2a_*functions provide clean interfaces to the JIT module, with appropriate docstrings where needed.
249-307: LGTM! Dispatch wrapper with tensor wrapping.The dispatch function correctly wraps the raw offsets into workspace-backed tensors for each payload.
310-359: LGTM! Combine, sanitize, and workspace size wrappers.Clean delegation to the underlying JIT module.
379-409: LGTM! Workspace caching with proper key management.The
get_workspaceclassmethod correctly caches workspaces by configuration tuple, preventing redundant allocations.
411-432: LGTM! Lazy metainfo constant initialization.The
_init_constantsmethod properly strips prefixes for a cleaner Python API.
434-494: LGTM! Constructor with proper validation and MNNVL configuration.Good input validation for
top_kandnum_experts, with optional MnnvlConfig support as discussed in past reviews.
510-567: LGTM! Dispatch method with proper state management.Good state assertions, lifecycle tracking, and optional sanitization flow.
569-609: LGTM! Combine method with state reset.Proper state validation and reset after combine completes, enabling the next dispatch/combine cycle.
611-645: LGTM! Workspace-backed tensor accessor.The
get_combine_payload_tensor_in_workspacemethod correctly computes slice bounds and validates state.
647-654: LGTM! Clean__all__export list.Explicitly defines the public API surface.
|
/bot run |
7167939 to
b25cb7c
Compare
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
748-786: Unify combine prepare block size with env-configurable combine block sizemoe_a2a_prepare_combine_launch currently hardcodes
kBlockSize = 256and deriveskWarpsPerBlockfrom that, while moe_a2a_combine_launch usesgetEnvMoeA2ACombineBlockSize()and a possibly different block size. IfTLLM_MOE_A2A_COMBINE_BLOCK_SIZEis set, the prepare and combine phases will run with different block sizes/warps, which is confusing and can make tuning harder.To keep the two phases consistent, consider:
-void moe_a2a_prepare_combine_launch(MoeA2ACombineParams const& params) { - constexpr int kBlockSize = 256; - constexpr int kWarpsPerBlock = kBlockSize / 32; // 8 warps per block +void moe_a2a_prepare_combine_launch(MoeA2ACombineParams const& params) { + int const kBlockSize = tensorrt_llm::common::getEnvMoeA2ACombineBlockSize(); + int const kWarpsPerBlock = kBlockSize / 32; // warpSizeand keeping the rest of the grid/block calculations unchanged.
🧹 Nitpick comments (3)
flashinfer/comm/trtllm_moe_alltoall.py (3)
210-237: Verify workspace contiguity assumptions in moe_a2a_wrap_payload_tensor_in_workspaceThis helper relies on
workspace.view(-1).view(dtype=torch.uint8)and absoluteslice_start/slice_endindices. That’s fine if the underlying tensor is logically contiguous in memory (as it likely is for a plain[ep_size, size_per_rank]uint8workspace), but it can break or raise if the tensor has nontrivial strides (e.g., from MnnvlMemory/pack_strided_memory).Given the MNNVL integration, please double‑check that:
- The workspace tensor passed here is effectively contiguous over the region
[0, workspace.numel()), and- The absolute offsets coming back from C++ (including rankWorkspaceOffset) are computed in the same units as
workspace.numel()(bytes foruint8).If either assumption doesn’t hold, we may need to adjust this helper (e.g., operate on a per‑rank row only, or derive slice indices from the underlying strides rather than flattening with view).
487-563: Consider tracking runtime_max_tokens_per_rank in state to guard misuseMoeAlltoAll.dispatch() and combine() both take runtime_max_tokens_per_rank as an argument, but the value is not stored in _A2AState. If a caller accidentally uses a different runtime_max_tokens_per_rank in combine (or get_combine_payload_tensor_in_workspace) than was used for dispatch, the workspace region calculations and C++ expectations could diverge.
A lightweight guard would be:
- Add
runtime_max_tokens_per_rank: Optional[int]to _A2AState.- Set it in dispatch(), and in combine() / get_combine_payload_tensor_in_workspace() assert that the incoming value matches the stored one.
This gives an early, clear error if the API is misused, without affecting the hot path beyond a couple of integer stores/checks.
373-388: Optional: align with Ruff hints for class-level cache and Optional typingRuff’s feedback (RUF012, RUF013) suggests:
- Annotating
_WORKSPACE_CACHEas aClassVar[...]to make its class-level nature explicit.- Using
MnnvlConfig | None/int | Noneinstead of implicitOptional[...]where desired (or being consistent across the module).These are non-functional style tweaks but may reduce lint noise if you’re enforcing Ruff in CI.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu(1 hunks)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h(1 hunks)csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h(1 hunks)csrc/trtllm_moe_alltoall.cu(1 hunks)flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
csrc/trtllm_moe_alltoall.cu (2)
getMoeA2AMetaInfoIndexPairs(392-404)getMoeA2AMetaInfoIndexPairs(392-392)
flashinfer/comm/trtllm_moe_alltoall.py (4)
flashinfer/comm/mnnvl.py (5)
MnnvlMemory(232-551)MnnvlConfig(224-229)as_torch_strided_tensor(264-273)initialize(276-285)set_comm_from_config(288-293)flashinfer/comm/mapping.py (2)
Mapping(21-475)moe_ep_rank(349-350)flashinfer/jit/comm.py (1)
gen_moe_alltoall_module(83-112)include/flashinfer/trtllm/fused_moe/runner.h (2)
num_experts(263-263)hidden_size(265-265)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
csrc/nv_internal/cpp/common/envUtils.cpp (4)
getEnvMoeA2ADispatchBlockSize(347-350)getEnvMoeA2ADispatchBlockSize(347-347)getEnvMoeA2ACombineBlockSize(352-355)getEnvMoeA2ACombineBlockSize(352-352)
🪛 Clang (14.0.6)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h
[error] 19-19: 'array' file not found
(clang-diagnostic-error)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
[error] 18-18: 'cuda_bf16.h' file not found
(clang-diagnostic-error)
🪛 Ruff (0.14.7)
flashinfer/comm/trtllm_moe_alltoall.py
388-388: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
493-493: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
494-494: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
533-533: Avoid specifying long messages outside the exception class
(TRY003)
535-535: Avoid specifying long messages outside the exception class
(TRY003)
696-698: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
27-57: Metainfo enum and mapping look consistent and self-containedEnum values, NUM_METAINFO_FIELDS, the MoeA2ADataOffsets alias, and getMoeA2AMetaInfoIndexPairs() are coherent and match the usage pattern on the C++/Python side; no issues from a correctness or API perspective.
csrc/trtllm_moe_alltoall.cu (5)
53-93: Aux-data layout and size computation look correctcalculateOffsets()’s layout and getMoeA2AAuxDataSize() returning PAYLOAD_DATA_OFFSET_INDEX align with the Python-side workspace sizing (aux region + payload regions), and the 128‑byte alignment strategy is internally consistent.
95-117: Initialization helper is straightforward and consistent with layoutType/dimension checks on workspace, per‑rank memset, metainfo construction from calculateOffsets(), and final stream sync are all appropriate; this matches expectations for pre‑zeroing aux data and exporting offsets.
119-255: Dispatch host wrapper is careful about shapes, payload layout, and alignmentThe validation of token_selected_experts, payload list, metainfo, workspace, ep topology, and top_k, plus the computation of payloadByteSizes/totalBytesNeeded and the per‑rank recv buffer wiring, all look correct. The alignment check on totalBytesNeeded against each payload’s element size ensures aligned starts for each payload region, which is a nice touch.
257-354: Combine host wrapper and DLPack→TensorRT dtype mapping look correcttoNvDataType() covers the expected dtypes (FP16/BF16/FP32) and fails fast otherwise, and moeA2ACombineOp enforces payload/workspace/metainfo invariants, sets up MoeA2ACombineParams with the correct pointers (including the payload_in_workspace fast path), and performs a robust bounds check on the combine payload region before launching prepare+combine.
390-404: Metainfo index exposure helper cleanly bridges C++ and PythonForwarding fi_throughput::getMoeA2AMetaInfoIndexPairs() into TVM Arrays and returning (names, values) matches the Python-side expectations and keeps the metainfo constants centrally defined.
flashinfer/comm/trtllm_moe_alltoall.py (1)
675-708: Combine payload workspace view matches C++ expectationsget_combine_payload_tensor_in_workspace() validates state, computes element_size, and uses moe_a2a_wrap_payload_tensor_in_workspace with
[ep_size, runtime_max_tokens_per_rank]leading_shape and combine_payload_offset. This matches the C++ convention where the combine workspace region is laid out as[ep_size, max_tokens_per_rank, hidden_size]for the current rank, and ensures the payload_in_workspace fast path in moeA2ACombineOp sees the expected pointer.csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (1)
23-179: Header types and API surface are consistent with the CUDA implementationThe configuration constants, descriptor/params structs, and function declarations in this header match the usage in moeAlltoAllKernels.cu (including kMaxTopK, kMaxPayloads, and the various pointer layouts), so the public MoE A2A kernel interface looks coherent and ready for consumption by the higher‑level C++/Python bindings.
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (8)
flashinfer/comm/__init__.py (1)
42-55: Missing export formoe_a2a_get_metainfo_index_pairs.The
moe_a2a_get_metainfo_index_pairsfunction is used internally byMoeAlltoAll._init_constants()and exposed via the JIT module, but is not exported from this module. If this is intentional (internal-only), the function could be clarified as private. If it should be public for debugging/testing purposes, add the export.csrc/trtllm_moe_alltoall.cu (1)
356-388: Addep_rankand workspace dimension validation in sanitize op.Unlike
moeA2ADispatchOpandmoeA2ACombineOp, this function does not validate thatepRankis within bounds or thatworkspace.size(0)matchesepSize. IfepRankis out of range,rankWorkspacePtrcomputation at line 376 will result in out-of-bounds access.CHECK_INPUT_TYPE(workspace, dl_uint8); TVM_FFI_ICHECK_EQ(workspace.ndim(), 2); + TVM_FFI_ICHECK_EQ(workspace.size(0), epSize); + TVM_FFI_ICHECK(epRank >= 0 && epRank < epSize) << "epRank out of range"; auto* workspaceBase = static_cast<uint8_t*>(workspace.data_ptr());flashinfer/comm/trtllm_moe_alltoall.py (2)
140-142: Declaremoe_a2a_sanitize_expert_idsas mutating its input.This function modifies
expert_idsin-place, but is registered withmutates_args=[]. This can mislead alias analysis and downstream tooling about side effects.@register_custom_op( "flashinfer::moe_a2a_sanitize_expert_ids", - mutates_args=[], + mutates_args=["expert_ids"], )
434-439: Docstring parameter name mismatch.The docstring refers to
extra_payload_size_per_tokenbut the actual parameter isextra_payload_bytes_per_token.Args: ep_size: Total expert parallel size top_k: Number of experts per token max_num_tokens: Maximum number of tokens across all ranks hidden_size: Hidden dimension size - extra_payload_size_per_token: Extra size per token in the payload + extra_payload_bytes_per_token: Extra size per token in the payload (in bytes) Returns:csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (2)
122-134: Consider adding defensive bounds check tocompute_target_rank_id.If an invalid
expert_idreaches this function (e.g., due to upstream gating bug), the computed rank could exceedep_size, causing out-of-bounds memory access. Consider adding an assertion or documenting the invariant.
748-751: Block size inconsistency between prepare and combine launch functions.
moe_a2a_prepare_combine_launchuses hardcodedkBlockSize = 256, whilemoe_a2a_combine_launch(line 800) usesgetEnvMoeA2ACombineBlockSize(). If the environment variable is set to a different value,kWarpsPerBlockcalculations will diverge between the two phases.void moe_a2a_prepare_combine_launch(MoeA2ACombineParams const& params) { - constexpr int kBlockSize = 256; - constexpr int kWarpsPerBlock = kBlockSize / 32; // 8 warps per block + int const kBlockSize = tensorrt_llm::common::getEnvMoeA2ACombineBlockSize(); + int const kWarpsPerBlock = kBlockSize / 32;tests/comm/test_mnnvl_moe_alltoall.py (2)
31-34: Critical: Failure detection logic is still broken.The condition
if any(comm.allgather(False))always evaluates toFalsebecause every rank contributesFalse, socheck_any_rank_failedcan never detect failures from other ranks.Apply this diff to fix the failure detection:
def check_any_rank_failed(): comm = MPI.COMM_WORLD - if any(comm.allgather(False)): + failed = False + if any(comm.allgather(failed)): raise MPIExit("Another rank failed")This allows
safe_run(line 45) to signal failure viacomm.allgather(True), which would then be detected by other ranks callingcheck_any_rank_failed.
586-590: Major: Allgather verification skips rank 0.The assertion only checks elements
[1:], skipping rank 0. If rank 0 has a differentall_num_tokensvalue, it won't be detected.Apply this diff to check all ranks:
# Check all ranks have the same all_num_tokens gathered_all_num_tokens = comm.allgather(all_num_tokens) - assert all(i == all_num_tokens for i in gathered_all_num_tokens[1:]), ( + assert all(i == gathered_all_num_tokens[0] for i in gathered_all_num_tokens), ( "all_num_tokens should be the same" )
🧹 Nitpick comments (6)
csrc/nv_internal/tensorrt_llm/common/envUtils.h (1)
95-102: Consider removing or updating the TODO comment before merge.The TODO at line 98 indicates this is "For DEV purpose temporarily." If these configuration options are intended to be permanent, consider updating or removing this comment. If they are truly temporary, consider tracking this as a follow-up issue.
flashinfer/comm/trtllm_moe_alltoall.py (2)
386-388: Annotate mutable class attribute withClassVar.Per static analysis (RUF012), mutable class attributes should be annotated with
typing.ClassVarto clarify they are class-level state.+from typing import ClassVar + class MoeAlltoAll: # ... - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
493-494: Use explicitOptionaltype annotation.PEP 484 prohibits implicit
Optional. UseOptional[int]orint | Nonefor clarity.def __init__( self, mapping: Mapping, max_num_tokens: int, top_k: int, num_experts: int, - workspace_size_per_rank: int = None, - hidden_size: int = None, + workspace_size_per_rank: Optional[int] = None, + hidden_size: Optional[int] = None, mnnvl_config: Optional[MnnvlConfig] = None, ):csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
272-276: Non-atomic flag_val increment is safe but worth documenting.The increment
*flag_val_ptr = *flag_val_ptr + 1at line 275 is not atomic, but it's protected by theidx == 0guard ensuring only one thread executes it. A brief comment clarifying this single-thread guarantee would improve readability.tests/comm/test_trtllm_moe_alltoall.py (2)
242-245: Redundant synchronization after stream synchronization.Lines 242-243 already synchronize all CUDA streams individually, making the
torch.cuda.synchronize()on line 245 redundant.Apply this diff to remove the redundant synchronization:
for rank in range(world_size): cuda_streams_all_ranks[rank].synchronize() - - torch.cuda.synchronize()
301-304: Redundant synchronization after stream synchronization.Lines 301-302 already synchronize all CUDA streams individually, making the
torch.cuda.synchronize()on line 304 redundant.Apply this diff to remove the redundant synchronization:
for rank in range(world_size): cuda_streams_all_ranks[rank].synchronize() - - torch.cuda.synchronize()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
csrc/nv_internal/cpp/common/envUtils.cpp(2 hunks)csrc/nv_internal/tensorrt_llm/common/envUtils.h(2 hunks)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu(1 hunks)csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h(1 hunks)csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h(1 hunks)csrc/trtllm_moe_alltoall.cu(1 hunks)docs/api/comm.rst(1 hunks)flashinfer/aot.py(1 hunks)flashinfer/comm/__init__.py(1 hunks)flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)flashinfer/jit/__init__.py(1 hunks)flashinfer/jit/comm.py(1 hunks)pyproject.toml(1 hunks)scripts/task_test_multi_node_comm_kernels.sh(1 hunks)scripts/task_test_single_node_comm_kernels.sh(1 hunks)tests/comm/test_mnnvl_memory.py(1 hunks)tests/comm/test_mnnvl_moe_alltoall.py(1 hunks)tests/comm/test_trtllm_moe_alltoall.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- flashinfer/jit/init.py
- docs/api/comm.rst
- pyproject.toml
- flashinfer/aot.py
- tests/comm/test_mnnvl_memory.py
- csrc/nv_internal/cpp/common/envUtils.cpp
- scripts/task_test_multi_node_comm_kernels.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
csrc/trtllm_moe_alltoall.cucsrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
🧬 Code graph analysis (8)
csrc/trtllm_moe_alltoall.cu (2)
csrc/tvm_ffi_utils.h (2)
Tensor(287-289)get_current_stream(271-275)flashinfer/comm/cuda_ipc.py (1)
cudaGetErrorString(146-147)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (2)
include/flashinfer/trtllm/gemm/trtllmGen_gemm_export/KernelParams.h (1)
- `` (287-336)
csrc/trtllm_moe_alltoall.cu (2)
getMoeA2AMetaInfoIndexPairs(392-404)getMoeA2AMetaInfoIndexPairs(392-392)
flashinfer/comm/__init__.py (1)
flashinfer/comm/trtllm_moe_alltoall.py (11)
MoeAlltoAll(373-708)moe_a2a_combine(97-138)moe_a2a_combine(301-324)moe_a2a_dispatch(51-91)moe_a2a_dispatch(240-298)moe_a2a_initialize(39-45)moe_a2a_initialize(199-207)moe_a2a_get_workspace_size_per_rank(339-370)moe_a2a_sanitize_expert_ids(144-153)moe_a2a_sanitize_expert_ids(327-336)moe_a2a_wrap_payload_tensor_in_workspace(210-237)
csrc/nv_internal/tensorrt_llm/common/envUtils.h (1)
csrc/nv_internal/cpp/common/envUtils.cpp (8)
getEnvKVCacheTimeOutputPath(275-278)getEnvKVCacheTimeOutputPath(275-275)getEnvMoeA2AOneBlockPerToken(326-333)getEnvMoeA2AOneBlockPerToken(326-326)getEnvMoeA2ADispatchBlockSize(347-350)getEnvMoeA2ADispatchBlockSize(347-347)getEnvMoeA2ACombineBlockSize(352-355)getEnvMoeA2ACombineBlockSize(352-352)
flashinfer/jit/comm.py (1)
flashinfer/jit/core.py (2)
JitSpec(213-312)gen_jit_spec(315-381)
tests/comm/test_trtllm_moe_alltoall.py (2)
flashinfer/comm/mnnvl.py (2)
MnnvlMemory(232-551)supports_mnnvl(545-551)flashinfer/comm/trtllm_moe_alltoall.py (10)
moe_a2a_get_workspace_size_per_rank(339-370)MoeAlltoAll(373-708)dispatch(574-631)combine(633-673)moe_a2a_initialize(39-45)moe_a2a_initialize(199-207)moe_a2a_dispatch(51-91)moe_a2a_dispatch(240-298)moe_a2a_wrap_payload_tensor_in_workspace(210-237)get_moe_workspace_size_per_rank(423-462)
tests/comm/test_mnnvl_moe_alltoall.py (2)
flashinfer/comm/trtllm_moe_alltoall.py (3)
MoeAlltoAll(373-708)dispatch(574-631)combine(633-673)flashinfer/comm/mnnvl.py (3)
MnnvlMemory(232-551)initialize(276-285)supports_mnnvl(545-551)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
csrc/nv_internal/cpp/common/envUtils.cpp (4)
getEnvMoeA2ADispatchBlockSize(347-350)getEnvMoeA2ADispatchBlockSize(347-347)getEnvMoeA2ACombineBlockSize(352-355)getEnvMoeA2ACombineBlockSize(352-352)
🪛 Clang (14.0.6)
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h
[error] 19-19: 'array' file not found
(clang-diagnostic-error)
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
[error] 18-18: 'cuda_bf16.h' file not found
(clang-diagnostic-error)
🪛 Ruff (0.14.7)
tests/comm/test_mnnvl_moe_alltoall.py
34-34: Avoid specifying long messages outside the exception class
(TRY003)
576-576: Do not catch blind exception: Exception
(BLE001)
677-677: Do not catch blind exception: Exception
(BLE001)
711-711: Unpacked variable expert_id_payload_index is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
flashinfer/comm/trtllm_moe_alltoall.py
388-388: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
493-493: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
494-494: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
533-533: Avoid specifying long messages outside the exception class
(TRY003)
535-535: Avoid specifying long messages outside the exception class
(TRY003)
696-698: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
scripts/task_test_single_node_comm_kernels.sh (1)
25-25: LGTM!The new test invocation follows the existing pattern and appropriately expands test coverage for the new MOE A2A throughput backend.
csrc/nv_internal/tensorrt_llm/thop/moeAlltoAllMeta.h (1)
1-61: LGTM!The meta-info index enum and helper function are well-structured. The enum values are properly sequenced (0-8) with
NUM_METAINFO_FIELDS = 9correctly counting all fields. The static analysis error about<array>not found is a false positive from the analysis environment lacking proper include paths.csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu (1)
1-874: LGTM on the overall kernel implementation.The dispatch/combine kernel design with WarpPolicy/BlockPolicy threading abstractions, vectorized memory operations, and flag-based synchronization is well-structured. The timeout mechanism provides safety against hangs during development/debugging.
tests/comm/test_trtllm_moe_alltoall.py (2)
28-32: Docstring accurately reflects implementation.The fixture docstring has been updated to correctly describe its purpose of setting up deterministic seeding, addressing the previous review feedback.
561-614: Comprehensive workspace sizing validation.The test effectively validates workspace size calculations across multiple scenarios including empty workspaces, payload-based sizing, and MoeAlltoAll integration.
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h (2)
23-27: Well-defined configuration constants.The limits are appropriate for MoE workloads and clearly documented with inline comments.
36-76: Good const-correctness in kernel pointer structures.The distinction between dispatch (const source, mutable receive) and combine (mutable output in src[0], const receive buffers) correctly reflects the data flow in each phase.
tests/comm/test_mnnvl_moe_alltoall.py (1)
122-182: Well-implemented fake MoE reference.The function correctly handles both global and EP-partitioned modes, with proper expert ID mapping and precision-aware accumulation using float32.
csrc/nv_internal/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
flashinfer/comm/trtllm_moe_alltoall.py (2)
388-388: AddClassVarannotation for mutable class attribute.The
_WORKSPACE_CACHEclass variable should be annotated withtyping.ClassVarto properly indicate it's shared across all instances.+from typing import ClassVar + class MoeAlltoAll: """ Manages MoE All-to-All operations with proper workspace allocation and synchronization. This class provides the throughput-optimized backend that supports multiple payloads per collective operation, explicit dispatch/combine phases, and workspace-backed tensors. Example: >>> moe_a2a = MoeAlltoAll(mapping, max_num_tokens=2048, top_k=2, num_experts=8) >>> recv = moe_a2a.dispatch(experts, [hidden, ids, scales], batch_size) >>> output = moe_a2a.combine(processed, batch_size) """ # Single shared workspace across the process # _WORKSPACE: Optional[dict] = None - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}Based on static analysis hints (Ruff RUF012).
487-535: Use explicit Optional annotations for clarity.The
workspace_size_per_rankandhidden_sizeparameters use implicitOptional(defaulting toNone), which PEP 484 discourages.def __init__( self, mapping: Mapping, max_num_tokens: int, top_k: int, num_experts: int, - workspace_size_per_rank: int = None, - hidden_size: int = None, + workspace_size_per_rank: Optional[int] = None, + hidden_size: Optional[int] = None, mnnvl_config: Optional[MnnvlConfig] = None, ):Based on static analysis hints (Ruff RUF013).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)flashinfer/jit/comm.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/jit/comm.py
🧰 Additional context used
🧬 Code graph analysis (1)
flashinfer/comm/trtllm_moe_alltoall.py (2)
flashinfer/comm/mapping.py (1)
moe_ep_rank(349-350)flashinfer/jit/comm.py (1)
gen_moe_alltoall_module(83-109)
🪛 Ruff (0.14.7)
flashinfer/comm/trtllm_moe_alltoall.py
388-388: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
493-493: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
494-494: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
533-533: Avoid specifying long messages outside the exception class
(TRY003)
535-535: Avoid specifying long messages outside the exception class
(TRY003)
696-698: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
flashinfer/comm/trtllm_moe_alltoall.py (8)
1-28: LGTM - Clean module structure and state management.The module docstring clearly describes the purpose, and the
_A2AStatedataclass provides straightforward phase tracking for dispatch/combine workflows.
30-196: LGTM - Well-structured JIT module integration.The cached module builder with custom op registration is implemented correctly. The
mutates_argsdeclarations properly reflect in-place mutations, and the docstrings accurately document the operations.
199-237: LGTM - Correct workspace tensor wrapping.The zero-copy tensor creation via dtype views is correctly implemented, and the defensive bounds check at line 231 guards against out-of-bounds access.
240-298: LGTM - Robust dispatch implementation.The use of
strict=Truein the zip at line 285 is excellent defensive programming, ensuring payload/offset/size lists match in length.
301-336: LGTM - Clean wrapper functions.These provide a clean Python API surface for the underlying C++ operations.
339-370: LGTM - Correct workspace size calculation.The 128-byte alignment padding (line 365-369) correctly mirrors the C++ implementation, and the calculation properly accounts for all workspace regions.
422-485: LGTM - Helpful convenience methods.The
get_moe_workspace_size_per_rankwrapper with sensible defaults simplifies the common case, and the note at line 432 about potential overallocation is appropriately documented. The lazy constant initialization pattern in_init_constantsis clean.
574-709: LGTM - Robust state management and API.The dispatch/combine state machine is well-implemented with proper phase transitions and thorough validation. The zero-copy
get_combine_payload_tensor_in_workspacemethod provides an efficient optimization path. The__all__export list appropriately exposes the public API.
|
[FAILED] Pipeline #39780933: 12/20 passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/comm/test_mnnvl_moe_alltoall.py (2)
31-34:check_any_rank_failednever detects failures.The condition
if any(comm.allgather(False))always evaluates toFalsebecause every rank contributesFalse. The MPI allgather calls incheck_any_rank_failedandsafe_runare separate collective operations that don't synchronize with each other.The current design expects:
- A failing rank calls
comm.allgather(True)insafe_run- Other ranks call
comm.allgather(False)incheck_any_rank_failedBut these are independent collectives—they don't pair up. This means failure propagation across ranks is broken.
Consider a different approach using a shared state or exception-safe barrier:
def check_any_rank_failed(): comm = MPI.COMM_WORLD - if any(comm.allgather(False)): - raise MPIExit("Another rank failed") + # This function should be called at synchronization points + # where all ranks are expected to reach. Use a barrier with + # error checking or a separate mechanism for failure detection. + pass # Placeholder - needs redesign for proper MPI error coordinationAlternatively, if you want a simpler pattern, you could use MPI error handlers or a dedicated abort mechanism.
583-587: Allgather verification skips rank 0.The assertion
all(i == all_num_tokens for i in gathered_all_num_tokens[1:])compares ranks 1+ against the localall_num_tokens, but doesn't verify rank 0's value is correct. If rank 0 has incorrect data, this check passes.# Check all ranks have the same all_num_tokens gathered_all_num_tokens = comm.allgather(all_num_tokens) - assert all(i == all_num_tokens for i in gathered_all_num_tokens[1:]), ( + assert all(i == gathered_all_num_tokens[0] for i in gathered_all_num_tokens), ( "all_num_tokens should be the same" )
🧹 Nitpick comments (5)
flashinfer/comm/trtllm_moe_alltoall.py (3)
400-400: Annotate class-level mutable attribute withClassVar.The mutable class attribute
_WORKSPACE_CACHEshould be annotated withtyping.ClassVarfor clarity and to satisfy type checkers.+from typing import ClassVar, Optional + class MoeAlltoAll: # ... - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
505-506: Use explicitOptionaltype hints.Per PEP 484, use explicit
Optional[T]orT | Noneinstead of implicitOptionalvia= None.def __init__( self, mapping: Mapping, max_num_tokens: int, top_k: int, num_experts: int, - workspace_size_per_rank: int = None, - hidden_size: int = None, + workspace_size_per_rank: Optional[int] = None, + hidden_size: Optional[int] = None, mnnvl_config: Optional[MnnvlConfig] = None, ):
712-712: Minor: Avoid allocating tensor just forelement_size().Creating an empty tensor to get element size is unnecessary. Use
torch.finfoortorch.iinfofor the dtype's element size, or access it from an existing constant.- element_size = torch.tensor([], dtype=dtype).element_size() + element_size = torch.empty(0, dtype=dtype).element_size()Alternatively, consider a helper function or lookup table to avoid tensor allocation entirely:
element_size = {torch.float16: 2, torch.bfloat16: 2, torch.float32: 4}.get(dtype, torch.empty(0, dtype=dtype).element_size())tests/comm/test_mnnvl_moe_alltoall.py (2)
570-575: Consider catching a more specific exception.Catching bare
Exceptionis broad. If MNNVL initialization has known failure modes, catch those specifically.try: MnnvlMemory.initialize() if not MnnvlMemory.supports_mnnvl(): pytest.skip("MNNVL not supported on this system") - except Exception: + except (RuntimeError, OSError) as e: + pytest.skip(f"MNNVL not supported on this system: {e}") - pytest.skip("MNNVL not supported on this system")
706-708: Unused variable from unpacking.
expert_id_payload_indexis unpacked but never used. Prefix with underscore to indicate intentional discard.- payloads, expert_id_payload_index = make_bfloat16_payloads( + payloads, _expert_id_payload_index = make_bfloat16_payloads( local_num_tokens, hidden_size, top_k, rank, token_selected_experts )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)tests/comm/test_mnnvl_moe_alltoall.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/comm/test_mnnvl_moe_alltoall.py (2)
flashinfer/comm/mapping.py (1)
Mapping(21-475)flashinfer/comm/mnnvl.py (3)
MnnvlMemory(232-551)initialize(276-285)supports_mnnvl(545-551)
🪛 Ruff (0.14.8)
tests/comm/test_mnnvl_moe_alltoall.py
34-34: Avoid specifying long messages outside the exception class
(TRY003)
574-574: Do not catch blind exception: Exception
(BLE001)
673-673: Do not catch blind exception: Exception
(BLE001)
706-706: Unpacked variable expert_id_payload_index is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
flashinfer/comm/trtllm_moe_alltoall.py
400-400: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
505-505: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
506-506: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
545-545: Avoid specifying long messages outside the exception class
(TRY003)
547-547: Avoid specifying long messages outside the exception class
(TRY003)
708-710: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (11)
flashinfer/comm/trtllm_moe_alltoall.py (5)
1-28: LGTM!Module docstring, imports, and
_A2AStatedataclass are well-structured. The state tracking design with explicit phase transitions is clean.
30-196: LGTM!The JIT module wrapper pattern using
@functools.cachefor lazy singleton loading is appropriate. Custom op registrations correctly specify mutation behavior, and docstrings are informative.
210-249: LGTM!The workspace slicing logic correctly handles byte-level offset calculations and supports both 1D and 2D workspace tensors. The assertions provide good validation for offset bounds.
252-383: LGTM!The dispatch wrapper correctly creates workspace-backed output tensors, and the workspace size calculation with 128-byte alignment matching the C++ implementation is well-documented.
723-731: LGTM!The
__all__exports cover the public API surface appropriately, keeping internal helpers likeget_moe_alltoall_moduleprivate.tests/comm/test_mnnvl_moe_alltoall.py (6)
55-183: LGTM!Helper functions are well-documented and implement correct logic for test data generation and reference MoE computation. The expert seeding strategy ensures reproducibility across ranks.
185-259: LGTM!Payload creation helpers correctly construct test data structures. The TODO comment on line 206 about FP8 is appropriately documented for future work.
261-381: LGTM!The worker function correctly sets up the MPI environment, performs dispatch, and extracts workspace metadata for verification. The use of
Split_type(MPI.COMM_TYPE_SHARED)for node-local rank is the standard MPI pattern.
384-554: LGTM!The verification function is comprehensive, checking dimensions, dtypes, counter symmetry (send ↔ recv), routing correctness, and payload content integrity. The invalid token sanitization check at the end ensures edge cases are handled.
796-808: LGTM!The 99% match rate threshold with explanation for bf16 accumulation order differences is a pragmatic approach. The tolerance is well-documented in the comment.
683-687: Correct allgather verification here.This verification correctly checks all elements without skipping rank 0, unlike the similar check in
moe_a2a_dispatch_test_impl. Good.
|
/bot run |
| ) | ||
|
|
||
|
|
||
| def moe_a2a_initialize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe we started marking these public apis with a decorator @flashinfer_api it adds logging and helps readability for users and code reviewers
good job having already added to the documentation too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done I decorated this and the class methods, that will mean that it gets logged twice if users use the helper class. Let me know if you would prefer something different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's ok. the outer level logged as MoeAlltoAll and the inner level logged as moe_a2a_*. as a user looking at the logging, i would grep either the class or the underlying api functions that are also considered public api.
cudnn has also the precedence of allowing public api calling public api to be logged, even tracking the depth and tweak formatting. overall i think this is fine & no concerns for now
cc @bkryu for vis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be okay; there are similar examples within the library as well.
One example can be found in fp4_quantization.py where fp4_quantize() and mxfp4_quantize() are both decorated functions, but mxfp4_quantize()'s implementation calls fp4_quantizeso there is double-logging being done
| } | ||
|
|
||
| int getEnvMoeA2ADispatchBlockSize() { | ||
| static int const kBlock = sanitizeBlockSize(getIntEnv("TLLM_MOE_A2A_DISPATCH_BLOCK_SIZE")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we document env vars too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobboli can you advise? Should we remove these or do you know the cases these should be enabled
aleozlx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
flashinfer/comm/trtllm_moe_alltoall.py (2)
213-253: Strengthen safety checks around workspace slicing
moe_a2a_wrap_payload_tensor_in_workspacecurrently assumesslice_start/slice_endare valid and only checks that the slice length fits within a single rank. If offsets ever drift (e.g., from a metainfo bug), this could either index the wrong region or raise hard-to-debug errors.You can add a few cheap assertions to validate the byte range and alignment before computing
slice_rank:if workspace.ndim == 1: workspace = workspace.unsqueeze(0) workspace_base = workspace.view(dtype=torch.uint8) assert workspace.ndim == 2, "workspace must be shape [ep_size, size_per_rank]" - assert slice_end - slice_start <= workspace_base.shape[1], ( - "slice_end - slice_start must belong to a single rank" - ) + assert 0 <= slice_start <= slice_end, "invalid slice range" + slice_length = slice_end - slice_start + assert slice_length <= workspace_base.shape[1], ( + "slice_end - slice_start must belong to a single rank" + ) + element_size = torch.tensor([], dtype=dtype).element_size() + assert slice_length % element_size == 0, ( + "slice length must be a multiple of dtype element size" + ) - slice_rank = slice_start // workspace_base.stride(0) - local_slice_start = slice_start % workspace_base.stride(0) - slice_length = slice_end - slice_start + slice_rank = slice_start // workspace_base.stride(0) + local_slice_start = slice_start % workspace_base.stride(0)This keeps the fast path but catches misaligned or out‑of‑bounds slices early.
408-408: Tighten type hints for workspace cache and optional argsRuff’s warnings here are reasonable and easy to address:
_WORKSPACE_CACHEis a mutable class attribute; annotating it asClassVarmakes intent clear and avoids dataclass-style pitfalls.
workspace_size_per_rank: int = None/hidden_size: int = Nonerely on implicitOptional, which PEP 484 discourages. UsingOptional[int]is clearer and keeps type-checkers happy.Suggested changes:
-from typing import Optional +from typing import Optional, ClassVar ... - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {} ... - workspace_size_per_rank: int = None, - hidden_size: int = None, + workspace_size_per_rank: Optional[int] = None, + hidden_size: Optional[int] = None,Also applies to: 514-516
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flashinfer/comm/trtllm_moe_alltoall.py (3)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)flashinfer/comm/mapping.py (1)
moe_ep_rank(349-350)flashinfer/jit/comm.py (1)
gen_moe_alltoall_module(83-109)
🪛 Ruff (0.14.8)
flashinfer/comm/trtllm_moe_alltoall.py
408-408: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
514-514: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
515-515: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
554-554: Avoid specifying long messages outside the exception class
(TRY003)
556-556: Avoid specifying long messages outside the exception class
(TRY003)
720-722: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
flashinfer/comm/trtllm_moe_alltoall.py (2)
595-696: Dispatch/combine state handling looks solidThe
MoeAlltoAll.dispatch/combinepair has a clear state machine (idle→dispatched→ reset) with assertions on phase andruntime_max_tokens_per_rank. That should make misuse (double-dispatch, combine-before-dispatch, exceeding configured max tokens) fail loudly and predictably, which is great for debugging and integration.
15-15: Fix relative import forflashinfer_apiThe relative import path for
flashinfer_apineeds verification. Ifapi_logging.pyis located atflashinfer/api_logging.py(parent directory), then the import should use a parent-relative path. The current importfrom .api_logging import flashinfer_apisearches inflashinfer/comm/api_logging.py, which would cause an import failure if the module is not there.-from .api_logging import flashinfer_api +from ..api_logging import flashinfer_api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
flashinfer/comm/trtllm_moe_alltoall.py (2)
408-408: Add ClassVar annotation for type safety.Mutable class attributes should be annotated with
typing.ClassVarto make their scope explicit and avoid confusion with instance attributes.Apply this diff:
+from typing import ClassVar + class MoeAlltoAll: """ Manages MoE All-to-All operations with proper workspace allocation and synchronization. This class provides the throughput-optimized backend that supports multiple payloads per collective operation, explicit dispatch/combine phases, and workspace-backed tensors. Example: >>> moe_a2a = MoeAlltoAll(mapping, max_num_tokens=2048, top_k=2, num_experts=8) >>> recv = moe_a2a.dispatch(experts, [hidden, ids, scales], batch_size) >>> output = moe_a2a.combine(processed, batch_size) """ # Single shared workspace across the process # _WORKSPACE: Optional[dict] = None - _WORKSPACE_CACHE: dict[tuple[int, int, int, int], dict] = {} + _WORKSPACE_CACHE: ClassVar[dict[tuple[int, int, int, int], dict]] = {}
514-515: Use explicit Optional type annotations.PEP 484 prohibits implicit
Optional. Useint | NoneorOptional[int]for clarity.Apply this diff:
def __init__( self, mapping: Mapping, max_num_tokens: int, top_k: int, num_experts: int, - workspace_size_per_rank: int = None, - hidden_size: int = None, + workspace_size_per_rank: int | None = None, + hidden_size: int | None = None, mnnvl_config: Optional[MnnvlConfig] = None, ):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
flashinfer/comm/trtllm_moe_alltoall.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
flashinfer/comm/trtllm_moe_alltoall.py
408-408: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
514-514: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
515-515: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
554-554: Avoid specifying long messages outside the exception class
(TRY003)
556-556: Avoid specifying long messages outside the exception class
(TRY003)
720-722: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (9)
flashinfer/comm/trtllm_moe_alltoall.py (9)
1-30: LGTM!The module structure, imports, and internal
_A2AStatedataclass are well organized. The docstring clearly describes the module's purpose.
32-198: LGTM!The JIT module generation and custom op registration is well structured. The use of
@functools.cacheensures the module is built once and reused, and themutates_argsannotations correctly reflect the side effects of each operation.
214-253: LGTM!The workspace slicing logic correctly handles absolute byte offsets across ranks using stride calculations. The assertions provide robust validation of boundary conditions.
257-315: LGTM!The dispatch wrapper properly delegates to the JIT module and correctly wraps output payloads using workspace-backed tensors. The use of
strict=Truein the zip ensures list length consistency.
318-390: LGTM!The public API wrappers are clean and straightforward. The workspace size calculation correctly accounts for auxiliary data and applies 128-byte alignment padding to match the C++ backend implementation.
596-653: LGTM!The dispatch method properly validates state transitions, enforces runtime constraints, and correctly handles optional expert ID sanitization. The state machine ensures dispatch is not called twice without an intervening combine.
656-696: LGTM!The combine method properly validates the state machine, uses the stored dispatch metadata, and correctly resets state for the next round. The phase assertions ensure correct usage patterns.
699-732: LGTM!The method correctly provides workspace-backed tensors for zero-copy expert processing. The phase validation ensures it's called at the right time, and the offset calculations properly account for tensor dimensions and data types.
735-743: LGTM!The
__all__export list appropriately exposes the public API surface, providing both the high-levelMoeAlltoAllclass and the lower-level operation functions.
|
[FAILED] Pipeline #39926660: 15/20 passed |
📌 Description
This ports the latest MNNVL A2A communication implementation from TRT-LLM
🔍 Related Issues
#2094
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.