Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/references/environment_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ SGLang supports various environment variables that can be used to configure its
| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `0` |
| `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` |
| `SGLANG_ENABLE_FLASHINFER_GEMM` | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs | `false` |
| `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` | Select backend for `mm_fp4` on Blackwell GPUS | `` |
| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs | `false` |
| `SGLANG_CUTLASS_MOE` | Use Cutlass FP8 MoE kernel on Blackwell GPUs | `false` |

Expand Down
8 changes: 8 additions & 0 deletions python/sglang/srt/distributed/communication_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,17 @@

from .parallel_state import get_tp_group

use_flashinfer_mnnvl_allreduce = True # Will be set in server_args.py
# TODO(shuw): make it configurable

def tensor_model_parallel_all_reduce(input_: torch.Tensor) -> torch.Tensor:
"""All-reduce the input tensor across model parallel group."""
from sglang.srt.layers.flashinfer_allreduce import (
run_flashinfer_mnnvl_allreduce,
)
# if use_flashinfer_mnnvl_allreduce:
# return run_flashinfer_mnnvl_allreduce(input_)

return get_tp_group().all_reduce(input_)


Expand Down
Loading