Skip to content

Make NVTE tensor handle pool size configurable#3090

Draft
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var
Draft

Make NVTE tensor handle pool size configurable#3090
lhb8125 wants to merge 4 commits into
NVIDIA:mainfrom
lhb8125:codex/nvte-tensor-pool-env-var

Conversation

@lhb8125

@lhb8125 lhb8125 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add runtime environment variables to configure the internal NVTETensor and NVTEGroupedTensor handle pool sizes.
  • Keep the default pool size at 20 MiB to preserve existing behavior.
  • Improve exhaustion errors to mention the configured pool size and the env var to increase.

Motivation

Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.

Testing

  • git diff --check

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 5, 2026
@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR exposes two new runtime environment variables (NVTE_TENSOR_HANDLE_POOL_SIZE_MB and NVTE_GROUPED_TENSOR_HANDLE_POOL_SIZE_MB) to make the internal handle pool sizes configurable, preserving the existing 20 MiB default and improving exhaustion error messages to include the configured size and the relevant env var name.

  • Adds GetTensorHandlePoolSizeMB and GetTensorHandlePoolCapacity helpers in an anonymous namespace that read, validate, and convert env var values to pool capacities; the computed constants replace the previously hardcoded 20 * 1024 * 1024 / sizeof(T) expressions as member initializers in both allocator singletons.
  • Updates docs/envvars.rst with documentation entries for the two new variables.

Confidence Score: 4/5

Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward.

The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage.

transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB

Important Files Changed

Filename Overview
transformer_engine/common/transformer_engine.cpp Adds anonymous-namespace helpers to read and validate pool-size env vars; replaces hardcoded constants in TensorAllocator and GroupedTensorAllocator with env-var-driven member initializers. Logic is correct but partial string parsing (no eof check) can silently accept values like "50bad" as 50, and negative inputs produce a confusing "too large" error due to size_t unsigned wrapping.
docs/envvars.rst Adds documentation for both new env vars with correct defaults and description; consistent with the surrounding RST style.

Sequence Diagram

sequenceDiagram
    participant Env as Environment Variable
    participant Helper as GetTensorHandlePoolSizeMB
    participant Cap as GetTensorHandlePoolCapacity
    participant Alloc as TensorAllocator (singleton)
    participant App as Application Code

    App->>Alloc: first call to instance()
    activate Alloc
    Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
    Env-->>Helper: raw string value (or null default 20)
    Helper->>Helper: parse via getenv size_t
    Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
    Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
    Helper-->>Cap: pool_size_mb
    Cap->>Cap: compute pool_size_bytes
    Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
    Cap-->>Alloc: MAX_TENSOR_NUM
    Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
    Alloc-->>App: singleton ready
    deactivate Alloc

    App->>Alloc: Allocate(mode, out, N)
    Alloc->>Alloc: "check available >= N"
    alt pool exhausted
        Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
    else space available
        Alloc-->>App: NVTETensor handles
    end
Loading

Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

Comment thread transformer_engine/common/transformer_engine.cpp Outdated
Comment thread transformer_engine/common/transformer_engine.cpp
@lhb8125 lhb8125 marked this pull request as draft June 5, 2026 08:17
lhb8125 and others added 2 commits June 8, 2026 08:22
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant