Skip to content

Conversation

@Fan-Yunfan
Copy link
Contributor

@Fan-Yunfan Fan-Yunfan commented Oct 4, 2025

Problem

  1. The allocate and deallocate template functions in compile-time can determine the specified value of T, so the exception process macro TLLM_THROW will never be invoke in runtime, it should be replaced with compile-time check such as static_assert.

  2. The MemoryTypeString<T> template class don't have specified impl for unsupported type of T, so it don't have value member. It may throw error: 'value' is not a member of 'MemoryTypeString<T>' .

  3. Lines auto const sizeDiff = static_cast<DiffType>(size); and auto const sizeDiff = -static_cast<DiffType>(size); can overflow because SizeType32 is an alias for std::size_t while DiffType is std::ptrdiff_t.
    On a 32-bit platform, for example, std::size_t spans [0 … 4 294 967 295] (2³²–1) but std::ptrdiff_t only covers [–2 147 483 648 … 2 147 483 647] (–2³¹ … 2³¹–1).
    Any size value larger than PTRDIFF_MAX will therefore be truncated, yielding an incorrect signed result.

  4. The current MemoryCounters singleton does not explicitly forbid copy and assignment operations, which is unsafe.

Current Implementation

cpp/include/tensorrt_llm/runtime/memoryCounters.h

class MemoryCounters
{
public:
    using SizeType32 = std::size_t;
    using DiffType = std::ptrdiff_t;
    ......
}
template <MemoryType T>
void allocate(SizeType32 size)
{
    auto const sizeDiff = static_cast<DiffType>(size);
    if constexpr (T == MemoryType::kGPU)
    {
        mGpu += size;
        mGpuDiff = sizeDiff;
    }
    else if constexpr (T == MemoryType::kCPU)
    {
        mCpu += size;
        mCpuDiff = sizeDiff;
    }
    ......
    else
    {
        TLLM_THROW("Unknown memory type: %s", MemoryTypeString<T>::value);
    }
}
template <MemoryType T>
void deallocate(SizeType32 size)
{
    auto const sizeDiff = -static_cast<DiffType>(size);
    if constexpr (T == MemoryType::kGPU)
    {
        mGpu -= size;
        mGpuDiff = sizeDiff;
    }
    else if constexpr (T == MemoryType::kCPU)
    {
        mCpu -= size;
        mCpuDiff = sizeDiff;
    }
    ......
    else
    {
        TLLM_THROW("Unknown memory type: %s", MemoryTypeString<T>::value);
    }
}

cpp/include/tensorrt_llm/runtime/iBuffer.h

enum class MemoryType : std::int32_t
{
    kGPU = 0,
    kCPU = 1,
    kPINNED = 2,
    kUVM = 3,
    kPINNEDPOOL = 4
};

template <MemoryType T>
struct MemoryTypeString
{
};

template <>
struct MemoryTypeString<MemoryType::kGPU>
{
    static auto constexpr value = "GPU";
};

template <>
struct MemoryTypeString<MemoryType::kCPU>
{
    static auto constexpr value = "CPU";
};

template <>
struct MemoryTypeString<MemoryType::kPINNED>
{
    static auto constexpr value = "PINNED";
};

template <>
struct MemoryTypeString<MemoryType::kUVM>
{
    static auto constexpr value = "UVM";
};

template <>
struct MemoryTypeString<MemoryType::kPINNEDPOOL>
{
    static auto constexpr value = "PINNEDPOOL";
};

Solution

  1. Use static_assert to replace TLLM_THROW and remove MemoryTypeString<T>::value.
  2. Add boundary check before invoke static_cast<DiffType>(size).
  3. Explicitly prohibit all copy and assignment operations for MemoryCounters Singleton.
template <MemoryType T>
void allocate(SizeType32 size)
{
    if (size > static_cast<SizeType32>(std::numeric_limits<DiffType>::max()))
    {
        TLLM_THROW("Memory size too large for diff type: %zu", size);
    }
    auto const sizeDiff = static_cast<DiffType>(size);
    if constexpr (T == MemoryType::kGPU)
    {
        mGpu += size;
        mGpuDiff = sizeDiff;
    }
    ......
    else
    {
        static_assert(!std::is_same_v<T, T>, "Unknown memory type!");
    }
}

template <MemoryType T>
void deallocate(SizeType32 size)
{
    if (size > static_cast<SizeType32>(std::numeric_limits<DiffType>::max()))
    {
        TLLM_THROW("Memory size too large for diff type: %zu", size);
    }
    auto const sizeDiff = -static_cast<DiffType>(size);
    if constexpr (T == MemoryType::kGPU)
    {
        mGpu -= size;
        mGpuDiff = sizeDiff;
    }
    ......
    else
    {
        static_assert(!std::is_same_v<T, T>, "Unknown memory type!");
    }
}

......

MemoryCounters(MemoryCounters const&) = delete;
MemoryCounters& operator=(MemoryCounters const&) = delete;
MemoryCounters(MemoryCounters&&) = delete;
MemoryCounters& operator=(MemoryCounters&&) = delete;

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

📝 Walkthrough

Walkthrough

Added overflow checks in MemoryCounters::allocate and ::deallocate that throw when size exceeds DiffType limits. Replaced a runtime error for unknown MemoryType with a compile-time static_assert, removing the runtime fallback path. No public API signatures changed. Changes are confined to cpp/include/tensorrt_llm/runtime/memoryCounters.h.

Changes

Cohort / File(s) Summary
Memory counters overflow and type handling
cpp/include/tensorrt_llm/runtime/memoryCounters.h
Added runtime overflow checks in allocate/deallocate; throw on size > max DiffType. Replaced unknown MemoryType runtime branch with compile-time static_assert. Removed the corresponding runtime fallback logic. No exported signature changes.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant MemoryCounters

  Caller->>MemoryCounters: allocate(size)
  alt size exceeds DiffType max
    MemoryCounters-->>Caller: throw overflow_error
  else valid size
    MemoryCounters-->>Caller: update counters
  end

  Caller->>MemoryCounters: deallocate(size)
  alt size exceeds DiffType max
    MemoryCounters-->>Caller: throw overflow_error
  else valid size
    MemoryCounters-->>Caller: update counters
  end

  note over MemoryCounters: Unknown MemoryType<T> now fails at compile time via static_assert
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description does not adhere to the required repository template, as it omits the initial @coderabbitai summary marker and the standardized title format. It also lacks the mandatory ## Description, ## Test Coverage, and ## PR Checklist sections. Instead, it uses custom "Problem", "Current Implementation", and "Solution" headings that do not align with the prescribed structure. Please restructure the PR description to match the repository’s template by adding the @coderabbitai summary marker and a properly formatted title at the top. Include a ## Description section to explain the issue and solution, a ## Test Coverage section listing relevant tests, and a ## PR Checklist section completed according to the guidelines. Ensuring these sections are present will align the PR with repository standards and facilitate review.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly conveys the primary change by specifying enhancements to the MemoryCounters singleton with compile-time safety and bounds checking, which matches the main adjustments in the PR, and it is concise and specific enough for a quick history scan.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Oct 4, 2025
@Fan-Yunfan Fan-Yunfan changed the title [None][fix] Fix and enhance memory counters with compile-time safety and bounds checking [None][fix] Fix and enhance MemoryCounters Singleton with compile-time safety and bounds checking Oct 4, 2025
@Fan-Yunfan
Copy link
Contributor Author

Fan-Yunfan commented Oct 14, 2025

Dear @karljang , Would you like to help me review this pr when you have time?
Good

@karljang
Copy link
Collaborator

@Fan-Yunfan ,
Thank you for your contribution!
Thanks to you, I got a chance to refresh my C++ memory a bit 😊
Please take a look at my review comments when you get a chance.

@Fan-Yunfan Fan-Yunfan force-pushed the fyf_enhance_memory_counters branch from b3b1f2f to 4511c59 Compare October 15, 2025 03:24
@Fan-Yunfan
Copy link
Contributor Author

Fan-Yunfan commented Oct 15, 2025

@Fan-Yunfan , Thank you for your contribution! Thanks to you, I got a chance to refresh my C++ memory a bit 😊 Please take a look at my review comments when you get a chance.

Thanks for your correction!Using static_assert(!std::is_same_v<T, T>, "") does indeed trigger a compile-time error. [always_false](std::false_type) is clearly the better solution.

I have updated the commits.

Thanks for let me learn about the concept and usage of std::false_type of C++.

image

@karljang
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21433 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21433 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16188 completed with status: 'FAILURE'

@Fan-Yunfan
Copy link
Contributor Author

image Dear @karljang,It seems that the pipeline failed. I have encountered this issue many times when triggering blossom-ci. Do you know the underlying reason that caused this error, such as insufficient machine memory or not merging the latest changes from the main branch?

@karljang
Copy link
Collaborator

/bot run

@karljang
Copy link
Collaborator

Just running it again, the errors look not related to this change.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21718 [ run ] triggered by Bot. Commit: 2d60e56

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21718 [ run ] completed with state SUCCESS. Commit: 2d60e56
/LLM/main/L0_MergeRequest_PR pipeline #16365 completed with status: 'FAILURE'

@Fan-Yunfan
Copy link
Contributor Author

Just running it again, the errors look not related to this change.

image Got it !

@Fan-Yunfan
Copy link
Contributor Author

/bot run

@Fan-Yunfan Fan-Yunfan requested a review from karljang October 31, 2025 07:21
@karljang
Copy link
Collaborator

karljang commented Nov 3, 2025

Oops, this slipped my mind, I'm rerunning the tests now

@karljang
Copy link
Collaborator

karljang commented Nov 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23425 [ run ] triggered by Bot. Commit: 7924c7f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23425 [ run ] completed with state SUCCESS. Commit: 7924c7f
/LLM/main/L0_MergeRequest_PR pipeline #17640 completed with status: 'SUCCESS'

Copy link
Collaborator

@karljang karljang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM;

@Fan-Yunfan
Copy link
Contributor Author

Oops, this slipped my mind, I'm rerunning the tests now

Haha, no worries~ I just dropped by when it crossed my mind. It doesn’t matter whether it’s early or late—just feel free to take a look whenever you have a moment. If you’re busy, just focus on your work first. I don’t have any specific requests~

小熊跳舞

Copy link
Collaborator

@MartinMarciniszyn MartinMarciniszyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fan-Yunfan , thank you for your suggestions. I agree with the static_assert, but I am not convinced about the other changes. Please revert these.

Since you are editing this file, I suggest renaming SizeType32 to SizeType since the 32 is wrong and misleading. Many thanks for your help.

@Fan-Yunfan
Copy link
Contributor Author

@Fan-Yunfan , thank you for your suggestions. I agree with the static_assert, but I am not convinced about the other changes. Please revert these.

Since you are editing this file, I suggest renaming SizeType32 to SizeType since the 32 is wrong and misleading. Many thanks for your help.

Thank you for your review—these were very helpful suggestions! I have already made the corresponding revisions based on your advice.

Additionally, if the systems in focus are all 64-bit systems, I was wondering whether the 32-bit system check in another PR related to ITensor at #8855 might also be unnecessary? (I believe so, but it might require your confirmation~)

@Fan-Yunfan Fan-Yunfan force-pushed the fyf_enhance_memory_counters branch from 1509d03 to 8c43e55 Compare November 6, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants