Skip to content

Conversation

@clintg6
Copy link

@clintg6 clintg6 commented Nov 8, 2025

Description

This PR fixes a distributed profiling issue where multiple ranks could overwrite the same Chrome trace file when TLLM_TORCH_PROFILE_TRACE was set to a single filename (e.g. trace.json).

Previously, all ranks wrote to the same file path, resulting in trace collisions and data loss during multi-rank runs.
This update ensures each rank writes its own trace file by appending the process rank to the filename (e.g. trace_0.json, trace_1.json, etc.), or by placing ranked traces within the specified directory.


Changes

  • Added a helper method _get_ranked_trace_path() to centralize per-rank trace path generation.
    • Supports both file and directory paths:
      • TLLM_TORCH_PROFILE_TRACE=trace.jsontrace_0.json
      • TLLM_TORCH_PROFILE_TRACE=/tmp/traces/tmp/traces/trace_0.json
  • Promoted torch_trace_path to a class attribute (self.torch_trace_path) since _profiler() is a generator whose inner closure (profile_step) cannot cleanly access local variables from the enclosing scope after yielding.
    Making it an attribute ensures the trace path remains available for both normal and cleanup (finally) code paths.
  • Preserved all existing profiler behavior, start/stop semantics, and environment variable names for backward compatibility.

Example Behavior

Env var Rank Resulting file
TLLM_TORCH_PROFILE_TRACE=trace.json 0 trace_0.json
TLLM_TORCH_PROFILE_TRACE=trace.json 1 trace_1.json
TLLM_TORCH_PROFILE_TRACE=/tmp/traces 2 /tmp/traces/trace_2.json

Testing

  • Verified multi-rank profiling via trtllm-bench produces unique trace files for each rank when tp=8 and only one file when tp=1.
  • Confirmed identical output for single-rank runs.
  • Verified cleanup behavior and logging remain consistent.
  • Confirmed all traces export successfully even on early exit or exception.

Backward Compatibility

  • No breaking API changes.
  • Profiling workflow and environment variables remain unchanged.
  • Only affects cases where multiple ranks are active and profiling is enabled.

Summary

This PR:

  • Fixes rank-collision issues in torch profiler trace export.
  • Adds a helper method to generate per-rank trace paths cleanly.
  • Improves robustness and code maintainability.
  • Preserves all existing profiler behavior.

Summary by CodeRabbit

  • Improvements
    • Enhanced profiling to generate per-rank trace files, enabling rank-specific performance analysis during distributed execution.

@clintg6 clintg6 requested a review from a team as a code owner November 8, 2025 01:37
@clintg6 clintg6 requested a review from Naveassaf November 8, 2025 01:37
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 8, 2025

📝 Walkthrough

Walkthrough

The PyExecutor class in the torch module is refactored to improve rank-specific trace file handling. A new private helper method computes per-rank trace file paths, the trace path is converted from a local variable to an instance attribute, and profiling logic is updated to use this new approach when exporting Chrome traces.

Changes

Cohort / File(s) Summary
PyExecutor trace path handling
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added _get_ranked_trace_path() helper method to compute rank-specific trace file paths. Converted local torch_trace_path variable to instance attribute self.torch_trace_path. Updated profiling initialization and Chrome trace export logic to use the instance attribute and derive ranked paths dynamically.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • The changes are localized to a single file and class
  • The refactoring is straightforward: moving a variable from local scope to instance scope and extracting path computation into a helper method
  • Logic updates are consistent and follow a clear pattern with no complex conditionals or state management

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing trace file overwriting across ranks in multi-GPU torch profiler scenarios.
Description check ✅ Passed The description follows the template structure with clear sections for Description, Changes, Testing, and Backward Compatibility, providing comprehensive context for the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

507-514: Add defensive None check for trace_path.

While self.torch_trace_path is only set when the environment variable is present and the method is only called when enable_torch_trace is True, adding an explicit None check would improve defensive programming and prevent potential TypeError if the code is refactored later.

Apply this diff to add a defensive check:

 def _get_ranked_trace_path(self):
     """Return a per-rank torch trace path based on TLLM_TORCH_PROFILE_TRACE."""
     rank = getattr(self.dist, "rank", 0)
     trace_path = self.torch_trace_path
+    if trace_path is None:
+        raise ValueError("torch_trace_path must be set before calling _get_ranked_trace_path")
     if os.path.isdir(trace_path):
         return os.path.join(trace_path, f"trace_{rank}.json")
     base, ext = os.path.splitext(trace_path)
     return f"{base}_{rank}{ext or '.json'}"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff82ea and 550c480.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (1)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/mapping.py (2)
  • rank (183-184)
  • rank (187-194)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • rank (39-40)
  • rank (435-436)

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants