Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022

clintg6 · 2025-11-08T01:37:11Z

Description

This PR fixes a distributed profiling issue where multiple ranks could overwrite the same Chrome trace file when TLLM_TORCH_PROFILE_TRACE was set to a single filename (e.g. trace.json).

Previously, all ranks wrote to the same file path, resulting in trace collisions and data loss during multi-rank runs.
This update ensures each rank writes its own trace file by appending the process rank to the filename (e.g. trace_0.json, trace_1.json, etc.), or by placing ranked traces within the specified directory.

Changes

Added a helper method _get_ranked_trace_path() to centralize per-rank trace path generation.
- Supports both file and directory paths:
  - TLLM_TORCH_PROFILE_TRACE=trace.json → trace_0.json
  - TLLM_TORCH_PROFILE_TRACE=/tmp/traces → /tmp/traces/trace_0.json
Promoted torch_trace_path to a class attribute (self.torch_trace_path) since _profiler() is a generator whose inner closure (profile_step) cannot cleanly access local variables from the enclosing scope after yielding.
Making it an attribute ensures the trace path remains available for both normal and cleanup (finally) code paths.
Preserved all existing profiler behavior, start/stop semantics, and environment variable names for backward compatibility.

Example Behavior

Env var	Rank	Resulting file
`TLLM_TORCH_PROFILE_TRACE=trace.json`	0	`trace_0.json`
`TLLM_TORCH_PROFILE_TRACE=trace.json`	1	`trace_1.json`
`TLLM_TORCH_PROFILE_TRACE=/tmp/traces`	2	`/tmp/traces/trace_2.json`

Testing

Verified multi-rank profiling via trtllm-bench produces unique trace files for each rank when tp=8 and only one file when tp=1.
Confirmed identical output for single-rank runs.
Verified cleanup behavior and logging remain consistent.
Confirmed all traces export successfully even on early exit or exception.

Backward Compatibility

No breaking API changes.
Profiling workflow and environment variables remain unchanged.
Only affects cases where multiple ranks are active and profiling is enabled.

Summary

This PR:

Fixes rank-collision issues in torch profiler trace export.
Adds a helper method to generate per-rank trace paths cleanly.
Improves robustness and code maintainability.
Preserves all existing profiler behavior.

Summary by CodeRabbit

Improvements
- Enhanced profiling to generate per-rank trace files, enabling rank-specific performance analysis during distributed execution.

Signed-off-by: Clint <[email protected]>

coderabbitai · 2025-11-08T01:39:45Z

📝 Walkthrough

Walkthrough

The PyExecutor class in the torch module is refactored to improve rank-specific trace file handling. A new private helper method computes per-rank trace file paths, the trace path is converted from a local variable to an instance attribute, and profiling logic is updated to use this new approach when exporting Chrome traces.

Changes

Cohort / File(s)	Summary
PyExecutor trace path handling `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added `_get_ranked_trace_path()` helper method to compute rank-specific trace file paths. Converted local `torch_trace_path` variable to instance attribute `self.torch_trace_path`. Updated profiling initialization and Chrome trace export logic to use the instance attribute and derive ranked paths dynamically.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The changes are localized to a single file and class
The refactoring is straightforward: moving a variable from local scope to instance scope and extracting path computation into a helper method
Logic updates are consistent and follow a clear pattern with no complex conditionals or state management

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing trace file overwriting across ranks in multi-GPU torch profiler scenarios.
Description check	✅ Passed	The description follows the template structure with clear sections for Description, Changes, Testing, and Backward Compatibility, providing comprehensive context for the changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
507-514: Add defensive None check for trace_path.

While self.torch_trace_path is only set when the environment variable is present and the method is only called when enable_torch_trace is True, adding an explicit None check would improve defensive programming and prevent potential TypeError if the code is refactored later.

Apply this diff to add a defensive check:
 def _get_ranked_trace_path(self):
     """Return a per-rank torch trace path based on TLLM_TORCH_PROFILE_TRACE."""
     rank = getattr(self.dist, "rank", 0)
     trace_path = self.torch_trace_path
+    if trace_path is None:
+        raise ValueError("torch_trace_path must be set before calling _get_ranked_trace_path")
     if os.path.isdir(trace_path):
         return os.path.join(trace_path, f"trace_{rank}.json")
     base, ext = os.path.splitext(trace_path)
     return f"{base}_{rank}{ext or '.json'}"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff82ea and 550c480.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (1)

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

tensorrt_llm/mapping.py (2)

rank (183-184)

rank (187-194)

tensorrt_llm/_torch/distributed/communicator.py (2)

rank (39-40)

rank (435-436)

clintg6 added 3 commits November 7, 2025 17:41

Update py_executor.py

f0a032a

Signed-off-by: Clint <[email protected]>

Update py_executor.py

0e61d51

Signed-off-by: Clint <[email protected]>

Update py_executor.py

550c480

Signed-off-by: Clint <[email protected]>

clintg6 requested a review from a team as a code owner November 8, 2025 01:37

clintg6 requested a review from Naveassaf November 8, 2025 01:37

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022

Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022

Uh oh!

clintg6 commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 8, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022

Are you sure you want to change the base?

Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022

Uh oh!

Conversation

clintg6 commented Nov 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Example Behavior

Testing

Backward Compatibility

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 8, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clintg6 commented Nov 8, 2025 •

edited by coderabbitai bot

Loading