-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix: Prevent trace file overwrite across ranks in torch profiler when using multiple GPUs #9022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Clint <[email protected]>
Signed-off-by: Clint <[email protected]>
Signed-off-by: Clint <[email protected]>
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
507-514: Add defensive None check for trace_path.While
self.torch_trace_pathis only set when the environment variable is present and the method is only called whenenable_torch_traceis True, adding an explicit None check would improve defensive programming and prevent potentialTypeErrorif the code is refactored later.Apply this diff to add a defensive check:
def _get_ranked_trace_path(self): """Return a per-rank torch trace path based on TLLM_TORCH_PROFILE_TRACE.""" rank = getattr(self.dist, "rank", 0) trace_path = self.torch_trace_path + if trace_path is None: + raise ValueError("torch_trace_path must be set before calling _get_ranked_trace_path") if os.path.isdir(trace_path): return os.path.join(trace_path, f"trace_{rank}.json") base, ext = os.path.splitext(trace_path) return f"{base}_{rank}{ext or '.json'}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (1)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/mapping.py (2)
rank(183-184)rank(187-194)tensorrt_llm/_torch/distributed/communicator.py (2)
rank(39-40)rank(435-436)
Description
This PR fixes a distributed profiling issue where multiple ranks could overwrite the same Chrome trace file when
TLLM_TORCH_PROFILE_TRACEwas set to a single filename (e.g.trace.json).Previously, all ranks wrote to the same file path, resulting in trace collisions and data loss during multi-rank runs.
This update ensures each rank writes its own trace file by appending the process rank to the filename (e.g.
trace_0.json,trace_1.json, etc.), or by placing ranked traces within the specified directory.Changes
_get_ranked_trace_path()to centralize per-rank trace path generation.TLLM_TORCH_PROFILE_TRACE=trace.json→trace_0.jsonTLLM_TORCH_PROFILE_TRACE=/tmp/traces→/tmp/traces/trace_0.jsontorch_trace_pathto a class attribute (self.torch_trace_path) since_profiler()is a generator whose inner closure (profile_step) cannot cleanly access local variables from the enclosing scope after yielding.Making it an attribute ensures the trace path remains available for both normal and cleanup (
finally) code paths.Example Behavior
TLLM_TORCH_PROFILE_TRACE=trace.jsontrace_0.jsonTLLM_TORCH_PROFILE_TRACE=trace.jsontrace_1.jsonTLLM_TORCH_PROFILE_TRACE=/tmp/traces/tmp/traces/trace_2.jsonTesting
trtllm-benchproduces unique trace files for each rank when tp=8 and only one file when tp=1.Backward Compatibility
Summary
This PR:
Summary by CodeRabbit