Skip to content

Perf script utility to lock gpu frequency.#2977

Open
dingqingy-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:lock_freq
Open

Perf script utility to lock gpu frequency.#2977
dingqingy-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:lock_freq

Conversation

@dingqingy-nv
Copy link
Copy Markdown
Contributor

@dingqingy-nv dingqingy-nv commented Mar 24, 2026

What does this PR do ?

provide a command line option in perf script for locking gpu frequency. This can be helpful for correlation study.

Summary by CodeRabbit

  • New Features
    • Added -lgc/--lock_gpu_freq CLI option to lock GPU graphics clock frequency (MHz) during training on compute clusters, with automatic reset after execution.

@dingqingy-nv dingqingy-nv added this to the 26.04 milestone Mar 24, 2026
@dingqingy-nv dingqingy-nv added performance performance/release Performance items related with NeMo release labels Mar 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@dingqingy-nv dingqingy-nv marked this pull request as ready for review March 25, 2026 01:29
@dingqingy-nv dingqingy-nv requested review from a team, erhoo82 and malay-nagda as code owners March 25, 2026 01:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

These changes add GPU frequency locking capability to the performance experiment setup system. A new CLI argument accepts an optional GPU graphics clock frequency in MHz, which is threaded through the setup pipeline and applied via Slurm srun commands to lock GPU clocks before training.

Changes

Cohort / File(s) Summary
CLI Argument Definition
scripts/performance/argument_parser.py
Added new -lgc/--lock_gpu_freq option to "Performance arguments" group accepting an optional integer value for GPU graphics clock frequency (MHz), with execution details for sudo nvidia-smi -lgc per node.
Plugin Implementation
scripts/performance/perf_plugins.py
Added `lock_gpu_freq: int
Experiment Setup
scripts/performance/setup_experiment.py
Added lock_gpu_freq: Optional[int] parameter to main() function and threaded it into PerfEnvPlugin instantiation to wire CLI arguments to the plugin.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR adds GPU frequency locking feature but provides no test results, performance measurements, or validation evidence in the description. Include test results, performance measurements, sample usage output, or documentation confirming the feature works correctly and causes no regressions.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Perf script utility to lock gpu frequency' directly and clearly summarizes the main change: adding a utility feature to the perf script for locking GPU frequency, which is confirmed by the changeset additions across argument parser, plugin implementation, and setup modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/performance/perf_plugins.py (1)

428-432: Consider distinct command labels when both features are enabled.

Both _set_vboost and _set_lock_gpu_freq use "# Command 0:" in their comments. If a user enables both features simultaneously, the setup script will have two commands labeled "Command 0", which could be confusing when debugging.

💡 Optional: Use descriptive labels instead of numeric indices
             lock_freq_cmd = "\n".join(
                 [
                     "",
-                    "# Command 0: lock GPU graphics clock",
+                    "# Setup: lock GPU graphics clock",
                     " ".join(

Or consider implementing a shared counter if command ordering matters for documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/perf_plugins.py` around lines 428 - 432, Both _set_vboost
and _set_lock_gpu_freq generate script comments using the same literal label "#
Command 0:", which can produce duplicate command labels when both features are
enabled; update the comment labels in the lock_freq_cmd and the vboost_cmd
generation to use distinct, descriptive labels (e.g., "# Command: lock GPU
graphics clock" and "# Command: set vboost") or implement a shared incremental
counter used by both functions to produce unique "Command N" labels so the
produced setup script has non‑conflicting command identifiers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/perf_plugins.py`:
- Around line 428-432: Both _set_vboost and _set_lock_gpu_freq generate script
comments using the same literal label "# Command 0:", which can produce
duplicate command labels when both features are enabled; update the comment
labels in the lock_freq_cmd and the vboost_cmd generation to use distinct,
descriptive labels (e.g., "# Command: lock GPU graphics clock" and "# Command:
set vboost") or implement a shared incremental counter used by both functions to
produce unique "Command N" labels so the produced setup script has
non‑conflicting command identifiers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6473137f-4a78-46da-9ee4-380569f9ad03

📥 Commits

Reviewing files that changed from the base of the PR and between 80562fe and c261baa.

📒 Files selected for processing (3)
  • scripts/performance/argument_parser.py
  • scripts/performance/perf_plugins.py
  • scripts/performance/setup_experiment.py

yaoyu-33
yaoyu-33 previously approved these changes Mar 25, 2026
@yaoyu-33 yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Mar 25, 2026
@suiyoubi suiyoubi added the needs-author Author action is required before review or merge can continue label Mar 30, 2026
@suiyoubi suiyoubi removed the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Mar 30, 2026
@malay-nagda
Copy link
Copy Markdown
Contributor

@dingqingy-nv can you add note for the new arg in README and doc string?

dingqingy-nv and others added 2 commits April 6, 2026 21:39
Address review feedback: document when/why to use GPU clock locking
and add the new arg to the performance README.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Dingqing Yang <[email protected]>
Address review feedback: document the correlation study use case
for GPU clock locking and add the new arg to the performance README.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Dingqing Yang <[email protected]>
@github-actions github-actions bot removed the needs-author Author action is required before review or merge can continue label Apr 7, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Dingqing Yang <[email protected]>
@dingqingy-nv
Copy link
Copy Markdown
Contributor Author

@malay-nagda fixed, please review and help approve, thanks!

@dingqingy-nv dingqingy-nv enabled auto-merge (squash) April 10, 2026 01:19
@dingqingy-nv dingqingy-nv added r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. 26.04.01 labels Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

26.04.01 performance/release Performance items related with NeMo release performance r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants