Perf script utility to lock gpu frequency. by dingqingy-nv · Pull Request #2977 · NVIDIA-NeMo/Megatron-Bridge

dingqingy-nv · 2026-03-24T23:43:06Z

What does this PR do ?

provide a command line option in perf script for locking gpu frequency. This can be helpful for correlation study.

Summary by CodeRabbit

New Features
- Added -lgc/--lock_gpu_freq CLI option to lock GPU graphics clock frequency (MHz) during training on compute clusters, with automatic reset after execution.

Signed-off-by: Dingqing Yang <[email protected]>

copy-pr-bot · 2026-03-24T23:43:10Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-25T01:33:44Z

📝 Walkthrough

Walkthrough

These changes add GPU frequency locking capability to the performance experiment setup system. A new CLI argument accepts an optional GPU graphics clock frequency in MHz, which is threaded through the setup pipeline and applied via Slurm srun commands to lock GPU clocks before training.

Changes

Cohort / File(s)	Summary
CLI Argument Definition `scripts/performance/argument_parser.py`	Added new `-lgc/--lock_gpu_freq` option to "Performance arguments" group accepting an optional integer value for GPU graphics clock frequency (MHz), with execution details for `sudo nvidia-smi -lgc` per node.
Plugin Implementation `scripts/performance/perf_plugins.py`	Added `lock_gpu_freq: int
Experiment Setup `scripts/performance/setup_experiment.py`	Added `lock_gpu_freq: Optional[int]` parameter to `main()` function and threaded it into `PerfEnvPlugin` instantiation to wire CLI arguments to the plugin.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR adds GPU frequency locking feature but provides no test results, performance measurements, or validation evidence in the description.	Include test results, performance measurements, sample usage output, or documentation confirming the feature works correctly and causes no regressions.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Perf script utility to lock gpu frequency' directly and clearly summarizes the main change: adding a utility feature to the perf script for locking GPU frequency, which is confirmed by the changeset additions across argument parser, plugin implementation, and setup modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

scripts/performance/perf_plugins.py (1)
428-432: Consider distinct command labels when both features are enabled.

Both _set_vboost and _set_lock_gpu_freq use "# Command 0:" in their comments. If a user enables both features simultaneously, the setup script will have two commands labeled "Command 0", which could be confusing when debugging.
💡 Optional: Use descriptive labels instead of numeric indices
             lock_freq_cmd = "\n".join(
                 [
                     "",
-                    "# Command 0: lock GPU graphics clock",
+                    "# Setup: lock GPU graphics clock",
                     " ".join(
Or consider implementing a shared counter if command ordering matters for documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/perf_plugins.py` around lines 428 - 432, Both _set_vboost
and _set_lock_gpu_freq generate script comments using the same literal label "#
Command 0:", which can produce duplicate command labels when both features are
enabled; update the comment labels in the lock_freq_cmd and the vboost_cmd
generation to use distinct, descriptive labels (e.g., "# Command: lock GPU
graphics clock" and "# Command: set vboost") or implement a shared incremental
counter used by both functions to produce unique "Command N" labels so the
produced setup script has non‑conflicting command identifiers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/perf_plugins.py`:
- Around line 428-432: Both _set_vboost and _set_lock_gpu_freq generate script
comments using the same literal label "# Command 0:", which can produce
duplicate command labels when both features are enabled; update the comment
labels in the lock_freq_cmd and the vboost_cmd generation to use distinct,
descriptive labels (e.g., "# Command: lock GPU graphics clock" and "# Command:
set vboost") or implement a shared incremental counter used by both functions to
produce unique "Command N" labels so the produced setup script has
non‑conflicting command identifiers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6473137f-4a78-46da-9ee4-380569f9ad03

📥 Commits

Reviewing files that changed from the base of the PR and between 80562fe and c261baa.

📒 Files selected for processing (3)

scripts/performance/argument_parser.py
scripts/performance/perf_plugins.py
scripts/performance/setup_experiment.py

scripts/performance/perf_plugins.py

scripts/performance/argument_parser.py

malay-nagda · 2026-04-06T14:07:01Z

@dingqingy-nv can you add note for the new arg in README and doc string?

Address review feedback: document when/why to use GPU clock locking and add the new arg to the performance README. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>

Address review feedback: document the correlation study use case for GPU clock locking and add the new arg to the performance README. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>

dingqingy-nv · 2026-04-07T04:50:02Z

@malay-nagda fixed, please review and help approve, thanks!

provide perf script util for lock gpu frequency. Helpful for correlation

c261baa

Signed-off-by: Dingqing Yang <[email protected]>

dingqingy-nv added this to the 26.04 milestone Mar 24, 2026

dingqingy-nv added performance performance/release Performance items related with NeMo release labels Mar 24, 2026

dingqingy-nv marked this pull request as ready for review March 25, 2026 01:29

dingqingy-nv requested review from a team, erhoo82 and malay-nagda as code owners March 25, 2026 01:29

copy-pr-bot bot temporarily deployed to test March 25, 2026 01:29 Inactive

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci March 25, 2026 01:35 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 25, 2026 01:42 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 25, 2026 01:48 Inactive

yaoyu-33 previously approved these changes Mar 25, 2026

View reviewed changes

yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Mar 25, 2026

malay-nagda reviewed Mar 30, 2026

View reviewed changes

scripts/performance/perf_plugins.py Show resolved Hide resolved

malay-nagda reviewed Mar 30, 2026

View reviewed changes

scripts/performance/argument_parser.py Show resolved Hide resolved

suiyoubi added the needs-author Author action is required before review or merge can continue label Mar 30, 2026

suiyoubi removed the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Mar 30, 2026

malay-nagda requested changes Apr 6, 2026

View reviewed changes

dingqingy-nv and others added 2 commits April 6, 2026 21:39

dingqingy-nv dismissed yaoyu-33’s stale review via bf84232 April 7, 2026 04:44

github-actions bot removed the needs-author Author action is required before review or merge can continue label Apr 7, 2026

[perf] fix: lint blank line after docstring

6882c4a

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Signed-off-by: Dingqing Yang <[email protected]>

dingqingy-nv requested a review from malay-nagda April 7, 2026 04:48

copy-pr-bot bot temporarily deployed to test April 7, 2026 04:48 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 7, 2026 05:21 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 7, 2026 05:50 Failure

malay-nagda approved these changes Apr 7, 2026

View reviewed changes

Merge branch 'main' into lock_freq

40f6b55

copy-pr-bot bot temporarily deployed to test April 7, 2026 22:08 Inactive

Merge branch 'main' into lock_freq

f10edd4

copy-pr-bot bot temporarily deployed to test April 9, 2026 22:40 Inactive

dingqingy-nv enabled auto-merge (squash) April 10, 2026 01:19

Merge branch 'main' into lock_freq

c0702f5

copy-pr-bot bot temporarily deployed to test April 10, 2026 01:20 Inactive

Merge branch 'main' into lock_freq

1269588

copy-pr-bot bot temporarily deployed to test April 11, 2026 19:02 Inactive

dingqingy-nv added r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. 26.04.01 labels Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf script utility to lock gpu frequency.#2977

Perf script utility to lock gpu frequency.#2977
dingqingy-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
dingqingy-nv:lock_freq

dingqingy-nv commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

malay-nagda commented Apr 6, 2026

Uh oh!

dingqingy-nv commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dingqingy-nv commented Mar 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 24, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

malay-nagda commented Apr 6, 2026

Uh oh!

dingqingy-nv commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dingqingy-nv commented Mar 24, 2026 •

edited by coderabbitai bot

Loading