Skip to content

Removed examples from TB and added a tag to build TB from that repo t…#240

Merged
urtiwari merged 2 commits into
mainfrom
urtiwari/TBerrors
Jun 26, 2026
Merged

Removed examples from TB and added a tag to build TB from that repo t…#240
urtiwari merged 2 commits into
mainfrom
urtiwari/TBerrors

Conversation

@urtiwari

Copy link
Copy Markdown
Collaborator

Removed the TransferBench example tests (1–6) from the test code, health config, and docs.

Pinned TransferBench installs to v1.67.00 via a new git_tag config field and post-clone checkout in install_transferbench.py.

Added GFX_TEMPORAL=3 and GFX_UNROLL=32 to the scaling test command.

…o resolve Tb errors

Signed-off-by: Urvashi Tiwari <urtiwari.com>

@solaiys solaiys left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the TransferBench changes. The example-test removal is clean (no dangling references to example_tests_path / example_results / the removed functions), and the scaling-test env-var change looks fine.

One blocking correctness issue on the new version-pin checkout, plus a minor robustness nit on the git_tag lookup — both inline.

Comment on lines +220 to +223
out_dict = orch.exec(
f"bash -c 'cd {tb_src} && git checkout {git_tag}'",
timeout=120,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking (correctness): the git checkout result isn't verified, so the version pin can silently fail open.

  • The out_dict returned by this git checkout is discarded — there's no per-node error check.
  • The only post-install verification is ls -l {git_install_path}/TransferBench (line 239), which confirms the clone directory exists, not that the tag was actually checked out.
  • So if git checkout {git_tag} fails (tag missing, dirty/detached tree, fetch issue), the build below proceeds on the default branch, the test still passes, and you get a false-green that looks like v1.67.00 but isn't.
  • Pinning to v1.67.00 is the whole purpose of this PR, so an unverified checkout makes the pin a no-op exactly when it matters.

Suggested fix — scan the checkout output the same way the install step does:

Suggested change
out_dict = orch.exec(
f"bash -c 'cd {tb_src} && git checkout {git_tag}'",
timeout=120,
)
out_dict = orch.exec(
f"bash -c 'cd {tb_src} && git checkout {git_tag}'",
timeout=120,
)
# Fail loudly if the tag checkout did not succeed; otherwise the build below
# silently falls back to the default branch and the version pin is a no-op.
for node in out_dict.keys():
if re.search(r'error:|fatal:|did not match any', out_dict[node] or '', re.I):
fail_test(f'TransferBench checkout of tag {git_tag} failed on node {node}: {out_dict[node]}')
  • Even stronger (optional): assert the resolved HEAD matches the tag — e.g. run git describe --tags/git rev-parse HEAD after checkout and compare against {git_tag} — so a wrong-but-non-erroring checkout is also caught.

)

tb_src = f'{git_install_path}/TransferBench'
git_tag = config_dict['git_tag']

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (robustness): git_tag is read with a bare subscript.

  • This is inconsistent with the nearby config_dict.get('rocm_path', '') and config_dict.get(...) usage.
  • git_tag is now effectively required for every TransferBench health config; an older or copied config without it will crash here with an opaque KeyError instead of a clear, actionable message.
  • Only mi300_health_config.json exists today (and it does define git_tag), so this isn't a current crash — it's future-proofing for other configs.

Suggested fix — fail cleanly with a clear message when it's absent:

Suggested change
git_tag = config_dict['git_tag']
git_tag = config_dict.get('git_tag')
if not git_tag:
fail_test('TransferBench config is missing the required "git_tag" field')
update_test_result()
return

Signed-off-by: Urvashi Tiwari <urtiwari.com>

@solaiys solaiys left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both comments are addressed.
LGTM.

@urtiwari urtiwari merged commit baa6c48 into main Jun 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants