Removed examples from TB and added a tag to build TB from that repo t…#240
Conversation
…o resolve Tb errors Signed-off-by: Urvashi Tiwari <urtiwari.com>
solaiys
left a comment
There was a problem hiding this comment.
Reviewed the TransferBench changes. The example-test removal is clean (no dangling references to example_tests_path / example_results / the removed functions), and the scaling-test env-var change looks fine.
One blocking correctness issue on the new version-pin checkout, plus a minor robustness nit on the git_tag lookup — both inline.
| out_dict = orch.exec( | ||
| f"bash -c 'cd {tb_src} && git checkout {git_tag}'", | ||
| timeout=120, | ||
| ) |
There was a problem hiding this comment.
Blocking (correctness): the git checkout result isn't verified, so the version pin can silently fail open.
- The
out_dictreturned by thisgit checkoutis discarded — there's no per-node error check. - The only post-install verification is
ls -l {git_install_path}/TransferBench(line 239), which confirms the clone directory exists, not that the tag was actually checked out. - So if
git checkout {git_tag}fails (tag missing, dirty/detached tree, fetch issue), the build below proceeds on the default branch, the test still passes, and you get a false-green that looks likev1.67.00but isn't. - Pinning to
v1.67.00is the whole purpose of this PR, so an unverified checkout makes the pin a no-op exactly when it matters.
Suggested fix — scan the checkout output the same way the install step does:
| out_dict = orch.exec( | |
| f"bash -c 'cd {tb_src} && git checkout {git_tag}'", | |
| timeout=120, | |
| ) | |
| out_dict = orch.exec( | |
| f"bash -c 'cd {tb_src} && git checkout {git_tag}'", | |
| timeout=120, | |
| ) | |
| # Fail loudly if the tag checkout did not succeed; otherwise the build below | |
| # silently falls back to the default branch and the version pin is a no-op. | |
| for node in out_dict.keys(): | |
| if re.search(r'error:|fatal:|did not match any', out_dict[node] or '', re.I): | |
| fail_test(f'TransferBench checkout of tag {git_tag} failed on node {node}: {out_dict[node]}') |
- Even stronger (optional): assert the resolved HEAD matches the tag — e.g. run
git describe --tags/git rev-parse HEADafter checkout and compare against{git_tag}— so a wrong-but-non-erroring checkout is also caught.
| ) | ||
|
|
||
| tb_src = f'{git_install_path}/TransferBench' | ||
| git_tag = config_dict['git_tag'] |
There was a problem hiding this comment.
Minor (robustness): git_tag is read with a bare subscript.
- This is inconsistent with the nearby
config_dict.get('rocm_path', '')andconfig_dict.get(...)usage. git_tagis now effectively required for every TransferBench health config; an older or copied config without it will crash here with an opaqueKeyErrorinstead of a clear, actionable message.- Only
mi300_health_config.jsonexists today (and it does definegit_tag), so this isn't a current crash — it's future-proofing for other configs.
Suggested fix — fail cleanly with a clear message when it's absent:
| git_tag = config_dict['git_tag'] | |
| git_tag = config_dict.get('git_tag') | |
| if not git_tag: | |
| fail_test('TransferBench config is missing the required "git_tag" field') | |
| update_test_result() | |
| return |
Signed-off-by: Urvashi Tiwari <urtiwari.com>
solaiys
left a comment
There was a problem hiding this comment.
Both comments are addressed.
LGTM.
Removed the TransferBench example tests (1–6) from the test code, health config, and docs.
Pinned TransferBench installs to v1.67.00 via a new git_tag config field and post-clone checkout in install_transferbench.py.
Added GFX_TEMPORAL=3 and GFX_UNROLL=32 to the scaling test command.