Skip to content

fix(test_cases): replace hardcoded user paths with overridable env vars#1092

Open
Zhenye-Na wants to merge 1 commit into
awslabs:mainfrom
Zhenye-Na:fix/portable-data-paths
Open

fix(test_cases): replace hardcoded user paths with overridable env vars#1092
Zhenye-Na wants to merge 1 commit into
awslabs:mainfrom
Zhenye-Na:fix/portable-data-paths

Conversation

@Zhenye-Na

Copy link
Copy Markdown
Contributor

Introduce DATA_HOME_DIR, ESM2_DATA_DIR, and NSYS_OUTPUT_DIR environment variables that default to the original values. Fix bionemo enroot.sh removing wrong path for bionemo.sqsh.

Purpose

Fixes portability issues where scripts assumed a specific user's filesystem layout, making them unusable without manual edits.

Changes

  • Replace hardcoded data paths in SMHP-esm2 and bionemo test cases with DATA_HOME_DIR and ESM2_DATA_DIR env vars (defaulting to original values)
  • Replace hardcoded nsight output path with NSYS_OUTPUT_DIR env var
  • Fix bionemo enroot.sh removing wrong path for bionemo.sqsh (was targeting source instead of enroot import output)
  • Normalize nemotron shebang to #!/bin/bash

Test Plan

N/A - shell variable substitution only; no behavioral change when env vars are unset (defaults match original hardcoded values).

Environment:

  • AWS Service: N/A
  • Instance type: N/A
  • Number of nodes: N/A

Test commands:

# Verify defaults match original behavior
shellcheck 3.test_cases/23.SMHP-esm2/*.sh
shellcheck 3.test_cases/megatron/bionemo/bionemo_2.5/*.sh

Test Results

No runtime changes when env vars are unset. Scripts produce identical behavior to before.

Directory Structure

No new directories or files added. Changes are to existing scripts only.

Checklist

  • I have read the contributing guidelines.
  • I am working against the latest main branch.
  • I have searched existing open and recently merged PRs to confirm this is not a duplicate.
  • The contribution is self-contained with documentation and scripts.
  • External dependencies are pinned to a specific version or tag (no latest).
  • A README is included or updated with prerequisites, instructions, and known issues.
  • New test cases follow the expected directory structure.

Introduce DATA_HOME_DIR, ESM2_DATA_DIR, and NSYS_OUTPUT_DIR environment
variables that default to the original values. Fix bionemo enroot.sh
removing wrong path for bionemo.sqsh.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/2 — Deployment Pipeline & Operational Correctness

Thanks for this portability fix — it's small, consistent, and well-targeted. One note below worth confirming before merge.

@@ -1,12 +1,17 @@
#! /bin/bash -x
#!/bin/bash

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shebang change silently drops set -x tracing

I noticed the shebang changed from #! /bin/bash -x to #!/bin/bash. Normalizing the stray space after #! is a good cleanup, but the same edit also removes the -x flag, which was turning on execution tracing for the whole script. For an Nsight profiling helper, that trace is often deliberately there so the exact nsys profile ... invocation shows up in the Slurm logs — losing it makes failed profiling runs harder to debug. Could you confirm dropping -x is intentional? If the trace was load-bearing, I'd keep it as #!/bin/bash -x (or add an explicit set -x after the shebang) so the normalization doesn't change observable behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation. I will send out a second revision to get this comment resolved

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/2 — Documentation Consistency

New override env vars aren't documented anywhere

Files: 3.test_cases/23.SMHP-esm2/, 3.test_cases/megatron/bionemo/bionemo_2.5/, 4.validation_and_observability/5.nsight/nemotron/ (READMEs)

The three new knobs — ESM2_DATA_DIR, DATA_HOME_DIR, NSYS_OUTPUT_DIR — are the user-facing payoff of this PR, but a reader can only discover them by opening the scripts. I checked the corresponding READMEs and none mention them (and the SMHP-esm2 README's prerequisites section would be the natural home). The PR checklist also leaves "README updated" unchecked. I'd suggest a one-line note per test case along the lines of "Set DATA_HOME_DIR to your shared-filesystem mount before running; it defaults to /fsxl/$USER/bionemo." so the override is discoverable without reading source. Not blocking, but it's the difference between a knob people use and one they never find.


Things That Look Great

  • The bionemo enroot.sh bug fix is real and correct. The original removed /fsxl/awsankur/bionemo.sqsh (parent dir) but imported to /fsxl/awsankur/bionemo/bionemo.sqsh (subdir) — so the cleanup never actually deleted the stale image. The fix now removes and imports to the same ${DATA_HOME_DIR}/bionemo.sqsh, and I verified that path matches the IMAGE default in train-esm.sbatch. The inline comment explaining the path match is a nice touch.
  • Consistent, idiomatic env-var defaulting. Using : "${VAR:=default}" (and export VAR separately in the sbatch) is exactly the right pattern, and the defaults are kept consistent across sibling scripts (ESM2_DATA_DIR in SMHP-esm2's enroot.sh and 4.train_docker_dpp.sh; DATA_HOME_DIR across all three bionemo scripts).
  • Quoting was tightened along the way. The SMHP-esm2 enroot.sh went from unquoted $file_name to "$file_name" and gained a trailing newline — small but correct hygiene improvements bundled sensibly with the path change.
  • Helpful override comments. Every introduced default is preceded by a short "Override … if your shared filesystem mount differs" comment, which makes the intent obvious at the point of use.
  • Tightly scoped. Six files, all related to the same portability theme, no drive-by refactors. Easy to review.

KeitaW

This comment was marked as duplicate.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants