Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions 3.test_cases/23.SMHP-esm2/4.train_docker_dpp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,16 @@ set -ex;
###### User Variables #####
###########################

# Override ESM2_DATA_DIR if your shared filesystem mount differs.
: "${ESM2_DATA_DIR:=/fsxl/${USER}/esm2}"

GPUS_PER_NODE=8 # 4 for G5.12x, 8 for P4/P5

DATASET_DIR="/fsxl/awsankur/esm2/processed/arrow"
DATASET_DIR="${ESM2_DATA_DIR}/processed/arrow"

OUTPUT_DIR="/fsxl/awsankur/esm2/output"
OUTPUT_DIR="${ESM2_DATA_DIR}/output"

IMAGE="/fsxl/awsankur/esm2/esm.sqsh"
IMAGE="${ESM2_DATA_DIR}/esm.sqsh"
###########################
## Environment Variables ##
###########################
Expand Down
10 changes: 7 additions & 3 deletions 3.test_cases/23.SMHP-esm2/enroot.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
#!/bin/bash
# Override ESM2_DATA_DIR if your shared filesystem mount differs.
: "${ESM2_DATA_DIR:=/fsxl/${USER}/esm2}"

file_name=/fsxl/awsankur/esm2/esm.sqsh
[ -f $file_name ] && rm $file_name
mkdir -p "${ESM2_DATA_DIR}"

enroot import -o $file_name dockerd://esm:aws
file_name="${ESM2_DATA_DIR}/esm.sqsh"
[ -f "$file_name" ] && rm "$file_name"

enroot import -o "$file_name" dockerd://esm:aws
10 changes: 8 additions & 2 deletions 3.test_cases/megatron/bionemo/bionemo_2.5/enroot.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
#!/bin/bash
# Override DATA_HOME_DIR if your shared filesystem mount differs.
: "${DATA_HOME_DIR:=/fsxl/${USER}/bionemo}"

rm /fsxl/awsankur/bionemo.sqsh
mkdir -p "${DATA_HOME_DIR}"

enroot import -o /fsxl/awsankur/bionemo/bionemo.sqsh dockerd://bionemo:aws
# Remove any prior squash image so the import below can write fresh.
# Path matches the IMAGE default used by train-esm.sbatch.
rm -f "${DATA_HOME_DIR}/bionemo.sqsh"

enroot import -o "${DATA_HOME_DIR}/bionemo.sqsh" dockerd://bionemo:aws
6 changes: 5 additions & 1 deletion 3.test_cases/megatron/bionemo/bionemo_2.5/get-data.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
#!/bin/bash
# Override DATA_HOME_DIR if your shared filesystem mount differs.
: "${DATA_HOME_DIR:=/fsxl/${USER}/bionemo}"

docker run --rm -v /fsxl/awsankur/bionemo:/root/.cache/bionemo bionemo:aws download_bionemo_data esm2/testdata_esm2_pretrain:2.0
mkdir -p "${DATA_HOME_DIR}"
docker run --rm -v "${DATA_HOME_DIR}:/root/.cache/bionemo" bionemo:aws \
download_bionemo_data esm2/testdata_esm2_pretrain:2.0
6 changes: 4 additions & 2 deletions 3.test_cases/megatron/bionemo/bionemo_2.5/train-esm.sbatch
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
export FI_PROVIDER=efa
export NCCL_DEBUG=INFO

#Path to store data and checkpoints
export DATA_HOME_DIR=/fsxl/awsankur/bionemo
# Path to store data and checkpoints. Override DATA_HOME_DIR to match your
# shared filesystem mount.
: "${DATA_HOME_DIR:=/fsxl/${USER}/bionemo}"
export DATA_HOME_DIR

###########################
###### User Variables #####
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
#! /bin/bash -x
#!/bin/bash

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shebang change silently drops set -x tracing

I noticed the shebang changed from #! /bin/bash -x to #!/bin/bash. Normalizing the stray space after #! is a good cleanup, but the same edit also removes the -x flag, which was turning on execution tracing for the whole script. For an Nsight profiling helper, that trace is often deliberately there so the exact nsys profile ... invocation shows up in the Slurm logs — losing it makes failed profiling runs harder to debug. Could you confirm dropping -x is intentional? If the trace was load-bearing, I'd keep it as #!/bin/bash -x (or add an explicit set -x after the shebang) so the normalization doesn't change observable behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation. I will send out a second revision to get this comment resolved


# Override NSYS_OUTPUT_DIR to point at your shared filesystem location for
# Nsight profile reports.
: "${NSYS_OUTPUT_DIR:=/fsx/${USER}/nemotron/results/nemotron4--15B-16g/profile_logs}"

NSYS_EXTRAS=""
if [ "$SLURM_LOCALID" == "0" ]; then
NSYS_EXTRAS="--enable efa_metrics"
fi

if [ "$SLURM_PROCID" == "0" ]; then
/fsx/nsight-efa-latest/target-linux-x64/nsys profile $NSYS_EXTRAS --sample none --delay 330 --duration 50 -o /fsx/awsankur/nemotron/results/nemotron4--15B-16g/profile_logs/profile_%q{SLURM_JOB_ID}_node_%q{SLURM_NODEID}_rank_%q{SLURM_PROCID}_on_%q{HOSTNAME}.nsys-rep --force-overwrite true \
mkdir -p "${NSYS_OUTPUT_DIR}"
/fsx/nsight-efa-latest/target-linux-x64/nsys profile $NSYS_EXTRAS --sample none --delay 330 --duration 50 -o "${NSYS_OUTPUT_DIR}/profile_%q{SLURM_JOB_ID}_node_%q{SLURM_NODEID}_rank_%q{SLURM_PROCID}_on_%q{HOSTNAME}.nsys-rep" --force-overwrite true \
"$@"
else
"$@"
Expand Down