Skip to content

Commit b68b29b

Browse files
mauri-melatomauri-melato
authored andcommitted
Fix integration test for Slurm: get num of slots
After fixing the configuration of the compute nodes in a Slurm cluster and set the CPU as consumable resource we should also fix job submission in the integration tests. In order to properly test the scale up a single job submission should allocate all the slots available in a compute node. The fix has been tested. Stage 1: two jobs submitted (one running and one pending) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute job2.sh centos PD 0:00 1 (Resources) 2 compute job1.sh centos R 5:18 1 ip-10-0-82-245 - one nodes with the 2 CPUs allocated [centos@ip-10-0-235-160 ~]$ scontrol show nodes --all NodeName=ip-10-0-82-245 Arch=x86_64 CoresPerSocket=1 CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.11 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ip-10-0-82-245 NodeHostName=ip-10-0-82-245 Version=16.05 OS=Linux RealMemory=3711 AllocMem=0 FreeMem=3022 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=14989 Weight=1 Owner=N/A MCS_label=N/A BootTime=2018-07-31T14:37:49 SlurmdStartTime=2018-07-31T14:41:31 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Stage 2: the second compute node join the cluster and the two jobs are both running on two different hosts: [centos@ip-10-0-235-160 ~]$ squeue --states=all JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute job1.sh centos R 6:14 1 ip-10-0-82-245 3 compute job2.sh centos R 0:34 1 ip-10-0-121-16 [centos@ip-10-0-235-160 ~]$ scontrol show nodes --all NodeName=ip-10-0-82-245 Arch=x86_64 CoresPerSocket=1 CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.11 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ip-10-0-82-245 NodeHostName=ip-10-0-82-245 Version=16.05 OS=Linux RealMemory=3711 AllocMem=0 FreeMem=3022 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=14989 Weight=1 Owner=N/A MCS_label=N/A BootTime=2018-07-31T14:37:49 SlurmdStartTime=2018-07-31T14:41:31 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=ip-10-0-121-16 Arch=x86_64 CoresPerSocket=1 CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.37 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ip-10-0-121-16 NodeHostName=ip-10-0-121-16 Version=(null) OS=Linux RealMemory=3711 AllocMem=0 FreeMem=3035 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=14989 Weight=1 Owner=N/A MCS_label=N/A BootTime=2018-07-31T14:43:46 SlurmdStartTime=2018-07-31T14:47:35 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Signed-off-by: Maurizio Melato <[email protected]>
1 parent 7eddac1 commit b68b29b

File tree

1 file changed

+21
-2
lines changed

1 file changed

+21
-2
lines changed

tests/cluster-check.sh

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,19 @@ sge_get_slots() {
4343
echo ${ppn}
4444
}
4545

46+
slurm_get_slots() {
47+
local -- ppn i=0
48+
ppn=$(scontrol show nodes -o | head -n 1 | sed -n -e 's/^.* CPUTot=\([0-9]\+\) .*$/\1/p')
49+
# wait 15 secs before giving up retrieving the slots per host
50+
while [ -z "${ppn}" -a $i -lt 15 ]; do
51+
sleep 1
52+
i=$((i+1))
53+
ppn=$(scontrol show nodes -o | head -n 1 | sed -n -e 's/^.* CPUTot=\([0-9]\+\) .*$/\1/p')
54+
done
55+
56+
echo ${ppn}
57+
}
58+
4659
torque_get_slots() {
4760
local -- chost ppn i=0
4861

@@ -76,6 +89,12 @@ set -e
7689
# less than 8 minutes in order for the test to succeed.
7790

7891
if test "$scheduler" = "slurm" ; then
92+
_ppn=$(slurm_get_slots)
93+
if [ -z "${_ppn}" ]; then
94+
>&2 echo "The number of slots per instance couldn't be retrieved, no compute nodes available in Slurm cluster"
95+
exit 1
96+
fi
97+
7998
cat > job1.sh <<EOF
8099
#!/bin/bash
81100
srun sleep ${_sleepjob1}
@@ -90,8 +109,8 @@ EOF
90109
chmod +x job1.sh job2.sh
91110
rm -f job1.done job2.done
92111

93-
sbatch -N 1 ./job1.sh
94-
sbatch -N 1 ./job2.sh
112+
sbatch -N 1 -n ${_ppn} ./job1.sh
113+
sbatch -N 1 -n ${_ppn} ./job2.sh
95114

96115
elif test "$scheduler" = "sge" ; then
97116
# get the slots per node count of the first real node (one with a

0 commit comments

Comments
 (0)