Skip to content

Commit

Permalink
Merge pull request chapel-lang#5316 from ronawho/slurm-aware-paratest
Browse files Browse the repository at this point in the history
Add a simple slurm-aware paratest wrapper for chapcs

[reviewed by @mppf and @ben-albrecht]

Add a basic wrapper for paratest that does an salloc and runs paratest. It
essentially does:

    salloc --nodes=${num_free_nodes} --partition=chapel --share paratest.server

You use it just like you do paratest.server, except that you don't set
-nodefile or -nodepara. The wrapper automatically determines how many nodes to
use by calculating how many nodes are not reserved exclusively. It also does
some things so make paratests play nicer with other paratests in order to avoid
timeouts.

This should allow us to run a std paratest in ~20 minutes, without interfering
with nightly testing, and without causing timeouts for ourselves, or other
developers. It should also make it easy for devs to grab a node exclusively
during the day to run performance tests. They'll have to wait for existing
paratests to finish, but paratests started after an exclusive reservation will
leave nodes open for that job.

With regular paratest I often see timeouts even if just one other person is
running paratest. With this, I was able to run 6 concurrent paratests (max
allowed by slurm) without getting any timeouts. I tested with both gasnet and
std configuration paratests.


Some details about the wrapper:
-------------------------------

To calculate the number of nodes to use we: use `sinfo` to determine how many
nodes are online, and `squeue` to determine how many nodes are reserved by
non-shared jobs. This allows us to run on all nodes not being used exclusively
(by nightly testing, or by a developer wanting to do performance testing or
something.)

The wrapper does a few other things:
 - automatically determines a "good" nodepara so that testing runs faster
 - throws `--share --nice` to `salloc` to share slurm resources
 - sets `CHPL_TEST_LIMIT_RUNNING_EXECUTABLES=yes`, `QT_AFFINITY=no`, and
   `QT_SPINCOUNT=300` to limit timeouts
  • Loading branch information
ronawho authored Feb 8, 2017
2 parents 468f6e1 + f5b4c27 commit 56a5d64
Showing 1 changed file with 81 additions and 0 deletions.
81 changes: 81 additions & 0 deletions util/test/paratest.chapcs
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

#
# Simple slurm-aware paratest wrapper for testing on the chapcs cluster. This
# script aims to run parallel testing as quickly as possible without
# interfering with nightly testing or other exclusive reservations (e.g.
# developers running performance experiments.) It also tries to interfere as
# little as possible with other paratests in order to avoid timeouts.
#

# Get the number of nodes reserved exclusively on the chapel partition
get_num_exclusive_nodes() {
local delim=","
# Grab the "SHARED,NODES" info for all jobs. For each exclusive (SHARED=no)
# job add up the number of nodes it's using
local squeue_output=$(squeue --partition=chapel --noheader --format="%h${delim}%D")
local num_exclusive_nodes=0
for job_info in ${squeue_output}; do
IFS=${delim} read -r -a split_job_info <<< "${job_info}"
local shared=${split_job_info[0]}
local num_nodes=${split_job_info[1]}
if [ "${shared}" == "no" ]; then
num_exclusive_nodes=$((${num_exclusive_nodes} + ${num_nodes}))
fi
done
echo ${num_exclusive_nodes}
}

# Get the number of nodes available for testing on the chapel partition (total - exclusive)
get_num_non_exclusive_nodes() {
local num_online_nodes=$(sinfo --partition=chapel --noheader --responding --format="%D")
local num_exclusive_nodes=$(get_num_exclusive_nodes)
local num_non_exclusive_nodes=$((${num_online_nodes} - ${num_exclusive_nodes}))
echo ${num_non_exclusive_nodes}
}

# Get the number of shared jobs running on the chapel partition
get_num_shared_jobs_running() {
local num_jobs=$(squeue --partition=chapel --noheader --format="%h" | grep "yes" | wc -l)
echo ${num_jobs}
}

# Get a "good" nodepara value: use up to 3 for comm=none testing, but limit to
# 1 for comm!=none since that's already oversubscribed.
#
# TODO: Consider increasing nodepara if no other shared jobs are running. Note
# that we should wait to do this until everybody is using slurm on chapcs.
get_good_nodepara() {
local nodepara=3
if [[ -n ${CHPL_COMM} && "${CHPL_COMM}" != "none" ]]; then
nodepara=1
fi
#if [[ $(get_num_shared_jobs_running) -eq 0 ]]; then
# nodepara=$((${nodepara} + 1))
#fi
echo ${nodepara}
}

# Run paratest inside an salloc using all nodes that are not reserved
# exclusively on the chapel partition. Throw `--share --nice` and turn off
# affinity and limit how many executables can run at once so we play nice with
# other testing going on.
run_paratest() {
local nodepara=$(get_good_nodepara)
local num_free_nodes=$(get_num_non_exclusive_nodes)
local para_env="-env CHPL_TEST_LIMIT_RUNNING_EXECUTABLES=yes"
para_env="${para_env} -env QT_AFFINITY=no"
para_env="${para_env} -env QT_SPINCOUNT=300"

local salloc_cmd="salloc --nodes=${num_free_nodes} --immediate --partition=chapel --share --nice"
local paratest_cmd="${CHPL_HOME}/util/test/paratest.server ${para_env} -nodepara ${nodepara} ${@}"
local test_cmd="${salloc_cmd} ${paratest_cmd}"
echo "running: '${test_cmd}'"

local start_time=${SECONDS}
${test_cmd}
local duration=$((${SECONDS} - ${start_time}))
echo "paratest took $((${duration} / 60)) minutes and $((${duration} % 60)) seconds"
}

run_paratest "${@}"

0 comments on commit 56a5d64

Please sign in to comment.