-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev add bert automated test #203
Open
luqiang-guo
wants to merge
26
commits into
test_oneflow_release
Choose a base branch
from
dev_add_bert_automated_test
base: test_oneflow_release
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
05ddf09
bert pretraining adam squad sh
ouyangyu de98fac
args num_accumulation_steps
ouyangyu 89b4e05
fix args
ouyangyu ed66b48
fix batch_size
ouyangyu 0e4d46b
bert lamb
ouyangyu 52adf9f
lamb weight decay
ouyangyu 647dbd6
run_pretraining.py Add some parameters.
luqiang-guo a4299c0
Revert "run_pretraining.py Add some parameters."
luqiang-guo 1ec5a65
run_pretraining.py Add some parameters.
luqiang-guo 728dd68
Organize files and move them to the tools file directory
luqiang-guo 7f98272
Modify path
luqiang-guo e0093dd
conda environment multi-machine automatic car market
luqiang-guo 494da78
Add multi-machine bert automated test
luqiang-guo 389af5d
Add automated test script
luqiang-guo 8f09974
Modify the automatic test script
luqiang-guo a4cc1eb
Add multi-machine docker script
luqiang-guo 8c32fa4
Add close multi-machine docker script
luqiang-guo 1a00c7d
megre test_oneflow_release
luqiang-guo 3ac9581
merge master
luqiang-guo b859111
Merge branch 'test_oneflow_release' into dev_add_bert_automated_test
luqiang-guo 2d36b9a
Modify small details
luqiang-guo 638d576
oneflow_auto_bert.sh can replace train_perbert_list.sh
luqiang-guo 4b217e9
modified oneflow_auto_bert.sh ip
luqiang-guo 45d9ace
Fix merge errors
luqiang-guo bb3d0bf
Modify the bert automatic test script
luqiang-guo d81c54a
Fix merge errors
luqiang-guo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
#!/bin/bash | ||
|
||
BENCH_ROOT=$1 | ||
PYTHON_WHL=$2 | ||
CMP_OLD=$3 | ||
|
||
# PYTHON_WHL=oneflow-0.3.5+cu112.git.325160b-cp38-cp38-linux_x86_64.whl | ||
# CMP_OLD=325160bcfb786b166b063e669aea345fadee2da7 | ||
|
||
BERT_OSSDIR=oss://oneflow-staging/branch/master/bert/ | ||
DOWN_FILE="wget https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/bert/${CMP_OLD}/out.tar.gz" | ||
# DOWN_FILE="ossutil64 cp ${BERT_OSSDIR}$CMP_OLD/out.tar.gz .; " | ||
ENABLE_FP32=0 | ||
GPU_NUM_PER_NODE=8 | ||
BSZ=64 | ||
|
||
PORT=57520 | ||
|
||
PYTHON="python3.8" | ||
DOCKER_USER=root | ||
|
||
multi_machine() | ||
{ | ||
# param 1 node | ||
NUM_NODES=$1 | ||
|
||
# param 2 run cmd | ||
RUN_CMD=$2 | ||
|
||
# param 3 output file | ||
OUTPUT_FILE=$3 | ||
|
||
# param 4 python | ||
PYTHON=$4 | ||
|
||
# param 5 | ||
IS_F32=$5 | ||
|
||
declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5") | ||
|
||
if [ $NUM_NODES -gt ${#host_list[@]} ] | ||
then | ||
echo num_nodes should be less than or equal to length of host_list. | ||
exit | ||
fi | ||
|
||
hosts=("${host_list[@]:0:${NUM_NODES}}") | ||
echo "Working on hosts:${hosts[@]}" | ||
|
||
ips=${hosts[0]} | ||
for host in "${hosts[@]:1}" | ||
do | ||
ips+=",${host}" | ||
done | ||
|
||
for host in "${hosts[@]:1}" | ||
do | ||
echo "start training on ${host}" | ||
|
||
echo -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
nohup $RUN_CMD 0 $PYTHON >/dev/null 2>&1 &" | ||
|
||
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
nohup $RUN_CMD 0 $PYTHON >/dev/null 2>&1 &" | ||
|
||
done | ||
|
||
# copy files to master host and start work | ||
host=${hosts[0]} | ||
echo "start training on ${host}" | ||
|
||
echo $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
$RUN_CMD 1 $PYTHON " | ||
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
$RUN_CMD 1 $PYTHON " | ||
|
||
|
||
for host in "${hosts[@]}" | ||
do | ||
echo $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
mkdir -p out/${OUTPUT_FILE}; mv -f log out/${OUTPUT_FILE}/log_1 " | ||
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
mkdir -p out/${OUTPUT_FILE}; mv -f log out/${OUTPUT_FILE}/log_1 " | ||
done | ||
|
||
# Result analysis | ||
|
||
host=${hosts[0]} | ||
echo "start training on ${host}" | ||
|
||
echo -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
$PYTHON tools/result_analysis.py $IS_F32 \ | ||
--cmp1_file=./old/$OUTPUT_FILE/log_1/out.json \ | ||
--cmp2_file=./out/$OUTPUT_FILE/log_1/out.json \ | ||
--out=./pic/$OUTPUT_FILE.png " | ||
|
||
|
||
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
$PYTHON tools/result_analysis.py $IS_F32 \ | ||
--cmp1_file=./old/$OUTPUT_FILE/log_1/out.json \ | ||
--cmp2_file=./out/$OUTPUT_FILE/log_1/out.json \ | ||
--out=./pic/$OUTPUT_FILE.png " | ||
|
||
echo "multi_machine done" | ||
|
||
} | ||
|
||
|
||
####################################################################################### | ||
# 0 prepare the host list ips for training | ||
######################################################################################## | ||
ALL_NODES=4 | ||
|
||
declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 上面不是定义了? |
||
|
||
if [ $ALL_NODES -gt ${#host_list[@]} ] | ||
then | ||
echo num_nodes should be less than or equal to length of host_list. | ||
exit | ||
fi | ||
|
||
hosts=("${host_list[@]:0:${ALL_NODES}}") | ||
echo "Working on hosts:${hosts[@]}" | ||
|
||
ips=${hosts[0]} | ||
for host in "${hosts[@]:1}" | ||
do | ||
ips+=",${host}" | ||
done | ||
|
||
# ####################################################################################### | ||
# # 1 prepare oneflow_temp folder on each host | ||
# ######################################################################################## | ||
|
||
for host in "${hosts[@]}" | ||
do | ||
ssh -p $PORT $DOCKER_USER@$host " rm -rf ~/oneflow_temp ; mkdir -p ~/oneflow_temp" | ||
scp -P $PORT -r $BENCH_ROOT $DOCKER_USER@$host:~/oneflow_temp/ | ||
echo "test--->" | ||
# scp -P $PORT -r $PYTHON_WHL $DOCKER_USER@$host:~/oneflow_temp/ | ||
# ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/; \ | ||
# $PYTHON -m pip install $PYTHON_WHL; " | ||
|
||
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
mkdir -p pic; rm -rf pic/*; mkdir -p out; rm -rf out/* " | ||
|
||
|
||
done | ||
|
||
#_______________________________________________________________________________________________ | ||
host=${hosts[0]} | ||
ssh -p $PORT $DOCKER_USER@$host "cd ~; rm -rf ~/out; \ | ||
${DOWN_FILE}; \ | ||
tar xvf out.tar.gz; \ | ||
cp -rf ~/out ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT/old;" | ||
|
||
|
||
####################################################################################### | ||
# 2 run single | ||
######################################################################################## | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里应该也能通过传参解决 |
||
|
||
if [ "$ENABLE_FP32" = 1 ];then | ||
float_types=(0 1) | ||
else | ||
float_types=(0 ) | ||
fi | ||
num_nodes=(1 4) | ||
optimizers=(adam lamb) | ||
accumulations=(1 2) | ||
FLOAT_STR=(f16 f32) | ||
NUM_NODE_STR=(null single multi multi multi) | ||
|
||
|
||
|
||
for ftype in ${float_types[@]} | ||
do | ||
for num_node in ${num_nodes[@]} | ||
do | ||
for optimizer in ${optimizers[@]} | ||
do | ||
for accumulation in ${accumulations[@]} | ||
do | ||
name=${NUM_NODE_STR[$num_node]}_bert_${FLOAT_STR[$ftype]}_pretraining_ | ||
multi_machine ${num_node} "sh train_perbert.sh 1 1 ${BSZ} 1 ${optimizer} ${GPU_NUM_PER_NODE} $num_node " \ | ||
"${name}${GPU_NUM_PER_NODE}gpu_${BSZ}bs_accumulation-${accumulation}_${optimizer}_debug" \ | ||
$PYTHON "--f32=${ftype}" | ||
done #end accumulations | ||
done #end optimizer | ||
done #end num_node | ||
done #float_types | ||
|
||
|
||
host=${hosts[0]} | ||
echo "start tar on ${host}" | ||
|
||
ssh $USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \ | ||
tar -zcvf out.tar.gz out; \ | ||
$PYTHON tools/stitching_pic.py --dir=pic --out_file=./pic/all.png " | ||
|
||
echo "multi_machine done" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
NUM_NODES=${1-4} | ||
|
||
|
||
####################################################################################### | ||
# 0 prepare the host list ips for training | ||
######################################################################################## | ||
declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5") | ||
|
||
if [ $NUM_NODES -gt ${#host_list[@]} ] | ||
then | ||
echo num_nodes should be less than or equal to length of host_list. | ||
exit | ||
fi | ||
|
||
hosts=("${host_list[@]:0:${NUM_NODES}}") | ||
echo "Working on hosts:${hosts[@]}" | ||
|
||
ips=${hosts[0]} | ||
for host in "${hosts[@]:1}" | ||
do | ||
ips+=",${host}" | ||
done | ||
|
||
####################################################################################### | ||
# 1 prepare docker image | ||
######################################################################################## | ||
WORK_PATH=`pwd` | ||
|
||
wget https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/bert/docker_image/oneflow_autobert.tar | ||
|
||
for host in "${hosts[@]}" | ||
do | ||
ssh $USER@$host "mkdir -p ~/oneflow_docker_temp; rm -rf ~/oneflow_docker_temp/*" | ||
scp -r oneflow_autobert.tar $USER@$host:~/oneflow_docker_temp | ||
ssh $USER@$host " docker load --input ~/oneflow_docker_temp/oneflow_autobert.tar; " | ||
|
||
echo "tesst--->" | ||
ssh $USER@$host " \ | ||
docker run --runtime=nvidia --rm -i -d --privileged --shm-size=16g \ | ||
--ulimit memlock=-1 --net=host \ | ||
--name oneflow-auto-test \ | ||
--cap-add=IPC_LOCK --device=/dev/infiniband \ | ||
-v /data/bert/:/data/bert/ \ | ||
-v /datasets/bert/:/datasets/bert/ \ | ||
-v /datasets/ImageNet/OneFlow/:/datasets/ImageNet/OneFlow/ \ | ||
-v /data/imagenet/ofrecord:/data/imagenet/ofrecord \ | ||
-v ${WORK_PATH}:/workspace/oneflow-test \ | ||
-w /workspace/oneflow-test \ | ||
oneflow:cu11.2-ubuntu18.04 bash -c \"/usr/sbin/sshd -p 57520 && bash\" " | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
BENCH_ROOT_DIR=/path/to/ | ||
# pretrained model dir | ||
PRETRAINED_MODEL=/DATA/disk1/of_output/uncased_L-12_H-768_A-12_oneflow | ||
|
||
# squad ofrecord dataset dir | ||
DATA_ROOT=/DATA/disk1/of_output/bert/of_squad | ||
|
||
# `vocab.txt` dir | ||
REF_ROOT_DIR=/DATA/disk1/of_output/uncased_L-12_H-768_A-12 | ||
|
||
# `evaluate-v*.py` and `dev-v*.json` dir | ||
SQUAD_TOOL_DIR=/DATA/disk1/of_output/bert/of_squad | ||
db_version=${1:-"v2.0"} | ||
if [ $db_version = "v1.1" ]; then | ||
train_example_num=88614 | ||
eval_example_num=10833 | ||
version_2_with_negative="False" | ||
elif [ $db_version = "v2.0" ]; then | ||
train_example_num=131944 | ||
eval_example_num=12232 | ||
version_2_with_negative="True" | ||
else | ||
echo "db_version must be 'v1.1' or 'v2.0'" | ||
exit | ||
fi | ||
|
||
train_data_dir=$DATA_ROOT/train-$db_version | ||
eval_data_dir=$DATA_ROOT/dev-$db_version | ||
LOGFILE=./bert_fp_training.log | ||
export PYTHONUNBUFFERED=1 | ||
export ONEFLOW_DEBUG_MODE=True | ||
export CUDA_VISIBLE_DEVICES=7 | ||
# finetune and eval SQuAD, | ||
# `predictions.json` will be saved to folder `./squad_output` | ||
python3 $BENCH_ROOT_DIR/run_squad.py \ | ||
--model=SQuAD \ | ||
--do_train=True \ | ||
--do_eval=True \ | ||
--gpu_num_per_node=1 \ | ||
--learning_rate=3e-5 \ | ||
--batch_size_per_device=16 \ | ||
--eval_batch_size_per_device=16 \ | ||
--num_epoch=3 \ | ||
--use_fp16 \ | ||
--version_2_with_negative=$version_2_with_negative \ | ||
--loss_print_every_n_iter=20 \ | ||
--do_lower_case=True \ | ||
--seq_length=384 \ | ||
--num_hidden_layers=12 \ | ||
--num_attention_heads=12 \ | ||
--max_position_embeddings=512 \ | ||
--type_vocab_size=2 \ | ||
--vocab_size=30522 \ | ||
--attention_probs_dropout_prob=0.1 \ | ||
--hidden_dropout_prob=0.1 \ | ||
--hidden_size_per_head=64 \ | ||
--train_data_dir=$train_data_dir \ | ||
--train_example_num=$train_example_num \ | ||
--eval_data_dir=$eval_data_dir \ | ||
--eval_example_num=$eval_example_num \ | ||
--log_dir=./log \ | ||
--model_load_dir=${PRETRAINED_MODEL} \ | ||
--save_last_snapshot=True \ | ||
--model_save_dir=./squad_snapshots \ | ||
--vocab_file=$REF_ROOT_DIR/vocab.txt \ | ||
--predict_file=$SQUAD_TOOL_DIR/dev-${db_version}.json \ | ||
--output_dir=./squad_output 2>&1 | tee ${LOGFILE} | ||
|
||
|
||
# evaluate predictions.json to get metrics | ||
python3 $SQUAD_TOOL_DIR/evaluate-${db_version}.py \ | ||
$SQUAD_TOOL_DIR/dev-${db_version}.json \ | ||
./squad_output/predictions.json | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用subprocess 或async处理下吧