Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev add bert automated test #203

Open
wants to merge 26 commits into
base: test_oneflow_release
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
05ddf09
bert pretraining adam squad sh
ouyangyu Jan 12, 2021
de98fac
args num_accumulation_steps
ouyangyu Jan 14, 2021
89b4e05
fix args
ouyangyu Jan 14, 2021
ed66b48
fix batch_size
ouyangyu Jan 19, 2021
0e4d46b
bert lamb
ouyangyu Jan 20, 2021
52adf9f
lamb weight decay
ouyangyu Jan 20, 2021
647dbd6
run_pretraining.py Add some parameters.
luqiang-guo May 19, 2021
a4299c0
Revert "run_pretraining.py Add some parameters."
luqiang-guo May 19, 2021
1ec5a65
run_pretraining.py Add some parameters.
luqiang-guo May 19, 2021
728dd68
Organize files and move them to the tools file directory
luqiang-guo May 20, 2021
7f98272
Modify path
luqiang-guo May 20, 2021
e0093dd
conda environment multi-machine automatic car market
luqiang-guo May 31, 2021
494da78
Add multi-machine bert automated test
luqiang-guo Jun 21, 2021
389af5d
Add automated test script
luqiang-guo Jun 21, 2021
8f09974
Modify the automatic test script
luqiang-guo Jun 22, 2021
a4cc1eb
Add multi-machine docker script
luqiang-guo Jun 22, 2021
8c32fa4
Add close multi-machine docker script
luqiang-guo Jun 22, 2021
1a00c7d
megre test_oneflow_release
luqiang-guo Jun 22, 2021
3ac9581
merge master
luqiang-guo Jun 22, 2021
b859111
Merge branch 'test_oneflow_release' into dev_add_bert_automated_test
luqiang-guo Jun 22, 2021
2d36b9a
Modify small details
luqiang-guo Jun 22, 2021
638d576
oneflow_auto_bert.sh can replace train_perbert_list.sh
luqiang-guo Jun 22, 2021
4b217e9
modified oneflow_auto_bert.sh ip
luqiang-guo Jun 22, 2021
45d9ace
Fix merge errors
luqiang-guo Jun 22, 2021
bb3d0bf
Modify the bert automatic test script
luqiang-guo Jun 23, 2021
d81c54a
Fix merge errors
luqiang-guo Jun 23, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion LanguageModeling/BERT/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ def get_parser(parser=None):
help='use use fp16 or not')
parser.add_argument('--use_xla', type=str2bool, nargs='?', const=True,
help='Whether to use use xla')

parser.add_argument("--num_accumulation_steps", type=int, default=1,
help='Number of accumulation steps before gradient update, Global batch size = num_accumulation_steps * train_batch_size')
parser.add_argument("--optimizer_type", type=str, default="adam",
help="Optimizer used for training - LAMB or ADAM")

Expand Down
200 changes: 200 additions & 0 deletions LanguageModeling/BERT/oneflow_auto_bert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
#!/bin/bash

BENCH_ROOT=$1
PYTHON_WHL=$2
CMP_OLD=$3

# PYTHON_WHL=oneflow-0.3.5+cu112.git.325160b-cp38-cp38-linux_x86_64.whl
# CMP_OLD=325160bcfb786b166b063e669aea345fadee2da7

BERT_OSSDIR=oss://oneflow-staging/branch/master/bert/
DOWN_FILE="wget https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/bert/${CMP_OLD}/out.tar.gz"
# DOWN_FILE="ossutil64 cp ${BERT_OSSDIR}$CMP_OLD/out.tar.gz .; "
ENABLE_FP32=0
GPU_NUM_PER_NODE=8
BSZ=64

PORT=57520

PYTHON="python3.8"
DOCKER_USER=root

multi_machine()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用subprocess 或async处理下吧

{
# param 1 node
NUM_NODES=$1

# param 2 run cmd
RUN_CMD=$2

# param 3 output file
OUTPUT_FILE=$3

# param 4 python
PYTHON=$4

# param 5
IS_F32=$5

declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5")

if [ $NUM_NODES -gt ${#host_list[@]} ]
then
echo num_nodes should be less than or equal to length of host_list.
exit
fi

hosts=("${host_list[@]:0:${NUM_NODES}}")
echo "Working on hosts:${hosts[@]}"

ips=${hosts[0]}
for host in "${hosts[@]:1}"
do
ips+=",${host}"
done

for host in "${hosts[@]:1}"
do
echo "start training on ${host}"

echo -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
nohup $RUN_CMD 0 $PYTHON >/dev/null 2>&1 &"

ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
nohup $RUN_CMD 0 $PYTHON >/dev/null 2>&1 &"

done

# copy files to master host and start work
host=${hosts[0]}
echo "start training on ${host}"

echo $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
$RUN_CMD 1 $PYTHON "
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
$RUN_CMD 1 $PYTHON "


for host in "${hosts[@]}"
do
echo $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
mkdir -p out/${OUTPUT_FILE}; mv -f log out/${OUTPUT_FILE}/log_1 "
ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
mkdir -p out/${OUTPUT_FILE}; mv -f log out/${OUTPUT_FILE}/log_1 "
done

# Result analysis

host=${hosts[0]}
echo "start training on ${host}"

echo -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
$PYTHON tools/result_analysis.py $IS_F32 \
--cmp1_file=./old/$OUTPUT_FILE/log_1/out.json \
--cmp2_file=./out/$OUTPUT_FILE/log_1/out.json \
--out=./pic/$OUTPUT_FILE.png "


ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
$PYTHON tools/result_analysis.py $IS_F32 \
--cmp1_file=./old/$OUTPUT_FILE/log_1/out.json \
--cmp2_file=./out/$OUTPUT_FILE/log_1/out.json \
--out=./pic/$OUTPUT_FILE.png "

echo "multi_machine done"

}


#######################################################################################
# 0 prepare the host list ips for training
########################################################################################
ALL_NODES=4

declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面不是定义了?


if [ $ALL_NODES -gt ${#host_list[@]} ]
then
echo num_nodes should be less than or equal to length of host_list.
exit
fi

hosts=("${host_list[@]:0:${ALL_NODES}}")
echo "Working on hosts:${hosts[@]}"

ips=${hosts[0]}
for host in "${hosts[@]:1}"
do
ips+=",${host}"
done

# #######################################################################################
# # 1 prepare oneflow_temp folder on each host
# ########################################################################################

for host in "${hosts[@]}"
do
ssh -p $PORT $DOCKER_USER@$host " rm -rf ~/oneflow_temp ; mkdir -p ~/oneflow_temp"
scp -P $PORT -r $BENCH_ROOT $DOCKER_USER@$host:~/oneflow_temp/
echo "test--->"
# scp -P $PORT -r $PYTHON_WHL $DOCKER_USER@$host:~/oneflow_temp/
# ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/; \
# $PYTHON -m pip install $PYTHON_WHL; "

ssh -p $PORT $DOCKER_USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
mkdir -p pic; rm -rf pic/*; mkdir -p out; rm -rf out/* "


done

#_______________________________________________________________________________________________
host=${hosts[0]}
ssh -p $PORT $DOCKER_USER@$host "cd ~; rm -rf ~/out; \
${DOWN_FILE}; \
tar xvf out.tar.gz; \
cp -rf ~/out ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT/old;"


#######################################################################################
# 2 run single
########################################################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该也能通过传参解决


if [ "$ENABLE_FP32" = 1 ];then
float_types=(0 1)
else
float_types=(0 )
fi
num_nodes=(1 4)
optimizers=(adam lamb)
accumulations=(1 2)
FLOAT_STR=(f16 f32)
NUM_NODE_STR=(null single multi multi multi)



for ftype in ${float_types[@]}
do
for num_node in ${num_nodes[@]}
do
for optimizer in ${optimizers[@]}
do
for accumulation in ${accumulations[@]}
do
name=${NUM_NODE_STR[$num_node]}_bert_${FLOAT_STR[$ftype]}_pretraining_
multi_machine ${num_node} "sh train_perbert.sh 1 1 ${BSZ} 1 ${optimizer} ${GPU_NUM_PER_NODE} $num_node " \
"${name}${GPU_NUM_PER_NODE}gpu_${BSZ}bs_accumulation-${accumulation}_${optimizer}_debug" \
$PYTHON "--f32=${ftype}"
done #end accumulations
done #end optimizer
done #end num_node
done #float_types


host=${hosts[0]}
echo "start tar on ${host}"

ssh $USER@$host "cd ~/oneflow_temp/OneFlow-Benchmark/LanguageModeling/BERT; \
tar -zcvf out.tar.gz out; \
$PYTHON tools/stitching_pic.py --dir=pic --out_file=./pic/all.png "

echo "multi_machine done"
50 changes: 50 additions & 0 deletions LanguageModeling/BERT/prepare_auto_docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
NUM_NODES=${1-4}


#######################################################################################
# 0 prepare the host list ips for training
########################################################################################
declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5")

if [ $NUM_NODES -gt ${#host_list[@]} ]
then
echo num_nodes should be less than or equal to length of host_list.
exit
fi

hosts=("${host_list[@]:0:${NUM_NODES}}")
echo "Working on hosts:${hosts[@]}"

ips=${hosts[0]}
for host in "${hosts[@]:1}"
do
ips+=",${host}"
done

#######################################################################################
# 1 prepare docker image
########################################################################################
WORK_PATH=`pwd`

wget https://oneflow-staging.oss-cn-beijing.aliyuncs.com/branch/master/bert/docker_image/oneflow_autobert.tar

for host in "${hosts[@]}"
do
ssh $USER@$host "mkdir -p ~/oneflow_docker_temp; rm -rf ~/oneflow_docker_temp/*"
scp -r oneflow_autobert.tar $USER@$host:~/oneflow_docker_temp
ssh $USER@$host " docker load --input ~/oneflow_docker_temp/oneflow_autobert.tar; "

echo "tesst--->"
ssh $USER@$host " \
docker run --runtime=nvidia --rm -i -d --privileged --shm-size=16g \
--ulimit memlock=-1 --net=host \
--name oneflow-auto-test \
--cap-add=IPC_LOCK --device=/dev/infiniband \
-v /data/bert/:/data/bert/ \
-v /datasets/bert/:/datasets/bert/ \
-v /datasets/ImageNet/OneFlow/:/datasets/ImageNet/OneFlow/ \
-v /data/imagenet/ofrecord:/data/imagenet/ofrecord \
-v ${WORK_PATH}:/workspace/oneflow-test \
-w /workspace/oneflow-test \
oneflow:cu11.2-ubuntu18.04 bash -c \"/usr/sbin/sshd -p 57520 && bash\" "
done
25 changes: 22 additions & 3 deletions LanguageModeling/BERT/run_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,33 @@
parser.add_argument("--data_part_num", type=int, default=32, help="data part number in dataset")
parser.add_argument("--iter_num", type=int, default=1144000, help="total iterations to run")
parser.add_argument("--batch_size_per_device", type=int, default=64)
parser.add_argument("--debug", type=int, default=0)
parser.add_argument("--data_load_random", type=int, default=1)

args = parser.parse_args()
configs.print_args(args)


if args.debug == 1:
flow.config.enable_debug_mode(True)
print('Enable Debug !!!!!!!')

if args.data_load_random == 1:
random_tmp=True
print('Enable random loading of data !!!!!!!')
else:
random_tmp=False
print('Disable random loading of data !!!!!!!')

batch_size = args.num_nodes * args.gpu_num_per_node * args.batch_size_per_device


def BertDecoder(data_dir, batch_size, data_part_num, seq_length, max_predictions_per_seq):
ofrecord = flow.data.ofrecord_reader(data_dir,
batch_size=batch_size,
data_part_num=data_part_num,
random_shuffle = False,
shuffle_after_epoch=False)
random_shuffle = random_tmp,
shuffle_after_epoch=random_tmp)
blob_confs = {}
def _blob_conf(name, shape, dtype=flow.int32):
blob_confs[name] = flow.data.OFRecordRawDecoder(ofrecord, name, shape=shape, dtype=dtype)
Expand Down Expand Up @@ -101,12 +116,16 @@ def main():
flow.env.log_dir(args.log_dir)
flow.config.enable_debug_mode(True)

flow.config.enable_legacy_model_io()
flow.config.enable_model_io_v2(True)

InitNodes(args)

snapshot = Snapshot(args.model_save_dir, args.model_load_dir)

print('num_accumulation_steps:', args.num_accumulation_steps)
metric = Metric(desc='train', print_steps=args.loss_print_every_n_iter,
batch_size=batch_size, keys=['total_loss', 'mlm_loss', 'nsp_loss'])
batch_size=batch_size * args.num_accumulation_steps, keys=['total_loss', 'mlm_loss', 'nsp_loss'])
for step in range(args.iter_num):
PretrainJob().async_get(metric.metric_cb(step))
#PretrainJob().async_get(metric.metric_cb(step, epoch=3))
Expand Down
74 changes: 74 additions & 0 deletions LanguageModeling/BERT/run_squad.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
BENCH_ROOT_DIR=/path/to/
# pretrained model dir
PRETRAINED_MODEL=/DATA/disk1/of_output/uncased_L-12_H-768_A-12_oneflow

# squad ofrecord dataset dir
DATA_ROOT=/DATA/disk1/of_output/bert/of_squad

# `vocab.txt` dir
REF_ROOT_DIR=/DATA/disk1/of_output/uncased_L-12_H-768_A-12

# `evaluate-v*.py` and `dev-v*.json` dir
SQUAD_TOOL_DIR=/DATA/disk1/of_output/bert/of_squad
db_version=${1:-"v2.0"}
if [ $db_version = "v1.1" ]; then
train_example_num=88614
eval_example_num=10833
version_2_with_negative="False"
elif [ $db_version = "v2.0" ]; then
train_example_num=131944
eval_example_num=12232
version_2_with_negative="True"
else
echo "db_version must be 'v1.1' or 'v2.0'"
exit
fi

train_data_dir=$DATA_ROOT/train-$db_version
eval_data_dir=$DATA_ROOT/dev-$db_version
LOGFILE=./bert_fp_training.log
export PYTHONUNBUFFERED=1
export ONEFLOW_DEBUG_MODE=True
export CUDA_VISIBLE_DEVICES=7
# finetune and eval SQuAD,
# `predictions.json` will be saved to folder `./squad_output`
python3 $BENCH_ROOT_DIR/run_squad.py \
--model=SQuAD \
--do_train=True \
--do_eval=True \
--gpu_num_per_node=1 \
--learning_rate=3e-5 \
--batch_size_per_device=16 \
--eval_batch_size_per_device=16 \
--num_epoch=3 \
--use_fp16 \
--version_2_with_negative=$version_2_with_negative \
--loss_print_every_n_iter=20 \
--do_lower_case=True \
--seq_length=384 \
--num_hidden_layers=12 \
--num_attention_heads=12 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--train_data_dir=$train_data_dir \
--train_example_num=$train_example_num \
--eval_data_dir=$eval_data_dir \
--eval_example_num=$eval_example_num \
--log_dir=./log \
--model_load_dir=${PRETRAINED_MODEL} \
--save_last_snapshot=True \
--model_save_dir=./squad_snapshots \
--vocab_file=$REF_ROOT_DIR/vocab.txt \
--predict_file=$SQUAD_TOOL_DIR/dev-${db_version}.json \
--output_dir=./squad_output 2>&1 | tee ${LOGFILE}


# evaluate predictions.json to get metrics
python3 $SQUAD_TOOL_DIR/evaluate-${db_version}.py \
$SQUAD_TOOL_DIR/dev-${db_version}.json \
./squad_output/predictions.json

Loading