Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Have problom in BERT pre-training: how to training on multiple GPUs #1508

Open
yangshuo0323 opened this issue Jan 28, 2021 · 13 comments
Open
Labels
enhancement New feature or request

Comments

@yangshuo0323
Copy link

Description

  • I want to train BERT model on GPU, but have some problems. My configuration:
    • Software environment: Python: 3.7.7, Cuda: 10.2
    • Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
    • Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.
  • Run script gluon-nlp/scripts/bert/run_pretraining.py:
    • Reference the instruction: https://nlp.gluon.ai/model_zoo/bert/index.html#bert-model-zoo
    • And download DataSet alse in above web.
      $  mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \
           -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
           --mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
           -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
           -x MXNET_SAFE_ACCUMULATION=1 --tag-output \
      python run_pretraining.py --verbose --model="bert_12_768_12" --warmup_ratio=1 --comm_backend="horovod" \
      	--accumulate=1 --max_seq_length=128 --raw --max_predictions_per_seq=20 --log_interval=1 --ckpt_interval=1000 \
      	--no_compute_acc --data=/home/yangshuo/mxnet/Dataset/pre-train-datasets/enwiki-feb-doc-split/*.train \
      	--num_steps=1000 --total_batch_size=128 --dtype="float16"
    
  • Result error:

image

Seek help:

I have read the guidance, but still don't known how to running.
Please help me, or can I have correct instruction or suggestion ? thanks.

@yangshuo0323 yangshuo0323 added the enhancement New feature or request label Jan 28, 2021
@leezu
Copy link
Contributor

leezu commented Jan 28, 2021

Please provide the complete error message

@yangshuo0323
Copy link
Author

Please provide the complete error message

the whole message:

[1,5]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,4]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,6]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,2]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,1]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,0]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,3]<stderr>:[21:43:11] src/storage/storage.cc:[1,3]<stderr>:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:INFO:root:Model created
[1,7]<stderr>:DEBUG:root:Random seed set to 91
[1,7]<stderr>:INFO:root:Begin process dataset......
[1,7]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 7
[1,7]<stderr>:INFO:root:400 files are found.
[1,4]<stderr>:INFO:root:Model created
[1,4]<stderr>:DEBUG:root:Random seed set to 580
[1,4]<stderr>:INFO:root:Begin process dataset......
[1,4]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 4
[1,4]<stderr>:INFO:root:400 files are found.
[1,6]<stderr>:INFO:root:Model created
[1,6]<stderr>:DEBUG:root:Random seed set to 555
[1,6]<stderr>:INFO:root:Begin process dataset......
[1,6]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 6
[1,6]<stderr>:INFO:root:400 files are found.
[1,5]<stderr>:INFO:root:Model created
[1,5]<stderr>:DEBUG:root:Random seed set to 185
[1,5]<stderr>:INFO:root:Begin process dataset......
[1,5]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
[1,5]<stderr>:INFO:root:400 files are found.
[1,7]<stderr>:[node106:26504:0:26504] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,7]<stderr>:==== backtrace ====
[1,7]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f5c21681cec]
[1,7]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f5c21681f64]
[1,7]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f5e1fe55d44]
[1,7]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f5dc2100564]
[1,7]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f5dc2103790]
[1,7]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f5dc20fbed1]
[1,7]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f5dc20d69d4]
[1,7]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5c3750818f]
[1,7]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5c374ffd84]
[1,7]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f5e1ee829dd]
[1,7]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f5e1ee82067]
[1,7]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f5e200b327e]
[1,7]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f5e200b3cb4]
[1,7]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x55d260c8500b]
[1,7]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x55d260ce99a1]
[1,7]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x55d260c7d20b]
[1,7]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x55d260ce4be6]
[1,7]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x55d260c2e1d4]
[1,7]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x55d260c2e1fc]
[1,7]<stderr>:   26  python(+0x22bf44) [0x55d260d43f44]
[1,7]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x55d260d4e2b1]
[1,7]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x55d260d4e4a3]
[1,7]<stderr>:   29  python(+0x2375d5) [0x55d260d4f5d5]
[1,7]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x55d260d4f6fc]
[1,7]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f5e1faa2840]
[1,7]<stderr>:   32  python(+0x1dc3c0) [0x55d260cf43c0]
[1,7]<stderr>:===================
[1,4]<stderr>:[node106:26501:0:26501] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,4]<stderr>:==== backtrace ====
[1,4]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f5fb1eb6cec]
[1,4]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f5fb1eb6f64]
[1,4]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f61b05e1d44]
[1,4]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f615288c564]
[1,4]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f615288f790]
[1,4]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f6152887ed1]
[1,4]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f61528629d4]
[1,4]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5fc7ca718f]
[1,4]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5fc7c9ed84]
[1,4]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f61af60e9dd]
[1,4]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f61af60e067]
[1,4]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f61b083f27e]
[1,4]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f61b083fcb4]
[1,4]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x55e8922d700b]
[1,4]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x55e89233b9a1]
[1,4]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x55e8922cf20b]
[1,4]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x55e892336be6]
[1,4]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x55e8922801d4]
[1,4]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x55e8922801fc]
[1,4]<stderr>:   26  python(+0x22bf44) [0x55e892395f44]
[1,4]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x55e8923a02b1]
[1,4]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x55e8923a04a3]
[1,4]<stderr>:   29  python(+0x2375d5) [0x55e8923a15d5]
[1,4]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x55e8923a16fc]
[1,4]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f61b022e840]
[1,4]<stderr>:   32  python(+0x1dc3c0) [0x55e8923463c0]
[1,4]<stderr>:===================
[1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,5]<stderr>:==== backtrace ====
[1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,6]<stderr>:==== backtrace ====
[1,5]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
[1,5]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
[1,5]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
[1,5]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564]
[1,5]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790]
[1,5]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1]
[1,5]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4]
[1,5]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f]
[1,5]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84]
[1,5]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd]
[1,5]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067]
[1,5]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e]
[1,5]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4]
[1,5]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b]
[1,5]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
[1,5]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b]
[1,5]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
[1,5]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
[1,5]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
[1,5]<stderr>:   26  python(+0x22bf44) [0x564d045faf44]
[1,5]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
[1,5]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
[1,5]<stderr>:   29  python(+0x2375d5) [0x564d046065d5]
[1,5]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x564d046066fc]
[1,5]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840]
[1,5]<stderr>:   32  python(+0x1dc3c0) [0x564d045ab3c0]
[1,5]<stderr>:===================
[1,6]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
[1,6]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
[1,6]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
[1,6]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564]
[1,6]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790]
[1,6]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1]
[1,6]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4]
[1,6]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f]
[1,6]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84]
[1,6]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd]
[1,6]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067]
[1,6]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e]
[1,6]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4]
[1,6]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b]
[1,6]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
[1,6]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b]
[1,6]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
[1,6]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
[1,6]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x562df52911fc]
[1,6]<stderr>:   26  python(+0x22bf44) [0x562df53a6f44]
[1,6]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
[1,6]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
[1,6]<stderr>:   29  python(+0x2375d5) [0x562df53b25d5]
[1,6]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x562df53b26fc]
[1,6]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840]
[1,6]<stderr>:   32  python(+0x1dc3c0) [0x562df53573c0]
[1,6]<stderr>:===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault).

@yangshuo0323
Copy link
Author

yangshuo0323 commented Jan 28, 2021

Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs?
@leezu

@leezu
Copy link
Contributor

leezu commented Jan 28, 2021

Software environment: Python: 3.7.7, Cuda: 10.2
Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.

Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation

@yangshuo0323
Copy link
Author

Software environment: Python: 3.7.7, Cuda: 10.2
Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.

Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation

I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported?
I will try as you suggest. think you.

@yangshuo0323
Copy link
Author

I think my environment of 'mpirun' mybe wrong, such as optional parameters:

mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \
       -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
       --mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
       -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
       -x MXNET_SAFE_ACCUMULATION=1 --tag-output \

it may causes problems with inter-process communication. So, what parameters need to set for Multi-GPU training ?
@leezu

@leezu
Copy link
Contributor

leezu commented Jan 29, 2021

I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported?

I don't know how this branch was created, but there is actually no gluon-nlp 2.0. cc @szha @sxjscience let's delete the branch?
The branch contains commits of GluonNLP 0.x, so yes, it should work with MXNet 1.7

@sxjscience
Copy link
Member

I have no idea about the 2.0 branch. We may just delete it.

@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert

@yangshuo0323
Copy link
Author

I have no idea about the 2.0 branch. We may just delete it.

@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert

I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...

@sxjscience
Copy link
Member

sxjscience commented Jan 30, 2021 via email

@yangshuo0323
Copy link
Author

Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much!

That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch? Get Outlook for iOShttps://aka.ms/o0ukef

________________________________ From: yangshuo0323 [email protected] Sent: Friday, January 29, 2021 7:54:06 PM To: dmlc/gluon-nlp [email protected] Cc: Xingjian SHI [email protected]; Mention [email protected] Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508) I have no idea about the 2.0 branch. We may just delete it. @yangshuo0323https://github.com/yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1508 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA.

@sxjscience
Copy link
Member

@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python

Also, you can just clone gluonnlp/master and install via the following command:

python3 -m pip install -U -e ."[extras]"

This will give the nlp_data and nlp_process CLI. You can use nlp_data to download corpus like wikipedia and bookcorpus and

Also, you are recommended to install horovod via

HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod

After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.

@yangshuo0323
Copy link
Author

The previous error was due to the wrong installation of horovod, which maybe not use the env HOROVOD_WITH_MXNET.
Thanks to everyone who give me advice above.
I will enjoy to try the new version as you advice.

@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python

Also, you can just clone gluonnlp/master and install via the following command:

python3 -m pip install -U -e ."[extras]"

This will give the nlp_data and nlp_process CLI. You can use nlp_data to download corpus like wikipedia and bookcorpus and

Also, you are recommended to install horovod via

HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod

After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants