[MISC] add decorator for logging exceptions #1512

szha · 2021-01-30T02:46:45Z

Description

add decorator for logging exceptions

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

add decorator for logging exceptions

Comments

This is useful in debugging worker-side errors that can result in KeyError mentioned in Issues in ELECTRA-base pre-training and fine-tuning #1505

cc @dmlc/gluon-nlp-team

Signed-off-by: Sheng Zha <[email protected]>

github-actions · 2021-01-30T02:58:50Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/5ce7c5f1b8e212a853a4d08717e0ccf875b7822a/index.html

Signed-off-by: Sheng Zha <[email protected]>

github-actions · 2021-01-30T03:29:23Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/f9a5fb71925c75e9ca484b7a0e908756319460bf/index.html

codecov · 2021-01-30T03:34:08Z

Codecov Report

Merging #1512 (0a41311) into master (8d31297) will decrease coverage by 0.62%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
- Coverage   86.49%   85.87%   -0.63%     
==========================================
  Files          55       55              
  Lines        7502     7396     -106     
==========================================
- Hits         6489     6351     -138     
- Misses       1013     1045      +32

Impacted Files	Coverage Δ
setup.py	`0.00% <ø> (ø)`
src/gluonnlp/utils/misc.py	`54.86% <100.00%> (+0.21%)`	⬆️
conftest.py	`76.31% <0.00%> (-9.94%)`	⬇️
src/gluonnlp/data/loading.py	`75.75% <0.00%> (-7.64%)`	⬇️
src/gluonnlp/utils/lazy_imports.py	`58.42% <0.00%> (-2.25%)`	⬇️
src/gluonnlp/data/tokenizers/spacy.py	`65.33% <0.00%> (-0.91%)`	⬇️
src/gluonnlp/data/tokenizers/huggingface.py	`71.06% <0.00%> (-0.49%)`	⬇️
src/gluonnlp/data/tokenizers/jieba.py	`73.13% <0.00%> (-0.40%)`	⬇️
src/gluonnlp/models/transformer_xl.py	`80.48% <0.00%> (-0.39%)`	⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d31297...0a41311. Read the comment docs.

szha · 2021-01-31T21:59:44Z

tools/docker/ubuntu18.04-gpu.Dockerfile

@@ -49,7 +49,7 @@ RUN cd ${WORKDIR} \
   && git clone https://github.com/dmlc/gluon-nlp \
   && cd gluon-nlp \
   && git checkout master \
-   && python3 -m pip install -U -e ."[extras]"
+   && python3 -m pip install -U -e ."[extras,dev]"


@leezu the docker build for gpu keeps failing in horovod build step

szha · 2021-01-31T22:00:12Z

horovod build error

Building wheels for collected packages: horovod
  Building wheel for horovod (setup.py): started
  Building wheel for horovod (setup.py): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-hp3mvl1_
       cwd: /tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/
  Complete output (226 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/horovod
  copying horovod/__init__.py -> build/lib.linux-x86_64-3.6/horovod
  creating build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/keras
  creating build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  creating build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.6/horovod/common
  creating build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/runner
  creating build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark
  creating build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.6/horovod/ray
  creating build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch
  creating build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  creating build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  creating build/lib.linux-x86_64-3.6/horovod/runner/common
  copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common
  creating build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  creating build/lib.linux-x86_64-3.6/horovod/runner/task
  copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
  copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
  creating build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  creating build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  creating build/lib.linux-x86_64-3.6/horovod/runner/driver
  copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
  copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
  creating build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  creating build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  creating build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  creating build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  creating build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  creating build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  creating build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
  copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
  creating build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
  copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
  warning: build_py: byte-compiling is disabled, skipping.

  running build_ext
  -- The CXX compiler identification is GNU 7.5.0
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /usr/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Build architecture flags: -mf16c -mavx -mfma
  -- Using command /usr/bin/python3
  -- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
  -- Found MPI: TRUE (found version "3.1")
  -- Found CUDA: /usr/local/cuda (found version "10.2")
  -- Linking against static NCCL library
  -- Found NCCL: /usr/include
  -- Determining NCCL version from the header file: /usr/include/nccl.h
  -- NCCL_MAJOR_VERSION: 2
  -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
  -- Found Pytorch: 1.7.1 (found suitable version "1.7.1", minimum required is "1.2.0")
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  CMake Error at /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:165 (message):
    Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
    version "1.4.0")
  Call Stack (most recent call first):
    /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:458 (_FPHSA_FAILURE_MESSAGE)
    cmake/Modules/FindMxnet.cmake:54 (find_package_handle_standard_args)
    horovod/mxnet/CMakeLists.txt:12 (find_package)


  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/temp.linux-x86_64-3.6/CMakeFiles/CMakeOutput.log".
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 191, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/usr/local/lib/python3.6/dist-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
      self.run_command(cmd)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.6/dist-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/root/.local/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 89, in build_extensions
      cwd=self.build_temp)
    File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3']' returned non-zero exit status 1.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod
Installing collected packages: cloudpickle, horovod
    Running setup.py install for horovod: started
    Running setup.py install for horovod: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-3e_m045u/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.6m/horovod
         cwd: /tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/
    Complete output (228 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/horovod
    copying horovod/__init__.py -> build/lib.linux-x86_64-3.6/horovod
    creating build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/keras
    creating build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    creating build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.6/horovod/common
    creating build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/runner
    creating build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark
    creating build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.6/horovod/ray
    creating build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch
    creating build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    creating build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    creating build/lib.linux-x86_64-3.6/horovod/runner/common
    copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common
    creating build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    creating build/lib.linux-x86_64-3.6/horovod/runner/task
    copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
    copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
    creating build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    creating build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    creating build/lib.linux-x86_64-3.6/horovod/runner/driver
    copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
    copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
    creating build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    creating build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    creating build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    creating build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    creating build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    creating build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    creating build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
    copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
    creating build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
    copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
    warning: build_py: byte-compiling is disabled, skipping.

    running build_ext
    -- The CXX compiler identification is GNU 7.5.0
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Build architecture flags: -mf16c -mavx -mfma
    -- Using command /usr/bin/python3
    -- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- Found CUDA: /usr/local/cuda (found version "10.2")
    -- Linking against static NCCL library
    -- Found NCCL: /usr/include
    -- Determining NCCL version from the header file: /usr/include/nccl.h
    -- NCCL_MAJOR_VERSION: 2
    -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
    -- Found Pytorch: 1.7.1 (found suitable version "1.7.1", minimum required is "1.2.0")
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    CMake Error at /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:165 (message):
      Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
      version "1.4.0")
    Call Stack (most recent call first):
      /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:458 (_FPHSA_FAILURE_MESSAGE)
      cmake/Modules/FindMxnet.cmake:54 (find_package_handle_standard_args)
      horovod/mxnet/CMakeLists.txt:12 (find_package)


    -- Configuring incomplete, errors occurred!
    See also "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/temp.linux-x86_64-3.6/CMakeFiles/CMakeOutput.log".
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 191, in <module>
        'horovodrun = horovod.runner.launch:run_commandline'
      File "/usr/local/lib/python3.6/dist-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
        self.run_command('build')
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/root/.local/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
        self.build_extensions()
      File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 89, in build_extensions
        cwd=self.build_temp)
      File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3']' returned non-zero exit status 1.
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-3e_m045u/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.6m/horovod Check the logs for full command output.
The command '/bin/sh -c bash /install/install_horovod.sh' returned a non-zero code: 1

sxjscience · 2021-01-31T22:03:51Z

@szha From the error message, it seems to be related to how the mxnet integration is written. Currently, the horovod will call MXNet API to determine some GPU-related flags, and will fail if the instance that is used does not contain GPU or is not configured appropriately. You may follow the guide in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself and try again (Need to edit /etc/docker/daemon.json).

szha · 2021-01-31T22:12:07Z

@sxjscience thanks. I think my system already has nvidia-docker2 installed and the config entry added. I think you are right that this has to do with how horovod integration is written. It's having trouble finding mxnet for some reason.

sxjscience · 2021-01-31T22:18:15Z

OK, because I find that there are the following warning in the log so I thought that GPU was not used.

    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0

szha · 2021-01-31T22:26:22Z

@sxjscience using nvidia-docker instead of docker command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.

sxjscience · 2021-01-31T22:27:45Z

I think we may try to automate our docker pipeline.

leezu · 2021-02-01T14:14:16Z

horovod build error

@szha do you mean this error occurs when you rebuild the container?

@sxjscience using nvidia-docker instead of docker command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.

That's not correct, because nvidia-docker only takes effect at runtime and not at buildtime. You need to follow the steps in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself

szha · 2021-02-01T14:45:39Z

@leezu @sxjscience thanks for helping. I noticed that previously I missed the "default-runtime" entry in the config. Sorry for the miss. I was able to complete the build after adding that entry and I'm pushing the GPU docker now.

szha · 2021-02-01T21:57:14Z

looks like there might be an upstream change as tests/test_data_tokenizers.py::test_spacy_tokenizer failed.

Signed-off-by: Sheng Zha <[email protected]>

github-actions · 2021-02-02T01:14:46Z

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/0a41311da7e394cce3459f93c90beef34c55f767/index.html

add decorator for logging exceptions

5ce7c5f

Signed-off-by: Sheng Zha <[email protected]>

szha requested a review from leezu January 30, 2021 02:46

szha requested a review from a team as a code owner January 30, 2021 02:46

make dev dependencies available in CI

f9a5fb7

Signed-off-by: Sheng Zha <[email protected]>

szha commented Jan 31, 2021

View reviewed changes

leezu approved these changes Feb 1, 2021

View reviewed changes

limit spacy<3

0a41311

Signed-off-by: Sheng Zha <[email protected]>

szha merged commit 302865c into dmlc:master Feb 2, 2021

szha deleted the except branch February 2, 2021 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] add decorator for logging exceptions #1512

[MISC] add decorator for logging exceptions #1512

szha commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

codecov bot commented Jan 30, 2021 •

edited

Loading

szha Jan 31, 2021 •

edited

Loading

szha commented Jan 31, 2021 •

edited

Loading

sxjscience commented Jan 31, 2021

szha commented Jan 31, 2021

sxjscience commented Jan 31, 2021

szha commented Jan 31, 2021

sxjscience commented Jan 31, 2021

leezu commented Feb 1, 2021

szha commented Feb 1, 2021

szha commented Feb 1, 2021

github-actions bot commented Feb 2, 2021

[MISC] add decorator for logging exceptions #1512

[MISC] add decorator for logging exceptions #1512

Conversation

szha commented Jan 30, 2021

Description

Checklist

Essentials

Changes

Comments

github-actions bot commented Jan 30, 2021

github-actions bot commented Jan 30, 2021

codecov bot commented Jan 30, 2021 • edited Loading

Codecov Report

szha Jan 31, 2021 • edited Loading

Choose a reason for hiding this comment

szha commented Jan 31, 2021 • edited Loading

sxjscience commented Jan 31, 2021

szha commented Jan 31, 2021

sxjscience commented Jan 31, 2021

szha commented Jan 31, 2021

sxjscience commented Jan 31, 2021

leezu commented Feb 1, 2021

szha commented Feb 1, 2021

szha commented Feb 1, 2021

github-actions bot commented Feb 2, 2021

codecov bot commented Jan 30, 2021 •

edited

Loading

szha Jan 31, 2021 •

edited

Loading

szha commented Jan 31, 2021 •

edited

Loading