Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[MISC] add decorator for logging exceptions #1512

Merged
merged 3 commits into from
Feb 2, 2021
Merged

[MISC] add decorator for logging exceptions #1512

merged 3 commits into from
Feb 2, 2021

Conversation

szha
Copy link
Member

@szha szha commented Jan 30, 2021

Description

add decorator for logging exceptions

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • add decorator for logging exceptions

Comments

cc @dmlc/gluon-nlp-team

@szha szha requested a review from leezu January 30, 2021 02:46
@szha szha requested a review from a team as a code owner January 30, 2021 02:46
@github-actions
Copy link

@github-actions
Copy link

@codecov
Copy link

codecov bot commented Jan 30, 2021

Codecov Report

Merging #1512 (0a41311) into master (8d31297) will decrease coverage by 0.62%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
- Coverage   86.49%   85.87%   -0.63%     
==========================================
  Files          55       55              
  Lines        7502     7396     -106     
==========================================
- Hits         6489     6351     -138     
- Misses       1013     1045      +32     
Impacted Files Coverage Δ
setup.py 0.00% <ø> (ø)
src/gluonnlp/utils/misc.py 54.86% <100.00%> (+0.21%) ⬆️
conftest.py 76.31% <0.00%> (-9.94%) ⬇️
src/gluonnlp/data/loading.py 75.75% <0.00%> (-7.64%) ⬇️
src/gluonnlp/utils/lazy_imports.py 58.42% <0.00%> (-2.25%) ⬇️
src/gluonnlp/data/tokenizers/spacy.py 65.33% <0.00%> (-0.91%) ⬇️
src/gluonnlp/data/tokenizers/huggingface.py 71.06% <0.00%> (-0.49%) ⬇️
src/gluonnlp/data/tokenizers/jieba.py 73.13% <0.00%> (-0.40%) ⬇️
src/gluonnlp/models/transformer_xl.py 80.48% <0.00%> (-0.39%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d31297...0a41311. Read the comment docs.

@@ -49,7 +49,7 @@ RUN cd ${WORKDIR} \
&& git clone https://github.com/dmlc/gluon-nlp \
&& cd gluon-nlp \
&& git checkout master \
&& python3 -m pip install -U -e ."[extras]"
&& python3 -m pip install -U -e ."[extras,dev]"
Copy link
Member Author

@szha szha Jan 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leezu the docker build for gpu keeps failing in horovod build step

@szha
Copy link
Member Author

szha commented Jan 31, 2021

horovod build error
Building wheels for collected packages: horovod
  Building wheel for horovod (setup.py): started
  Building wheel for horovod (setup.py): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-hp3mvl1_
       cwd: /tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/
  Complete output (226 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/horovod
  copying horovod/__init__.py -> build/lib.linux-x86_64-3.6/horovod
  creating build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/keras
  copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/keras
  creating build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
  creating build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/common
  copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.6/horovod/common
  creating build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner
  copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/runner
  creating build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
  copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark
  creating build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.6/horovod/ray
  copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.6/horovod/ray
  creating build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.6/horovod/torch
  copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch
  creating build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/_keras
  creating build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
  creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
  creating build/lib.linux-x86_64-3.6/horovod/runner/common
  copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common
  creating build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
  creating build/lib.linux-x86_64-3.6/horovod/runner/task
  copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
  copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
  creating build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
  creating build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
  creating build/lib.linux-x86_64-3.6/horovod/runner/driver
  copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
  copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
  creating build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
  creating build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
  creating build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
  creating build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
  creating build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
  creating build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
  creating build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
  creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
  copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
  creating build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
  creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
  copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
  warning: build_py: byte-compiling is disabled, skipping.

  running build_ext
  -- The CXX compiler identification is GNU 7.5.0
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /usr/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Build architecture flags: -mf16c -mavx -mfma
  -- Using command /usr/bin/python3
  -- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
  -- Found MPI: TRUE (found version "3.1")
  -- Found CUDA: /usr/local/cuda (found version "10.2")
  -- Linking against static NCCL library
  -- Found NCCL: /usr/include
  -- Determining NCCL version from the header file: /usr/include/nccl.h
  -- NCCL_MAJOR_VERSION: 2
  -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
  -- Found Pytorch: 1.7.1 (found suitable version "1.7.1", minimum required is "1.2.0")
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
    return torch._C._cuda_getDeviceCount() > 0
  CMake Error at /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:165 (message):
    Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
    version "1.4.0")
  Call Stack (most recent call first):
    /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:458 (_FPHSA_FAILURE_MESSAGE)
    cmake/Modules/FindMxnet.cmake:54 (find_package_handle_standard_args)
    horovod/mxnet/CMakeLists.txt:12 (find_package)


  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/temp.linux-x86_64-3.6/CMakeFiles/CMakeOutput.log".
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 191, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/usr/local/lib/python3.6/dist-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
      self.run_command(cmd)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.6/dist-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/root/.local/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 89, in build_extensions
      cwd=self.build_temp)
    File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3']' returned non-zero exit status 1.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod
Installing collected packages: cloudpickle, horovod
    Running setup.py install for horovod: started
    Running setup.py install for horovod: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-3e_m045u/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.6m/horovod
         cwd: /tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/
    Complete output (228 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/horovod
    copying horovod/__init__.py -> build/lib.linux-x86_64-3.6/horovod
    creating build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/keras
    copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/keras
    creating build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
    creating build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/common
    copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.6/horovod/common
    creating build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner
    copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/runner
    creating build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/spark
    copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark
    creating build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.6/horovod/ray
    copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.6/horovod/ray
    creating build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.6/horovod/torch
    copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch
    creating build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/_keras
    creating build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
    creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
    creating build/lib.linux-x86_64-3.6/horovod/runner/common
    copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common
    creating build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/util
    creating build/lib.linux-x86_64-3.6/horovod/runner/task
    copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
    copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/task
    creating build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/http
    creating build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/elastic
    creating build/lib.linux-x86_64-3.6/horovod/runner/driver
    copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
    copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/driver
    creating build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/util
    creating build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.6/horovod/runner/common/service
    creating build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
    creating build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
    creating build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
    creating build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
    creating build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
    creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
    copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
    creating build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/elastic
    creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
    copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
    warning: build_py: byte-compiling is disabled, skipping.

    running build_ext
    -- The CXX compiler identification is GNU 7.5.0
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Build architecture flags: -mf16c -mavx -mfma
    -- Using command /usr/bin/python3
    -- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- Found CUDA: /usr/local/cuda (found version "10.2")
    -- Linking against static NCCL library
    -- Found NCCL: /usr/include
    -- Determining NCCL version from the header file: /usr/include/nccl.h
    -- NCCL_MAJOR_VERSION: 2
    -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
    -- Found Pytorch: 1.7.1 (found suitable version "1.7.1", minimum required is "1.2.0")
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    CMake Error at /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:165 (message):
      Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
      version "1.4.0")
    Call Stack (most recent call first):
      /root/.local/lib/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/FindPackageHandleStandardArgs.cmake:458 (_FPHSA_FAILURE_MESSAGE)
      cmake/Modules/FindMxnet.cmake:54 (find_package_handle_standard_args)
      horovod/mxnet/CMakeLists.txt:12 (find_package)


    -- Configuring incomplete, errors occurred!
    See also "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/temp.linux-x86_64-3.6/CMakeFiles/CMakeOutput.log".
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 191, in <module>
        'horovodrun = horovod.runner.launch:run_commandline'
      File "/usr/local/lib/python3.6/dist-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/local/lib/python3.6/dist-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
        self.run_command('build')
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/root/.local/lib/python3.6/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
        self.build_extensions()
      File "/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py", line 89, in build_extensions
        cwd=self.build_temp)
      File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/build/lib.linux-x86_64-3.6', '-DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python3']' returned non-zero exit status 1.
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-8lyz893x/horovod_397f02fc95974407b7cd47817a76cb0f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-3e_m045u/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.6m/horovod Check the logs for full command output.
The command '/bin/sh -c bash /install/install_horovod.sh' returned a non-zero code: 1

@sxjscience
Copy link
Member

@szha From the error message, it seems to be related to how the mxnet integration is written. Currently, the horovod will call MXNet API to determine some GPU-related flags, and will fail if the instance that is used does not contain GPU or is not configured appropriately. You may follow the guide in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself and try again (Need to edit /etc/docker/daemon.json).

@szha
Copy link
Member Author

szha commented Jan 31, 2021

@sxjscience thanks. I think my system already has nvidia-docker2 installed and the config entry added. I think you are right that this has to do with how horovod integration is written. It's having trouble finding mxnet for some reason.

@sxjscience
Copy link
Member

OK, because I find that there are the following warning in the log so I thought that GPU was not used.

    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0

@szha
Copy link
Member Author

szha commented Jan 31, 2021

@sxjscience using nvidia-docker instead of docker command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.

@sxjscience
Copy link
Member

I think we may try to automate our docker pipeline.

@leezu
Copy link
Contributor

leezu commented Feb 1, 2021

horovod build error

@szha do you mean this error occurs when you rebuild the container?

@sxjscience using nvidia-docker instead of docker command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.

That's not correct, because nvidia-docker only takes effect at runtime and not at buildtime. You need to follow the steps in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself

@szha
Copy link
Member Author

szha commented Feb 1, 2021

@leezu @sxjscience thanks for helping. I noticed that previously I missed the "default-runtime" entry in the config. Sorry for the miss. I was able to complete the build after adding that entry and I'm pushing the GPU docker now.

@szha
Copy link
Member Author

szha commented Feb 1, 2021

looks like there might be an upstream change as tests/test_data_tokenizers.py::test_spacy_tokenizer failed.

Signed-off-by: Sheng Zha <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 2, 2021

@szha szha merged commit 302865c into dmlc:master Feb 2, 2021
@szha szha deleted the except branch February 2, 2021 02:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants