Skip to content

Commit 42905f4

Browse files
committed
1. Update the README for the installing instruction.
2. Add the patches to pytorch1.50-rc3 to enable the third party c10d.
1 parent bf3b028 commit 42905f4

File tree

3 files changed

+954
-15
lines changed

3 files changed

+954
-15
lines changed

CMakeLists.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,15 @@ add_subdirectory(third_party/pybind11)
2525
pybind11_add_module(${PLUGIN_NAME} SHARED src/ProcessGroupCCL.cpp)
2626
include_directories(${PROJECT_SOURCE_DIR}/build/third_party/pybind11/include)
2727
include_directories(${PROJECT_SOURCE_DIR}/src)
28-
include_directories(${PROJECT_SOURCE_DIR}/build/third_party/oneCCL/include)
28+
include_directories(${PROJECT_SOURCE_DIR}/third_party/oneCCL/include)
2929

3030
add_dependencies(${PLUGIN_NAME} ccl)
3131
add_dependencies(${PLUGIN_NAME} pybind11)
3232

33-
target_link_libraries(${PLUGIN_NAME} PUBLIC ccl)
33+
3434
link_directories(${PYTORCH_INSTALL_DIR}/lib)
3535
target_link_libraries(${PLUGIN_NAME} PUBLIC ${PYTORCH_INSTALL_DIR}/lib/libtorch_python.so)
3636
target_link_libraries(${PLUGIN_NAME} PUBLIC ${PYTORCH_INSTALL_DIR}/lib/libtorch_cpu.so)
3737
target_link_libraries(${PLUGIN_NAME} PUBLIC ${PYTORCH_INSTALL_DIR}/lib/libc10.so)
38+
target_link_libraries(${PLUGIN_NAME} PUBLIC ${PROJECT_SOURCE_DIR}/build/third_party/oneCCL/src/libccl.a)
3839
target_compile_options(${PLUGIN_NAME} PRIVATE "-DC10_BUILD_MAIN_LIB")

README.md

Lines changed: 44 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,32 +9,64 @@ This repository holds PyTorch bindings maintained by Intel for the Intel® oneAP
99

1010
[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective commnications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://oneapi-src.github.io/oneCCL).
1111

12-
`torch-ccl` module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup.
12+
`torch-ccl` module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.
1313

14+
# Pytorch API Align
15+
We recommend Anaconda as Python package management system. The following is the corresponding branchs (tags) of torch-ccl and supported Pytorch.
1416

15-
# Requirements
17+
| ``torch`` | ``torch-ccl`` |
18+
| :-----:| :---: |
19+
| ``master`` | ``master`` |
20+
| [v1.6.0](https://github.com/pytorch/pytorch/tree/v1.6.0) | [ccl_torch1.6](https://github.com/intel/torch-ccl/tree/ccl_torch1.6) |
21+
| [v1.5-rc3](https://github.com/pytorch/pytorch/tree/v1.5.0-rc3) | [2021.1-beta09](https://github.com/intel/torch-ccl/tree/2021.1-beta09) |
22+
23+
The usage details can be found in the README of corresponding branch. The following part is about the usage of 2021.1-beta09 tag. if you want to use other version of torch-ccl please checkout to that branch(tag). For pytorch-1.5.0-rc3, the [#PR28068](https://github.com/pytorch/pytorch/pull/28068) and [#PR32361](https://github.com/pytorch/pytorch/pull/32361) are need to dynamicall register external ProcessGroup and enable ``alltoall`` collective communication primitive. The patch file about these two PRs is in ``patches`` directory and you can use it directly.
1624

17-
PyTorch (1.5.0 or higher).
25+
# Requirements
1826

19-
Intel® oneAPI Collective Communications Library (2021.1-beta05 or higher).
27+
Python 3.6 or later and a C++14 compiler.
2028

29+
pytorch v1.5.0-rc3.
2130

2231
# Installation
2332

2433
To install `torch-ccl`:
2534

26-
1. Install PyTorch.
35+
1. clone [PyTorch](https://github.com/pytorch/pytorch) from source code.
2736

28-
2. Install the `torch-ccl`.
29-
30-
```
31-
$ python setup.py install
37+
```bash
38+
git clone https://github.com/pytorch/pytorch.git
39+
cd pytorch
40+
git checkout v1.5.0-rc3
41+
cd ../
3242
```
43+
2. clone the `torch-ccl`.
3344

34-
3. Source the oneCCL environment.
45+
```bash
46+
git clone https://github.com/intel/torch-ccl.git && cd torch-ccl
47+
git submodule sync
48+
git submodule update --init --recursive
3549

3650
```
37-
$ source <torch_ccl_install_path>/ccl/env/setvars.sh
51+
3. Install pytorch and torch-ccl
52+
53+
```bash
54+
cd ../pytorch
55+
git apply ../torch-ccl/patches/enable_torch_ccl_for_pytorch1.5.0-rc3.diff
56+
git submodule sync
57+
git submodule update --init --recursive
58+
python setup.py install
59+
cd ../torch-ccl
60+
python setup.py install
61+
```
62+
4. oneCCL is used as third party repo of torch-ccl but you need to source the oneCCL environment before runing.
63+
64+
```bash
65+
source <torch_ccl_path>/ccl/env/setvars.sh
66+
67+
for example:
68+
torch_ccl_path=$CONDA_PREFIX/lib/python3.7/site-packages/torch_ccl-1.0.1-py3.7-linux-x86_64.egg/
69+
source <torch_ccl_path>/ccl/env/setvars.sh
3870
```
3971

4072

@@ -70,7 +102,7 @@ model = torch.nn.parallel.DistributedDataParallel(model, ...)
70102
```
71103

72104
```
73-
$ source <ccl_install_path>/env/setvars.sh
105+
$ source <torch_ccl_path>/ccl/env/setvars.sh
74106
$ mpirun -n <N> -ppn <PPN> -f <hostfile> python example.py
75107
```
76108

@@ -108,7 +140,6 @@ print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_
108140
```
109141

110142
```
111-
$ source <ccl_install_path>/env/setvars.sh
112143
$ mpirun -n 2 -l python profiling.py
113144
```
114145

0 commit comments

Comments
 (0)