Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRNS CPU GPU #12

Open
trinayan opened this issue Jun 26, 2017 · 5 comments
Open

TRNS CPU GPU #12

trinayan opened this issue Jun 26, 2017 · 5 comments

Comments

@trinayan
Copy link

Hi,
The Chai paper mentions that SC , PAD and TRNS support only GPU execution in the OPENCL-D benchmarks. I did not try the OpenCL-D ones but the CUDA-D SC and PAD version is CPU-GPU executing together. I was wondering if it is also possible to do this for TRNS of CUDA-D as well or it is not possible at all? Thanks

Best,
Trinayan

@el1goluj
Copy link
Member

Hi Trinayan,

As stated in the Chai paper, inter-worker synchronization between CPU and GPU workers is not possible without system-wide atomics. For this reason, in the paper we use GPU-only version for PAD, SC and TRNS when we compare OpenCL-D and OpenCL-U versions (Figure 2).
PAD and SC -D versions in the repository support CPU+GPU out-of-place implementations that do not require CPU-GPU inter-worker synchronization. The input arrays are divided into two parts, each of which is assigned to CPU or GPU. These versions are not directly comparable to PAD and SC -U versions, which are in-place.
TRNS uses multiple depending cycles for concurrency (see Sung et al., Innovative Parallel Computing, 2012). Elements in a depending cycle are scattered across the whole matrix, making the elements hard to be collected in -D version. Thus, the input array cannot easily split into two parts as in PAD and SC.

Juan

@trinayan
Copy link
Author

Hi,

Thanks a lot for this information. Now I understand clearly. Is it also possible to generate larger input sets for Bezier Surface in a manner similar to the other benchmarks ?

Best,
Trinayan

@el1goluj
Copy link
Member

Yes, for BS you can use:
-m : input size in both dimensions (default=3)
-n : output resolution in both dimensions (default=300)

@robers97
Copy link

As stated in the Chai paper, inter-worker synchronization between CPU and GPU workers is not possible without system-wide atomics. For this reason, in the paper we use GPU-only version for PAD, SC and TRNS when we compare OpenCL-D and OpenCL-U versions (Figure 2).

TRANS appears to be a good focus to test implementations of system wide atomics wither that is in software: https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html or hardware (TBD). I see you have the CUDA_8_0 flag in the code, but haven't gotten to the point where that compiles though I'm running CUDA 9.2.

Can we revisit this together? I'm willing to contribute back.

@el1goluj
Copy link
Member

Hi robers97,

Could you be more specific about your question?
The CUDA-U version uses system-wide atomics. It is tested with CUDA 8.0 (first CUDA version with system-wide atomics), thus it should work with CUDA 9.2. You will need Pascal or Volta GPU.
Yes, TRNS is a good benchmark to test implementations of system-wide atomics. Actually all CUDA-U benchmarks use them, and might be useful for you.

Thanks,
Juan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants