diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..7b65501 --- /dev/null +++ b/.gitignore @@ -0,0 +1,21 @@ +.ipynb_checkpoints +*/.ipynb_checkpoints/* +alk.traj.dcd +*.simg +*.so* +*.a +*.la +mgpm +*.o +*.out +*/.ses/* +*/.log/* +*/not repo/* +*/.nsys-rep/* +*/.sqlite/* +*/.ncu-rep/* +not repo/ +_test/ +_advanced/ +README_.md + diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..0c8078b --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,87 @@ +Contributing +------------ + +Please use the following guidelines when contributing to this project. + +Before contributing significant changes, please begin a discussion of the desired changes via a GitHub Issue to prevent doing unnecessary or overlapping work. + +## License + +The preferred license for source code contributed to this project is the Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0) and for documentation, including Jupyter notebooks and text documentation, is the Creative Commons Attribution 4.0 International (CC BY 4.0) (https://creativecommons.org/licenses/by/4.0/). Contributions under other, compatible licenses will be considered on a case-by-case basis. + +## Styling + +Please use the following style guidelines when making contributions. + +### Source Code +* Two-space indentation, no tabs +* To the extent possible, variable names should be descriptive +* Code should be documentation with detail like what function does and returns making the code readable. The code should also have proper license at the beginning of the file. +* Fortran codes should use free-form source files +* Fortran codes should not use implicit variable names and should use implicit none +* The following file extensions should be used appropriately + * C - .c + * C++ - .cpp + * CUDA C/C++ - .cu + * CUDA Fortran - .cuf + * Fortran - .F90 + * Python = .py + +### Jupyter Notebooks & Markdown +* When they appear inline with the text; directive names, clauses, function or subroutine names, variable names, file names, commands and command-line arguments should appear between two backticks. +* Code blocks should begin with three backticks and either 'cpp' or 'fortran' to enable appropriate source formatting and end with three backticks. +* Leave an empty line before and after the codeblock. +Emphasis, including quotes made for emphasis and introduction of new terms should be highlighted between a single pair of asterisks +* A level 1 heading should appear at the top of the notebook as the title of the notebook. +* A horizontal rule should appear between sections that begin with a level 2 heading. + +Please refer to the following template for jupyter notebook styling in the github repository:misc/jupyter_lab_template + +## Contributing Labs/Modules + +### Directory stucture for Github + +Before starting to work on new lab it is important to follow the recommended git structure as shown below to avoid and reformatting: + +![Github Directory Structure](misc/images/git_repo_structure.jpg) + +Top level directories consist for 3 folders +* hpc: This directory contains labs related to HPC(Simulation) and parallel computing +* hpc_ai: This directory contains labs related to usage of AI/ML/DL for Science/HPC Simulations +* ai : This directory consists of labs related to traditional AI/ML/DL techniques and frameworks + +Each lab will have following files/directories consisting of training material for the lab. +* jupyter_notebook folder: Consists of jupyter notebooks and its corresponding images. +* source_code folder :Source codes are stored in a separate directory because sometime not all clusters may support jupyter notebooks. During such bootcamps, we should be able to use the source codes directly from this directory. Source code folder may optionally contain Makefile especially for HPC labs. +* presentations: Consists of presentations for the labs ( pdf format is preferred ) +* Dockerfile and Singularity : Each lab should have both Docker and Singularity recipes. + +The lab optionally may also add custom license in case of any deviation from the top level directory license ( Apache 2.0 ). The base of the module contains individual subdirectory containing versions of the module for languages respectively(C/C++/Fortran…). Each of these directories should contain a directory for individual language translation provided (English, for instance). Each lab translation and programming language combination should have a solutions directory containing correct solutions + +Additionally there are two folders "experimental" and "archived" for labs covering features which are in early access phase ( not stable ) or deprecated features repectively. + +### Git Branching + +Adding a new feature/lab will follow a forking workflow. Which means a feature branch development will happen on a forked repo which later gets merged into our original project (GPUHackathons.org) repository. + + +![Git Branching Workflow](misc/images/git_branching.jpg) + +The 5 main steps depicted in image above are as follows: +1. Fork: To create a new lab/feature the GPUHackathons.org repository must be forked. Fork will create a snapshot of GPUHackathons.org repository at the time it was forked. Any new feature/lab that will be developed should be based on the develop branch of the repository. +2. Clone: Developer can than clone this new repository to local machine +Create Feature Branch: Create a new branch with a feature name in which your changes will be done. Recommend naming convention of feature branch is naming convention for branch: hpc-,hpc-ai-, ai-. The new changes that developer makes can be added, committed and pushed +3. Push: After the changes are committed, the developer pushes the changes to the remote branch. Push command helps the local changes to github repository +4. Pull: Submit a pull request. Upon receiving pull request a Hackathon team reviewer/owner will review the changes and upon accepting it can be merged into the develop branch of GpuHacakthons.org + +Git Branch details are as follows: + +* master branch: Consists of the stable branch. + * origin/master to be the main branch where the source code of HEAD always reflects a production-ready state + * Merge request is possible through: develop branch +* develop branch: branched from master branch + * Must branch from: master branch + * Must merge back into: master branch + * It is the main development branch where the source code of HEAD always reflects a state with the latest delivered development changes for the next release. + * When the source code in the develop branch reaches a stable point and is ready to be released, all of the changes should be merged back into master somehow and then tagged with a release number + * All feature development should happen by forking GPUHackathons.org and branching from develop branch only. diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..261eeb9 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md new file mode 100644 index 0000000..44d5b79 --- /dev/null +++ b/README.md @@ -0,0 +1,39 @@ +[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) + +# HPC_Bootcamp + +This repository contains training content for the HPC_Bootcamp materials. This repository includes the following file structure in the initial two levels: + +``` + _advanced +│ ├── +├── _basic +│ ├── cuda_basic +│ ├── iso +│ ├── openacc_basic +│ └── openmp +├── LICENSE +├── README.md +├── _scripts +└── start_notebook +``` + +- The __advanced_ directory contains all of the advanced training materials for CUDA, OpenACC, and multiGPU. +- The __basic_ directory contains all of the introductory training materials for CUDA, Standard Languages, OpenMP Offloading, and OpenACC. +- The __scripts_ directory contains container definition files for each bootcamp type. + +Please note there is a container definition file for each content in `_advanced` and `_basic` directories that can be found inside the `_scripts` folder and those can be used on their own without mixing with other contents. + +### Building the container using the definition files inside the `_script` folder + +To build the singularity container, run: +`sudo singularity build miniapp.simg {Name of the content}_Singularity` , alternatively you can use `singularity build --fakeroot miniapp.simg {Name of the content}_Singularity` if you do not have `sudo` rights. + +Next, copy the files to a local directory to make sure changes are stored locally: +`singularity run miniapp.simg cp -rT /labs ~/labs` + +Then, run the container: +`singularity run --nv miniapp.simg jupyter-lab --notebook-dir=~/labs` + +Once inside the container, open the jupyter lab in browser: http://localhost:8888, and start the lab by clicking on the `_start_{Name of the content}.ipynb` notebook. + diff --git a/_basic/LICENSE b/_basic/LICENSE new file mode 100644 index 0000000..9210285 --- /dev/null +++ b/_basic/LICENSE @@ -0,0 +1,23 @@ +Copyright (c) 2018, National Center for Computational Sciences, Oak Ridge National Laboratory +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/_basic/README.md b/_basic/README.md new file mode 100644 index 0000000..f2f3105 --- /dev/null +++ b/_basic/README.md @@ -0,0 +1,95 @@ +# Nways to GPU programming +This repository contains mini applications for GPU Bootcamps (**Tested on NVIDIA driver 440.82**). This bootcamp comprises N-Ways to GPU programming implemented with the following programming approaches: + +**C programming language** + - std::par + - OpenACC + - OpenMP + - CUDA + +**Fortran programming language** + - do-concurrent + - OpenACC + - OpenMP + - CUDA + + +**Python programming language** + - CuPy + - Numba + +We showcase above ways using mini applications in domains like Molecular Dynamics, Computational Fluid Dynamics etc. + +## Target Audience: + +The target audience for this bootcamp are researchers/graduate students and developers who are interested in learning about various ways of GPU programming to accelerate their scientific applications. Basic experience with C/C++ or Python or Fortran programming is needed. No GPU programming knowledge is required. + +## Tutorial Duration + +N-Ways bootcamp is designed to be modular and the participants can choose one of the ways to go through the contents in this bootcamp: + +- Depth Learning: Choose one of the GPU programming approach and dive deep with optimaztion techniques. This approach is recommended for developers who have already decided to use a programming approach and want to learn best practises for same. e.g. Learn different features of OpenACC C and apply best programming practise to application. +- Breadth Learning: Cover at high level all the N-Ways to GPU programming. This approach is recommended for developers starting with GPU programming and yet to converge on the best available option to port to GPU. + +Individual labs in the bootcamp take 1 hour each and based on path chosen total Bootcamp can take approximate 8 hours. + + +## Prerequisites: +To run this tutorial you will need a machine with NVIDIA GPU. + +- Install the [Docker](https://docs.docker.com/get-docker/) or [Singularity](https://sylabs.io/docs/]). +- Install Nvidia toolkit, [Nsight Systems (latest version)](https://developer.nvidia.com/nsight-systems) and [compute (latest version)](https://developer.nvidia.com/nsight-compute). + +## Creating containers +To start with, you will have to build a Docker or Singularity container. + +### Docker Container +To build a docker container, specify the dockerfile name using `-f` flag: + +`sudo docker build -f -t : .` + +For example : + +`sudo docker build -f nways_Dockerfile_python -t myimage:1.0 .` to build the container for the Python version of the nways content. + +or + +`sudo docker build -f nways_Dockerfile -t myimage:1.0 .` to build the container for the C/Fortran version of the nways content. + +For C, Fortran, and Python, the code labs have been written using Jupyter lab and a Dockerfile has been built to simplify deployment. In order to serve the docker instance for a student, it is necessary to expose port 8000 from the container. For example, the following command would expose port 8000 inside the container as port 8000 on the lab machine: + +`sudo docker run --rm -it --gpus=all -p 8888:8888 myimage:1.0` + +When this command is run, you can browse to the serving machine on port 8888 using any web browser to access the labs. For instance, from if they are running on the local machine the web browser should be pointed to http://localhost:8888. The `--gpus` flag is used to enable `all` NVIDIA GPUs during container runtime. The `--rm` flag is used to clean an temporary images created during the running of the container. The `-it` flag enables killing the jupyter server with `ctrl-c`. This command may be customized for your hosting environment. + + +Then, inside the container launch the Jupyter lab assigning the port you opened: + +`jupyter-lab --ip 0.0.0.0 --port 8888 --no-browser --allow-root` + + +Once inside the container, open the jupyter lab in browser: http://localhost:8888, and start the lab by clicking on the `_start_nways.ipynb` notebook. + +### Singularity Container + +To build the singularity container for **C & Fortran**, run: + +`singularity build --fakeroot nways.simg nways_Singularity` + +While in the case of **Python**, run: + +`singularity build --fakeroot nways.simg nways_Singularity_python` + +Thereafter, for C, Fortran, and Python, copy the files to your local machine to make sure changes are stored locally: + +`singularity run nways.simg cp -rT /labs ~/labs` + +Then, run the container: + +`singularity run --nv nways.simg jupyter-lab --notebook-dir=~/labs` + +Once inside the container, open the jupyter lab in browser: http://localhost:8888, and start the lab by clicking on the `_start_nways.ipynb` notebook. + + +## Known issues +- Please go through the list of exisiting bugs/issues or file a new issue at [Github](https://github.com/openhackathons-org/HPC_Bootcamp/issues). diff --git a/_basic/_common/_start_nways_C_Fortran.ipynb b/_basic/_common/_start_nways_C_Fortran.ipynb new file mode 100644 index 0000000..f72dcac --- /dev/null +++ b/_basic/_common/_start_nways_C_Fortran.ipynb @@ -0,0 +1,125 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## N Ways to GPU Programming\n", + "\n", + "## Learning objectives\n", + "With the release of CUDA in 2007, different approaches to programming GPUs have evolved. Each approach has its own advantages and disadvantages. By the end of this bootcamp session, students will have a broader perspective on GPU programming approaches to help them select a programming model that better fits their applications' needs and constraints. The bootcamp will teach how to accelerate a popular algorithm of Radial Distribution Function (RDF) using the following methods:\n", + "* Standard: C++ stdpar, Fortran Do-Concurrent\n", + "* Directives: OpenACC, OpenMP\n", + "\n", + "* Programming Language Extension: CUDA C, CUDA Fortran\n", + "\n", + "Let's start with testing the CUDA Driver and GPU you are running the code on in this lab:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "### Bootcamp Outline\n", + "\n", + " We will be following the cycle of Analysis - Parallelization - Optimization cycle throughout. To start with let us understand the Nsight tool ecosystem: \n", + "\n", + "- [Nsight Systems](jupyter_notebook/nsight_systems.ipynb)\n", + " - Overview of Nsight profiler tools\n", + " - Introduction to Nsight Systems\n", + " - How to view the report\n", + " - How to use NVTX APIs\n", + " - Optimization Steps to parallel programming \n", + " \n", + "- [Nsight Compute](jupyter_notebook/nsight_compute.ipynb)\n", + " - Introduction to Nsight Compute\n", + " - Overview of sections\n", + " - Roofline Charts\n", + " - Memory Charts\n", + " - Profiling a kernel using CLI\n", + " - How to view the report\n", + " \n", + "We will be working on porting a radial distribution function (RDF) to GPUs. Please choose one of the programming language to proceed working on RDF. Note: Learn about all terminologies used throught the notebooks in the [GPU Architecture Terminologies](jupyter_notebook/GPU_Architecture_Terminologies.ipynb) notebook.\n", + "\n", + "Please read the [RDF Overview](jupyter_notebook/rdf_overview.ipynb) to get familiar with how this application works.\n", + "\n", + "Below is the list of GPU programming approaches we will be covering during this course, click on the link below to start exploring:\n", + " \n", + "1. [ISO C++ and ISO Fortran](../iso/jupyter_notebook/nways_iso.ipynb)\n", + "2. [OpenACC](../openacc/jupyter_notebook/nways_openacc.ipynb)\n", + "\n", + "3. [OpenMP](../openmp/jupyter_notebook/nways_openmp.ipynb) \n", + "4. [CUDA](../cuda/jupyter_notebook/nways_cuda.ipynb) \n", + "\n", + "To finish the lab let us go through some final [remarks](jupyter_notebook/Final_Remarks.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Bootcamp Duration\n", + "The lab material will be presented in a 8hr session. Link to material is available for download at the end of the lab.\n", + "\n", + "### Content Level\n", + "Beginner, Intermediate\n", + "\n", + "### Target Audience and Prerequisites\n", + "The target audience for this lab is researchers/graduate students and developers who are interested in learning about programming various ways to programming GPUs to accelerate their scientific applications.\n", + "\n", + "Basic experience with C/C++ or Fortran programming is needed. No GPU programming knowledge is required.\n", + "\n", + "-----\n", + "\n", + "-----\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/_start_nways_python.ipynb b/_basic/_common/_start_nways_python.ipynb new file mode 100644 index 0000000..475a5d2 --- /dev/null +++ b/_basic/_common/_start_nways_python.ipynb @@ -0,0 +1,125 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## N Ways to GPU Programming\n", + "\n", + "## Learning Objectives\n", + "With the release of NVIDIA CUDA in 2007, different approaches to GPU programming have evolved. Each approach has its own advantages and disadvantages. By the end of this bootcamp session, participants will have a broader perspective on GPU programming approaches to help them select a programming model that better fits their application's needs and constraints. The bootcamp will teach how to accelerate a popular algorithm of Radial Distribution Function (RDF) using the following methods:\n", + "\n", + "* Programming Language Extension: CuPy and Numba\n", + "\n", + "Let's start by testing the CUDA Driver and GPU to be used in running the code in this lab:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "### Bootcamp Outline\n", + "\n", + " Throughout the tutorial we will be following the Analysis - Parallelization - Optimization cycle. Let us begin by understanding the NVIDIA Nsight System tool ecosystem: \n", + "\n", + "- [Nsight Systems](jupyter_notebook/nsight_systems.ipynb)\n", + " - Overview of Nsight profiler tools\n", + " - Introduction to Nsight Systems\n", + " - How to view the report\n", + " - How to use NVTX APIs\n", + " - Optimization Steps to parallel programming \n", + " \n", + "- [Nsight Compute](jupyter_notebook/nsight_compute.ipynb)\n", + " - Introduction to Nsight Compute\n", + " - Overview of sections\n", + " - Roofline Charts\n", + " - Memory Charts\n", + " - Profiling a kernel using CLI\n", + " - How to view the report\n", + "\n", + "Note: Learn about all terminologies used throught the notebooks in the [GPU Architecture Terminologies](jupyter_notebook/GPU_Architecture_Terminologies.ipynb) notebook.\n", + "\n", + "\n", + "We will be working on porting a radial distribution function (RDF) to GPUs. Please choose one approach within the Python programming language to proceed working on RDF. \n", + " \n", + "\n", + "\n", + "#### Python Programming Language\n", + "\n", + "Please read the [RDF Overview](jupyter_notebook/rdf_overview_python.ipynb) to get familiar with how this application works.\n", + "\n", + "To get started, click on the following GPU programming approaches in Python:\n", + "\n", + "1. [CuPy](../python/jupyter_notebook/cupy/cupy_guide.ipynb)\n", + "2. [Numba](../python/jupyter_notebook/numba/numba_guide.ipynb)\n", + "\n", + "To round up this tutorial, see some final [remarks on Python](jupyter_notebook/Final_Remarks_python.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tutorial Duration\n", + "The lab material will be presented in an 8-hour session. A Link to the material is available for download at the end of the lab.\n", + "\n", + "### Content Level\n", + "Beginner, Intermediate\n", + "\n", + "### Target Audience and Prerequisites\n", + "The target audience for this lab are researchers, graduate students and developers who are interested in learning about various ways of GPU programming to accelerate scientific applications.\n", + "\n", + "Basic experience with Python programming is needed. No GPU programming knowledge is required.\n", + "\n", + "-----\n", + "\n", + "-----\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/dataset.py b/_basic/_common/dataset.py new file mode 100644 index 0000000..f744533 --- /dev/null +++ b/_basic/_common/dataset.py @@ -0,0 +1,10 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +import gdown +import os + +## alk.traj.dcd input file +#url = 'https://drive.google.com/uc?id=1WZ0rtXZ-uMLfy7htT0gaU4EQ_Rq61QTF&export=download' +url = 'https://drive.google.com/u/0/uc?export=download&confirm=jDXw&id=1WZ0rtXZ-uMLfy7htT0gaU4EQ_Rq61QTF' +output_ = '/labs/_common/input/alk.traj.dcd' +gdown.download(url, output_, quiet=False,proxy=None) diff --git a/_basic/_common/dataset_python.py b/_basic/_common/dataset_python.py new file mode 100644 index 0000000..59c90ca --- /dev/null +++ b/_basic/_common/dataset_python.py @@ -0,0 +1,10 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +import gdown +import os + +## alk.traj.dcd input file +#url = 'https://drive.google.com/uc?id=1WZ0rtXZ-uMLfy7htT0gaU4EQ_Rq61QTF&export=download' +url = 'https://drive.google.com/u/0/uc?export=download&confirm=jDXw&id=1WZ0rtXZ-uMLfy7htT0gaU4EQ_Rq61QTF' +output = '/labs/python/source_code/input/alk.traj.dcd' +gdown.download(url, output, quiet=False, proxy=None) diff --git a/_basic/_common/images/2d_array.png b/_basic/_common/images/2d_array.png new file mode 100644 index 0000000..12346c3 Binary files /dev/null and b/_basic/_common/images/2d_array.png differ diff --git a/_basic/_common/images/2d_col_mult.png b/_basic/_common/images/2d_col_mult.png new file mode 100644 index 0000000..a79d207 Binary files /dev/null and b/_basic/_common/images/2d_col_mult.png differ diff --git a/_basic/_common/images/Nsight Diagram.png b/_basic/_common/images/Nsight Diagram.png new file mode 100644 index 0000000..6647175 Binary files /dev/null and b/_basic/_common/images/Nsight Diagram.png differ diff --git a/_basic/_common/images/Optimization_Cycle.jpg b/_basic/_common/images/Optimization_Cycle.jpg new file mode 100644 index 0000000..e41a2e3 Binary files /dev/null and b/_basic/_common/images/Optimization_Cycle.jpg differ diff --git a/_basic/_common/images/SOL-compute.png b/_basic/_common/images/SOL-compute.png new file mode 100644 index 0000000..99a3784 Binary files /dev/null and b/_basic/_common/images/SOL-compute.png differ diff --git a/_basic/_common/images/UM.png b/_basic/_common/images/UM.png new file mode 100644 index 0000000..b68d09a Binary files /dev/null and b/_basic/_common/images/UM.png differ diff --git a/_basic/_common/images/allsection-compute.png b/_basic/_common/images/allsection-compute.png new file mode 100644 index 0000000..5693e3c Binary files /dev/null and b/_basic/_common/images/allsection-compute.png differ diff --git a/_basic/_common/images/baseline-compute.png b/_basic/_common/images/baseline-compute.png new file mode 100644 index 0000000..941f38e Binary files /dev/null and b/_basic/_common/images/baseline-compute.png differ diff --git a/_basic/_common/images/baseline1-compute.png b/_basic/_common/images/baseline1-compute.png new file mode 100644 index 0000000..941f38e Binary files /dev/null and b/_basic/_common/images/baseline1-compute.png differ diff --git a/_basic/_common/images/charts-compute.png b/_basic/_common/images/charts-compute.png new file mode 100644 index 0000000..ff9ee5d Binary files /dev/null and b/_basic/_common/images/charts-compute.png differ diff --git a/_basic/_common/images/cli-out.png b/_basic/_common/images/cli-out.png new file mode 100644 index 0000000..690fc32 Binary files /dev/null and b/_basic/_common/images/cli-out.png differ diff --git a/_basic/_common/images/coalesced_mem.png b/_basic/_common/images/coalesced_mem.png new file mode 100644 index 0000000..a652ad5 Binary files /dev/null and b/_basic/_common/images/coalesced_mem.png differ diff --git a/_basic/_common/images/collapse_feedback.png b/_basic/_common/images/collapse_feedback.png new file mode 100644 index 0000000..7e9514d Binary files /dev/null and b/_basic/_common/images/collapse_feedback.png differ diff --git a/_basic/_common/images/collapse_pre.png b/_basic/_common/images/collapse_pre.png new file mode 100644 index 0000000..e9a41b2 Binary files /dev/null and b/_basic/_common/images/collapse_pre.png differ diff --git a/_basic/_common/images/collapse_thread.png b/_basic/_common/images/collapse_thread.png new file mode 100644 index 0000000..f083058 Binary files /dev/null and b/_basic/_common/images/collapse_thread.png differ diff --git a/_basic/_common/images/compute-cli-1.png b/_basic/_common/images/compute-cli-1.png new file mode 100644 index 0000000..7793ba0 Binary files /dev/null and b/_basic/_common/images/compute-cli-1.png differ diff --git a/_basic/_common/images/compute-cli-2.png b/_basic/_common/images/compute-cli-2.png new file mode 100644 index 0000000..ebfb167 Binary files /dev/null and b/_basic/_common/images/compute-cli-2.png differ diff --git a/_basic/_common/images/compute-memory.png b/_basic/_common/images/compute-memory.png new file mode 100644 index 0000000..f549c4f Binary files /dev/null and b/_basic/_common/images/compute-memory.png differ diff --git a/_basic/_common/images/compute-memtable.png b/_basic/_common/images/compute-memtable.png new file mode 100644 index 0000000..a32c4d5 Binary files /dev/null and b/_basic/_common/images/compute-memtable.png differ diff --git a/_basic/_common/images/compute-open.png b/_basic/_common/images/compute-open.png new file mode 100644 index 0000000..761d10d Binary files /dev/null and b/_basic/_common/images/compute-open.png differ diff --git a/_basic/_common/images/compute-sections.png b/_basic/_common/images/compute-sections.png new file mode 100644 index 0000000..a5dbb7d Binary files /dev/null and b/_basic/_common/images/compute-sections.png differ diff --git a/_basic/_common/images/compute-sets.png b/_basic/_common/images/compute-sets.png new file mode 100644 index 0000000..7aed109 Binary files /dev/null and b/_basic/_common/images/compute-sets.png differ diff --git a/_basic/_common/images/compute.png b/_basic/_common/images/compute.png new file mode 100644 index 0000000..12aa606 Binary files /dev/null and b/_basic/_common/images/compute.png differ diff --git a/_basic/_common/images/compute_analyz.png b/_basic/_common/images/compute_analyz.png new file mode 100644 index 0000000..c5608ec Binary files /dev/null and b/_basic/_common/images/compute_analyz.png differ diff --git a/_basic/_common/images/compute_collapse.png b/_basic/_common/images/compute_collapse.png new file mode 100644 index 0000000..01b7c59 Binary files /dev/null and b/_basic/_common/images/compute_collapse.png differ diff --git a/_basic/_common/images/compute_collapse_roofline.png b/_basic/_common/images/compute_collapse_roofline.png new file mode 100644 index 0000000..3233669 Binary files /dev/null and b/_basic/_common/images/compute_collapse_roofline.png differ diff --git a/_basic/_common/images/compute_command.png b/_basic/_common/images/compute_command.png new file mode 100644 index 0000000..92af4d6 Binary files /dev/null and b/_basic/_common/images/compute_command.png differ diff --git a/_basic/_common/images/compute_command_line.png b/_basic/_common/images/compute_command_line.png new file mode 100644 index 0000000..724ab78 Binary files /dev/null and b/_basic/_common/images/compute_command_line.png differ diff --git a/_basic/_common/images/compute_open.png b/_basic/_common/images/compute_open.png new file mode 100644 index 0000000..761d10d Binary files /dev/null and b/_basic/_common/images/compute_open.png differ diff --git a/_basic/_common/images/compute_split.png b/_basic/_common/images/compute_split.png new file mode 100644 index 0000000..c849daf Binary files /dev/null and b/_basic/_common/images/compute_split.png differ diff --git a/_basic/_common/images/cpu.png b/_basic/_common/images/cpu.png new file mode 100644 index 0000000..5028832 Binary files /dev/null and b/_basic/_common/images/cpu.png differ diff --git a/_basic/_common/images/cuda.png b/_basic/_common/images/cuda.png new file mode 100644 index 0000000..40fef84 Binary files /dev/null and b/_basic/_common/images/cuda.png differ diff --git a/_basic/_common/images/cuda_cupy.png b/_basic/_common/images/cuda_cupy.png new file mode 100644 index 0000000..ea3919c Binary files /dev/null and b/_basic/_common/images/cuda_cupy.png differ diff --git a/_basic/_common/images/cuda_hw_sw.png b/_basic/_common/images/cuda_hw_sw.png new file mode 100644 index 0000000..b9bc962 Binary files /dev/null and b/_basic/_common/images/cuda_hw_sw.png differ diff --git a/_basic/_common/images/cuda_indexing.png b/_basic/_common/images/cuda_indexing.png new file mode 100644 index 0000000..b58746e Binary files /dev/null and b/_basic/_common/images/cuda_indexing.png differ diff --git a/_basic/_common/images/cuda_profile.png b/_basic/_common/images/cuda_profile.png new file mode 100644 index 0000000..45ed65e Binary files /dev/null and b/_basic/_common/images/cuda_profile.png differ diff --git a/_basic/_common/images/cuda_profile_api.png b/_basic/_common/images/cuda_profile_api.png new file mode 100644 index 0000000..f6b4d50 Binary files /dev/null and b/_basic/_common/images/cuda_profile_api.png differ diff --git a/_basic/_common/images/cuda_profile_timeline.jpg b/_basic/_common/images/cuda_profile_timeline.jpg new file mode 100644 index 0000000..2de4a42 Binary files /dev/null and b/_basic/_common/images/cuda_profile_timeline.jpg differ diff --git a/_basic/_common/images/cuda_profile_timeline.png b/_basic/_common/images/cuda_profile_timeline.png new file mode 100644 index 0000000..dfb0640 Binary files /dev/null and b/_basic/_common/images/cuda_profile_timeline.png differ diff --git a/_basic/_common/images/cuda_vec_add.png b/_basic/_common/images/cuda_vec_add.png new file mode 100644 index 0000000..c7dfcfa Binary files /dev/null and b/_basic/_common/images/cuda_vec_add.png differ diff --git a/_basic/_common/images/cuda_vec_add2.png b/_basic/_common/images/cuda_vec_add2.png new file mode 100644 index 0000000..e528086 Binary files /dev/null and b/_basic/_common/images/cuda_vec_add2.png differ diff --git a/_basic/_common/images/cupy.JPG b/_basic/_common/images/cupy.JPG new file mode 100644 index 0000000..d78844f Binary files /dev/null and b/_basic/_common/images/cupy.JPG differ diff --git a/_basic/_common/images/cupy_arch.png b/_basic/_common/images/cupy_arch.png new file mode 100644 index 0000000..e8a1f31 Binary files /dev/null and b/_basic/_common/images/cupy_arch.png differ diff --git a/_basic/_common/images/cupy_intro.png b/_basic/_common/images/cupy_intro.png new file mode 100644 index 0000000..395e504 Binary files /dev/null and b/_basic/_common/images/cupy_intro.png differ diff --git a/_basic/_common/images/cupy_kernel_memory.png b/_basic/_common/images/cupy_kernel_memory.png new file mode 100644 index 0000000..8dfe11d Binary files /dev/null and b/_basic/_common/images/cupy_kernel_memory.png differ diff --git a/_basic/_common/images/cupy_nsys1.png b/_basic/_common/images/cupy_nsys1.png new file mode 100644 index 0000000..883f419 Binary files /dev/null and b/_basic/_common/images/cupy_nsys1.png differ diff --git a/_basic/_common/images/cupy_nsys2.png b/_basic/_common/images/cupy_nsys2.png new file mode 100644 index 0000000..09f6aef Binary files /dev/null and b/_basic/_common/images/cupy_nsys2.png differ diff --git a/_basic/_common/images/cupy_nsys3.png b/_basic/_common/images/cupy_nsys3.png new file mode 100644 index 0000000..a263dcb Binary files /dev/null and b/_basic/_common/images/cupy_nsys3.png differ diff --git a/_basic/_common/images/cupy_summary.png b/_basic/_common/images/cupy_summary.png new file mode 100644 index 0000000..ce26ee5 Binary files /dev/null and b/_basic/_common/images/cupy_summary.png differ diff --git a/_basic/_common/images/data_feedback.png b/_basic/_common/images/data_feedback.png new file mode 100644 index 0000000..1ae96e8 Binary files /dev/null and b/_basic/_common/images/data_feedback.png differ diff --git a/_basic/_common/images/data_thread.png b/_basic/_common/images/data_thread.png new file mode 100644 index 0000000..dada530 Binary files /dev/null and b/_basic/_common/images/data_thread.png differ diff --git a/_basic/_common/images/dcdfile.png b/_basic/_common/images/dcdfile.png new file mode 100644 index 0000000..92df10f Binary files /dev/null and b/_basic/_common/images/dcdfile.png differ diff --git a/_basic/_common/images/diagram.png b/_basic/_common/images/diagram.png new file mode 100644 index 0000000..8901c65 Binary files /dev/null and b/_basic/_common/images/diagram.png differ diff --git a/_basic/_common/images/do_concurrent_gpu.jpg b/_basic/_common/images/do_concurrent_gpu.jpg new file mode 100644 index 0000000..d01a024 Binary files /dev/null and b/_basic/_common/images/do_concurrent_gpu.jpg differ diff --git a/_basic/_common/images/do_concurrent_multicore.jpg b/_basic/_common/images/do_concurrent_multicore.jpg new file mode 100644 index 0000000..0085d79 Binary files /dev/null and b/_basic/_common/images/do_concurrent_multicore.jpg differ diff --git a/_basic/_common/images/expand-compute.png b/_basic/_common/images/expand-compute.png new file mode 100644 index 0000000..1e00c24 Binary files /dev/null and b/_basic/_common/images/expand-compute.png differ diff --git a/_basic/_common/images/f_collapse_feedback.png b/_basic/_common/images/f_collapse_feedback.png new file mode 100644 index 0000000..3c59e3f Binary files /dev/null and b/_basic/_common/images/f_collapse_feedback.png differ diff --git a/_basic/_common/images/f_collapse_thread.png b/_basic/_common/images/f_collapse_thread.png new file mode 100644 index 0000000..00d666e Binary files /dev/null and b/_basic/_common/images/f_collapse_thread.png differ diff --git a/_basic/_common/images/f_compute_analyz.png b/_basic/_common/images/f_compute_analyz.png new file mode 100644 index 0000000..dbf18f4 Binary files /dev/null and b/_basic/_common/images/f_compute_analyz.png differ diff --git a/_basic/_common/images/f_compute_command.png b/_basic/_common/images/f_compute_command.png new file mode 100644 index 0000000..5988e89 Binary files /dev/null and b/_basic/_common/images/f_compute_command.png differ diff --git a/_basic/_common/images/f_data_thread.png b/_basic/_common/images/f_data_thread.png new file mode 100644 index 0000000..8abafa8 Binary files /dev/null and b/_basic/_common/images/f_data_thread.png differ diff --git a/_basic/_common/images/f_gang_128.png b/_basic/_common/images/f_gang_128.png new file mode 100644 index 0000000..c2b1088 Binary files /dev/null and b/_basic/_common/images/f_gang_128.png differ diff --git a/_basic/_common/images/f_gang_32.png b/_basic/_common/images/f_gang_32.png new file mode 100644 index 0000000..4b99ee9 Binary files /dev/null and b/_basic/_common/images/f_gang_32.png differ diff --git a/_basic/_common/images/f_gang_vector.png b/_basic/_common/images/f_gang_vector.png new file mode 100644 index 0000000..bb0748c Binary files /dev/null and b/_basic/_common/images/f_gang_vector.png differ diff --git a/_basic/_common/images/f_memory_collapse.png b/_basic/_common/images/f_memory_collapse.png new file mode 100644 index 0000000..b055d93 Binary files /dev/null and b/_basic/_common/images/f_memory_collapse.png differ diff --git a/_basic/_common/images/f_memory_sec.png b/_basic/_common/images/f_memory_sec.png new file mode 100644 index 0000000..c263bd7 Binary files /dev/null and b/_basic/_common/images/f_memory_sec.png differ diff --git a/_basic/_common/images/f_offload_compare_nvtx.png b/_basic/_common/images/f_offload_compare_nvtx.png new file mode 100644 index 0000000..cab4feb Binary files /dev/null and b/_basic/_common/images/f_offload_compare_nvtx.png differ diff --git a/_basic/_common/images/f_offload_grid.png b/_basic/_common/images/f_offload_grid.png new file mode 100644 index 0000000..37ec2f5 Binary files /dev/null and b/_basic/_common/images/f_offload_grid.png differ diff --git a/_basic/_common/images/f_openacc_data_directive.png b/_basic/_common/images/f_openacc_data_directive.png new file mode 100644 index 0000000..165587d Binary files /dev/null and b/_basic/_common/images/f_openacc_data_directive.png differ diff --git a/_basic/_common/images/f_openmp_collapse_baseline.png b/_basic/_common/images/f_openmp_collapse_baseline.png new file mode 100644 index 0000000..2971bbb Binary files /dev/null and b/_basic/_common/images/f_openmp_collapse_baseline.png differ diff --git a/_basic/_common/images/f_openmp_collapse_reg.png b/_basic/_common/images/f_openmp_collapse_reg.png new file mode 100644 index 0000000..edb8aef Binary files /dev/null and b/_basic/_common/images/f_openmp_collapse_reg.png differ diff --git a/_basic/_common/images/f_openmp_collapse_reg_memory.png b/_basic/_common/images/f_openmp_collapse_reg_memory.png new file mode 100644 index 0000000..f7cc506 Binary files /dev/null and b/_basic/_common/images/f_openmp_collapse_reg_memory.png differ diff --git a/_basic/_common/images/f_openmp_collapse_reg_occupancy.png b/_basic/_common/images/f_openmp_collapse_reg_occupancy.png new file mode 100644 index 0000000..d718628 Binary files /dev/null and b/_basic/_common/images/f_openmp_collapse_reg_occupancy.png differ diff --git a/_basic/_common/images/f_openmp_collapse_reg_roofline.png b/_basic/_common/images/f_openmp_collapse_reg_roofline.png new file mode 100644 index 0000000..4b1bfea Binary files /dev/null and b/_basic/_common/images/f_openmp_collapse_reg_roofline.png differ diff --git a/_basic/_common/images/f_openmp_feedback_offload_split.png b/_basic/_common/images/f_openmp_feedback_offload_split.png new file mode 100644 index 0000000..4f95060 Binary files /dev/null and b/_basic/_common/images/f_openmp_feedback_offload_split.png differ diff --git a/_basic/_common/images/f_openmp_gpu.png b/_basic/_common/images/f_openmp_gpu.png new file mode 100644 index 0000000..70e11d4 Binary files /dev/null and b/_basic/_common/images/f_openmp_gpu.png differ diff --git a/_basic/_common/images/f_openmp_gpu_collapse.png b/_basic/_common/images/f_openmp_gpu_collapse.png new file mode 100644 index 0000000..14dd684 Binary files /dev/null and b/_basic/_common/images/f_openmp_gpu_collapse.png differ diff --git a/_basic/_common/images/f_openmp_multicore.png b/_basic/_common/images/f_openmp_multicore.png new file mode 100644 index 0000000..230a605 Binary files /dev/null and b/_basic/_common/images/f_openmp_multicore.png differ diff --git a/_basic/_common/images/f_openmp_offload_collapse.png b/_basic/_common/images/f_openmp_offload_collapse.png new file mode 100644 index 0000000..c6aa3e8 Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_collapse.png differ diff --git a/_basic/_common/images/f_openmp_offload_occupancy.png b/_basic/_common/images/f_openmp_offload_occupancy.png new file mode 100644 index 0000000..57fc0fe Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_occupancy.png differ diff --git a/_basic/_common/images/f_openmp_offload_roofline.png b/_basic/_common/images/f_openmp_offload_roofline.png new file mode 100644 index 0000000..454b081 Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_roofline.png differ diff --git a/_basic/_common/images/f_openmp_offload_split_cmp.png b/_basic/_common/images/f_openmp_offload_split_cmp.png new file mode 100644 index 0000000..1c2cb82 Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_split_cmp.png differ diff --git a/_basic/_common/images/f_openmp_offload_split_cmp2.png b/_basic/_common/images/f_openmp_offload_split_cmp2.png new file mode 100644 index 0000000..74c3cef Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_split_cmp2.png differ diff --git a/_basic/_common/images/f_openmp_offload_split_grid.png b/_basic/_common/images/f_openmp_offload_split_grid.png new file mode 100644 index 0000000..8fb24f3 Binary files /dev/null and b/_basic/_common/images/f_openmp_offload_split_grid.png differ diff --git a/_basic/_common/images/f_openmp_warp_cmp.png b/_basic/_common/images/f_openmp_warp_cmp.png new file mode 100644 index 0000000..7ed2faa Binary files /dev/null and b/_basic/_common/images/f_openmp_warp_cmp.png differ diff --git a/_basic/_common/images/f_roofline_collapse.png b/_basic/_common/images/f_roofline_collapse.png new file mode 100644 index 0000000..16d4880 Binary files /dev/null and b/_basic/_common/images/f_roofline_collapse.png differ diff --git a/_basic/_common/images/f_sol.png b/_basic/_common/images/f_sol.png new file mode 100644 index 0000000..4b1a844 Binary files /dev/null and b/_basic/_common/images/f_sol.png differ diff --git a/_basic/_common/images/f_sol_baseline.png b/_basic/_common/images/f_sol_baseline.png new file mode 100644 index 0000000..15af4ee Binary files /dev/null and b/_basic/_common/images/f_sol_baseline.png differ diff --git a/_basic/_common/images/f_source_hover.png b/_basic/_common/images/f_source_hover.png new file mode 100644 index 0000000..8d923e3 Binary files /dev/null and b/_basic/_common/images/f_source_hover.png differ diff --git a/_basic/_common/images/f_source_loc.png b/_basic/_common/images/f_source_loc.png new file mode 100644 index 0000000..ef5f4ac Binary files /dev/null and b/_basic/_common/images/f_source_loc.png differ diff --git a/_basic/_common/images/f_source_sass.png b/_basic/_common/images/f_source_sass.png new file mode 100644 index 0000000..6ea98ef Binary files /dev/null and b/_basic/_common/images/f_source_sass.png differ diff --git a/_basic/_common/images/gang_128.png b/_basic/_common/images/gang_128.png new file mode 100644 index 0000000..356fff1 Binary files /dev/null and b/_basic/_common/images/gang_128.png differ diff --git a/_basic/_common/images/gang_256.png b/_basic/_common/images/gang_256.png new file mode 100644 index 0000000..d8ba8ad Binary files /dev/null and b/_basic/_common/images/gang_256.png differ diff --git a/_basic/_common/images/gang_32.png b/_basic/_common/images/gang_32.png new file mode 100644 index 0000000..9f7d137 Binary files /dev/null and b/_basic/_common/images/gang_32.png differ diff --git a/_basic/_common/images/gang_vector.png b/_basic/_common/images/gang_vector.png new file mode 100644 index 0000000..09814ee Binary files /dev/null and b/_basic/_common/images/gang_vector.png differ diff --git a/_basic/_common/images/gpu_feedback.png b/_basic/_common/images/gpu_feedback.png new file mode 100644 index 0000000..97ef3db Binary files /dev/null and b/_basic/_common/images/gpu_feedback.png differ diff --git a/_basic/_common/images/grid.png b/_basic/_common/images/grid.png new file mode 100644 index 0000000..49447ba Binary files /dev/null and b/_basic/_common/images/grid.png differ diff --git a/_basic/_common/images/header-compute.png b/_basic/_common/images/header-compute.png new file mode 100644 index 0000000..ee808ac Binary files /dev/null and b/_basic/_common/images/header-compute.png differ diff --git a/_basic/_common/images/kernel_feedback.png b/_basic/_common/images/kernel_feedback.png new file mode 100644 index 0000000..1040558 Binary files /dev/null and b/_basic/_common/images/kernel_feedback.png differ diff --git a/_basic/_common/images/kernel_indep_feedback.png b/_basic/_common/images/kernel_indep_feedback.png new file mode 100644 index 0000000..1a637a4 Binary files /dev/null and b/_basic/_common/images/kernel_indep_feedback.png differ diff --git a/_basic/_common/images/kokkos_abstraction.png b/_basic/_common/images/kokkos_abstraction.png new file mode 100644 index 0000000..6326813 Binary files /dev/null and b/_basic/_common/images/kokkos_abstraction.png differ diff --git a/_basic/_common/images/kokkos_ecosystem.png b/_basic/_common/images/kokkos_ecosystem.png new file mode 100644 index 0000000..6b0f996 Binary files /dev/null and b/_basic/_common/images/kokkos_ecosystem.png differ diff --git a/_basic/_common/images/kokkos_mirror_view.png b/_basic/_common/images/kokkos_mirror_view.png new file mode 100644 index 0000000..9058ecb Binary files /dev/null and b/_basic/_common/images/kokkos_mirror_view.png differ diff --git a/_basic/_common/images/laplas3.png b/_basic/_common/images/laplas3.png new file mode 100644 index 0000000..4d66974 Binary files /dev/null and b/_basic/_common/images/laplas3.png differ diff --git a/_basic/_common/images/launch-compute.png b/_basic/_common/images/launch-compute.png new file mode 100644 index 0000000..d322a55 Binary files /dev/null and b/_basic/_common/images/launch-compute.png differ diff --git a/_basic/_common/images/mapping.png b/_basic/_common/images/mapping.png new file mode 100644 index 0000000..0fbff9d Binary files /dev/null and b/_basic/_common/images/mapping.png differ diff --git a/_basic/_common/images/matrix.png b/_basic/_common/images/matrix.png new file mode 100644 index 0000000..fd38e0e Binary files /dev/null and b/_basic/_common/images/matrix.png differ diff --git a/_basic/_common/images/matrix_block.png b/_basic/_common/images/matrix_block.png new file mode 100644 index 0000000..b851331 Binary files /dev/null and b/_basic/_common/images/matrix_block.png differ diff --git a/_basic/_common/images/matrix_grid.png b/_basic/_common/images/matrix_grid.png new file mode 100644 index 0000000..4160a44 Binary files /dev/null and b/_basic/_common/images/matrix_grid.png differ diff --git a/_basic/_common/images/memory-compute.png b/_basic/_common/images/memory-compute.png new file mode 100644 index 0000000..6c5e152 Binary files /dev/null and b/_basic/_common/images/memory-compute.png differ diff --git a/_basic/_common/images/memory.png b/_basic/_common/images/memory.png new file mode 100644 index 0000000..67d85b6 Binary files /dev/null and b/_basic/_common/images/memory.png differ diff --git a/_basic/_common/images/memory_architecture.png b/_basic/_common/images/memory_architecture.png new file mode 100644 index 0000000..f16a08f Binary files /dev/null and b/_basic/_common/images/memory_architecture.png differ diff --git a/_basic/_common/images/ngc_error.PNG b/_basic/_common/images/ngc_error.PNG new file mode 100644 index 0000000..e01b696 Binary files /dev/null and b/_basic/_common/images/ngc_error.PNG differ diff --git a/_basic/_common/images/nsight_open.png b/_basic/_common/images/nsight_open.png new file mode 100644 index 0000000..f5f780b Binary files /dev/null and b/_basic/_common/images/nsight_open.png differ diff --git a/_basic/_common/images/nsys-compute-command.png b/_basic/_common/images/nsys-compute-command.png new file mode 100644 index 0000000..5977c0b Binary files /dev/null and b/_basic/_common/images/nsys-compute-command.png differ diff --git a/_basic/_common/images/nsys-compute-command1.png b/_basic/_common/images/nsys-compute-command1.png new file mode 100644 index 0000000..69e1efa Binary files /dev/null and b/_basic/_common/images/nsys-compute-command1.png differ diff --git a/_basic/_common/images/nsys-compute-command2.png b/_basic/_common/images/nsys-compute-command2.png new file mode 100644 index 0000000..1953b6b Binary files /dev/null and b/_basic/_common/images/nsys-compute-command2.png differ diff --git a/_basic/_common/images/numba_nsys1.png b/_basic/_common/images/numba_nsys1.png new file mode 100644 index 0000000..cd10883 Binary files /dev/null and b/_basic/_common/images/numba_nsys1.png differ diff --git a/_basic/_common/images/numba_nsys2.png b/_basic/_common/images/numba_nsys2.png new file mode 100644 index 0000000..0a8f677 Binary files /dev/null and b/_basic/_common/images/numba_nsys2.png differ diff --git a/_basic/_common/images/numba_output_files.png b/_basic/_common/images/numba_output_files.png new file mode 100644 index 0000000..22930c9 Binary files /dev/null and b/_basic/_common/images/numba_output_files.png differ diff --git a/_basic/_common/images/numba_summary.png b/_basic/_common/images/numba_summary.png new file mode 100644 index 0000000..03fcc4b Binary files /dev/null and b/_basic/_common/images/numba_summary.png differ diff --git a/_basic/_common/images/numba_summary1.png b/_basic/_common/images/numba_summary1.png new file mode 100644 index 0000000..1f4df1b Binary files /dev/null and b/_basic/_common/images/numba_summary1.png differ diff --git a/_basic/_common/images/nvtx.PNG b/_basic/_common/images/nvtx.PNG new file mode 100644 index 0000000..1d01e85 Binary files /dev/null and b/_basic/_common/images/nvtx.PNG differ diff --git a/_basic/_common/images/nvtx_gpu.png b/_basic/_common/images/nvtx_gpu.png new file mode 100644 index 0000000..e092221 Binary files /dev/null and b/_basic/_common/images/nvtx_gpu.png differ diff --git a/_basic/_common/images/nvtx_multicore (copy).png b/_basic/_common/images/nvtx_multicore (copy).png new file mode 100644 index 0000000..1b3322e Binary files /dev/null and b/_basic/_common/images/nvtx_multicore (copy).png differ diff --git a/_basic/_common/images/nvtx_multicore.jpg b/_basic/_common/images/nvtx_multicore.jpg new file mode 100644 index 0000000..4943732 Binary files /dev/null and b/_basic/_common/images/nvtx_multicore.jpg differ diff --git a/_basic/_common/images/nvtx_multicore.png b/_basic/_common/images/nvtx_multicore.png new file mode 100644 index 0000000..1b3322e Binary files /dev/null and b/_basic/_common/images/nvtx_multicore.png differ diff --git a/_basic/_common/images/nvtx_serial.jpg b/_basic/_common/images/nvtx_serial.jpg new file mode 100644 index 0000000..9829c7c Binary files /dev/null and b/_basic/_common/images/nvtx_serial.jpg differ diff --git a/_basic/_common/images/nvtx_serial.png b/_basic/_common/images/nvtx_serial.png new file mode 100644 index 0000000..e962aeb Binary files /dev/null and b/_basic/_common/images/nvtx_serial.png differ diff --git a/_basic/_common/images/openacc correlation.jpg b/_basic/_common/images/openacc correlation.jpg new file mode 100644 index 0000000..78e306f Binary files /dev/null and b/_basic/_common/images/openacc correlation.jpg differ diff --git a/_basic/_common/images/openacc correlation.png b/_basic/_common/images/openacc correlation.png new file mode 100644 index 0000000..dd3c03c Binary files /dev/null and b/_basic/_common/images/openacc correlation.png differ diff --git a/_basic/_common/images/openacc_3_directives.png b/_basic/_common/images/openacc_3_directives.png new file mode 100644 index 0000000..294eca0 Binary files /dev/null and b/_basic/_common/images/openacc_3_directives.png differ diff --git a/_basic/_common/images/openacc_construct.jpg b/_basic/_common/images/openacc_construct.jpg new file mode 100644 index 0000000..101c043 Binary files /dev/null and b/_basic/_common/images/openacc_construct.jpg differ diff --git a/_basic/_common/images/openacc_construct.png b/_basic/_common/images/openacc_construct.png new file mode 100644 index 0000000..075a05c Binary files /dev/null and b/_basic/_common/images/openacc_construct.png differ diff --git a/_basic/_common/images/openacc_copyclause.png b/_basic/_common/images/openacc_copyclause.png new file mode 100644 index 0000000..8349ac3 Binary files /dev/null and b/_basic/_common/images/openacc_copyclause.png differ diff --git a/_basic/_common/images/openacc_multicore_feedback.png b/_basic/_common/images/openacc_multicore_feedback.png new file mode 100644 index 0000000..e8919ea Binary files /dev/null and b/_basic/_common/images/openacc_multicore_feedback.png differ diff --git a/_basic/_common/images/openacc_parallel.png b/_basic/_common/images/openacc_parallel.png new file mode 100644 index 0000000..3adf566 Binary files /dev/null and b/_basic/_common/images/openacc_parallel.png differ diff --git a/_basic/_common/images/openacc_parallel2.png b/_basic/_common/images/openacc_parallel2.png new file mode 100644 index 0000000..9737870 Binary files /dev/null and b/_basic/_common/images/openacc_parallel2.png differ diff --git a/_basic/_common/images/openacc_parallel_loop.png b/_basic/_common/images/openacc_parallel_loop.png new file mode 100644 index 0000000..eb40581 Binary files /dev/null and b/_basic/_common/images/openacc_parallel_loop.png differ diff --git a/_basic/_common/images/openmp_collapse_baseline.png b/_basic/_common/images/openmp_collapse_baseline.png new file mode 100644 index 0000000..f162f0b Binary files /dev/null and b/_basic/_common/images/openmp_collapse_baseline.png differ diff --git a/_basic/_common/images/openmp_collapse_reg.png b/_basic/_common/images/openmp_collapse_reg.png new file mode 100644 index 0000000..81dab1f Binary files /dev/null and b/_basic/_common/images/openmp_collapse_reg.png differ diff --git a/_basic/_common/images/openmp_collapse_reg_memory.png b/_basic/_common/images/openmp_collapse_reg_memory.png new file mode 100644 index 0000000..baef230 Binary files /dev/null and b/_basic/_common/images/openmp_collapse_reg_memory.png differ diff --git a/_basic/_common/images/openmp_collapse_reg_occupancy.png b/_basic/_common/images/openmp_collapse_reg_occupancy.png new file mode 100644 index 0000000..9d8ea09 Binary files /dev/null and b/_basic/_common/images/openmp_collapse_reg_occupancy.png differ diff --git a/_basic/_common/images/openmp_collapse_reg_roofline.png b/_basic/_common/images/openmp_collapse_reg_roofline.png new file mode 100644 index 0000000..cbc2a1e Binary files /dev/null and b/_basic/_common/images/openmp_collapse_reg_roofline.png differ diff --git a/_basic/_common/images/openmp_feedback.png b/_basic/_common/images/openmp_feedback.png new file mode 100644 index 0000000..0846a67 Binary files /dev/null and b/_basic/_common/images/openmp_feedback.png differ diff --git a/_basic/_common/images/openmp_feedback_collapse.png b/_basic/_common/images/openmp_feedback_collapse.png new file mode 100644 index 0000000..5fcbc1c Binary files /dev/null and b/_basic/_common/images/openmp_feedback_collapse.png differ diff --git a/_basic/_common/images/openmp_feedback_multicore.png b/_basic/_common/images/openmp_feedback_multicore.png new file mode 100644 index 0000000..d354d64 Binary files /dev/null and b/_basic/_common/images/openmp_feedback_multicore.png differ diff --git a/_basic/_common/images/openmp_feedback_offload_split.png b/_basic/_common/images/openmp_feedback_offload_split.png new file mode 100644 index 0000000..70a075c Binary files /dev/null and b/_basic/_common/images/openmp_feedback_offload_split.png differ diff --git a/_basic/_common/images/openmp_fork_join.png b/_basic/_common/images/openmp_fork_join.png new file mode 100644 index 0000000..0d15aa2 Binary files /dev/null and b/_basic/_common/images/openmp_fork_join.png differ diff --git a/_basic/_common/images/openmp_gpu.png b/_basic/_common/images/openmp_gpu.png new file mode 100644 index 0000000..9e40673 Binary files /dev/null and b/_basic/_common/images/openmp_gpu.png differ diff --git a/_basic/_common/images/openmp_gpu_collapse.png b/_basic/_common/images/openmp_gpu_collapse.png new file mode 100644 index 0000000..cb3f61c Binary files /dev/null and b/_basic/_common/images/openmp_gpu_collapse.png differ diff --git a/_basic/_common/images/openmp_multicore.png b/_basic/_common/images/openmp_multicore.png new file mode 100644 index 0000000..63f3f41 Binary files /dev/null and b/_basic/_common/images/openmp_multicore.png differ diff --git a/_basic/_common/images/openmp_offload_collapse.png b/_basic/_common/images/openmp_offload_collapse.png new file mode 100644 index 0000000..595a098 Binary files /dev/null and b/_basic/_common/images/openmp_offload_collapse.png differ diff --git a/_basic/_common/images/openmp_offload_occupancy.png b/_basic/_common/images/openmp_offload_occupancy.png new file mode 100644 index 0000000..9a7f9bc Binary files /dev/null and b/_basic/_common/images/openmp_offload_occupancy.png differ diff --git a/_basic/_common/images/openmp_offload_roofline.png b/_basic/_common/images/openmp_offload_roofline.png new file mode 100644 index 0000000..4fa4dd3 Binary files /dev/null and b/_basic/_common/images/openmp_offload_roofline.png differ diff --git a/_basic/_common/images/openmp_offload_split_cmp.png b/_basic/_common/images/openmp_offload_split_cmp.png new file mode 100644 index 0000000..1234189 Binary files /dev/null and b/_basic/_common/images/openmp_offload_split_cmp.png differ diff --git a/_basic/_common/images/openmp_offload_split_cmp2.png b/_basic/_common/images/openmp_offload_split_cmp2.png new file mode 100644 index 0000000..ec1e3bf Binary files /dev/null and b/_basic/_common/images/openmp_offload_split_cmp2.png differ diff --git a/_basic/_common/images/openmp_offload_split_grid.png b/_basic/_common/images/openmp_offload_split_grid.png new file mode 100644 index 0000000..8367846 Binary files /dev/null and b/_basic/_common/images/openmp_offload_split_grid.png differ diff --git a/_basic/_common/images/openmp_parallel_construct.png b/_basic/_common/images/openmp_parallel_construct.png new file mode 100644 index 0000000..2b64c5c Binary files /dev/null and b/_basic/_common/images/openmp_parallel_construct.png differ diff --git a/_basic/_common/images/openmp_parallelfor_construct.png b/_basic/_common/images/openmp_parallelfor_construct.png new file mode 100644 index 0000000..17f39ee Binary files /dev/null and b/_basic/_common/images/openmp_parallelfor_construct.png differ diff --git a/_basic/_common/images/openmp_target_distribute.png b/_basic/_common/images/openmp_target_distribute.png new file mode 100644 index 0000000..49ac0ce Binary files /dev/null and b/_basic/_common/images/openmp_target_distribute.png differ diff --git a/_basic/_common/images/openmp_target_teams.png b/_basic/_common/images/openmp_target_teams.png new file mode 100644 index 0000000..94a23a4 Binary files /dev/null and b/_basic/_common/images/openmp_target_teams.png differ diff --git a/_basic/_common/images/openmp_teams.png b/_basic/_common/images/openmp_teams.png new file mode 100644 index 0000000..53faac5 Binary files /dev/null and b/_basic/_common/images/openmp_teams.png differ diff --git a/_basic/_common/images/openmp_teams_for.png b/_basic/_common/images/openmp_teams_for.png new file mode 100644 index 0000000..2ac1424 Binary files /dev/null and b/_basic/_common/images/openmp_teams_for.png differ diff --git a/_basic/_common/images/openmp_warp_cmp.png b/_basic/_common/images/openmp_warp_cmp.png new file mode 100644 index 0000000..e725e26 Binary files /dev/null and b/_basic/_common/images/openmp_warp_cmp.png differ diff --git a/_basic/_common/images/output_files.png b/_basic/_common/images/output_files.png new file mode 100644 index 0000000..df6158e Binary files /dev/null and b/_basic/_common/images/output_files.png differ diff --git a/_basic/_common/images/page-compute.png b/_basic/_common/images/page-compute.png new file mode 100644 index 0000000..74040a6 Binary files /dev/null and b/_basic/_common/images/page-compute.png differ diff --git a/_basic/_common/images/pair_gpu.png b/_basic/_common/images/pair_gpu.png new file mode 100644 index 0000000..c4c152b Binary files /dev/null and b/_basic/_common/images/pair_gpu.png differ diff --git a/_basic/_common/images/pair_gpu_analysis.png b/_basic/_common/images/pair_gpu_analysis.png new file mode 100644 index 0000000..ec40841 Binary files /dev/null and b/_basic/_common/images/pair_gpu_analysis.png differ diff --git a/_basic/_common/images/parallel1f.png b/_basic/_common/images/parallel1f.png new file mode 100644 index 0000000..7d5edff Binary files /dev/null and b/_basic/_common/images/parallel1f.png differ diff --git a/_basic/_common/images/parallel2f.png b/_basic/_common/images/parallel2f.png new file mode 100644 index 0000000..3a563c6 Binary files /dev/null and b/_basic/_common/images/parallel2f.png differ diff --git a/_basic/_common/images/parallel3f.png b/_basic/_common/images/parallel3f.png new file mode 100644 index 0000000..b6d8021 Binary files /dev/null and b/_basic/_common/images/parallel3f.png differ diff --git a/_basic/_common/images/parallel_data.jpg b/_basic/_common/images/parallel_data.jpg new file mode 100644 index 0000000..13956b2 Binary files /dev/null and b/_basic/_common/images/parallel_data.jpg differ diff --git a/_basic/_common/images/parallel_data.png b/_basic/_common/images/parallel_data.png new file mode 100644 index 0000000..2ff3445 Binary files /dev/null and b/_basic/_common/images/parallel_data.png differ diff --git a/_basic/_common/images/parallel_data_feedback.png b/_basic/_common/images/parallel_data_feedback.png new file mode 100644 index 0000000..5adcb1a Binary files /dev/null and b/_basic/_common/images/parallel_data_feedback.png differ diff --git a/_basic/_common/images/parallel_detailed.png b/_basic/_common/images/parallel_detailed.png new file mode 100644 index 0000000..51ffad6 Binary files /dev/null and b/_basic/_common/images/parallel_detailed.png differ diff --git a/_basic/_common/images/parallel_expand.jpg b/_basic/_common/images/parallel_expand.jpg new file mode 100644 index 0000000..24c1948 Binary files /dev/null and b/_basic/_common/images/parallel_expand.jpg differ diff --git a/_basic/_common/images/parallel_expand.png b/_basic/_common/images/parallel_expand.png new file mode 100644 index 0000000..7782f68 Binary files /dev/null and b/_basic/_common/images/parallel_expand.png differ diff --git a/_basic/_common/images/parallel_loop.png b/_basic/_common/images/parallel_loop.png new file mode 100644 index 0000000..0fe83b9 Binary files /dev/null and b/_basic/_common/images/parallel_loop.png differ diff --git a/_basic/_common/images/parallel_timeline.jpg b/_basic/_common/images/parallel_timeline.jpg new file mode 100644 index 0000000..9c1f6fc Binary files /dev/null and b/_basic/_common/images/parallel_timeline.jpg differ diff --git a/_basic/_common/images/parallel_timeline.png b/_basic/_common/images/parallel_timeline.png new file mode 100644 index 0000000..3d10d2c Binary files /dev/null and b/_basic/_common/images/parallel_timeline.png differ diff --git a/_basic/_common/images/parallel_unified.jpg b/_basic/_common/images/parallel_unified.jpg new file mode 100644 index 0000000..d8c0a8f Binary files /dev/null and b/_basic/_common/images/parallel_unified.jpg differ diff --git a/_basic/_common/images/parallel_unified.png b/_basic/_common/images/parallel_unified.png new file mode 100644 index 0000000..b64265e Binary files /dev/null and b/_basic/_common/images/parallel_unified.png differ diff --git a/_basic/_common/images/rapids_package.png b/_basic/_common/images/rapids_package.png new file mode 100644 index 0000000..f4e6332 Binary files /dev/null and b/_basic/_common/images/rapids_package.png differ diff --git a/_basic/_common/images/raw_kernel.png b/_basic/_common/images/raw_kernel.png new file mode 100644 index 0000000..3ef0e29 Binary files /dev/null and b/_basic/_common/images/raw_kernel.png differ diff --git a/_basic/_common/images/rdf.png b/_basic/_common/images/rdf.png new file mode 100644 index 0000000..5db03ae Binary files /dev/null and b/_basic/_common/images/rdf.png differ diff --git a/_basic/_common/images/roofline-achieved.png b/_basic/_common/images/roofline-achieved.png new file mode 100644 index 0000000..0140d27 Binary files /dev/null and b/_basic/_common/images/roofline-achieved.png differ diff --git a/_basic/_common/images/roofline-analysis.png b/_basic/_common/images/roofline-analysis.png new file mode 100644 index 0000000..acf1930 Binary files /dev/null and b/_basic/_common/images/roofline-analysis.png differ diff --git a/_basic/_common/images/roofline-baseline.png b/_basic/_common/images/roofline-baseline.png new file mode 100644 index 0000000..c048125 Binary files /dev/null and b/_basic/_common/images/roofline-baseline.png differ diff --git a/_basic/_common/images/roofline-compute.png b/_basic/_common/images/roofline-compute.png new file mode 100644 index 0000000..9b7457a Binary files /dev/null and b/_basic/_common/images/roofline-compute.png differ diff --git a/_basic/_common/images/roofline-overview.png b/_basic/_common/images/roofline-overview.png new file mode 100644 index 0000000..7b89142 Binary files /dev/null and b/_basic/_common/images/roofline-overview.png differ diff --git a/_basic/_common/images/roofline_collapse.png b/_basic/_common/images/roofline_collapse.png new file mode 100644 index 0000000..3ceec82 Binary files /dev/null and b/_basic/_common/images/roofline_collapse.png differ diff --git a/_basic/_common/images/rule-compute.png b/_basic/_common/images/rule-compute.png new file mode 100644 index 0000000..7344ef3 Binary files /dev/null and b/_basic/_common/images/rule-compute.png differ diff --git a/_basic/_common/images/sass-compute.png b/_basic/_common/images/sass-compute.png new file mode 100644 index 0000000..666ae43 Binary files /dev/null and b/_basic/_common/images/sass-compute.png differ diff --git a/_basic/_common/images/scheduler_collapse.png b/_basic/_common/images/scheduler_collapse.png new file mode 100644 index 0000000..5505a79 Binary files /dev/null and b/_basic/_common/images/scheduler_collapse.png differ diff --git a/_basic/_common/images/sections-compute.png b/_basic/_common/images/sections-compute.png new file mode 100644 index 0000000..f6c2bd7 Binary files /dev/null and b/_basic/_common/images/sections-compute.png differ diff --git a/_basic/_common/images/serial.jpg b/_basic/_common/images/serial.jpg new file mode 100644 index 0000000..e4a7c94 Binary files /dev/null and b/_basic/_common/images/serial.jpg differ diff --git a/_basic/_common/images/serial.png b/_basic/_common/images/serial.png new file mode 100644 index 0000000..6893e4f Binary files /dev/null and b/_basic/_common/images/serial.png differ diff --git a/_basic/_common/images/serial_cpu_rdf1.png b/_basic/_common/images/serial_cpu_rdf1.png new file mode 100644 index 0000000..1da6440 Binary files /dev/null and b/_basic/_common/images/serial_cpu_rdf1.png differ diff --git a/_basic/_common/images/serial_cpu_rdf2.png b/_basic/_common/images/serial_cpu_rdf2.png new file mode 100644 index 0000000..442d86a Binary files /dev/null and b/_basic/_common/images/serial_cpu_rdf2.png differ diff --git a/_basic/_common/images/serial_cupy_profile.png b/_basic/_common/images/serial_cupy_profile.png new file mode 100644 index 0000000..b756cac Binary files /dev/null and b/_basic/_common/images/serial_cupy_profile.png differ diff --git a/_basic/_common/images/serial_numba_profile.png b/_basic/_common/images/serial_numba_profile.png new file mode 100644 index 0000000..e47cfe5 Binary files /dev/null and b/_basic/_common/images/serial_numba_profile.png differ diff --git a/_basic/_common/images/serial_output1.png b/_basic/_common/images/serial_output1.png new file mode 100644 index 0000000..d0f87a4 Binary files /dev/null and b/_basic/_common/images/serial_output1.png differ diff --git a/_basic/_common/images/serial_output_file.png b/_basic/_common/images/serial_output_file.png new file mode 100644 index 0000000..0e850ec Binary files /dev/null and b/_basic/_common/images/serial_output_file.png differ diff --git a/_basic/_common/images/serial_profile.png b/_basic/_common/images/serial_profile.png new file mode 100644 index 0000000..5f1bc6b Binary files /dev/null and b/_basic/_common/images/serial_profile.png differ diff --git a/_basic/_common/images/serial_profiler1.png b/_basic/_common/images/serial_profiler1.png new file mode 100644 index 0000000..1e14d9d Binary files /dev/null and b/_basic/_common/images/serial_profiler1.png differ diff --git a/_basic/_common/images/sol.png b/_basic/_common/images/sol.png new file mode 100644 index 0000000..ccdc2a1 Binary files /dev/null and b/_basic/_common/images/sol.png differ diff --git a/_basic/_common/images/sol_baseline.png b/_basic/_common/images/sol_baseline.png new file mode 100644 index 0000000..f0cd07b Binary files /dev/null and b/_basic/_common/images/sol_baseline.png differ diff --git a/_basic/_common/images/source-compute.png b/_basic/_common/images/source-compute.png new file mode 100644 index 0000000..0db7eab Binary files /dev/null and b/_basic/_common/images/source-compute.png differ diff --git a/_basic/_common/images/source_collapse.png b/_basic/_common/images/source_collapse.png new file mode 100644 index 0000000..6dead9a Binary files /dev/null and b/_basic/_common/images/source_collapse.png differ diff --git a/_basic/_common/images/source_hover.png b/_basic/_common/images/source_hover.png new file mode 100644 index 0000000..c5e6563 Binary files /dev/null and b/_basic/_common/images/source_hover.png differ diff --git a/_basic/_common/images/source_loc.png b/_basic/_common/images/source_loc.png new file mode 100644 index 0000000..41243f8 Binary files /dev/null and b/_basic/_common/images/source_loc.png differ diff --git a/_basic/_common/images/source_sass.png b/_basic/_common/images/source_sass.png new file mode 100644 index 0000000..2830aaa Binary files /dev/null and b/_basic/_common/images/source_sass.png differ diff --git a/_basic/_common/images/source_sass_collapse.png b/_basic/_common/images/source_sass_collapse.png new file mode 100644 index 0000000..4a7c091 Binary files /dev/null and b/_basic/_common/images/source_sass_collapse.png differ diff --git a/_basic/_common/images/stdpar_gpu.png b/_basic/_common/images/stdpar_gpu.png new file mode 100644 index 0000000..4fde201 Binary files /dev/null and b/_basic/_common/images/stdpar_gpu.png differ diff --git a/_basic/_common/images/stdpar_multicore.png b/_basic/_common/images/stdpar_multicore.png new file mode 100644 index 0000000..705ccb7 Binary files /dev/null and b/_basic/_common/images/stdpar_multicore.png differ diff --git a/_basic/_common/images/stdpar_um.png b/_basic/_common/images/stdpar_um.png new file mode 100644 index 0000000..cddf295 Binary files /dev/null and b/_basic/_common/images/stdpar_um.png differ diff --git a/_basic/_common/images/summary-compute.png b/_basic/_common/images/summary-compute.png new file mode 100644 index 0000000..e9e67e1 Binary files /dev/null and b/_basic/_common/images/summary-compute.png differ diff --git a/_basic/_common/images/thread.png b/_basic/_common/images/thread.png new file mode 100644 index 0000000..953d3c8 Binary files /dev/null and b/_basic/_common/images/thread.png differ diff --git a/_basic/_common/images/thread_blocks.JPG b/_basic/_common/images/thread_blocks.JPG new file mode 100644 index 0000000..384b706 Binary files /dev/null and b/_basic/_common/images/thread_blocks.JPG differ diff --git a/_basic/_common/images/thread_blocks.png b/_basic/_common/images/thread_blocks.png new file mode 100644 index 0000000..37d0be0 Binary files /dev/null and b/_basic/_common/images/thread_blocks.png differ diff --git a/_basic/_common/images/thread_position.png b/_basic/_common/images/thread_position.png new file mode 100644 index 0000000..a4054c2 Binary files /dev/null and b/_basic/_common/images/thread_position.png differ diff --git a/_basic/_common/images/ufunc.png b/_basic/_common/images/ufunc.png new file mode 100644 index 0000000..e10ab85 Binary files /dev/null and b/_basic/_common/images/ufunc.png differ diff --git a/_basic/_common/images/uncoalesced_hint.png b/_basic/_common/images/uncoalesced_hint.png new file mode 100644 index 0000000..6b87ca1 Binary files /dev/null and b/_basic/_common/images/uncoalesced_hint.png differ diff --git a/_basic/_common/images/unified_memory.png b/_basic/_common/images/unified_memory.png new file mode 100644 index 0000000..27610f7 Binary files /dev/null and b/_basic/_common/images/unified_memory.png differ diff --git a/_basic/_common/images/warning-compute.png b/_basic/_common/images/warning-compute.png new file mode 100644 index 0000000..b458e4b Binary files /dev/null and b/_basic/_common/images/warning-compute.png differ diff --git a/_basic/_common/images/warp_collapse.png b/_basic/_common/images/warp_collapse.png new file mode 100644 index 0000000..f8192f5 Binary files /dev/null and b/_basic/_common/images/warp_collapse.png differ diff --git a/_basic/_common/images/workflow.png b/_basic/_common/images/workflow.png new file mode 100644 index 0000000..e2cb99b Binary files /dev/null and b/_basic/_common/images/workflow.png differ diff --git a/_basic/_common/input/.gitignore b/_basic/_common/input/.gitignore new file mode 100644 index 0000000..86d0cb2 --- /dev/null +++ b/_basic/_common/input/.gitignore @@ -0,0 +1,4 @@ +# Ignore everything in this directory +* +# Except this file +!.gitignore \ No newline at end of file diff --git a/_basic/_common/jupyter_notebook/Final_Remarks.ipynb b/_basic/_common/jupyter_notebook/Final_Remarks.ipynb new file mode 100644 index 0000000..64a98a8 --- /dev/null +++ b/_basic/_common/jupyter_notebook/Final_Remarks.ipynb @@ -0,0 +1,138 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Final Remarks\n", + "\n", + "In this tutorial we took an approach where same algorithm was ported to GPU using different popular methods. Each method has it strengths and suffices a purpose for which it was created. From a developer point of view below listed are some key parameters which are crucial to any development exercise: \n", + "\n", + "- **Ease of Programming**: How much in-depth knowledge of processor architecture is required for a developer before starting to convert the code to GPU?\n", + "- **Performance**: How much effort is required to reach desirable performance on a particular architecture.\n", + "- **Portability**: To what extent does the same code run on multiple architecture? What provisions are provided by programming approach to target different platforms?\n", + "- **Support**: The overall ecosystem and support by the community.\n", + " - Which all compilers implement the standard?\n", + " - Which all languages are supported?\n", + " - Which all applications make use it?\n", + " - How easy or difficult it is to profile/debug the application?\n", + " \n", + "Let us try to create a high level buckets for each of these parameter above with a limited scope of GPU support:\n", + "\n", + "| | | | | \n", + "| :--- | :--- | :--- | :--- |\n", + "| Ease of Programming | Low: Minimal architecture specific knowledge needed | Intermediate: Mimimal changes expected in code design. Using these along with architecture knowledge helps in better performance | High: In-Depth GPU architecture knowledge must |\n", + "| Performance | Depends: Based on the complexity/type of application the performance may vary | High: Exposes methods to get good performance. These methods are integral part of design and exposed to programmer at various granularities | Best: Full control to developers to control parallelism and memory access |\n", + "| Portability | Integral: Part of the key objective | Limited: Works only on specific platform | | \n", + "| Support | Established: Proven over years and support by multiple vendors for GPU | Emerging: Gaining traction by multiple vendors for GPU | |\n", + "\n", + "There is a very thin line between these categories and within that limited scope and view we could categorize different approaches as follows:\n", + "\n", + " \n", + "| | OpenACC | OpenMP | stdpar and DO CONCURRENT| Kokkos | CUDA Laguages |\n", + "| --- | --- | --- | --- | --- | --- |\n", + "| Ease | High | High | High | Intermediate | Low |\n", + "| Performance | Depends | Depends | Depends | High | Best |\n", + "| Portability | Integral | Integral | Integral | Integral | Limited |\n", + "| Support | Established | Emerging | Emerging | Established | Established |\n", + "\n", + "\n", + "\n", + "Below given are points that will help users as there is no one programming model that fits all needs.\n", + "\n", + "## Ease of Programming\n", + "- The directive‐based OpenMP and OpenACC programming models are generally least intrusive when applied to the loops. \n", + "\n", + "- CUDA required a comparable amount of rewriting effort, in particular, to map the loops onto a CUDA grid of threads and thread blocks\n", + "- stdpar also required us to change the constructs to make use of C++17 templates and may be preferred for new developments having C++ template style coding. \n", + "- DO-CONCURRENT also required us to do minimal change by replacing the *do* loop to *do concurrent* . \n", + "- The overhead for OpenMP and OpenACC in terms of lines of code is the smallest, followed by stdpar and Kokkos\n", + "\n", + "## Performance\n", + "While we have not gone into the details of optimization for any of these programming model the analysis provided here is based on the general design of the programming model itself.\n", + "\n", + "\n", + "- OpenACC and OpenMP abstract model defines a least common denominator for accelerator devices, but cannot represent architectural specifics of these devices without making the language less portable.\n", + "- stdpar and DO-CONCURRENT on the other hand are more abstract and gives less control to developers to optimize the code\n", + "\n", + "## Portability\n", + "\n", + "We observed the same code being run on both multicore and GPU using OpenMP, OpenACC, stdpar and DO-CONCURRENT. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different directive variant to get more performance. In a real application the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application \n", + "- OpenMP supports [Metadirective](https://www.openmp.org/spec-html/5.0/openmpsu28.html) where the developer can choose to activate different directive variant based on the condition selected.\n", + "- In OpenACC when using ```kernel``` construct, the compiler is responsible for mapping and partitioning the program to the underlying hardware. Since the compiler will mostly take care of the parallelization issues, the descriptive approach may generate performance code for specific architecture. The downside is the quality of the generated accelerated code depends significantly on the capability of the compiler used and hence the term \"may\".\n", + "\n", + "## Support\n", + "\n", + "- OpenACC implementation is present in most popular compilers like NVIDIA HPC SDK, PGI, GCC, Clang and CRAY. \n", + "- OpenMP GPU support is currently available on limited compilers but being the most supported programming model for multicore it is matter of time when it comes at par with other models for GPU support.\n", + "- stdpar being part of the C++ standard is bound to become integral part of most compiler supporting parallelism. \n", + "- DO-CONCURRENT being part of the ISO Fortran standard is bound to become integral part of most compiler supporting parallelism. \n", + "\n", + "Parallel Computing in general has been a difficult task and requires developers not just to know a programming approach but also think in parallel. While this tutorial provide you a good start, it is highly recommended to go through Profiling and Optimization bootcamps as next steps.\n", + "\n", + "-----\n", + "\n", + "-----\n", + "\n", + "# Links and Resources\n", + "[OpenACC API guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n", + "\n", + "[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n", + "\n", + "### Other Bootcamps\n", + "The contents of this Bootcamp originates from [OpenACC GPU Bootcamp Github](https://github.com/gpuhackathons-org/gpubootcamp). Here are some additional Bootcamp which might of interest: \n", + "\n", + "- [AI for HPC](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/hpc_ai/ai_science_climate)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/Final_Remarks_python.ipynb b/_basic/_common/jupyter_notebook/Final_Remarks_python.ipynb new file mode 100644 index 0000000..4c7871a --- /dev/null +++ b/_basic/_common/jupyter_notebook/Final_Remarks_python.ipynb @@ -0,0 +1,117 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Final Remarks\n", + "\n", + "In this Bootcamp we saw how a single algorithm was ported to GPUs using two well-known approaches (Numba and CuPy). Each approach has it strengths and a purpose. From a developer perspective the following are metrics crucial to any development exercise: \n", + "\n", + " 1. **Ease of Programming**: Determine what level of processor architectural knowledge a developer needs before starting to convert the serial code to GPU version.\n", + " 2. **Performance**: Measure how much effort is required to reach desirable performance on a particular architecture.\n", + " 3. **Portability**: To what extent does the same code run on multiple architectures? What provisions are provided by th chosen programming approach to target different platforms?\n", + " 4. **Support**: Consider the overall ecosystem and support available from the community.\n", + " \n", + "Let us try to create high-level buckets for each of these above parameters with a limited scope of GPU support:\n", + "\n", + "| | | | | \n", + "| :--- | :--- | :--- | :--- |\n", + "| Ease of Programming | Low: Minimal architecture specific knowledge needed | Intermediate: Minimal changes expected in code design. Using these along with architecture knowledge helps in better performance | High: In-Depth GPU architecture knowledge must |\n", + "| Performance | Depends: Based on the complexity/type of application the performance may vary | High: Exposes methods to get good performance. These methods are integral part of design and exposed to programmer at various granularities | Best: Full control to developers to control parallelism and memory access |\n", + "| Portability | Integral: Part of the key objective | Limited: Works only on specific platform | | \n", + "| Support | Established: Proven over years and support by multiple vendors for GPU | Emerging: Gaining traction by multiple vendors for GPU | |\n", + "\n", + "There is a very thin line between these categories and within that limited scope and view we could categorize different approaches as follows:\n", + "\n", + "\n", + "\n", + " \n", + "| Metrics | Python CuPy | Python Numba | CUDA Laguages |\n", + "| --- | --- | --- | --- |\n", + "| **Ease** | Depends | Intermediate | Low |\n", + "| **Performance** | High | High | Best |\n", + "| **Portability** | Integral | Integral | Limited |\n", + "| **Support** | Emerging | Established | Established |\n", + "\n", + "The following are points to broaden the user's understanding as there is no \"one-size fits all\" programming model.\n", + "\n", + "## Ease of Programming\n", + "- The Python CuPy programming model is problem-based and it depends on the type of task engaged. A major challenge in Python CuPy is the raw kernel must be written in CUDA C form; however, performance is increased and access to thread IDs (which are not possible with other kernel classes in CuPy) is possible. Because of non-sequential index access within the array, access to thread IDs is required in the serial code task and therefore the CuPy raw kernel was used. A comparable amount of rewriting effort to map the loops onto the CUDA grid of blocks and thread blocks was performed.\n", + "\n", + "- Python Numba programming model uses the CUDA C programming paradigm with Python semantics. Moderate effort is required to map the loops onto a grid of blocks and thread of blocks since syntax in Numba are in Python form. \n", + "\n", + "## Performance\n", + "While we have not gone into the details of optimization for any of these programming models the analysis provided here is based on the general design of the programming model itself.\n", + "- Python CuPy and Numba code optimization is dependent on the logic used in terms of data movement, thread block management and shared memory. However, with emphasis on the lab task, the CuPy approach is expected to have better performance than the Numba approach. \n", + "\n", + "## Portability\n", + "We observed the same code being run on multicore and GPU using CuPy and Numba. The point we highlight here is how a programming model supports the divergent cases where developers may choose to use different kernel classes or function decorators to get more performance. In a real application, the tolerance for this portability/performance trade-off will vary according to the needs of the programmer and application. \n", + "\n", + "\n", + "\n", + "## Support\n", + "- CuPy and Numba libraries are well documented and the developer support on GitHub are excellent. \n", + "- CuPy implementation is present in the RAPIDS package via conda [here](https://rapids.ai/start.html). \n", + "- Numba is well support in Anaconda package.\n", + "- **CUDA Python Ecosystem**: NVIDIA has recently shown support towards the simplification of developers' experience with improved Python code portability and compatibility. The goal is to help unify the Python CUDA ecosystem with a single standard set of low-level interfaces, providing full coverage of, and access to, the CUDA host APIs from Python. The ecosystem allows interoperability among different accelerated libraries and ease of use for Python developers interested in NVIDIA GPUs. The initial release of CUDA Python includes Cython and Python wrappers for the CUDA Driver and runtime APIs. You can read more here:\n", + " - [Python Ecosystem](https://developer.nvidia.com/blog/unifying-the-cuda-python-ecosystem/)\n", + " - [CUDA Python Public Preview](https://developer.nvidia.com/cuda-python)\n", + " - [GPU-Accelerated Computing with Python Numba](https://developer.nvidia.com/how-to-cuda-python)\n", + "\n", + "\n", + "In general parallel computing has been a difficult task that requires developers to not only know a programming approach but to also think in parallel. While this Bootcamp provides you a good start, we recommend participating in the Profiling and Optimization Labs as next steps.\n", + "\n", + "-----\n", + "\n", + "#
[HOME](../_start_nways_python.ipynb)
\n", + "\n", + "-----\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "--- \n", + "\n", + "\n", + "### Other Bootcamps\n", + "The contents of this Bootcamp originates from [OpenACC GPU Bootcamp Github](https://github.com/gpuhackathons-org/gpubootcamp). Here are some additional Bootcamp which might of interest : \n", + "\n", + "- [AI for HPC](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/hpc_ai/ai_science_climate)\n", + "\n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/GPU_Architecture_Terminologies.ipynb b/_basic/_common/jupyter_notebook/GPU_Architecture_Terminologies.ipynb new file mode 100644 index 0000000..5a504be --- /dev/null +++ b/_basic/_common/jupyter_notebook/GPU_Architecture_Terminologies.ipynb @@ -0,0 +1,99 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Thread\n", + "A basic element of the data to be processed on the GPU.\n", + "\n", + "### CUDA Blocks\n", + "A collection or group of threads which can communicate within their own block.\n", + "### Grid\n", + "CUDA blocks are grouped into a grid. Blocks are independent of each other.\n", + "\n", + "### Kernel\n", + "A kernel is executed as a grid of blocks of threads.\n", + "\n", + "\n", + "\n", + "### Streaming Multiprocessor (SM) \n", + "Streaming multi-processors with multiple processing cores. Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU. One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks. Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time. Below figure shows the kernel execution and mapping on hardware resources available in GPU.\n", + "\n", + "\n", + "\n", + "### Warp\n", + "32 threads form a warp.The SM has a maximum number of warps that can be active at once. \n", + "\n", + "### Memory Hierarchy\n", + "CUDA-capable GPUs have a memory hierarchy as shown below:\n", + "\n", + "\n", + "\n", + "The following memories are exposed by the GPU architecture:\n", + "\n", + "- **Registers** : These are private to each thread, which means that registers assigned to a thread are not visible to other threads. The compiler makes decisions about register utilization.\n", + "- **L1/Shared memory (SMEM)** : Every SM has a fast, on-chip scratchpad memory that can be used as L1 cache and shared memory. All threads in a CUDA block can share shared memory, and all CUDA blocks running on a given SM can share the physical memory resource provided by the SM..\n", + "- **Read-only memory** : Each SM has an instruction cache, constant memory, texture memory and RO cache, which is read-only to kernel code.\n", + "- **L2 cache** : The L2 cache is shared across all SMs, so every thread in every CUDA block can access this memory. The NVIDIA A100 GPU has increased the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.\n", + "- **Global memory** : This is the framebuffer size of the GPU and DRAM sitting in the GPU.\n", + "\n", + "To learn more, please checkout the CUDA Refresher series at https://developer.nvidia.com/blog/tag/cuda-refresher/ .\n", + "\n", + "\n", + "### Occupancy\n", + "The Streaming Multiprocessor (SM) has a maximum number of warps that can be active at once. Occupancy is the ratio of active warps to maximum supported active warps. Occupancy is 100% if the number of active warps equals the maximum. If this factor is limiting active blocks, occupancy cannot be increased. \n", + "\n", + "The Streaming Multiprocessor (SM) has a maximum number of blocks that can be active at once. If occupancy is below 100% and this factor is limiting active blocks, it means each block does not contain enough warps to reach 100% occupancy when the device's active block limit is reached. Occupancy can be increased by increasing block size. \n", + "\n", + "To learn more about occupancy, checkout https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Unified Memory\n", + "\n", + "With every new CUDA and GPU architecture release, new features are added. These new features provide more performance and ease of programming or allow developers to implement new algorithms that otherwise weren't possible to port on GPUs using CUDA.\n", + "One such important feature that was released from CUDA 6.0 onward and finds its implementation from the Kepler GPU architecture is unified memory (UM). \n", + "\n", + "In simpler words, UM provides the user with a view of single memory space that's accessible by all GPUs and CPUs in the system. This is illustrated in the following diagram:\n", + "\n", + "\n", + "\n", + "UM simplifies programming effort for beginners to CUDA as developers need not explicitly manage copying data to and from GPU. We will be using this feature of latest CUDA release and GPU architecture in labs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/nsight_compute.ipynb b/_basic/_common/jupyter_notebook/nsight_compute.ipynb new file mode 100644 index 0000000..a544de8 --- /dev/null +++ b/_basic/_common/jupyter_notebook/nsight_compute.ipynb @@ -0,0 +1,327 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This lab gives an overview of the Nvidia Nsight Compute and steps to profile a kernel with Nsight Compute command line interface.\n", + "\n", + "Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the nvidia-smi command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Introduction to Nsight Compute\n", + "Nsight Compute tool provides detailed performance metrics and API debugging via a user interface and command line tool. NVIDIA Nsight Compute is an interactive kernel profiler for GPU applications which provides detailed performance metrics and API debugging via a user interface and command line tool. The NVIDIA Nsight Compute CLI (`ncu`) provides a non-interactive way to profile applications from the command line and can print the results directly on the command line or store them in a report file. \n", + "\n", + "Results can then be imported to the GUI version for inspection. With command line profiler, you can instrument the target API, and collect profile results for the specified kernels or all of them.\n", + "\n", + "\n", + "\n", + "- **Navigating the report via GUI**\n", + "The Nsight Compute UI consists of a header with general information, as well as controls to switch between report pages or individual collected kernel launches. By default, the profile report comes up on the *Details* page. You can easily switch between different report pages of the report with the dropdown labeled *Page* on the top-left of the page. \n", + "\n", + "\n", + "\n", + "A report can contain any number of results from kernel launches. The *Launch* dropdown allows switching between the different results in the report.\n", + "\n", + "\n", + "\n", + "\n", + "- **Sections and Sets**\n", + "Nsight Compute uses section sets to decided the amount of metrics to be collected. By default, a relatively small number of metrics is collected such as SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis. You can optionally select which of these sections are collected and displayed with command-line parameters. If you are profiling from the command-line, use the flag `--set detailed` or `--set full`. In the later sections, you will learn how to collect these metrics. Read more at https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules.\n", + "\n", + "\n", + "\n", + "\n", + "Below screenshots show close-up view of example sections in the Nsight Compute profiler. You can expand each section by clicking on each. Under each section, there is description explaining what it shows (some of these sections are not collected by default). \n", + "\n", + "\n", + "\n", + "Various sections have a triangle with an exclamation mark inside in front of them. Follow the warning sign/icon and it tells you what the bottleneck is and gives you guidance on how you can improve it.\n", + "\n", + "\n", + "\n", + "Some of sections have one or more bodies with additional charts or tables. You can click on the triangle expander icon in the top-left corner of each section to show or hide those. If a section has multiple bodies, a dropdown in their top-right corner allows you to switch between them. As shown in the example screenshot below, you can switch between different bodies in the SOL section and choose to view *SOL Chart*, *SOL breakdown*, *SOL Rooflines*, or all together.\n", + "\n", + "\n", + "\n", + "Let's have a look at some of these sections:\n", + "\n", + "The _**GPU Speed Of Light Roofline**_ Chart section contains a Roofline chart that is helpful for visualizing kernel performance. More information on how to use and read this chart can be found in [*Roofline Charts*](#roofline) section.\n", + "\n", + "\n", + "\n", + "_**Memory Workload Analysis**_ section contains a Memory chart that visualizes data transfers, cache hit rates, instructions and memory requests. More information on how to use and read this chart can be found in [*Memory Charts*](#memory) section.\n", + "\n", + "\n", + "\n", + "_**Source Counters**_ can contain source hotspot tables that indicate the N highest or lowest values of one or more metrics in the kernel source code. In other words, it depicts performance problems in the source code.\n", + "\n", + "\n", + "\n", + "You can select the location links to navigate directly to this location in the *Source Page* (it displays metrics that can be correlated with source code). Please note for the correlation of SASS and source code to work, the source code needs to be compiled with the `-lineinfo` flag.\n", + " \n", + "\n", + "\n", + "To read more about different sections in NVIDIA Nsight Compute, checkout the documentation : http://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "- **Comparing multiple results**\n", + "With Nsight Compute GUI, you can create a baseline and compare results against each other. On the *Details* page, press the button *Add Baseline* to make the current report/result, the baseline for all other results from this report and any other report opened in the same instance of Nsight Compute. When a baseline is set, every element on the *Details* page shows two values: The current value of the result in focus and the corresponding value of the baseline or the percentage of change from the corresponding baseline value.\n", + "\n", + "\n", + "\n", + "\n", + "- **Applying Rules**\n", + "Sections on the *Details* page may provide rules. By pressing the *Apply Rules* button on the top of the page, all available rules for the current report is executed. \n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Roofline Charts \n", + "\n", + "Once you wrote the high performance software code, you need to understand how well the application performs on the the available hardware. Different platforms, whether they are CPUs, GPUs, or something else, will have different hardware limitations such as available memory bandwidth and theoretical compute limits. The Roofline performance model visualizes achieved performance and helps you understand how well your application is using the available hardware resources and find the performance limiters. \n", + "\n", + "Kernel performance is not only dependent on the operational speed of the GPU. Since a kernel requires data to work on, performance is also dependent on the rate at which the GPU can feed data to the kernel. A typical roofline chart combines the peak performance and memory bandwidth of the GPU, with a metric called *Arithmetic Intensity* (a ratio between Work and Memory Traffic), into a single chart, to more realistically represent the achieved performance of the profiled kernel.\n", + "\n", + "With *Arithmetic intensity* and *FLOP/s*, you can plot a kernel on a graph that includes rooflines and ceilings of performance limits and visualize how your kernel is affected by them.\n", + "\n", + "- *Arithmetic intensity* The ratio between compute work (FLOPs) and data movement (bytes)\n", + "- *FLOP/s*: Floating-point operations per second\n", + "\n", + "\n", + "Nsight compute collects and displays roofline analysis data in the roofline chart. This chart is part of the Speed of Light (SOl) section. \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "This chart actually shows two different rooflines. However, the following components can be identified for each:\n", + "\n", + "- *Vertical Axis* represents Floating Point Operations per Second (FLOPS) (Note: For GPUs this number can get quite large and to better accommodate the range, this axis is rendered using a logarithmic scale.)\n", + "- *Horizontal Axis* - The horizontal axis represents Arithmetic Intensity, which is the ratio between Work (expressed in floating point operations per second), and Memory Traffic (expressed in bytes per second). The resulting unit is in floating point operations per byte. This axis is also shown using a logarithmic scale.\n", + "- *Memory Bandwidth Boundary* is the sloped part of the roofline. By default, this slope is determined entirely by the memory transfer rate of the GPU but it can be customized too.\n", + "- *Peak Performance Boundary* - The peak performance boundary is the flat part of the roofline By default, this value is determined entirely by the peak performance of the GPU but but it can be customized too.\n", + "- *Ridge Point* is the point at which the memory bandwidth boundary meets the peak performance boundary (a useful reference when analyzing kernel performance).\n", + "- *Achieved Value* represents the performance of the profiled kernel.\n", + "\n", + "To learn more about customizing NVIDIA Nsight Compute tools, read the Nsight Compute Customization Guide: https://docs.nvidia.com/nsight-compute/2021.2/CustomizationGuide/index.html#abstract\n", + "\n", + "#### Roofline Analysis\n", + "\n", + "The roofline chart can be very helpful in guiding performance optimization efforts for a particular kernel.\n", + "\n", + "\n", + "\n", + "As shown here, the ridge point partitions the roofline chart into two regions. The area shaded in blue under the sloped Memory Bandwidth Boundary is the *Memory Bound* region, while the area shaded in green under the Peak Performance Boundary is the *Compute Bound* region. The region in which the achieved value falls, determines the current limiting factor of kernel performance.\n", + "\n", + "The distance from the achieved value to the respective roofline boundary (shown in this figure as a dotted white line), represents the opportunity for performance improvement. The closer the achieved value is to the roofline boundary, the more optimal is its performance. An achieved value that lies on the *Memory Bandwidth Boundary* but is not yet at the height of the ridge point would indicate that any further improvements in overall FLOP/s are only possible if the *Arithmetic Intensity* is increased at the same time. \n", + "\n", + "If you hover your mouse over the achieved value, you can see the achieved performance (FLOP/s)(see below example).\n", + "\n", + "\n", + "\n", + "Using the baseline feature in combination with roofline charts, is a good way to track optimization progress over a number of kernel executions. As shown in the example below, the roofline chart also contains an achieved value for each baseline. The outline color of the plotted achieved value point can be used to determine from which baseline the point came.In this example, the outline colors are light blue and green showing the achieved value points.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Memory Charts \n", + "\n", + "Memory Workload Analysis section shows detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Depending on the limiting factor, the memory chart and tables allow to identify the exact bottleneck in the memory system.\n", + "\n", + "Below is a memory chart of an NVIDIA V100 GPU:\n", + "\n", + "\n", + "\n", + "*Logical unit* (e.g: Kernel, Global memory) are shown in green and *physical units* (e.g: L2 Cache, Device Memory, System memory) are shown in blue color. Since not all GPUs have all units, exact set of shown units may vary for a specific GPU architecture.\n", + "\n", + "*Links* between *Kernel* and other logical units represent the number of executed instructions (Inst) targeting the respective unit. For example, the link between Kernel and Global represents the instructions loading from or storing to the global memory space. \n", + "\n", + "Links between logical units (green) and physical units (blue) represent the number of requests (Req) issued as a result of their respective instructions. For example, the link going from L1/TEX Cache to Global shows the number of requests generated due to global load instructions.\n", + "\n", + "The color of each link represents the percentage of peak utilization of the corresponding communication path. The color legend to the right of the chart shows the applied color gradient from unused (0%) to operating at peak performance (100%). Triangle markers to the left of the legend correspond to the links in the chart. \n", + "\n", + "\n", + "Colored rectangles inside the units located at the incoming and outgoing links represents port utilization. Units often share a common data port for incoming and outgoing traffic. Ports use the same color gradient as the data links. Below screenshot shows the mapping of the peak values between the memory chart and the table. An example of the correlation between the peak values reported in the memory tables and the ports in the memory chart is shown below:\n", + "\n", + "\n", + "\n", + "\n", + "Memory tables shows detailed metrics for the various memory hardware units such as device memory. To learn more, please read the profiling guide: https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html#memory-tables" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Profiling using command line interface \n", + "To profile the application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the applications using CLI. The Nsight Compute command line executable is named `ncu`. To collect the default set of data for all kernel launches in the application, run:\n", + "\n", + "`ncu -o output ./rdf`\n", + "\n", + "For all kernel invocations in the application code, details page data will be gathered and displayed and the results are written to `output.ncu-rep`. \n", + "\n", + "\n", + "\n", + "As seen in the above screenshot, each output from the compute profiler starts with `==PROF==`. The other lines are output from the application itself. For each profiled kernel, the name of the kernel function and the progress of data collection is shown. In the example screenshot, the kernel function name starts with `_Z16pair_gpu_183_gpuPKdS0_S0_...`.\n", + "\n", + "\n", + "\n", + "\n", + "Example screenshot shows major sections (annotated in green) for SOL (speed of light – comparison against best possible behavior), launch statistics, and occupancy analysis for the example kernel function `pair_gpu`. You can optionally select which of these sections are collected and displayed with command-line parameters. Simply run `ncu --list-sets` from the terminal to see list of available sets. \n", + "\n", + "\n", + " \n", + "\n", + "\n", + "To see the list of currently available sections, use `--list-sections`.\n", + "\n", + "\n", + " \n", + "\n", + "To collect all sections and sets when profiling your application with Nsight Compute, add `--set=full` to the command line. Then it collects Memory and Compute Workload Analysis, scheduler, warp state and instruction statistics in addition to the default sections and all will be added to the profiling report. \n", + "\n", + "**Note**: The choice of sections and metrics will affect profiling time and will slow down the process. It also increases the size of the output.\n", + "\n", + "\n", + "There are also options available to specify for which kernels data should be collected. Below is a typical command line invocation to collect the default set of data for all kernel launches in the target application:\n", + "\n", + "`ncu -k _Z16pair_gpu_183_gpuPKdS0_S0_Pyiidddi --launch-skip 1 --launch-count 1 -f -o output ./rdf`\n", + "\n", + "where command switch options used for this lab are:\n", + "- `-c` or `--launch-count`: to specify number of kernel launches to collect\n", + "- `-s` or `--launch-skip`: to specify number of kernels to skip before collection starts\n", + "- `-k` or `--kernel-name`: to specify the matching kernel name\n", + "- `-f`: Overwrites the existing generated report\n", + "- `-o`: name for the intermediate result file, created at the end of the collection (.nsight-cuprof-report or .ncu-rep filename)\n", + "\n", + "**Customising data collection**: One may ask how would you decide on the number of kernels to skip and how many kernel launches to collect? Since data is collected per kernel, it makes sense to collect for more than one kernel launches if kernels have different behavior or performance characteristics. The decision on how many kernel launches to skip or collect depends on if you want to collect the performance metrics for those kernel launches or not.\n", + "\n", + "You can also profile the kernel from inside the Nsight Systems or you can copy the command line options for the specific kernel you want to profile. To achieve this, you would need to right click on the kernel in the timeline view from inside the Nsight Systems. \n", + "\n", + "\n", + "\n", + "Then click on the \"Analyze the selected Kernel with NVIDIA Nsight Compute\". \n", + "\n", + "\n", + "\n", + "Then choose \"Display the command line to use NVIDIA Nsight Compute CLI\". Then, you copy the command and run it on the target system to analyze the selected kernel.\n", + "\n", + "\n", + "\n", + "\n", + "**Note**: You do not need to memorize the profiler options. You can always run `ncu --help` from the command line and use the necessary options or profiler arguments. For more info on Nsight compute profiler, please read the __[documentation](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)__.\n", + "\n", + "\n", + "### How to view the report\n", + "The profiler report contains all the information collected during profiling for each kernel launch. When using CLI to profile the application, there are two ways to view the profiler's report. \n", + "\n", + "1) On the Terminal: By default, a temporary file is used to store profiling results, and data is printed to the command line. You can also use `--print-summary per-kernel` option to view the summary of each kernel type on the terminal. To read more about Console output options, checkout the guide at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-console-output .\n", + "\n", + "\n", + "2) NVIDIA Nsight Compute UI: To permanently store the profiler report, use `-o` to specify the output filename. After the profiling session ends, a `*.nsight-cuprof-report` or `*.ncu-rep` file will be created. This file can be loaded into Nsight Compute UI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight Compute UI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n", + "\n", + "To view the profiler report, simply open the file from the GUI (File > Open).\n", + "\n", + "\n", + "\n", + "**NOTE**: Example screenshots are for reference only and you may not get identical profiler report." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "-----\n", + "\n", + "\n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Links and Resources\n", + "\n", + "\n", + "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n", + "\n", + "\n", + "**NOTE**: To be able to see the Nsight Compute profiler output, please download Nsight Compute's latest version from [here](https://developer.nvidia.com/nsight-compute).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/nsight_systems.ipynb b/_basic/_common/jupyter_notebook/nsight_systems.ipynb new file mode 100644 index 0000000..a77353a --- /dev/null +++ b/_basic/_common/jupyter_notebook/nsight_systems.ipynb @@ -0,0 +1,226 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This lab gives an overview of the Nvidia Nsight Tool and steps to profile an application with Nsight Systems command line interface with NVTX API. You will learn how to integrate NVTX markers in your application to trace CPU events when profiling using Nsight tools. \n", + "\n", + "Let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# NVIDIA Profiler\n", + "\n", + "### What is profiling\n", + "Profiling is the first step in optimizing and tuning your application. Profiling an application would help us understand where most of the execution time is spent. You will gain an understanding of its performance characteristics and can easily identify parts of the code that present opportunities for improvement. Finding hotspots and bottlenecks in your application, can help you decide where to focus our optimization efforts.\n", + "\n", + "### NVIDIA Nsight Tools\n", + "NVIDIA offers Nsight tools (Nsight Systems, Nsight Compute, Nsight Graphics), a collection of applications which enable developers to debug, profile the performance of CUDA, OpenACC, or OpenMP applications. \n", + "\n", + "Your profiling workflow will change to reflect the individual Nsight tools. Start with Nsight Systems to get a system-level overview of the workload and eliminate any system level bottlenecks, such as unnecessary thread synchronization or data movement, and improve the system level parallelism of your algorithms. Once you have done that, then proceed to Nsight Compute or Nsight Graphics to optimize the most significant CUDA kernels or graphics workloads, respectively. Periodically return to Nsight Systems to ensure that you remain focused on the largest bottleneck. Otherwise the bottleneck may have shifted and your kernel level optimizations may not achieve as high of an improvement as expected.\n", + "\n", + "- **Nsight Systems** analyze application algorithm system-wide\n", + "- **Nsight Compute** debug and optimize CUDA kernels \n", + "- **Nsight Graphics** debug and optimize graphic workloads\n", + "\n", + "\n", + "*The data flows between the NVIDIA Nsight tools.*\n", + "\n", + "In this lab, we only focus on Nsight Systems to get the system-wide actionable insights to eliminate bottlenecks.\n", + "\n", + "### Introduction to Nsight Systems \n", + "Nsight Systems tool offers system-wide performance analysis in order to visualize application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.\n", + "\n", + "#### Nsight Systems Timeline\n", + "- CPU rows help locating CPU core's idle times. Each row shows how the process' threads utilize the CPU cores.\n", + "\n", + "\n", + "- Thread rows shows a detailed view of each thread's activity including OS runtime libraries usage, CUDA API calls, NVTX time ranges and events (if integrated in the application).\n", + "\n", + "\n", + "- CUDA Workloads rows display Kernel and memory transfer activites. \n", + "\n", + "\n", + "### Profiling using command line interface \n", + "To profile your application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the mini application using CLI.\n", + "\n", + "The Nsight Systems command line interface is named `nsys`. Below is a typical command line invocation:\n", + "\n", + "`nsys profile -t openacc,nvtx --stats=true --force-overwrite true -o laplace ./laplace`\n", + "\n", + "where command switch options used for this lab are:\n", + "- `profile` – start a profiling session\n", + "- `-t`: Selects the APIs to be traced (nvtx and openacc in this example)\n", + "- `--stats`: if true, it generates summary of statistics after the collection\n", + "- `--force-overwrite`e: if true, it overwrites the existing generated report\n", + "- `-o` – name for the intermediate result file, created at the end of the collection (.qdrep filename)\n", + "\n", + "**Note**: You do not need to memorize the profiler options. You can always run `nsys --help` or `nsys [specific command] --help` from the command line and use the necessary options or profiler arguments.\n", + "For more info on Nsight profiler and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-systems/)__.\n", + "\n", + "### How to view the report\n", + "\n", + "When using CLI to profile the application, there are two ways to view the profiler's report. \n", + "\n", + "1) On the Terminal using `--stats` option: By using `--stats` switch option, profiling results are displayed on the console terminal after the profiling data is collected.\n", + "\n", + "\n", + "\n", + "2) NVIDIA Nsight System GUI: After the profiling session ends, a `*.qdrep` file will be created. This file can be loaded into Nsight Systems GUI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight System GUI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n", + "\n", + "To view the profiler report, simply open the file from the GUI (File > Open).\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using NVIDIA Tools Extension (NVTX) \n", + "\n", + "NVIDIA Tools Extension (NVTX) is a C-based Application Programming Interface (API) for annotating events, time ranges and resources in applications. NVTX brings the profiled application’s logic into the Profiler, making the Profiler’s displayed data easier to analyse and enables correlating the displayed data to profiled application’s actions. \n", + "\n", + "During this lab, we profile the application using Nsight Systems command line interface and collect the timeline. We will also be tracing NVTX APIs (already integrated into the application). The NVTX tool is a powerful mechanism that allows users to manually instrument their application. NVIDIA Nsight Systems can then collect the information and present it on the timeline. It is particularly useful for tracing of CPU events and time ranges and greatly improves the timeline's readability. \n", + "\n", + "**How to use NVTX**: Add `#include \"nvtx3/nvToolsExt.h\"` in your source code and wrap parts of your code which you want to capture events with calls to the NVTX API functions. For example, try adding `nvtxRangePush(\"main\")` in the beginning of your `main()` function, and `nvtxRangePop(`) just before the return statement in the end.\n", + "\n", + "The sample code snippet below shows the use of range events.The resulting NVTX markers can be viewed in Nsight Systems timeline view. \n", + "\n", + "```cpp\n", + " nvtxRangePushA(\"init\");\n", + " initialize(A, Anew, m, n);\n", + " nvtxRangePop();\n", + "\n", + " printf(\"Jacobi relaxation Calculation: %d x %d mesh\\n\", n, m);\n", + "\n", + " double st = omp_get_wtime();\n", + " int iter = 0;\n", + "\n", + " nvtxRangePushA(\"while\");\n", + " while ( error > tol && iter < iter_max )\n", + " {\n", + " nvtxRangePushA(\"calc\");\n", + " error = calcNext(A, Anew, m, n);\n", + " nvtxRangePop();\n", + "\n", + " nvtxRangePushA(\"swap\");\n", + " swap(A, Anew, m, n);\n", + " nvtxRangePop();\n", + "\n", + " if(iter % 100 == 0) printf(\"%5d, %0.6f\\n\", iter, error);\n", + "\n", + " iter++;\n", + " }\n", + " nvtxRangePop();\n", + " \n", + "```\n", + "\n", + "\n", + "\n", + "**Using NVTX with Fortran** Being a C API in order to use NVTX in Fortran, wrappers must be written calling C API and exposed as a module. As part of this tutorial the file `nvtx.f90` consists of wrapper sub routines for NVTX API. \n", + "\n", + "Detailed NVTX documentation can be found under the __[CUDA Profiler user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx)__." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Steps to follow\n", + "To obtain the best performance from GPU and utilize the hardware, one should follow the cyclical process (analyze, parallelize, optimize). \n", + "\n", + "- **Analyze**: In this step, you first identify the portion of your code that includes most of the computation and most of the execution time is spent. From here, you find the hotspots, evaluate the bottlenecks and start investigating GPU acceleration.\n", + "\n", + "- **Parallelize**: Now that we have identified the bottlenecks, we use use the techniques to paralellise the routines where most of the time is spent.\n", + "\n", + "- **Optimize**: To further improve the performance, one can implement optimization strategies step by step in an iterative process including: identify optimization opportunity, apply and test the optimization method, verify and repeat the process.\n", + "\n", + "Note: The above optimization is done incrementally after investigating the profiler output.\n", + "\n", + "We will follow the optimization cycle for porting and improving the code performance.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Getting Started \n", + "In the following sections, we parallelise and optimize the serial [RDF](../../../nways_MD/English/C/jupyter_notebook/serial/rdf_overview.ipynb) using different approaches to the GPU programming following the above steps. For each section, inspect the code, compile, and profile it. Then, investigate the profiler’s report to identify the bottlenecks and spot the optimization opportunities. At each step, locate problem areas in the application and make improvements iteratively to increase performance.\n", + "\n", + "This lab comprises of multiple exercises, each follows the optimization cycle method. For each exercise, compile the code, validate the output (more instruction in the labs) and profile it. You will profile the code with Nsight Systems (`nsys`), identify certain areas/kernels in the code, where they don't behave as expected. \n", + "\n", + "**NOTE**: Example screenshots are for reference only and you may not get identical profiler report." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "-----\n", + "\n", + "\n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Links and Resources\n", + "\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System's latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/rdf_overview.ipynb b/_basic/_common/jupyter_notebook/rdf_overview.ipynb new file mode 100644 index 0000000..93744ae --- /dev/null +++ b/_basic/_common/jupyter_notebook/rdf_overview.ipynb @@ -0,0 +1,151 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RDF\n", + "The radial distribution function (RDF) denoted in equations by g(r) defines the probability of finding a particle at a distance r from another tagged particle. The RDF is strongly dependent on the type of matter so will vary greatly for solids, gases and liquids.\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you might have observed the code complexity of the algorithm in $N^{2}$ . Let us get into details of the sequential code. **Understand and analyze** the code present at:\n", + "\n", + "
\n", + " C/C++ code\n", + " \n", + "[RDF Serial Code](../source_code/rdf.cpp)\n", + "\n", + "[File Reader](../source_code/dcdread.h)\n", + " \n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran code\n", + "\n", + "[RDF Serial Code](../source_code/rdf.f90)\n", + " \n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cd ../source_code && make clean && make rdf_c rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We plan to follow the typical optimization cycle that every code needs to go through\n", + "\n", + "\n", + "In order analyze the application we we will make use of profiler \"nsys\" and add \"nvtx\" marking into the code to get more information out of the serial code. Before running the below cells, let's first start by divining into the profiler lab to learn more about the tools. Using Profiler gives us the hotspots and helps to understand which function is important to be made parallel.\n", + "\n", + "-----\n", + "\n", + "#
[Nsight Systems Overview](nsight_systems.ipynb)
\n", + "\n", + "#
[Nsight Compute Overview](nsight_compute.ipynb)
\n", + "\n", + "-----\n", + "\n", + "Now, that we are familiar with the Nsight Profiler and know how to [NVTX](nsight_systems.ipynb#nvtx), let's profile the serial code and checkout the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# profiling C/C++ code\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_serial ./rdf_c" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you run the above cell, you should see the following in the terminal.\n", + "\n", + "\n", + "\n", + "To view the profiler report, download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_serial.nsys-rep) and choosing save Link As Once done, open it via the GUI. For more information on how to open the report via the GUI, please checkout the section on [How to view the report](nsight_systems.ipynb#gui-report). \n", + "\n", + "From the timeline view, right click on the nvtx row and click the \"show in events view\". Now you can see the nvtx statistic at the bottom of the window which shows the duration of each range. In the following labs, we will look in to the profiler report in more detail. \n", + "\n", + "\n", + "\n", + "The obvious next step is to make **Pair Calculation** algorithm parallel using different approaches to GPU Programming. Please follow the below link and choose one of the approaches to parallelise th serial code.\n", + "\n", + "-----\n", + "\n", + "#
[HOME](../_start_nways_C_Fortran.ipynb)
\n", + "\n", + "-----\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA Nsight Compute](https://docs.nvidia.com/nsight-compute/)\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "[Profiling timelines with NVTX](https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/jupyter_notebook/rdf_overview_python.ipynb b/_basic/_common/jupyter_notebook/rdf_overview_python.ipynb new file mode 100644 index 0000000..aba104d --- /dev/null +++ b/_basic/_common/jupyter_notebook/rdf_overview_python.ipynb @@ -0,0 +1,138 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## RDF\n", + "The radial distribution function (RDF) denoted as g(r) defines the probability of finding a particle at a distance r from another tagged particle. The RDF is strongly dependent on the type of matter so will vary greatly for solids, gases and liquids. You can read more [here](https://en.wikibooks.org/wiki/Molecular_Simulation/Radial_Distribution_Functions).\n", + "\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The code complexity of the algorithm is $N^{2}$ . Let us get into details of the serial code by clicking on the link below:\n", + "\n", + "[RDF Serial Code](../../python/source_code/serial/nways_serial_overview.py)\n", + "\n", + "\n", + "Open the downloaded file, analyze and understand the code if possible, and run the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%run ../../python/source_code/serial/nways_serial_overview.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We plan to follow a typical optimization cycle that every code need to go through\n", + "\n", + "\n", + "In order to analyze the application, we will make use of the NVIDIA Nsight System profiler \"nsys\" and add NVIDIA Tools Extension SDK for annotation \"nvtx\" marking within the code to get more information out of the serial code. Before running the cell below, let's first start by diving into the profiler lab to learn more about the tools. Using profiler identifies the hotspots and helps us understand which function(s) are most important to parallelize.\n", + "\n", + "-----\n", + "\n", + "#
[Nsight Systems Overview](nsight_systems.ipynb)
\n", + "\n", + "#
[Nsight Compute Overview](nsight_compute.ipynb)
\n", + "\n", + "-----\n", + "\n", + "Now, that we are familiar with the Nsight Profiler and know how to use [NVTX](nsight_systems.ipynb#nvtx), let's profile the serial code and evaluate the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cd ../../python/source_code/serial&& nsys profile --stats=true --force-overwrite true -o serial_cpu_rdf python3 nways_serial_overview.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you run the above cell, you should see the following in the terminal:\n", + "\n", + "\n", + "\n", + "\n", + "To view the profiler report, you need to download and save the report file by holding down Shift and Right-Clicking [Here](../../python/source_code/serial/serial_cpu_rdf.nsys-rep) and choosing save Link As Once done, open it via the GUI. For more information on how to open the report via the GUI, please check out the section on [how to view the report](jupyter_notebook/nsight_systems.ipynb#gui-report). \n", + "\n", + "From the timeline view, right click on the nvtx row and click the \"show in events view\". You can see the nvtx statistic at the bottom of the window which shows the duration of each range. In the following labs, we will explore the profiler report in more detail. \n", + "\n", + "\n", + "\n", + "The next step is to make the **Pair Calculation** algorithm parallel using existing approaches within GPU Programming. Please follow the link below and choose one approach to parallelize the serial code.\n", + "\n", + "-----\n", + "\n", + "#
[HOME](../_start_nways_python.ipynb)
\n", + "-----\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "[Profiling timelines with NVTX](https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of NVIDIA Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/_common/source_code/Makefile b/_basic/_common/source_code/Makefile new file mode 100644 index 0000000..e0b792b --- /dev/null +++ b/_basic/_common/source_code/Makefile @@ -0,0 +1,20 @@ +# Copyright (c) 2020 NVIDIA Corporation. All rights reserved. + +CC := nvc++ +CFLAGS := -O3 -w -ldl +ACCFLAGS := -Minfo=accel +NVTXLIB_c := -I/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -lnvToolsExt + +FC := nvfortran +FLAGS := -O3 -w +NVTXLIB_f := -lnvhpcwrapnvtx + +rdf_f:rdf.f90 + ${FC} ${FLAGS} ${ACCFLAGS} rdf.f90 -o rdf_f ${NVTXLIB_f} + + +rdf_c: rdf.cpp + ${CC} ${CFLAGS} ${ACCFLAGS} -o rdf_c rdf.cpp ${NVTXLIB_c} + +clean: + rm -f *.o rdf_c rdf_f \ No newline at end of file diff --git a/_basic/_common/source_code/dcdread.h b/_basic/_common/source_code/dcdread.h new file mode 100644 index 0000000..5acbc25 --- /dev/null +++ b/_basic/_common/source_code/dcdread.h @@ -0,0 +1,53 @@ +////////////////////////////////////////////////////////////////////////////////////////// +// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +////////////////////////////////////////////////////////////////////////////////////////// + +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/_common/source_code/rdf.cpp b/_basic/_common/source_code/rdf.cpp new file mode 100644 index 0000000..11bc98a --- /dev/null +++ b/_basic/_common/source_code/rdf.cpp @@ -0,0 +1,204 @@ +////////////////////////////////////////////////////////////////////////////////////////// +// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +////////////////////////////////////////////////////////////////////////////////////////// + +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +//Function declation where distances are calculated +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////Input Details////////////////////////////////////////////// + inconf = 10; + nbin = 2000; + file = "../input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + //Output file storing the RDF values + pairfile.open("RDF.dat"); + //Output file storing the entropy. This value will be used to check the correctness of output + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + /////////////////////This is where we will concentrate////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + //////////////////////////////////////////////////////////////////////// + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + // Loop to calculate entropy + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} + +int round(float num) +{ + return num < 0 ? num - 0.5 : num + 0.5; +} + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + int myround; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + + printf("\n %d %d ", nconf, numatm); + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); + for (int id1 = 0; id1 < numatm; id1++) + { + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } +} diff --git a/_basic/_common/source_code/rdf.f90 b/_basic/_common/source_code/rdf.f90 new file mode 100644 index 0000000..f16c13c --- /dev/null +++ b/_basic/_common/source_code/rdf.f90 @@ -0,0 +1,161 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r\n", + "* Programming Language Extension: CUDA C, CUDA Fortran, Python CuPy, Python Numba\n", + "\n", + "Let's start by testing the CUDA Driver and GPU you are running the code on in this lab:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Bootcamp Outline\n", + "\n", + "During this lab, we will be working on porting mini applications in Molecular Simulation (MD) domain to GPUs. You can choose to work with either version of this application. Please click on one of the below links to start N Ways to GPU Programming in **MD** for:\n", + "\n", + "- [C and Fortran](_common/_start_nways_C_Fortran.ipynb) \n", + "- [Python](_common/_start_nways_python.ipynb) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Bootcamp Duration\n", + "The lab material will be presented in an 8-hour session. A Link to the material is available for download at the end of the lab.\n", + "\n", + "### Content Level\n", + "Beginner, Intermediate\n", + "\n", + "### Target Audience and Prerequisites\n", + "The target audience for this lab are researchers/graduate students and developers who are interested in learning about various ways of GPU programming to accelerate their scientific applications.\n", + "\n", + "Basic experience with C/C++ or Python or Fortran programming is needed. No GPU programming knowledge is required. \n", + "\n", + "--- \n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/cuda/Presentations/README.md b/_basic/cuda/Presentations/README.md new file mode 100644 index 0000000..929ea89 --- /dev/null +++ b/_basic/cuda/Presentations/README.md @@ -0,0 +1,5 @@ +For Partners who are interested in delivering the critical hands-on skills needed to advance science in form of Bootcamp can reach out to us at [Open Hackathons Partner](https://www.openhackathons.org/s/about-open-hackathons) website. In addition to current bootcamp material the Partners will be provided with the following: + +- Presentation: All the Bootcamps are accompanied with training material presentations which can be used during the Bootcamp session. +- Mini challenge : To test the knowledge gained during this Bootcamp a mini application challenge is provided along with sample Solution. +- Additional Support: On case to case basis the Partners can also be trained on how to effectively deliver the Bootcamp with maximal impact. \ No newline at end of file diff --git a/_basic/cuda/jupyter_notebook/nways_cuda.ipynb b/_basic/cuda/jupyter_notebook/nways_cuda.ipynb new file mode 100644 index 0000000..d616afe --- /dev/null +++ b/_basic/cuda/jupyter_notebook/nways_cuda.ipynb @@ -0,0 +1,785 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning objectives\n", + "The **goal** of this lab is to:\n", + "The goal of this lab is:\n", + "- Learn how to use CUDA C and CUDA Fortran to parallelize our code.\n", + "- Understand the basic terms and steps involved in making a sequential code parallel.\n", + "\n", + "We do not intend to cover:\n", + "- Optimization techniques like memory access patterns, memory hierarchy.\n", + "\n", + "# Introduction\n", + "Graphics Processing Units (GPUs) were initially designed to accelerate graphics processing, but in 2007 the release of CUDA introduced GPUs as General Purpose Processors. CUDA is a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant. The developer still programs in the familiar C, C++, Fortran, or an ever expanding list of supported languages, and incorporates extensions of these languages in the form of a few basic keywords.\n", + "\n", + "- CUDA C/C++ is based on a standard C/C++ and CUDA Fortran is based on a standard Fortran\n", + "- CUDA is a set of extensions to enable heterogeneous programming\n", + "- CUDA is a straightforward API to manage devices, memory, etc.\n", + "\n", + "\n", + "# CUDA \n", + "\n", + "\n", + "**Heterogeneous Computing:** CUDA is a heterogeneous programming model that includes provisions for both a CPU and GPU. The CUDA C/C++ programming interface consists of C language extensions and the CUDA Fortran programming interface consists of Fortran language extensions. These enables you to target portions of source code for parallel execution on the device (GPU). CUDA provides library of C/Fortran functions that can be\n", + "executed on the host (CPU) so that it can interact with the device. The two processor that work with each other are: \n", + "\n", + "- Host: CPU and its memory (Host Memory)\n", + "- Device: GPU and its memory (Device Memory)\n", + "\n", + "\n", + "Let us look at a Hello World example in C and Fortran: \n", + "\n", + "\n", + "
\n", + " CUDA C/C++\n", + " \n", + "```cpp\n", + "_global__ void print_from_gpu(void) {\n", + " printf(\"Hello World! from thread [%d,%d] From device\\n\", threadIdx.x,blockIdx.x);\n", + "}\n", + "\n", + "int main(void) {\n", + " printf(\"Hello World from host!\\n\");\n", + " print_from_gpu<<<1,1>>>();\n", + " cudaDeviceSynchronize();\n", + " return 0;\n", + "}\n", + "\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " CUDA Fortran\n", + " \n", + "\n", + "```fortran\n", + "module printgpu\n", + "contains\n", + " attributes(global) subroutine print_form_gpu()\n", + " implicit none\n", + " integer :: i\n", + " i = blockDim%x * (blockIdx%x - 1) + threadIdx%x\n", + " print *, i\n", + " end subroutine saxpy \n", + "end module printgpu\n", + "\n", + "program testPrint\n", + " use printgpu\n", + " use cudafor\n", + " implicit none\n", + "\n", + " call print_form_gpu<<<1, 1>>>()\n", + " cudaDeviceSynchronize()\n", + "end program testPrint\n", + "\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "So you might have already observed that CUDA C is nothing but extensions/constructs to existing language. Let us look at what those additional constructs we introduced above:\n", + "\n", + "- ```__global__``` :This keyword, when added before the function, tells the compiler that this is a function that will run on the device and not on the host. \n", + "- ``` <<<,>>> ``` : This keyword tells the compiler that this is a call to the device function and not the host function. Additionally, the 1,1 parameter basically dictates the number of threads to launch in the kernel. We will cover the parameters inside angle brackets later. \n", + "- ``` threadIdx.x, blockIdx.x ``` : This is a unique ID that's given to all threads. \n", + "- ``` cudaDeviceSynchronize() ``` : All of the kernel(Function that runs on GPU) calls in CUDA are asynchronous in nature. This API will make sure that host does not proceed until all device calls are over.\n", + "\n", + "\n", + "## GPU Architecture\n", + " \n", + "In this section will take an approach of describing the CUDA programming model by showing relationship between the software programming concepts and how do they get mapped to GPU hardware.\n", + "\n", + "The diagram below shows a higher level of abstraction of components of GPU hardware and its respective programming model mapping. \n", + "\n", + "\n", + "\n", + "As shown in the diagram above CUDA programming model is tightly coupled with hardware design. This makes CUDA one of the most efficient parallel programming model for shared memory systems. Another way to look at the diagram shown above is given below: \n", + "\n", + "| Software | Executes | Hardware |\n", + "| --- | --- | --- |\n", + "| CUDA thread | on/as | CUDA Core | \n", + "| CUDA block | on/as | Streaming Multiprocessor |\n", + "| GRID/Kernel | on/as | GPU Device |\n", + "\n", + "We will get into the concept of _blocks_ and _threads_ in upcoming section. But let us first look at steps involved in writing CUDA code.\n", + "\n", + "\n", + "## Steps in CUDA Programming\n", + "\n", + "The below table highlights the typical steps which are required to convert sequential code to CUDA code:\n", + "\n", + "| Sequential code | CUDA Code |\n", + "| --- | --- |\n", + "| **Step 1** Allocate memory on the CPU ( _malloc new_ ) | **Step 1** : Allocate memory on the CPU (_malloc, new_ )|\n", + "| **Step 2** Populate/initialize the CPU data | **Step 2** Allocate memory on the GPU, using API like _cudaMalloc()_ |\n", + "| **Step 3** Call the CPU function that has the crunching of data. | **Step 3** Populate/initialize the CPU |\n", + "| **Step 4** Consume the crunched data on Host | **Step 4** Transfer the data from the host to the device with _cudaMemcpy()_ |\n", + "| | **Step 5** Call the GPU function with _<<<,>>>_ brackets |\n", + "| | **Step 6** Synchronize the device and host with _cudaDeviceSynchronize()_ |\n", + "| | **Step 7** Transfer data from the device to the host with _cudaMemcpy()_ |\n", + "| | **Step 8** Consume the crunched data on Host |\n", + "\n", + "CPU and GPU memory are different and developer needs to use additional CUDA API to allocate and free memory on GPU. Only device memory can be consumed inside GPU function call (kernel). \n", + " \n", + "In CUDA C/C++, linear memory on Device is typically allocated using ```cudaMalloc()``` and freed using ```cudaFree()``` and data transfer between host memory and device memory are typically done using ```cudaMemcpy()```.\n", + "\n", + "In CUDA Fortran, linear memory on Device is typically allocated by defining array as ```allocatable, device``` type and data transfer between host memory and device memory are typically done using ```cudaMemcpy()```.\n", + " \n", + "\n", + "The API definition of these are as follows: \n", + "\n", + "**cudaError_t cudaMalloc (void ∗∗ devPtr, size_t size)** in CUDA C/C++ and **integer function cudaMalloc(devptr, size)** in CUDA Fortran, allocate size bytes of linear memory on the device and returns a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. `cudaMalloc()` returns ```cudaErrorMemoryAllocation``` in case of failure or ```cudaSuccess```.\n", + " \n", + "**cudaError_t cudaMemcpy (void ∗ dst, const void ∗ src, size_t count, enum cudaMemcpyKind kind)** in CUDA C/C++ and **integer function cudaMemcpy(dst, src, count, kind)** in CUDA Fortran, copies count bytes from the memory area pointed to by `src` to the memory area pointed to by `dst`. `dst` and `src` may be any device or host, scalar or array. `kind` is one of the defined enums `cudaMemcpyHostToDevice`, `cudaMemcpyDeviceToHost`, `cudaMemcpyDeviceToDevice` or `cudaMemcpyHostToHost` (this specifies the direction of the copy).\n", + "\n", + "Please note, calling `cudaMemcpy()` with `dst` and `src` pointers that do not match the direction of the copy results in an undefined behavior.\n", + "\n", + "**cudaError_t cudaFree (void ∗ devPtr)** Frees the memory space pointed to by `devPtr`, which must have been returned by a previous call to `cudaMalloc()` or other equivalent API. \n", + " \n", + "Let us look at these steps in more detail for a simple vector addition code:\n", + "\n", + " \n", + "
\n", + " CUDA C/C++\n", + " \n", + "```cpp\n", + "int main(void) {\n", + "\tint *a, *b, *c;\n", + " int *d_a, *d_b, *d_c; // device copies of a, b, c\n", + "\n", + "\tint size = N * sizeof(int);\n", + "\n", + "\t// Alloc space for host copies of a, b, c and setup input values\n", + "\ta = (int *)malloc(size); fill_array(a);\n", + "\tb = (int *)malloc(size); fill_array(b);\n", + "\tc = (int *)malloc(size);\n", + "\n", + " // Alloc space for device copies of a, b, c\n", + " cudaMalloc((void **)&d_a, size);\n", + " cudaMalloc((void **)&d_b, size);\n", + " cudaMalloc((void **)&d_c, size);\n", + "\n", + " // Copy inputs to device\n", + " cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);\n", + " cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);\n", + "\n", + "\n", + "\tdevice_add<<>>(d_a,d_b,d_c);\n", + "\n", + " // Copy result back to host\n", + " cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);\n", + "\n", + "\tprint_output(a,b,c);\n", + "\n", + "\tfree(a); free(b); free(c);\n", + " cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);\n", + "\n", + "\n", + "\n", + "\treturn 0;\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " CUDA Fortran\n", + " \n", + "\n", + "```fortran\n", + "module kernel\n", + " contains\n", + " ! CUDA kernel. Each thread takes care of one element of c\n", + " attributes(global) subroutine vecAdd_kernel(n, a, b, c)\n", + " integer, value :: n\n", + " real(8), device :: a(n), b(n), c(n)\n", + " integer :: id\n", + " \n", + " ! Get our global thread ID\n", + " id = (blockidx%x-1)*blockdim%x + threadidx%x\n", + " \n", + " ! Make sure we do not go out of bounds\n", + " if (id <= n) then\n", + " c(id) = a(id) + b(id)\n", + " endif\n", + " end subroutine vecAdd_kernel\n", + "end module kernel\n", + " \n", + "program main\n", + " use cudafor\n", + " use kernel\n", + " \n", + " type(dim3) :: blockSize, gridSize\n", + " real(8) :: sum\n", + " integer :: i\n", + " \n", + " ! Size of vectors\n", + " integer :: n = 1\n", + " \n", + " ! Host input vectors\n", + " real(8),dimension(:),allocatable :: h_a\n", + " real(8),dimension(:),allocatable :: h_b\n", + " !Host output vector\n", + " real(8),dimension(:),allocatable :: h_c\n", + " \n", + " ! Device input vectors\n", + " real(8),device,dimension(:),allocatable :: d_a\n", + " real(8),device,dimension(:),allocatable :: d_b\n", + " !Host output vector\n", + " real(8),device,dimension(:),allocatable :: d_c\n", + " \n", + " ! Allocate memory for each vector on host\n", + " allocate(h_a(n))\n", + " allocate(h_b(n))\n", + " allocate(h_c(n))\n", + " \n", + " ! Allocate memory for each vector on GPU\n", + " allocate(d_a(n))\n", + " allocate(d_b(n))\n", + " allocate(d_c(n))\n", + " \n", + " ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2\n", + " do i=1,n\n", + " h_a(i) = sin(i*1D0)*sin(i*1D0)\n", + " h_b(i) = cos(i*1D0)*cos(i*1D0)\n", + " enddo\n", + " \n", + " ! Implicit copy of host vectors to device\n", + " d_a = h_a(1:n)\n", + " d_b = h_b(1:n)\n", + " \n", + "\n", + " ! Execute the kernel\n", + " call vecAdd_kernel<<<1, 1>>>(n, d_a, d_b, d_c)\n", + " \n", + " ! Implicit copy of device array to host\n", + " h_c = d_c(1:n)\n", + " \n", + " ! Sum up vector c and print result divided by n, this should equal 1 within error\n", + " sum = 0.0;\n", + " do i=1,n\n", + " sum = sum + h_c(i)\n", + " enddo\n", + " sum = sum/real(n)\n", + " print *, 'final result: ', sum\n", + " \n", + " ! Release device memory\n", + " deallocate(d_a)\n", + " deallocate(d_b)\n", + " deallocate(d_c)\n", + " \n", + " ! Release host memory\n", + " deallocate(h_a)\n", + " deallocate(h_b)\n", + " deallocate(h_c)\n", + " \n", + "end program main\n", + "```\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "### Unified Memory\n", + "An easier way to allocate memory accessible by the GPU is to use *Unified Memory*. It provides a single memory space accessible by all GPUs and CPUs in the system. To allocate data in unified memory, we call `cudaMallocManaged()`, which returns a pointer that you can access from host (CPU) code or device (GPU) code. To free the data, just pass the pointer to `cudaFree()`. To read more about unified memory, please checkout the blog on [Unified Memory for CUDA beginners](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/).\n", + "\n", + "\n", + "\n", + "Below is the example usage of how to use managed memory in the CUDA code:\n", + "\n", + "
\n", + " CUDA C/C++\n", + "\n", + "```cpp\n", + " // Allocate Unified Memory -- accessible from CPU or GPU\n", + " int *a, *b, *c;\n", + " cudaMallocManaged(&a, N*sizeof(int));\n", + " cudaMallocManaged(&b, N*sizeof(int));\n", + " cudaMallocManaged(&c, N*sizeof(int));\n", + " ...\n", + "\n", + " // Free memory\n", + " cudaFree(a);\n", + " cudaFree(b);\n", + " cudaFree(c);\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " CUDA Fortran\n", + " \n", + "```fortran\n", + "!matrix data\n", + "real, managed, allocatable, dimension(:,:) :: A, B, C\n", + "```\n", + "
\n", + "
\n", + "\n", + "## Understanding Threads and Blocks\n", + "We will be looking at understanding _thread_ and _block_ level parallelism in this section.The number of threads and blocks to be launched is passed as parameter to ```<<<,>>>``` brackets in a kernel call.\n", + "\n", + "### Creating multiple blocks\n", + "\n", + "In order to create multiple blocks for vector addition code above you need to change two things:\n", + "1. Change _<<<1,1>>>_ to <<>>_ which basically launches N number of blocks\n", + "2. Access the array with block index using private variable passed by default to CUDA kernel: _blockIdx.x_\n", + "\n", + "\n", + "
\n", + " CUDA C/C++\n", + " \n", + "```cpp\n", + "//changing from device_add<<<1,1>>> to\n", + "device_add<<>>\n", + "//access the array using blockIdx.x private variable\n", + "__global__ void device_add(int *a, int *b, int *c) {\n", + " c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " CUDA Fortran\n", + "\n", + "```fortran\n", + "attributes(global) subroutine vecAdd_kernel(n, a, b, c)\n", + " integer, value :: n\n", + " real(8), device :: a(n), b(n), c(n)\n", + " integer :: id\n", + " \n", + " ! Get our global thread ID\n", + " id = blockidx%x\n", + " \n", + " ! Make sure we do not go out of bounds\n", + " if (id <= n) then\n", + " c(id) = a(id) + b(id)\n", + " endif\n", + " end subroutine vecAdd_kernel\n", + "}\n", + "``` \n", + "\n", + "
\n", + "
\n", + "\n", + "By using `blockIdx.x` to index the array, each block handles a different element of the array and may execute in parallel to each other.\n", + "\n", + "| Block Id | Performs |\n", + "| --- | --- |\n", + "| Block 0 | _c\\[0\\]=b\\[0\\]+a\\[0\\]_ |\n", + "| Block 1 | _c\\[1\\]=b\\[1\\]+a\\[1\\]_ |\n", + "| Block 2 | _c\\[2\\]=b\\[2\\]+a\\[2\\]_ |\n", + "\n", + "**Understand and analyze** the sample vector addition code [vector_addition_block.cu](../source_code/vector_addition_gpu_block_only.cu).Open the downloaded files for inspection. \n", + "\n", + "\n", + "\n", + "### Creating multiple threads\n", + "\n", + "In order to create multiple threads for vector addition code above. You need to change two things:\n", + "1. change _<<<1,1>>>_ to <<<1,N>>>_ which basically launches N number of threads inside 1 block\n", + "2. Access the array with thread index using private variable passed by default to CUDA kernel: _threadIdx.x_\n", + "\n", + "\n", + "
\n", + " CUDA C/C++\n", + " \n", + "```cpp\n", + "//changing from device_add<<<1,1>>> to\n", + "device_add<<<1,N>>>\n", + "//access the array using threadIdx.x private variable\n", + "__global__ void device_add(int *a, int *b, int *c) {\n", + " c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " CUDA Fortran\n", + " \n", + " ```fortran\n", + "attributes(global) subroutine vecAdd_kernel(n, a, b, c)\n", + " integer, value :: n\n", + " real(8), device :: a(n), b(n), c(n)\n", + " integer :: id\n", + " \n", + " ! Get our global thread ID\n", + " id = threadidx%x\n", + " \n", + " ! Make sure we do not go out of bounds\n", + " if (id <= n) then\n", + " c(id) = a(id) + b(id)\n", + " endif\n", + " end subroutine vecAdd_kernel\n", + "``` \n", + "\n", + "
\n", + "
\n", + " \n", + "By using `threadIdx.x` to index the array, each thread handles a different element of the array and can execute in parallel.\n", + "\n", + "| thread Id | Performs |\n", + "| --- | --- |\n", + "| Thread 0 | _c\\[0\\]=b\\[0\\]+a\\[0\\]_ |\n", + "| Thread 1 | _c\\[1\\]=b\\[1\\]+a\\[1\\]_ |\n", + "| Thread 2 | _c\\[2\\]=b\\[2\\]+a\\[2\\]_ |\n", + "\n", + "**Understand and analyze** the sample vector addition code [vector_addition_thread.cu](../source_code/vector_addition_gpu_thread_only.cu).\n", + " \n", + "### Creating multiple blocks each having many threads\n", + "\n", + "So far, we've looked at parallel vector addition through the use of several blocks with one thread and one block with several\n", + "threads. Now let us look at creating multiple blocks, each block containing multiple threads.\n", + "\n", + "To understand it lets take a scenario where the total number of vector elements is 32 which needs to be added in parallel. Total number of parallel execution unit required is 32. As a first step let us define that each block contains eight threads(we are not saying this is optimal configuration and is just for explanation purpose). Next we define the number of blocks. The simplest calculation is No_Of_Blocks = 32/8 where 8 is number of threads per blocks. The code changes required to launch 4 blocks with 8 thread each is as shown below: \n", + "1. Change _<<<1,1>>>_ to <<<4,8>>>_ which basically launches 4 number of threads per block and 8 total blocks\n", + "2. Access the array with both thread index and block index using private variable passed by default to call CUDA kernel: _threadIdx.x_ and _blockIdx.x_ and _bloxkDim.x_ which tells how many threads are allocated per block. \n", + "\n", + " \n", + "
\n", + " CUDA C/C++\n", + "\n", + "```cpp\n", + "threads_per_block = 8;\n", + "no_of_blocks = N/threads_per_block;\n", + "device_add<<>>(d_a,d_b,d_c);\n", + "\n", + "__global__ void device_add(int *a, int *b, int *c) {\n", + " int index = threadIdx.x + blockIdx.x * blockDim.x;\n", + " c[index] = a[index] + b[index];\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "
\n", + "CUDA Fortran\n", + " \n", + "```fortran\n", + "! Number of threads in each thread block\n", + " blockSize = dim3(8,1,1)\n", + " ! Number of thread blocks in grid\n", + " gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)\n", + " call vecAdd_kernel<<>>(n, d_a, d_b, d_c)\n", + "\n", + " ! CUDA kernel. Each thread takes care of one element of c\n", + " attributes(global) subroutine vecAdd_kernel(n, a, b, c)\n", + " integer, value :: n\n", + " real(8), device :: a(n), b(n), c(n)\n", + " integer :: id\n", + " \n", + " ! Get our global thread ID\n", + " id = (blockidx%x-1)*blockdim%x + threadidx%x\n", + " \n", + " ! Make sure we do not go out of bounds\n", + " if (id <= n) then\n", + " c(id) = a(id) + b(id)\n", + " endif\n", + " end subroutine vecAdd_kernel\n", + "```\n", + "
\n", + "
\n", + " \n", + "The diagram below shows the launch configuration that we discussed so far:\n", + "\n", + "\n", + "\n", + "Modern GPU Architectures consists of multiple SM, each consisting of number of cores. In order to utilize whole GPU it is important to make use of both threads and blocks. \n", + "\n", + "**Understand and analyze** the sample vector addition code [vector_addition_block_thread.cu](../source_code/vector_addition_gpu_thread_block.cu).Open the downloaded files for inspection. \n", + "\n", + "\n", + "The more important question which may arise is why bother with threads altogether? What do we gain by adding additional level of parallelism? Short answer is CUDA programming model defines that unlike parallel blocks, threads have mechanisms to efficiently communicate and synchronize.\n", + "\n", + "This is necessary to implement certain algorithms where threads needs to communicate with each other.We do not require synchronization across threads in **Pair Calculation** so we will not be going into details of concept of synchronization across threads and usage of specialized memory like _shared_ memory in this tutorial. \n", + "\n", + "# Atomic Construct\n", + "\n", + "In the code you will also require one more construct which will help you in getting the right results. OpenACC atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if I want to count the number of elements that have a value greater than zero, we could write the following:\n", + "\n", + "\n", + "
\n", + " CUDA C/C++\n", + " \n", + "```cpp\n", + "__global__ void countMoreThanZero( ... )\n", + "{\n", + " if ( val > 0 )\n", + " {\n", + " atomicAdd(&cnt[0],1);\n", + " }\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "
\n", + " CUDA Fortran\n", + "\n", + "```fortran\n", + "if(r\n", + "
\n", + " \n", + "# A Quick Recap\n", + "We saw the definition of CUDA and briefly covered CUDA architecture and introduced CUDA C and CUDA Fortran constructs. We also played with block and thread configurations for a simple vector addition code. All this was done under the following restrictions:\n", + "1. **Multiple Dimension**: We launched threads and blocks in one dimension. We have been using `threadIdx.x` and `blockIdx.x`, so what is `.x` ? This statement basically says that we are launching threads and blocks in one dimension only. CUDA allows to launch threads in 3 dimensions. You can also have `.y` and `.z` for index calculation. For example you can launch threads and blocks in 2 dimensions for dividing work for a 2D image. Also the maximum number of threads per block and number of blocks allowed per dimension is restricted based on the GPU that the code is run on.\n", + "2. **GPU Memory**: What we have not covered is that GPU has different hierarchy of memory, e.g. GPU has a read only memory which provides high bandwidth for 2D and 3D locality access called _texture_. Also GPU provides a scratch pad limited memory called as _shared memory_\n", + "3. **Optimization** : What we did not cover so far is the right way to access the compute and memory to get max performance. \n", + "\n", + "**One key characteristic about CUDA is that a user can control access pattern of data for each thread. The user can decide which part of memory the data can sits on. While we are covering some part of this in this lab, which is required for us to port our code, we do not intend to cover all optimizations**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compile and Run for NVIDIA GPU\n", + "Now, lets start modifying the original code and add add the CUDA constructs. You can either explicitly transfer the allocated data between CPU and GPU or use unified memory which creates a pool of managed memory that is shared between the CPU and GPU.\n", + "\n", + "Click on the [C/C++ version](../source_code/rdf.cu) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells.\n", + "\n", + "**Note:** When `-arch=native` compiled option is used, `nvcc` detects the visible GPUs on the system and generates codes for them. It is a warning if no visible supported GPU on the system, and the default architecture will be used.\n", + "\n", + "Moreover, for the CUDA Fortran version, we are targeting the NVTX v3 API, a header-only C library, and added Fortran-callable wrappers to the code, we add `-lnvhpcwrapnvtx` at the compile time to do the link to the library." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvcc -arch=native -o rdf_c rdf.cu && echo \"Running the executable and validating the output\" && ./rdf_c && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Output\n", + " \n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "
\n", + "
\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -cuda -o rdf_f rdf.f90 -lnvhpcwrapnvtx && echo \"Running the executable and validating the output\" && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
\n", + " Output\n", + " \n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + "```\n", + "
\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's profile the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_cuda_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_cuda_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_cuda_c.nsys-rep) or the [Fortran version](../source_code/rdf_cuda_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "Nsight systems is capable of capturing information about CUDA execution in the profiled process.CUDA API row in the timeline view shows traces of CUDA Runtime and Driver calls made by application. If you hover your mouse over it, you will see more information about the calls.\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "Near the bottom of the timeline row tree, the GPU node will appear and contain a CUDA node. Within the CUDA node, each CUDA context used within the process will be shown along with its corresponding CUDA streams. Streams will contain memory operations and kernel launches on the GPU. In the example screenshot below, you can see Kernel launches are represented by blue, while memory transfers are displayed in red and green. In this example screenshot, unified memory was used rather than explicitly transferring data between CPU and GPU.\n", + "\n", + "\n", + "\n", + "\n", + "Feel free to checkout the solutions for [C/C++ solution (with managed memory)](../source_code/SOLUTION/rdf_unified_memory.cu), [C/C++ solution (without managed memory)](../source_code/SOLUTION/rdf_malloc.cu) and [Fortran solution (with managed memory)](../source_code/SOLUTION/rdf_unified_memory.f90) versions to help you understand better.\n", + "\n", + "# Analysis\n", + "\n", + "**Usage Scenarios**\n", + "\n", + "Using launguage extensions like CUDA C, CUDA Fortran helps developers get the best performance out of their code on an NVIDIA GPU. CUDA C and other language construct exposes the GPU architecture and programming model which gives more control to developers with respect to memory storage, access and thread control. Based on the type of application it can provide many fold improvement over say compiler generated codes with help of directives. \n", + "\n", + "**How is CUDA different from other GPU progamming models like OpenACC and OpenMP?**\n", + "\n", + "CUDA should not be considered an alternative to OpenMP or OpenACC. In fact CUDA complements directive-based programming models and there are defined interoperability strategies between them. You can always start accelerating your code with OpenACC and use CUDA to optimize the most performance critical kernels. For example use OpenACC for data transfer and then pass a device pointer to one of critical CUDA kernels which is written in CUDA. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "-----\n", + "\n", + "\n", + "# Links and Resources\n", + "[Introduction to CUDA](https://devblogs.nvidia.com/even-easier-introduction-cuda/)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/cuda/source_code/SOLUTION/dcdread.h b/_basic/cuda/source_code/SOLUTION/dcdread.h new file mode 100644 index 0000000..66ddba0 --- /dev/null +++ b/_basic/cuda/source_code/SOLUTION/dcdread.h @@ -0,0 +1,49 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/cuda/source_code/SOLUTION/rdf_malloc.cu b/_basic/cuda/source_code/SOLUTION/rdf_malloc.cu new file mode 100644 index 0000000..5db6416 --- /dev/null +++ b/_basic/cuda/source_code/SOLUTION/rdf_malloc.cu @@ -0,0 +1,221 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +using namespace std; +//additional error handling code +static void HandleError(cudaError_t err, + const char *file, int line) { + if (err != cudaSuccess) { + printf( "%s in %s at line %d\n", cudaGetErrorString( err ), + file, line ); + exit( EXIT_FAILURE ); + } +} +#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) + +//declaration of GPU function +__global__ void pair_gpu(const double* d_x, const double* d_y, const double* d_z, + unsigned long long int *d_g2, int numatm, int nconf, + double xbox, double ybox, double zbox, int d_bin); + +int main(int argc , char* argv[]) +{ + double xbox,ybox,zbox; + double* h_x,*h_y,*h_z; + double* d_x,*d_y,*d_z; + unsigned long long int *h_g2,*d_g2; + int nbin; + int device; + int numatm,nconf,inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + device = 0; + HANDLE_ERROR (cudaSetDevice(device));//pick the device to use + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if(!infile){ + cout<<"file "<nconf) cout << "nconf is reset to "<< nconf <>> + (d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin); + + HANDLE_ERROR (cudaPeekAtLastError()); + HANDLE_ERROR(cudaDeviceSynchronize()); + + HANDLE_ERROR(cudaMemcpy(h_g2, d_g2, sizebin, cudaMemcpyDeviceToHost)); + + nvtxRangePop(); //Pop for Pair Calculation + + double pi=acos(-1.0l); + double rho=(numatm)/(xbox*ybox*zbox); + double norm=(4.0l*pi*rho)/3.0l; + double rl,ru,nideal; + double g2[nbin]; + double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l; + double box=min(xbox,ybox); + box=min(box,zbox); + double del=box/(2.0l*nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i=0;i= numatm || id2 >= numatm) return; + if (id1 > id2) return; + + for (int frame = 0; frame < nconf; ++frame) { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) { + ig2 = (int)(r / del); + atomicAdd(&d_g2[ig2], 2); + } + } +} diff --git a/_basic/cuda/source_code/SOLUTION/rdf_orig.f90 b/_basic/cuda/source_code/SOLUTION/rdf_orig.f90 new file mode 100644 index 0000000..5bc16e8 --- /dev/null +++ b/_basic/cuda/source_code/SOLUTION/rdf_orig.f90 @@ -0,0 +1,208 @@ + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + use cudafor + implicit none + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, managed, allocatable :: x(:,:) + real*4, managed, allocatable :: y(:,:) + real*4, managed, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + xbox=d(1) + ybox=d(3) + zbox=d(6) + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + + attributes(global) subroutine pair_calculation( x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + use cudafor + implicit none + real*4 :: x(:,:) + real*4 :: y(:,:) + real*4 :: z(:,:) + double precision,intent(inout) :: g(:) + integer, value :: nframes,natoms,ind + double precision, value :: xbox,ybox,zbox,del,cut + integer i,j,iconf + double precision dx,dy,dz,r,oldvalue + + i = (blockIdx%x-1)*blockDim%x+threadIdx%x + j = (blockIdx%y-1)*blockDim%y+threadIdx%y + + if ( i == 1 .and. j == 1) then + print *, natoms,nframes,xbox,ybox,zbox,del,cut, x(1,1), y(1,1), z(1,1), g(1) + end if + + do iconf=1,nframes + if(i<=natoms .and. j<=natoms) then + + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + if(r>>(x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + + istat = cudaDeviceSynchronize() + if(istat /= 0) then + print *, "Error" + end if + !do iconf=1,nframes + ! do i=1,natoms + ! do j=1,natoms + ! if ( i == 1 .and. j == 1) then + ! print *, natoms,nframes,xbox,ybox,zbox,del,cut, x(1,1), y(1,1), z(1,1), g(1) + ! end if + + + ! dx=x(iconf,i)-x(iconf,j) + ! dy=y(iconf,i)-y(iconf,j) + ! dz=z(iconf,i)-z(iconf,j) + + ! dx=dx-nint(dx/xbox)*xbox + ! dy=dy-nint(dy/ybox)*ybox + ! dz=dz-nint(dz/zbox)*zbox + + ! r=dsqrt(dx**2+dy**2+dz**2) + ! ind=int(r/del)+1 + ! !if (ind.le.nbin) then + ! if(r +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +using namespace std; +//additional error handling code +static void HandleError(cudaError_t err, + const char *file, int line) { + if (err != cudaSuccess) { + printf( "%s in %s at line %d\n", cudaGetErrorString( err ), + file, line ); + exit( EXIT_FAILURE ); + } +} +#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) + +//declaration of GPU function +__global__ void pair_gpu(const double* d_x, const double* d_y, const double* d_z, + unsigned long long int *d_g2, int numatm, int nconf, + double xbox, double ybox, double zbox, int d_bin); + +int main(int argc , char* argv[]) +{ + double xbox,ybox,zbox; + double* d_x,*d_y,*d_z; + unsigned long long int *d_g2; + int nbin; + int device; + int numatm,nconf,inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + device = 0; + HANDLE_ERROR (cudaSetDevice(device));//pick the device to use + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if(!infile){ + cout<<"file "<nconf) cout << "nconf is reset to "<< nconf <>> + (d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin); + + HANDLE_ERROR (cudaPeekAtLastError()); + HANDLE_ERROR(cudaDeviceSynchronize()); + + + nvtxRangePop(); //Pop for Pair Calculation + + double pi=acos(-1.0l); + double rho=(numatm)/(xbox*ybox*zbox); + double norm=(4.0l*pi*rho)/3.0l; + double rl,ru,nideal; + double g2[nbin]; + double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l; + double box=min(xbox,ybox); + box=min(box,zbox); + double del=box/(2.0l*nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i=0;i= numatm || id2 >= numatm) return; + if (id1 > id2) return; + + for (int frame = 0; frame < nconf; ++frame) { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) { + ig2 = (int)(r / del); + atomicAdd(&d_g2[ig2], 2); + } + } +} diff --git a/_basic/cuda/source_code/SOLUTION/rdf_unified_memory.f90 b/_basic/cuda/source_code/SOLUTION/rdf_unified_memory.f90 new file mode 100644 index 0000000..c5e1080 --- /dev/null +++ b/_basic/cuda/source_code/SOLUTION/rdf_unified_memory.f90 @@ -0,0 +1,176 @@ + +module readdata + contains + + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + use cudafor + implicit none + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, managed, allocatable :: x(:,:) + real*4, managed, allocatable :: y(:,:) + real*4, managed, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + xbox=d(1) + ybox=d(3) + zbox=d(6) + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + + attributes(global) subroutine pair_calculation( x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + use cudafor + implicit none + real*4 :: x(:,:) + real*4 :: y(:,:) + real*4 :: z(:,:) + double precision,intent(inout) :: g(:) + integer, value :: nframes,natoms,ind + double precision, value :: xbox,ybox,zbox,del,cut + integer i,j,iconf + double precision dx,dy,dz,r,oldvalue + + i = (blockIdx%x-1)*blockDim%x+threadIdx%x + j = (blockIdx%y-1)*blockDim%y+threadIdx%y + + do iconf=1,nframes + if(i<=natoms .and. j<=natoms) then + + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + if(r>>(x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + + istat = cudaDeviceSynchronize() + if(istat /= 0) then + print *, "Error" + end if + + s2=0.01d0 + s2bond=0.01d0 + const=(4.0d0/3.0d0)*pi*rho + + do i=1,nbin + rlower=dble((i-1)*del) + rupper=rlower+del + nideal=const*(rupper**3-rlower**3) + g(i)=g(i)/(dble(nframes)*dble(natoms)*nideal) + r=dble(i)*del + if (r.lt.2.0) then + gr=0.0 + else + gr=g(i) + endif + + if (gr.lt.1e-5) then + lngr=0.0 + else + lngr=dlog(gr) + endif + if (g(i).lt.1e-6) then + lngrbond=0.01 + else + lngrbond=dlog(g(i)) + endif + + s2=s2-2*pi*rho*((gr*lngr)-gr+1)*del*r**2.0 + s2bond=s2bond-2*pi*rho*((g(i)*lngrbond)-g(i)+1)*del*r*r + + + rf=dble(i-.5)*del + write(23,*) rf,g(i) + enddo + write(24,*)"s2 : ",s2 + write(24,*)"s2bond : ",s2bond + call cpu_time(finish) + print*,"starting at time",start,"and ending at",finish + stop + deallocate(x,y,z,g) +end diff --git a/_basic/cuda/source_code/dcdread.h b/_basic/cuda/source_code/dcdread.h new file mode 100644 index 0000000..66ddba0 --- /dev/null +++ b/_basic/cuda/source_code/dcdread.h @@ -0,0 +1,49 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/cuda/source_code/rdf.cu b/_basic/cuda/source_code/rdf.cu new file mode 100644 index 0000000..ea884b5 --- /dev/null +++ b/_basic/cuda/source_code/rdf.cu @@ -0,0 +1,206 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +using namespace std; +//additional error handling code +static void HandleError(cudaError_t err, + const char *file, int line) { + if (err != cudaSuccess) { + printf( "%s in %s at line %d\n", cudaGetErrorString( err ), + file, line ); + exit( EXIT_FAILURE ); + } +} +#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) + +//declaration of GPU function +__global__ void pair_gpu(const double* d_x, const double* d_y, const double* d_z, + unsigned long long int *d_g2, int numatm, int nconf, + double xbox, double ybox, double zbox, int d_bin); + +int main(int argc , char* argv[]) +{ + double xbox,ybox,zbox; + double* d_x,*d_y,*d_z; + unsigned long long int *d_g2; + int nbin; + int device; + int numatm,nconf,inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + device = 0; + HANDLE_ERROR (cudaSetDevice(device));//pick the device to use + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if(!infile){ + cout<<"file "<nconf) cout << "nconf is reset to "<< nconf <>> + (, ,, , numatm, nconf, xbox, ybox, zbox, nbin); + + HANDLE_ERROR (cudaPeekAtLastError()); + HANDLE_ERROR(cudaDeviceSynchronize()); + + + nvtxRangePop(); //Pop for Pair Calculation + + double pi=acos(-1.0l); + double rho=(numatm)/(xbox*ybox*zbox); + double norm=(4.0l*pi*rho)/3.0l; + double rl,ru,nideal; + double g2[nbin]; + double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l; + double box=min(xbox,ybox); + box=min(box,zbox); + double del=box/(2.0l*nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i=0;i= numatm || id2 >= numatm) return; + if (id1 > id2) return; + + for (int frame = 0; frame < nconf; ++frame) { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) { + ig2 = (int)(r / del); + //Note: Usage of atomic function + atomicAdd(&d_g2[ig2], 2); + } + } +} diff --git a/_basic/cuda/source_code/rdf.f90 b/_basic/cuda/source_code/rdf.f90 new file mode 100644 index 0000000..e0d4195 --- /dev/null +++ b/_basic/cuda/source_code/rdf.f90 @@ -0,0 +1,187 @@ + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + use cudafor + implicit none + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, managed, allocatable :: x(:,:) + real*4, managed, allocatable :: y(:,:) + real*4, managed, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + xbox=d(1) + ybox=d(3) + zbox=d(6) + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd +! Todo: Add global attribute + attributes() subroutine pair_calculation( x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + use cudafor + implicit none + real*4 :: x(:,:) + real*4 :: y(:,:) + real*4 :: z(:,:) + double precision,intent(inout) :: g(:) + integer, value :: nframes,natoms,ind + double precision, value :: xbox,ybox,zbox,del,cut + integer i,j,iconf + double precision dx,dy,dz,r,oldvalue + !Todo: Add indexing + i = + j = + + do iconf=1,nframes + if(i<=natoms .and. j<=natoms) then + + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + if(r>>(x,y,z,g,natoms,nframes,xbox,ybox,zbox,del,cut) + + istat = cudaDeviceSynchronize() + if(istat /= 0) then + print *, "Error" + end if + call nvtxEndRange + + s2=0.01d0 + s2bond=0.01d0 + const=(4.0d0/3.0d0)*pi*rho + call nvtxStartRange("Entropy Calculation") + do i=1,nbin + rlower=dble((i-1)*del) + rupper=rlower+del + nideal=const*(rupper**3-rlower**3) + g(i)=g(i)/(dble(nframes)*dble(natoms)*nideal) + r=dble(i)*del + if (r.lt.2.0) then + gr=0.0 + else + gr=g(i) + endif + + if (gr.lt.1e-5) then + lngr=0.0 + else + lngr=dlog(gr) + endif + if (g(i).lt.1e-6) then + lngrbond=0.01 + else + lngrbond=dlog(g(i)) + endif + + s2=s2-2*pi*rho*((gr*lngr)-gr+1)*del*r**2.0 + s2bond=s2bond-2*pi*rho*((g(i)*lngrbond)-g(i)+1)*del*r*r + + rf=dble(i-.5)*del + write(23,*) rf,g(i) + enddo + write(24,*)"s2 : ",s2 + write(24,*)"s2bond : ",s2bond + call cpu_time(finish) + print*,"starting at time",start,"and ending at",finish + stop + call nvtxEndRange + + deallocate(x,y,z,g) +end + diff --git a/_basic/cuda/source_code/rdf_obsolete.cu b/_basic/cuda/source_code/rdf_obsolete.cu new file mode 100644 index 0000000..c064b74 --- /dev/null +++ b/_basic/cuda/source_code/rdf_obsolete.cu @@ -0,0 +1,259 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +using namespace std; +//Note: CUDA error handling code. This is required as many CUDA calls are async in nature +static void HandleError( cudaError_t err, + const char *file, + int line ) { + if (err != cudaSuccess) { + printf( "%s in %s at line %d\n", cudaGetErrorString( err ), + file, line ); + exit( EXIT_FAILURE ); + } +} +#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )) + +//Note declaration of GPU function by using keyword __global__ +__global__ void pair_gpu(const double* d_x, const double* d_y, const double* d_z, unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin, unsigned long long int bl); + +int main(int argc , char* argv[] ) +{ + double xbox,ybox,zbox; + double* h_x,*h_y,*h_z; + double* d_x,*d_y,*d_z; + unsigned long long int *h_g2,*d_g2; + int nbin; + int nthreads,device; + int numatm,nconf,inconf; + unsigned long long int near2; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin=2000; + file = "../../_common/input/alk.traj.dcd"; + device = 0; + nthreads = 128; + HANDLE_ERROR (cudaSetDevice(device));//pick the device to use + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if(!infile){ + cout<<"file "<nconf) cout << "nconf is reset to "<< nconf <>> (, , , , numatm, nconf, xbox, ybox, zbox, nbin, bl); + + HANDLE_ERROR (cudaPeekAtLastError()); + HANDLE_ERROR(cudaDeviceSynchronize()); + } + + //Todo: Copy d_ge back from Device to Host + HANDLE_ERROR(cudaMemcpy(dest, source, , )); + + nvtxRangePop(); //Pop for Pair Calculation + + double pi=acos(-1.0l); + double rho=(numatm)/(xbox*ybox*zbox); + double norm=(4.0l*pi*rho)/3.0l; + double rl,ru,nideal; + double g2[nbin]; + double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l; + double box=min(xbox,ybox); + box=min(box,zbox); + double del=box/(2.0l*nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i=0;i +#include + +#define N 512 + +void host_add(int *a, int *b, int *c) { + for(int idx=0;idx>>(d_a,d_b,d_c); + + // Copy result back to host + cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); + + print_output(a,b,c); + + free(a); free(b); free(c); + cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); + + + + return 0; +} diff --git a/_basic/cuda/source_code/vector_addition_gpu_thread_block.cu b/_basic/cuda/source_code/vector_addition_gpu_thread_block.cu new file mode 100644 index 0000000..3ee8081 --- /dev/null +++ b/_basic/cuda/source_code/vector_addition_gpu_thread_block.cu @@ -0,0 +1,65 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include + +#define N 512 + +void host_add(int *a, int *b, int *c) { + for(int idx=0;idx>>(d_a,d_b,d_c); + + // Copy result back to host + cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); + + print_output(a,b,c); + + free(a); free(b); free(c); + cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); + + + + return 0; +} diff --git a/_basic/cuda/source_code/vector_addition_gpu_thread_only.cu b/_basic/cuda/source_code/vector_addition_gpu_thread_only.cu new file mode 100644 index 0000000..7d2cc89 --- /dev/null +++ b/_basic/cuda/source_code/vector_addition_gpu_thread_only.cu @@ -0,0 +1,62 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include + +#define N 512 + +void host_add(int *a, int *b, int *c) { + for(int idx=0;idx>>(d_a,d_b,d_c); + + // Copy result back to host + cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); + + print_output(a,b,c); + + free(a); free(b); free(c); + cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); + + + + return 0; +} diff --git a/_basic/iso/Presentations/README.md b/_basic/iso/Presentations/README.md new file mode 100644 index 0000000..929ea89 --- /dev/null +++ b/_basic/iso/Presentations/README.md @@ -0,0 +1,5 @@ +For Partners who are interested in delivering the critical hands-on skills needed to advance science in form of Bootcamp can reach out to us at [Open Hackathons Partner](https://www.openhackathons.org/s/about-open-hackathons) website. In addition to current bootcamp material the Partners will be provided with the following: + +- Presentation: All the Bootcamps are accompanied with training material presentations which can be used during the Bootcamp session. +- Mini challenge : To test the knowledge gained during this Bootcamp a mini application challenge is provided along with sample Solution. +- Additional Support: On case to case basis the Partners can also be trained on how to effectively deliver the Bootcamp with maximal impact. \ No newline at end of file diff --git a/_basic/iso/jupyter_notebook/nways_iso.ipynb b/_basic/iso/jupyter_notebook/nways_iso.ipynb new file mode 100644 index 0000000..8631092 --- /dev/null +++ b/_basic/iso/jupyter_notebook/nways_iso.ipynb @@ -0,0 +1,606 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning objectives\n", + "The **goal** of this lab is to:\n", + "\n", + "- Learn how to run the same code on both a multicore CPU and a GPU using standard language parallelism\n", + "- Understand steps required to make a sequential code parallel using ISO Fortran and ISO C++\n", + "\n", + "We do not intend to cover:\n", + "- Detailed optimization techniques and mapping of standard constructs to CUDA C\n", + "\n", + "\n", + "
\n", + " ISO C++\n", + " \n", + "# Standard Template Library (STL)\n", + " \n", + "If you are not familiar with STL (Standard Template Library), this section will give you a brief introduction that would be required to understand the usage of STL library for our code.\n", + "\n", + "The C++ STL (Standard Template Library) is a powerful set of C++ template classes to provide general-purpose classes and functions with templates that implement many popular and commonly used algorithms and data structures like vectors, lists, queues, and stacks.\n", + "\n", + "At the core of the C++ Standard Template Library are following three well-structured components \n", + "\n", + "- Containers: Containers are used to manage collections of objects of a certain kind. There are several different types of containers like dequeue, list, vector, map etc.\n", + "\n", + "- Algorithms: Algorithms act on containers. They provide the means by which you will perform initialization, sorting, searching, and transforming of the contents of containers.\n", + "\n", + "- Iterators: Iterators are used to step through the elements of collections of objects. These collections may be containers or subsets of containers.\n", + "\n", + "For our code to make *Pair Calculation* we will be making use of ```vector``` container. The example below will introduce you to the container and how to use iterator to step through elements of vector. ```vector``` container (a C++ Standard Template) which is similar to an array with an exception that it automatically handles its own storage requirements in case it grows\n", + "\n", + "For our code we will be making use of ```std::for_each``` algorithm and its sample usage is also shown in code below:\n", + "\n", + "```cpp\n", + "#include \n", + "#include \n", + "#include \n", + " \n", + "//Using functor\n", + "struct Sum\n", + "{\n", + " void operator()(int n) { sum += n; }\n", + " int sum{0};\n", + "};\n", + " \n", + "int main()\n", + "{\n", + " std::vector nums{3, 4, 2, 8, 15, 267};\n", + " \n", + " auto print = [](const int& n) { std::cout << \" \" << n; };\n", + " \n", + " std::cout << \"before:\";\n", + " std::for_each(nums.cbegin(), nums.cend(), print);\n", + " std::cout << '\\n';\n", + " \n", + " std::for_each(nums.begin(), nums.end(), [](int &n){ n++; });\n", + " \n", + " // calls Sum::operator() for each number\n", + " Sum s = std::for_each(nums.begin(), nums.end(), Sum());\n", + " \n", + " std::cout << \"after: \";\n", + " std::for_each(nums.cbegin(), nums.cend(), print);\n", + " std::cout << '\\n';\n", + " std::cout << \"sum: \" << s.sum << '\\n';\n", + "}\n", + "```\n", + "\n", + "To learn more about STL you can read and execute sample codes [here](https://www.tutorialspoint.com/cplusplus/cpp_stl_tutorial.htm).\n", + "\n", + "\n", + "# Parallel STL\n", + "Starting with C++17, parallelism has become an integral part of the standard itself. Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, commonly called C++17.\n", + "\n", + "C++17 Parallel Standard Library (stdpar) introduces parallel and vector concurrency for standard algorithms. It is important to note that stdpar is a library and not a language extension.\n", + "\n", + "\n", + "## `std::par` Execution Policies\n", + "\n", + "\n", + "Execution Policies define the kind of parallelism that will be applied to parallel algorithms. Most standard algorithms included in STL support execution policies. Defined below are the execution policies:\n", + "\n", + "- `std::execution::seq` = sequential\n", + " - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm’s execution may not be parallelized.\n", + "- `std::execution::par` = parallel\n", + " - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized\n", + "- `std::execution::par_unseq` = parallel + vectorized\n", + " - This execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized and vectorized\n", + "\n", + "Implementation of execution policies is provided by different compilers from specific vendors. For GPU parallel execution policy we will be making use of NVIDIA compiler. \n", + "\n", + "\n", + "## Historical Perspective\n", + "\n", + "Changes to how the call to _stl_ algorithms changed the new version of C++ standard to incorporate execution policies:\n", + "\n", + "**C++98:** \n", + "```cpp\n", + "std::sort(c.begin(), c.end()); \n", + "```\n", + "**C++17:** \n", + "```cpp\n", + "std::sort(std::execution::par, c.begin(), c.end());\n", + "```\n", + "\n", + "We will be using the NVIDIA HPC C++ compiler, NVC++. It supports C++17, C++ Standard Parallelism (stdpar) for NVIDIA GPUs, OpenACC for multicore CPUs and NVIDIA GPUs, and OpenMP for multicore CPUs. No language extensions or non-standard libraries are required to enable GPU acceleration. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of CUDA [Unified Memory](../GPU_Architecture_Terminologies.ipynb), which means that heap memory is automatically shared between a CPU(Host) and GPU(Device). Stack memory and global memory are not shared. Below given example shows the right allocation and usage of the stdpar.\n", + "\n", + "```cpp\n", + "std::vector v = ...;\n", + "std::sort(std::execution::par, v.begin(), v.end()); // OK, vector allocates on heap\n", + "\n", + "std::array a = ...;\n", + "std::sort(std::execution::par, a.begin(), a.end()); // Fails, array stored on the stack\n", + "```\n", + "\n", + "For our code we will be making use of ```std::for_each``` algorithm with support for ```std::execution::par``` execution policy\n", + "\n", + "**Counting Iterator**: In our code we will also be using a special iterator ```counting_iterator```. This iterator which represents a pointer into a range of sequentially changing values. This iterator is useful for creating a range filled with a sequence without explicitly storing it in memory. Using ```counting_iterator``` saves memory capacity and bandwidth\n", + "
\n", + "
\n", + " \n", + "
\n", + " ISO Fortran\n", + " \n", + "# Fortran Standard Parallelism\n", + "\n", + "ISO Standard Fortran 2008 introduced the DO CONCURRENT construct to allow you to express loop-level parallelism, one of the various mechanisms for expressing parallelism directly in the Fortran language. \n", + "\n", + "Fortran developers have been able to accelerate their programs using CUDA Fortran, OpenACC or OpenMP. Now with the support of DO CONCURRENT on GPU with NVIDIA HPC SDK, compiler automatically accelerates loops using DO CONCURRENT, allowing developers to get the benefit of acclerating on NVIDIA GPUs using ISO Standard Fortran without any extensions, directives, or non-standard libraries. You can now write standard Fortran, remaining fully portable to other compilers and systems, and still benefit from the full power of NVIDIA GPUs\n", + "\n", + "For our code to make *Pair Calculation* all that’s required is expressing loops with DO CONCURRENT. The example below will introduce you to the syntax of DO CONCURRENT \n", + "\n", + "Sample vector addition codeis shown in code below:\n", + "\n", + "```fortran\n", + " subroutine vec_addition(x,y,n)\n", + " real :: x(:), y(:)\n", + " integer :: n, i \n", + " do i = 1, n \n", + " y(i) = x(i)+y(i)\n", + " enddo \n", + " end subroutine vec_addition\n", + "```\n", + "\n", + "In order to make use of ISO Fortran DO CONCURRENT we need to replace the `do` loop with `do concurrent` as shown in code below\n", + "\n", + "```fortran\n", + " subroutine vec_addition(x,y,n)\n", + " real :: x(:), y(:)\n", + " integer :: n, i \n", + " do concurrent (i = 1: n) \n", + " y(i) = x(i)+y(i)\n", + " enddo \n", + " end subroutine vec_addition\n", + "```\n", + "\n", + "By changing the DO loop to DO CONCURRENT, you are telling the compiler that there are no data dependencies between the n loop iterations. This leaves the compiler free to generate instructions that the iterations can be executed in any order and simultaneously. The compiler parallelizes the loop even if there are data dependencies, resulting in race conditions and likely incorrect results. It’s your responsibility to ensure that the loop is safe to be parallelized.\n", + "\n", + "## Nested Loop Parallelism\n", + "\n", + "Nested loops are a common code pattern encountered in HPC applications. A simple example might look like the following:\n", + "\n", + "```fortran\n", + " do i=2, n-1\n", + " do j=2, m-1\n", + " a(i,j) = w0 * b(i,j) \n", + " enddo\n", + " enddo\n", + "```\n", + "\n", + "It is straightforward to write such patterns with a single DO CONCURRENT statement, as in the following example. It is easier to read, and the compiler has more information available for optimization.\n", + "\n", + "```fortran\n", + " do concurrent(i=2 : n-1, j=2 : m-1)\n", + " a(i,j) = w0 * b(i,j) \n", + " enddo\n", + "```\n", + "
\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets start modifying the original code and add the necessary changes to parallelise the code. Click on the [C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run for Multicore\n", + "\n", + "Now, let's compile the code. We will be using NVIDIA HPC SDK for this exercise. The flags used for enabling standard parallelism for target offloading are as follows:\n", + "\n", + "- `-stdpar` : This flag enables standard parallelism for the target architecture\n", + "- `-stdpar=multicore` will allow us to compile our code for a multicore\n", + "- `-stdpar` will allow us to compile our code for a NVIDIA GPU (Default is NVIDIA)\n", + "\n", + "After running the cells, you can inspect part of the compiler feedback for C++ or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (C++)\n", + "!cd ../source_code && echo \"compiling C++ version .. \" && nvc++ -std=c++17 -stdpar=multicore -Minfo \\\n", + "-I/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/include \\\n", + "-o rdf_c rdf.cpp -fopenmp \\\n", + "-L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -lnvToolsExt && ./rdf_c && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: C++ code\n", + "The output should be the following:\n", + "\n", + " \n", + "```\n", + " \n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + " \n", + "```\n", + " \n", + "and the compiler feedback would look similar as below:\n", + " \n", + "```\n", + "main:\n", + " 79, stdpar: Generating Multicore code\n", + " 79, std::fill with std::execution::par policy parallelized on CPU\n", + "pair_gpu(double *, double *, double *, std::atomic *, int, int, double, double, double, int):\n", + " 191, stdpar: Generating Multicore code\n", + " 191, std::for_each with std::execution::par policy parallelized on CPU\n", + "main:\n", + " 117, FMA (fused multiply-add) instruction(s) generated\n", + " 146, FMA (fused multiply-add) instruction(s) generated\n", + " 147, FMA (fused multiply-add) instruction(s) generated\n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -stdpar=multicore -Minfo -o rdf_f rdf.f90 -lnvhpcwrapnvtx && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** Since we are targeting the NVTX v3 API, a header-only C library, and added Fortran-callable wrappers to the code, we add `-lnvhpcwrapnvtx` at the compile time to do the link to the library.\n", + "\n", + "
\n", + " Compiler Feedback: Fortran code\n", + "The output should be the following:\n", + "\n", + "```\n", + " \n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + " \n", + "```\n", + "and the compiler feedback would look similar as below:\n", + " \n", + "```\n", + "rdf:\n", + " 80, Memory zero idiom, loop replaced by call to __c_mzero8\n", + " 92, Generating Multicore code\n", + " 92, Loop parallelized across CPU threads\n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_stdpar_multicore ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_doconcurrent_multicore ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C++ version](../source_code/rdf_stdpar_multicore.nsys-rep) or the [Fortran version](../source_code/rdf_doconcurrent_multicore.nsys-rep) and choosing save Link As Once done, open it via the GUI and have a look at the example expected profiler report below:\n", + "\n", + "
\n", + " Example screenshot (C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "### Compile and run for NVIDIA GPU\n", + "\n", + "Without changing the code now let us try to recompile the code for NVIDIA GPU and rerun. GPU acceleration of standard parallel algorithms is enabled with the `-⁠stdpar` command-line option when using NVIDIA HPC Fortran or C++ compiler. If `-⁠stdpar `is specified, almost all algorithms that use a parallel execution policy are compiled for offloading to run in parallel on an NVIDIA GPU.\n", + "\n", + "**Understand and analyze** the solutions for [C++](../source_code/SOLUTION/rdf.cpp) and [Fortran](../source_code/SOLUTION/rdf.f90) versions and compare with your versions. Once done, compile your code by running below cells." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (ISO C++)\n", + "!cd ../source_code && echo \"compiling C++ version .. \" && nvc++ -std=c++17 -DUSE_COUNTING_ITERATOR -stdpar=gpu -Minfo -o rdf_c rdf.cpp " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: C++ code\n", + "The output should be the following:\n", + "\n", + " \n", + "```\n", + " \n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + " \n", + "```\n", + " \n", + "and the compiler feedback would look similar as below:\n", + " \n", + "```\n", + "main:\n", + " 79, stdpar: Generating NVIDIA GPU code\n", + " 79, std::fill with std::execution::par policy parallelized on GPU\n", + "pair_gpu(double *, double *, double *, std::atomic *, int, int, double, double, double, int):\n", + " 188, stdpar: Generating NVIDIA GPU code\n", + " 188, std::for_each with std::execution::par policy parallelized on GPU\n", + "main:\n", + " 16, include \"nvToolsExt.h\"\n", + " 1494, include \"nvtxImpl.h\"\n", + " 119, FMA (fused multiply-add) instruction(s) generated\n", + " 148, FMA (fused multiply-add) instruction(s) generated\n", + " 149, FMA (fused multiply-add) instruction(s) generated\n", + "pair_gpu(double *, double *, double *, std::atomic *, int, int, double, double, double, int)::[lambda(unsigned int) (instance 1)]::operator ()(unsigned int) const:\n", + " 16, include \"nvToolsExt.h\"\n", + " 1494, include \"nvtxImpl.h\"\n", + " 200, FMA (fused multiply-add) instruction(s) generated\n", + " 201, FMA (fused multiply-add) instruction(s) generated\n", + " 202, FMA (fused multiply-add) instruction(s) generated\n", + " 204, FMA (fused multiply-add) instruction(s) generated\n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (ISO Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -stdpar=gpu -Minfo -acc -o rdf_f rdf.f90 -lnvhpcwrapnvtx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** Since we are targeting the NVTX v3 API, a header-only C library, and added Fortran-callable wrappers to the code, we add `-lnvhpcwrapnvtx` at the compile time to do the link to the library.\n", + "\n", + "
\n", + " Compiler Feedback: Fortran code\n", + "The output should be the following:\n", + "\n", + "```\n", + " \n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + " \n", + "```\n", + "and the compiler feedback would look similar as below:\n", + " \n", + "```rdf:\n", + " 80, Memory zero idiom, loop replaced by call to __c_mzero8\n", + " 92, Generating Tesla code\n", + " 92, Loop parallelized across CUDA thread blocks, CUDA threads(4) blockidx%y threadidx%y\n", + " Loop parallelized across CUDA thread blocks, CUDA threads(32) ! blockidx%x threadidx%x\n", + "```\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make sure to validate the output by running the executable and validate the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (ISO C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_stdpar_gpu ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (ISO Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_doconcurrent_gpu ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C++ version](../source_code/rdf_stdpar_gpu.nsys-rep) or the [Fortran version](../source_code/rdf_doconcurrent_gpu.nsys-rep) and choosing save Link As Once done, open it via the GUI and have a look at the example expected profiler report below:\n", + "\n", + "\n", + "
\n", + " Example screenshot (ISO C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "
\n", + " Example screenshot (ISO Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "If you inspect the output of the profiler closer, you can see the *Unified Memory* usage. Moreover, if you compare the NVTX marker `Pair_Calculation` (from the NVTX row) in both multicore and GPU version, you can see how much improvement you achieved. \n", + "\n", + "Feel free to checkout the solutions for [ISO C++](../source_code/SOLUTION/rdf.cpp) and [ISO Fortran](../source_code/SOLUTION/rdf.f90) versions to help you understand better." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# stdpar Analysis\n", + "\n", + "**Usage Scenarios**\n", + "- stdpar is part of the standard language and provides a good start for accelerating code on accelerators like GPU and multicores.\n", + "- DO-CONCURRENT is part of the standard language and provides a good start for accelerating code on accelerators like GPU and multicores.\n", + "\n", + "**Limitations/Constraints**\n", + "1. It is key to understand that it is not an alternative to CUDA. *std:par* and *DO CONCURRENT* provide highest portability and can be seen as the first step to porting on GPU. The general abstraction limits the optimization functionalities. For example, the implementations are currently dependent on Unified memory. Moreover, one does not have control over thread management and that will limit the performance improvement.\n", + "2. C++ constructs can only be used in the code using C++17 features and may not work for legacy codes.\n", + "\n", + "**Which Compilers Support stdpar on GPUs and Multicore?**\n", + "1. NVIDIA GPU: As of Jan 2021, the HPC SDK compiler from NVIDIA supports std::par and DO-CONCURRENT on NVIDIA GPU.\n", + "2. x86 Multicore: \n", + " - stdpar: GCC has an implementation on a multicore CPU which is based on Intel TBB in the backend\n", + " - DO CONCURRENT: Other compilers like intel compiler has an implementation on a multicore CPU" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "\n", + "\n", + "# Links and Resources\n", + "[Blog post on Developing Accelerated Code with Standard Language Parallelism](https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/)\n", + "\n", + "[Blog post on Accelerating Standard C++ with GPUs Using stdpar](https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/)\n", + "\n", + "[Blog post on Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK](https://developer.nvidia.com/blog/accelerating-fortran-do-concurrent-with-gpus-and-the-nvidia-hpc-sdk/)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/iso/source_code/Pair_entropy.dat b/_basic/iso/source_code/Pair_entropy.dat new file mode 100644 index 0000000..0998d26 --- /dev/null +++ b/_basic/iso/source_code/Pair_entropy.dat @@ -0,0 +1,2 @@ +s2 value is -2.43191 +s2bond value is -3.87014 diff --git a/_basic/iso/source_code/RDF.dat b/_basic/iso/source_code/RDF.dat new file mode 100644 index 0000000..ea809b6 --- /dev/null +++ b/_basic/iso/source_code/RDF.dat @@ -0,0 +1,2000 @@ +0.00761401 2.27364e+06 +0.022842 0 +0.03807 0 +0.0532981 0 +0.0685261 0 +0.0837541 0 +0.0989821 0 +0.11421 0 +0.129438 0 +0.144666 0 +0.159894 0 +0.175122 0 +0.19035 0 +0.205578 0 +0.220806 0 +0.236034 0 +0.251262 0 +0.26649 0 +0.281718 0 +0.296946 0 +0.312174 0 +0.327402 0 +0.34263 0 +0.357858 0 +0.373086 0 +0.388314 0 +0.403543 0 +0.418771 0 +0.433999 0 +0.449227 0 +0.464455 0 +0.479683 0 +0.494911 0 +0.510139 0 +0.525367 0 +0.540595 0 +0.555823 0 +0.571051 0 +0.586279 0 +0.601507 0 +0.616735 0 +0.631963 0 +0.647191 0 +0.662419 0 +0.677647 0 +0.692875 0 +0.708103 0 +0.723331 0 +0.738559 0 +0.753787 0 +0.769015 0 +0.784243 0 +0.799471 0 +0.814699 0 +0.829927 0 +0.845155 0 +0.860383 0 +0.875611 0 +0.890839 0 +0.906067 0 +0.921295 0 +0.936523 0 +0.951751 0 +0.966979 0 +0.982207 0 +0.997435 0 +1.01266 0 +1.02789 0 +1.04312 0 +1.05835 0 +1.07358 0 +1.0888 0 +1.10403 0 +1.11926 0 +1.13449 0 +1.14972 0 +1.16494 0 +1.18017 0 +1.1954 0 +1.21063 0 +1.22586 0 +1.24108 0 +1.25631 0 +1.27154 0 +1.28677 0 +1.302 0 +1.31722 0 +1.33245 0 +1.34768 0 +1.36291 0 +1.37814 0 +1.39336 0 +1.40859 0.015817 +1.42382 0.0490216 +1.43905 0.272783 +1.45428 0.976895 +1.4695 2.98168 +1.48473 6.70771 +1.49996 12.3749 +1.51519 18.7254 +1.53042 23.1225 +1.54564 22.5968 +1.56087 18.7488 +1.5761 12.2483 +1.59133 6.82648 +1.60656 2.88983 +1.62178 1.03012 +1.63701 0.322052 +1.65224 0.0498164 +1.66747 0.0112871 +1.6827 0.00184729 +1.69792 0 +1.71315 0 +1.72838 0 +1.74361 0 +1.75884 0 +1.77406 0 +1.78929 0 +1.80452 0 +1.81975 0 +1.83498 0 +1.8502 0 +1.86543 0 +1.88066 0 +1.89589 0 +1.91112 0 +1.92634 0 +1.94157 0 +1.9568 0 +1.97203 0 +1.98726 0 +2.00248 0 +2.01771 0 +2.03294 0 +2.04817 0 +2.0634 0 +2.07862 0 +2.09385 0 +2.10908 0 +2.12431 0 +2.13954 0 +2.15476 0 +2.16999 0 +2.18522 0 +2.20045 0 +2.21568 0 +2.2309 0 +2.24613 0 +2.26136 0 +2.27659 0.0030276 +2.29182 0.00199167 +2.30704 0.00294819 +2.32227 0.00969885 +2.3375 0.0124448 +2.35273 0.0349627 +2.36796 0.0522381 +2.38319 0.0976197 +2.39841 0.160034 +2.41364 0.269354 +2.42887 0.425579 +2.4441 0.670716 +2.45933 0.911498 +2.47455 1.21124 +2.48978 1.65042 +2.50501 2.03301 +2.52024 2.55944 +2.53547 2.96246 +2.55069 3.34284 +2.56592 3.37557 +2.58115 3.46854 +2.59638 3.22468 +2.61161 2.9924 +2.62683 2.6076 +2.64206 2.14228 +2.65729 1.75705 +2.67252 1.3138 +2.68775 0.920997 +2.70297 0.61569 +2.7182 0.445989 +2.73343 0.277221 +2.74866 0.151618 +2.76389 0.087643 +2.77911 0.0650139 +2.79434 0.0515797 +2.80957 0.0404201 +2.8248 0.044574 +2.84003 0.0460427 +2.85525 0.05646 +2.87048 0.0628453 +2.88571 0.0722337 +2.90094 0.0795574 +2.91617 0.0897999 +2.93139 0.0967824 +2.94662 0.110243 +2.96185 0.114478 +2.97708 0.122753 +2.99231 0.141952 +3.00753 0.146879 +3.02276 0.174026 +3.03799 0.179086 +3.05322 0.177865 +3.06845 0.187215 +3.08367 0.187021 +3.0989 0.199348 +3.11413 0.201179 +3.12936 0.224864 +3.14459 0.234857 +3.15981 0.195928 +3.17504 0.200279 +3.19027 0.21122 +3.2055 0.206672 +3.22073 0.18657 +3.23595 0.185318 +3.25118 0.179132 +3.26641 0.175995 +3.28164 0.165137 +3.29687 0.157359 +3.31209 0.14924 +3.32732 0.152602 +3.34255 0.140915 +3.35778 0.134073 +3.37301 0.14206 +3.38823 0.123472 +3.40346 0.13682 +3.41869 0.12889 +3.43392 0.132186 +3.44915 0.128383 +3.46437 0.140767 +3.4796 0.150338 +3.49483 0.166589 +3.51006 0.163873 +3.52529 0.16709 +3.54051 0.18944 +3.55574 0.199404 +3.57097 0.220267 +3.5862 0.226128 +3.60143 0.256885 +3.61665 0.285117 +3.63188 0.294627 +3.64711 0.314193 +3.66234 0.337714 +3.67757 0.317906 +3.69279 0.367454 +3.70802 0.402104 +3.72325 0.410141 +3.73848 0.459575 +3.75371 0.471445 +3.76893 0.532818 +3.78416 0.551185 +3.79939 0.61236 +3.81462 0.708128 +3.82985 0.8209 +3.84507 0.899673 +3.8603 1.03509 +3.87553 1.12309 +3.89076 1.25771 +3.90599 1.32643 +3.92122 1.42024 +3.93644 1.44843 +3.95167 1.43294 +3.9669 1.43725 +3.98213 1.39131 +3.99736 1.28907 +4.01258 1.21239 +4.02781 1.13327 +4.04304 1.04412 +4.05827 0.933079 +4.0735 0.908779 +4.08872 0.886065 +4.10395 0.854347 +4.11918 0.791012 +4.13441 0.809676 +4.14964 0.824096 +4.16486 0.817478 +4.18009 0.852244 +4.19532 0.826752 +4.21055 0.882149 +4.22578 0.871116 +4.241 0.886682 +4.25623 0.897962 +4.27146 0.886124 +4.28669 0.937907 +4.30192 0.93891 +4.31714 0.948575 +4.33237 0.958081 +4.3476 0.954426 +4.36283 1.01977 +4.37806 1.0075 +4.39328 1.02357 +4.40851 1.04907 +4.42374 1.08543 +4.43897 1.05384 +4.4542 1.09278 +4.46942 1.08954 +4.48465 1.10686 +4.49988 1.15182 +4.51511 1.14201 +4.53034 1.15983 +4.54556 1.21004 +4.56079 1.16451 +4.57602 1.14853 +4.59125 1.17492 +4.60648 1.20512 +4.6217 1.18421 +4.63693 1.16939 +4.65216 1.17456 +4.66739 1.20412 +4.68262 1.20966 +4.69784 1.19235 +4.71307 1.16135 +4.7283 1.20676 +4.74353 1.18414 +4.75876 1.14747 +4.77398 1.20006 +4.78921 1.18652 +4.80444 1.14887 +4.81967 1.12856 +4.8349 1.16577 +4.85012 1.17247 +4.86535 1.18259 +4.88058 1.16666 +4.89581 1.18255 +4.91104 1.15506 +4.92626 1.17788 +4.94149 1.18884 +4.95672 1.21348 +4.97195 1.25832 +4.98718 1.23719 +5.0024 1.28005 +5.01763 1.30657 +5.03286 1.30136 +5.04809 1.32287 +5.06332 1.35981 +5.07854 1.37905 +5.09377 1.37706 +5.109 1.37668 +5.12423 1.40795 +5.13946 1.37903 +5.15468 1.38684 +5.16991 1.39805 +5.18514 1.37254 +5.20037 1.37805 +5.2156 1.31291 +5.23082 1.33662 +5.24605 1.29124 +5.26128 1.26413 +5.27651 1.24613 +5.29174 1.22851 +5.30696 1.22742 +5.32219 1.24201 +5.33742 1.21069 +5.35265 1.19761 +5.36788 1.1861 +5.3831 1.21622 +5.39833 1.21314 +5.41356 1.17866 +5.42879 1.20542 +5.44402 1.20081 +5.45925 1.17165 +5.47447 1.18033 +5.4897 1.18507 +5.50493 1.17473 +5.52016 1.17255 +5.53539 1.18727 +5.55061 1.17567 +5.56584 1.17516 +5.58107 1.16607 +5.5963 1.15188 +5.61153 1.15743 +5.62675 1.15927 +5.64198 1.15631 +5.65721 1.13522 +5.67244 1.16262 +5.68767 1.14039 +5.70289 1.14959 +5.71812 1.14235 +5.73335 1.11688 +5.74858 1.14009 +5.76381 1.12526 +5.77903 1.12106 +5.79426 1.09929 +5.80949 1.1143 +5.82472 1.09692 +5.83995 1.1162 +5.85517 1.12154 +5.8704 1.10374 +5.88563 1.09139 +5.90086 1.09283 +5.91609 1.07704 +5.93131 1.09308 +5.94654 1.08128 +5.96177 1.08003 +5.977 1.06941 +5.99223 1.09034 +6.00745 1.05511 +6.02268 1.03392 +6.03791 1.09069 +6.05314 1.0578 +6.06837 1.05833 +6.08359 1.06067 +6.09882 1.07141 +6.11405 1.08944 +6.12928 1.01957 +6.14451 1.03711 +6.15973 1.03791 +6.17496 1.04213 +6.19019 1.03319 +6.20542 1.03763 +6.22065 1.01133 +6.23587 1.02833 +6.2511 1.04327 +6.26633 1.02901 +6.28156 1.03675 +6.29679 1.01697 +6.31201 1.00327 +6.32724 1.02745 +6.34247 1.0159 +6.3577 1.02553 +6.37293 1.03145 +6.38815 1.03436 +6.40338 1.05381 +6.41861 1.04716 +6.43384 1.03905 +6.44907 1.0334 +6.46429 1.02941 +6.47952 1.03156 +6.49475 1.03193 +6.50998 1.02662 +6.52521 1.02023 +6.54043 1.01708 +6.55566 0.998726 +6.57089 0.990953 +6.58612 0.972268 +6.60135 0.965387 +6.61657 0.9901 +6.6318 0.967006 +6.64703 0.971459 +6.66226 0.952646 +6.67749 0.956869 +6.69271 0.94131 +6.70794 0.947619 +6.72317 0.95768 +6.7384 0.950131 +6.75363 0.956057 +6.76885 0.939089 +6.78408 0.935446 +6.79931 0.956264 +6.81454 0.940506 +6.82977 0.943382 +6.84499 0.924565 +6.86022 0.948027 +6.87545 0.930665 +6.89068 0.965112 +6.90591 0.932345 +6.92113 0.933379 +6.93636 0.91787 +6.95159 0.959205 +6.96682 0.931632 +6.98205 0.951177 +6.99728 0.925249 +7.0125 0.933041 +7.02773 0.916823 +7.04296 0.936061 +7.05819 0.929926 +7.07342 0.911918 +7.08864 0.910086 +7.10387 0.916035 +7.1191 0.892408 +7.13433 0.918404 +7.14956 0.913575 +7.16478 0.915096 +7.18001 0.909697 +7.19524 0.919288 +7.21047 0.908366 +7.2257 0.906946 +7.24092 0.909819 +7.25615 0.892395 +7.27138 0.900532 +7.28661 0.90091 +7.30184 0.881557 +7.31706 0.901241 +7.33229 0.921629 +7.34752 0.911516 +7.36275 0.898872 +7.37798 0.885941 +7.3932 0.904305 +7.40843 0.904212 +7.42366 0.890826 +7.43889 0.898525 +7.45412 0.908131 +7.46934 0.893931 +7.48457 0.902622 +7.4998 0.900913 +7.51503 0.891802 +7.53026 0.9072 +7.54548 0.90538 +7.56071 0.916834 +7.57594 0.90978 +7.59117 0.902775 +7.6064 0.915618 +7.62162 0.921957 +7.63685 0.930392 +7.65208 0.914901 +7.66731 0.928264 +7.68254 0.945414 +7.69776 0.933115 +7.71299 0.928379 +7.72822 0.930241 +7.74345 0.926761 +7.75868 0.913742 +7.7739 0.920898 +7.78913 0.918335 +7.80436 0.918705 +7.81959 0.926679 +7.83482 0.91882 +7.85004 0.924425 +7.86527 0.92347 +7.8805 0.929338 +7.89573 0.920387 +7.91096 0.910245 +7.92618 0.929063 +7.94141 0.915551 +7.95664 0.929648 +7.97187 0.909968 +7.9871 0.924868 +8.00232 0.938585 +8.01755 0.938848 +8.03278 0.92832 +8.04801 0.933774 +8.06324 0.943364 +8.07846 0.940132 +8.09369 0.93021 +8.10892 0.943742 +8.12415 0.946944 +8.13938 0.948457 +8.1546 0.950266 +8.16983 0.951507 +8.18506 0.940788 +8.20029 0.95192 +8.21552 0.951572 +8.23074 0.951606 +8.24597 0.948094 +8.2612 0.94169 +8.27643 0.971444 +8.29166 0.972215 +8.30688 0.966304 +8.32211 0.96179 +8.33734 0.95399 +8.35257 0.967459 +8.3678 0.956247 +8.38302 0.973691 +8.39825 0.967715 +8.41348 0.967023 +8.42871 0.989375 +8.44394 0.959033 +8.45916 0.992278 +8.47439 0.964024 +8.48962 0.973777 +8.50485 0.973258 +8.52008 0.971151 +8.5353 0.98413 +8.55053 0.978553 +8.56576 1.00359 +8.58099 0.997689 +8.59622 0.989557 +8.61145 0.987965 +8.62667 0.995936 +8.6419 0.99299 +8.65713 0.997037 +8.67236 0.998128 +8.68759 0.990266 +8.70281 0.995851 +8.71804 1.00834 +8.73327 1.01223 +8.7485 1.00208 +8.76373 0.998809 +8.77895 1.01231 +8.79418 1.00293 +8.80941 1.00526 +8.82464 1.003 +8.83987 1.02445 +8.85509 1.01159 +8.87032 1.01443 +8.88555 1.01672 +8.90078 1.01338 +8.91601 1.02644 +8.93123 1.00937 +8.94646 1.02096 +8.96169 1.02023 +8.97692 1.02391 +8.99215 1.03992 +9.00737 1.02596 +9.0226 1.03413 +9.03783 1.01874 +9.05306 1.01531 +9.06829 1.03073 +9.08351 1.03628 +9.09874 1.02732 +9.11397 1.0465 +9.1292 1.02776 +9.14443 1.02446 +9.15965 1.0465 +9.17488 1.03725 +9.19011 1.03839 +9.20534 1.02212 +9.22057 1.02933 +9.23579 1.04286 +9.25102 1.03864 +9.26625 1.0345 +9.28148 1.03948 +9.29671 1.03959 +9.31193 1.03402 +9.32716 1.0339 +9.34239 1.04221 +9.35762 1.04486 +9.37285 1.05873 +9.38807 1.04438 +9.4033 1.03402 +9.41853 1.04866 +9.43376 1.04222 +9.44899 1.04402 +9.46421 1.04534 +9.47944 1.03581 +9.49467 1.04015 +9.5099 1.01982 +9.52513 1.05334 +9.54035 1.04228 +9.55558 1.04641 +9.57081 1.03052 +9.58604 1.04205 +9.60127 1.03761 +9.61649 1.05254 +9.63172 1.04555 +9.64695 1.04298 +9.66218 1.04681 +9.67741 1.04089 +9.69263 1.04815 +9.70786 1.04381 +9.72309 1.03833 +9.73832 1.04771 +9.75355 1.03961 +9.76877 1.03818 +9.784 1.0415 +9.79923 1.04519 +9.81446 1.04151 +9.82969 1.03856 +9.84491 1.03918 +9.86014 1.05077 +9.87537 1.03063 +9.8906 1.03794 +9.90583 1.04232 +9.92105 1.0421 +9.93628 1.05062 +9.95151 1.03346 +9.96674 1.03362 +9.98197 1.03882 +9.99719 1.02744 +10.0124 1.02787 +10.0277 1.02953 +10.0429 1.022 +10.0581 1.03116 +10.0733 1.03738 +10.0886 1.03841 +10.1038 1.04333 +10.119 1.0361 +10.1342 1.02148 +10.1495 1.0205 +10.1647 1.04144 +10.1799 1.03162 +10.1952 1.03372 +10.2104 1.03164 +10.2256 1.04078 +10.2408 1.04197 +10.2561 1.02764 +10.2713 1.02886 +10.2865 1.02384 +10.3018 1.01692 +10.317 1.02414 +10.3322 1.02132 +10.3474 1.01519 +10.3627 1.02152 +10.3779 1.02537 +10.3931 1.03965 +10.4084 1.02686 +10.4236 1.01534 +10.4388 1.02145 +10.454 0.999813 +10.4693 1.00545 +10.4845 1.0188 +10.4997 1.0196 +10.5149 1.01774 +10.5302 1.01729 +10.5454 1.01704 +10.5606 1.00468 +10.5759 1.00492 +10.5911 1.01043 +10.6063 1.01423 +10.6215 1.02653 +10.6368 1.01957 +10.652 1.0155 +10.6672 1.015 +10.6825 1.00683 +10.6977 1.01174 +10.7129 1.01652 +10.7281 1.00918 +10.7434 1.00501 +10.7586 1.00538 +10.7738 1.00632 +10.7891 1.01332 +10.8043 1.00993 +10.8195 1.0083 +10.8347 0.992809 +10.85 1.00651 +10.8652 1.00201 +10.8804 1.01025 +10.8956 1.0169 +10.9109 1.00374 +10.9261 1.00002 +10.9413 0.997109 +10.9566 1.01582 +10.9718 0.992798 +10.987 1.00283 +11.0022 0.982988 +11.0175 0.998629 +11.0327 0.993983 +11.0479 0.99613 +11.0632 1.00155 +11.0784 0.994582 +11.0936 0.990791 +11.1088 1.01194 +11.1241 0.992389 +11.1393 0.989425 +11.1545 0.988659 +11.1698 0.992673 +11.185 0.992648 +11.2002 0.994037 +11.2154 0.995913 +11.2307 0.991514 +11.2459 0.988707 +11.2611 0.997336 +11.2763 0.99666 +11.2916 0.993194 +11.3068 0.987206 +11.322 0.994386 +11.3373 0.999408 +11.3525 0.984634 +11.3677 0.988231 +11.3829 0.988132 +11.3982 1.00011 +11.4134 0.98672 +11.4286 0.983892 +11.4439 0.98491 +11.4591 0.982931 +11.4743 0.986641 +11.4895 0.980461 +11.5048 0.968936 +11.52 0.973432 +11.5352 0.978253 +11.5505 0.980106 +11.5657 0.979677 +11.5809 0.989192 +11.5961 0.983523 +11.6114 0.984165 +11.6266 0.98658 +11.6418 0.975048 +11.657 0.971116 +11.6723 0.979756 +11.6875 0.977166 +11.7027 0.99796 +11.718 0.985845 +11.7332 0.977474 +11.7484 0.978997 +11.7636 0.9858 +11.7789 0.980538 +11.7941 0.986055 +11.8093 0.978488 +11.8246 0.979635 +11.8398 0.990661 +11.855 0.967723 +11.8702 0.987626 +11.8855 0.974989 +11.9007 0.984461 +11.9159 0.976494 +11.9312 0.971652 +11.9464 0.976983 +11.9616 0.985866 +11.9768 0.976214 +11.9921 0.976609 +12.0073 0.971812 +12.0225 0.982596 +12.0377 0.974517 +12.053 0.979185 +12.0682 0.977757 +12.0834 0.986757 +12.0987 0.972054 +12.1139 0.993208 +12.1291 0.998182 +12.1443 0.990041 +12.1596 0.98307 +12.1748 0.979413 +12.19 0.981015 +12.2053 0.980816 +12.2205 0.989581 +12.2357 0.97856 +12.2509 0.981112 +12.2662 0.993313 +12.2814 0.974518 +12.2966 0.984698 +12.3119 0.986853 +12.3271 0.983314 +12.3423 0.984083 +12.3575 0.979056 +12.3728 0.977399 +12.388 0.992448 +12.4032 0.982261 +12.4184 0.981787 +12.4337 0.983105 +12.4489 0.989612 +12.4641 0.990494 +12.4794 0.980085 +12.4946 0.996661 +12.5098 0.985045 +12.525 0.986719 +12.5403 0.990078 +12.5555 0.990133 +12.5707 1.00157 +12.586 0.987559 +12.6012 0.984581 +12.6164 0.995218 +12.6316 0.992296 +12.6469 0.991052 +12.6621 0.980677 +12.6773 1.00429 +12.6926 0.991009 +12.7078 0.992101 +12.723 0.991408 +12.7382 0.983333 +12.7535 0.99848 +12.7687 0.985898 +12.7839 0.987423 +12.7992 0.99989 +12.8144 0.995572 +12.8296 0.994036 +12.8448 0.993298 +12.8601 0.995881 +12.8753 0.997597 +12.8905 0.983185 +12.9057 0.98454 +12.921 0.982503 +12.9362 1.00241 +12.9514 0.998406 +12.9667 0.999609 +12.9819 0.984912 +12.9971 1.00075 +13.0123 1.0002 +13.0276 0.998851 +13.0428 0.993752 +13.058 1.00349 +13.0733 1.0003 +13.0885 0.994094 +13.1037 0.995654 +13.1189 1.00459 +13.1342 1.00538 +13.1494 1.00399 +13.1646 0.99627 +13.1799 1.00068 +13.1951 1.00159 +13.2103 1.00441 +13.2255 1.00293 +13.2408 0.988753 +13.256 1.00169 +13.2712 0.993724 +13.2864 1.01453 +13.3017 1.00304 +13.3169 0.995058 +13.3321 1.0002 +13.3474 1.0015 +13.3626 1.00121 +13.3778 1.00273 +13.393 0.99926 +13.4083 1.00325 +13.4235 0.995718 +13.4387 1.02358 +13.454 0.999682 +13.4692 0.999181 +13.4844 1.00656 +13.4996 1.01058 +13.5149 1.00097 +13.5301 1.00441 +13.5453 1.00263 +13.5606 1.00886 +13.5758 0.999558 +13.591 1.0046 +13.6062 1.01246 +13.6215 1.00696 +13.6367 1.01056 +13.6519 1.01443 +13.6671 1.00576 +13.6824 1.00771 +13.6976 1.01537 +13.7128 0.998899 +13.7281 1.00204 +13.7433 1.00561 +13.7585 1.00836 +13.7737 1.01148 +13.789 1.00545 +13.8042 1.01114 +13.8194 1.00184 +13.8347 1.01227 +13.8499 1.01124 +13.8651 1.00312 +13.8803 1.00778 +13.8956 0.995527 +13.9108 1.00581 +13.926 1.0065 +13.9413 1.00925 +13.9565 1.00606 +13.9717 1.01702 +13.9869 1.00879 +14.0022 1.00281 +14.0174 1.01072 +14.0326 1.01095 +14.0478 1.01793 +14.0631 1.0039 +14.0783 1.00854 +14.0935 1.00036 +14.1088 1.0044 +14.124 1.00764 +14.1392 1.0119 +14.1544 1.00936 +14.1697 1.00897 +14.1849 1.00069 +14.2001 1.00905 +14.2154 1.00775 +14.2306 1.01383 +14.2458 1.00666 +14.261 1.01318 +14.2763 1.00309 +14.2915 1.00106 +14.3067 1.00887 +14.322 1.00323 +14.3372 1.0128 +14.3524 1.00842 +14.3676 1.00231 +14.3829 1.00777 +14.3981 1.00567 +14.4133 1.00274 +14.4285 1.00474 +14.4438 1.00987 +14.459 1.00369 +14.4742 1.0025 +14.4895 1.00707 +14.5047 1.00872 +14.5199 1.01052 +14.5351 1.00823 +14.5504 1.00837 +14.5656 1.0055 +14.5808 1.00475 +14.5961 1.01167 +14.6113 0.998363 +14.6265 1.01088 +14.6417 1.01227 +14.657 1.00522 +14.6722 1.00732 +14.6874 1.00862 +14.7027 1.01171 +14.7179 1.00598 +14.7331 1.00163 +14.7483 1.00471 +14.7636 1.00259 +14.7788 1.0033 +14.794 1.00509 +14.8092 1.002 +14.8245 0.995416 +14.8397 1.00489 +14.8549 1.00615 +14.8702 1.00987 +14.8854 1.01387 +14.9006 1.01067 +14.9158 1.00599 +14.9311 1.00329 +14.9463 1.00082 +14.9615 0.999999 +14.9768 1.00249 +14.992 0.999873 +15.0072 0.999657 +15.0224 1.00289 +15.0377 1.00172 +15.0529 0.9996 +15.0681 1.00539 +15.0834 1.00327 +15.0986 1.00432 +15.1138 1.00793 +15.129 1.00771 +15.1443 1.00527 +15.1595 0.998929 +15.1747 1.00272 +15.1899 0.996197 +15.2052 1.00656 +15.2204 1.00242 +15.2356 1.0088 +15.2509 0.999342 +15.2661 1.00018 +15.2813 1.00681 +15.2965 1.00264 +15.3118 1.00544 +15.327 1.00505 +15.3422 0.99463 +15.3575 1.0069 +15.3727 1.00132 +15.3879 1.00035 +15.4031 1.00622 +15.4184 1.00362 +15.4336 0.999379 +15.4488 1.00344 +15.4641 1.00498 +15.4793 1.00211 +15.4945 0.993869 +15.5097 1.0079 +15.525 1.0023 +15.5402 0.999838 +15.5554 1.0038 +15.5707 1.00247 +15.5859 0.997689 +15.6011 1.00086 +15.6163 0.99927 +15.6316 0.995719 +15.6468 0.999059 +15.662 1.00104 +15.6772 0.995755 +15.6925 0.992995 +15.7077 0.990329 +15.7229 1.00235 +15.7382 0.993933 +15.7534 1.00392 +15.7686 1.00331 +15.7838 1.00066 +15.7991 0.990642 +15.8143 0.995261 +15.8295 1.00088 +15.8448 0.996147 +15.86 1.00313 +15.8752 0.999966 +15.8904 0.997574 +15.9057 1.00813 +15.9209 0.997475 +15.9361 1.00762 +15.9514 0.991799 +15.9666 1.00476 +15.9818 0.990112 +15.997 1.00309 +16.0123 0.999303 +16.0275 0.988324 +16.0427 0.996407 +16.0579 0.988229 +16.0732 0.99393 +16.0884 0.995404 +16.1036 0.998645 +16.1189 1.00401 +16.1341 0.994678 +16.1493 0.989233 +16.1645 0.996978 +16.1798 0.998579 +16.195 0.990041 +16.2102 0.998632 +16.2255 0.995467 +16.2407 0.995069 +16.2559 0.993938 +16.2711 0.991545 +16.2864 0.998999 +16.3016 1.00148 +16.3168 0.988161 +16.3321 0.995281 +16.3473 0.995091 +16.3625 0.992361 +16.3777 0.992252 +16.393 0.988035 +16.4082 1.00054 +16.4234 0.998491 +16.4386 0.99308 +16.4539 0.996208 +16.4691 0.997626 +16.4843 0.992839 +16.4996 0.994888 +16.5148 0.986361 +16.53 0.997887 +16.5452 0.998764 +16.5605 1.0012 +16.5757 0.994202 +16.5909 0.999865 +16.6062 1.00454 +16.6214 0.994557 +16.6366 1.00695 +16.6518 1.00094 +16.6671 0.994949 +16.6823 0.996254 +16.6975 0.996389 +16.7128 0.996278 +16.728 0.996072 +16.7432 0.995959 +16.7584 0.994522 +16.7737 0.991472 +16.7889 1.00073 +16.8041 1.00351 +16.8193 0.992454 +16.8346 0.997322 +16.8498 0.996165 +16.865 0.9949 +16.8803 0.996043 +16.8955 1.00158 +16.9107 0.998896 +16.9259 1.00849 +16.9412 1.00116 +16.9564 0.997542 +16.9716 0.998985 +16.9869 1.00118 +17.0021 0.997707 +17.0173 0.993592 +17.0325 0.996901 +17.0478 0.999206 +17.063 0.993668 +17.0782 0.994641 +17.0935 0.999475 +17.1087 1.0005 +17.1239 1.00145 +17.1391 0.990611 +17.1544 0.999606 +17.1696 0.99645 +17.1848 1.00269 +17.2 0.993685 +17.2153 0.996516 +17.2305 1.00043 +17.2457 0.998152 +17.261 0.998165 +17.2762 0.998825 +17.2914 0.994162 +17.3066 0.996814 +17.3219 0.995219 +17.3371 1.00938 +17.3523 1.00068 +17.3676 0.998782 +17.3828 0.990905 +17.398 1.00403 +17.4132 0.999689 +17.4285 0.999786 +17.4437 0.996597 +17.4589 0.996798 +17.4742 0.999653 +17.4894 0.99952 +17.5046 0.994026 +17.5198 0.99402 +17.5351 0.99175 +17.5503 0.997179 +17.5655 1.00362 +17.5807 0.997315 +17.596 0.999813 +17.6112 1.00139 +17.6264 1.00496 +17.6417 0.994087 +17.6569 0.997725 +17.6721 0.989625 +17.6873 1.00586 +17.7026 0.993616 +17.7178 0.997124 +17.733 0.998224 +17.7483 0.998886 +17.7635 0.999942 +17.7787 0.992025 +17.7939 1.00268 +17.8092 1.00168 +17.8244 0.994124 +17.8396 0.998065 +17.8549 0.999267 +17.8701 1.00212 +17.8853 0.994297 +17.9005 0.998222 +17.9158 1.00228 +17.931 1.00492 +17.9462 1.00299 +17.9614 0.998531 +17.9767 0.999786 +17.9919 0.999468 +18.0071 1.00673 +18.0224 1.00307 +18.0376 1.00901 +18.0528 1.00716 +18.068 1.00407 +18.0833 0.998623 +18.0985 1.00284 +18.1137 1.00499 +18.129 1.00515 +18.1442 0.998237 +18.1594 0.994327 +18.1746 0.991965 +18.1899 0.991822 +18.2051 0.998133 +18.2203 1.00003 +18.2356 1.00119 +18.2508 0.999832 +18.266 1.00296 +18.2812 1.002 +18.2965 1.00263 +18.3117 1.00378 +18.3269 0.999873 +18.3421 1.00774 +18.3574 0.999073 +18.3726 0.997758 +18.3878 1.0035 +18.4031 0.999818 +18.4183 1.00951 +18.4335 1.00258 +18.4487 1.00002 +18.464 0.999983 +18.4792 1.00451 +18.4944 1.00087 +18.5097 1.00218 +18.5249 1.0007 +18.5401 1.0021 +18.5553 0.995278 +18.5706 1.00548 +18.5858 0.997546 +18.601 1.00442 +18.6163 0.999869 +18.6315 1.00298 +18.6467 0.994695 +18.6619 0.996902 +18.6772 1.00595 +18.6924 1.00114 +18.7076 1.00758 +18.7229 1.00893 +18.7381 1.00742 +18.7533 1.00933 +18.7685 0.99939 +18.7838 0.997533 +18.799 0.998567 +18.8142 1.00338 +18.8294 0.99885 +18.8447 1.00266 +18.8599 1.00529 +18.8751 1.00437 +18.8904 1.00419 +18.9056 0.99647 +18.9208 1.00137 +18.936 1.00081 +18.9513 1.00005 +18.9665 0.995256 +18.9817 1.00025 +18.997 1.00585 +19.0122 1.00214 +19.0274 1.00047 +19.0426 1.0036 +19.0579 1.00061 +19.0731 1.00344 +19.0883 1.00321 +19.1036 0.997911 +19.1188 0.996393 +19.134 0.999137 +19.1492 1.00116 +19.1645 1.00109 +19.1797 1.00899 +19.1949 1.00364 +19.2101 1.0008 +19.2254 0.999186 +19.2406 0.995161 +19.2558 1.00288 +19.2711 1.00665 +19.2863 1.00518 +19.3015 1.00567 +19.3167 0.999429 +19.332 1.00221 +19.3472 1.00257 +19.3624 1.00423 +19.3777 1.00342 +19.3929 0.999955 +19.4081 1.00227 +19.4233 1.00342 +19.4386 1.00159 +19.4538 1.00671 +19.469 0.999214 +19.4843 0.998714 +19.4995 1.00289 +19.5147 0.996272 +19.5299 0.997873 +19.5452 1.00403 +19.5604 1.00298 +19.5756 1.00327 +19.5908 1.00311 +19.6061 0.999254 +19.6213 0.992405 +19.6365 1.0052 +19.6518 0.999625 +19.667 0.999106 +19.6822 0.999221 +19.6974 1.00876 +19.7127 0.99767 +19.7279 1.00024 +19.7431 0.999935 +19.7584 1.00093 +19.7736 0.999372 +19.7888 0.99678 +19.804 0.999702 +19.8193 0.999524 +19.8345 1.00203 +19.8497 1.00823 +19.865 0.998975 +19.8802 0.998319 +19.8954 1.00046 +19.9106 1.00297 +19.9259 1.00094 +19.9411 1.00732 +19.9563 0.996415 +19.9715 1.0019 +19.9868 1.00273 +20.002 1.00234 +20.0172 1.00121 +20.0325 1.00486 +20.0477 1.00002 +20.0629 0.999032 +20.0781 0.998205 +20.0934 0.997586 +20.1086 1.00454 +20.1238 1.00902 +20.1391 0.998134 +20.1543 0.997283 +20.1695 0.995032 +20.1847 0.998024 +20.2 1.00387 +20.2152 0.998808 +20.2304 1.00055 +20.2457 0.999123 +20.2609 1.0006 +20.2761 1.00313 +20.2913 1.00018 +20.3066 1.00623 +20.3218 1.00005 +20.337 0.99549 +20.3522 1.00021 +20.3675 0.996335 +20.3827 0.999291 +20.3979 0.99702 +20.4132 0.994805 +20.4284 0.999715 +20.4436 1.00425 +20.4588 0.993117 +20.4741 1.00318 +20.4893 0.9952 +20.5045 1.002 +20.5198 0.999938 +20.535 1.00127 +20.5502 0.993471 +20.5654 1.00144 +20.5807 0.997991 +20.5959 1.00089 +20.6111 0.998701 +20.6264 0.999329 +20.6416 0.997229 +20.6568 1.0031 +20.672 0.996631 +20.6873 0.991754 +20.7025 1.00318 +20.7177 0.998443 +20.7329 1.00324 +20.7482 0.996948 +20.7634 0.997379 +20.7786 0.996814 +20.7939 0.993044 +20.8091 0.996725 +20.8243 0.995521 +20.8395 1.00014 +20.8548 1.00374 +20.87 1.00182 +20.8852 0.9985 +20.9005 0.995106 +20.9157 0.998739 +20.9309 0.997573 +20.9461 1.00022 +20.9614 0.998283 +20.9766 0.994195 +20.9918 0.99083 +21.0071 0.991978 +21.0223 0.993855 +21.0375 1.00076 +21.0527 0.994345 +21.068 0.997493 +21.0832 0.994287 +21.0984 1.0028 +21.1136 1.00074 +21.1289 0.994069 +21.1441 0.998804 +21.1593 1.00031 +21.1746 1.00117 +21.1898 0.996983 +21.205 1.00341 +21.2202 0.996365 +21.2355 1.00166 +21.2507 1.00267 +21.2659 1.00139 +21.2812 0.998731 +21.2964 1.00212 +21.3116 0.99536 +21.3268 1.00053 +21.3421 1.00149 +21.3573 0.999913 +21.3725 0.990863 +21.3878 1.00145 +21.403 0.996825 +21.4182 0.999125 +21.4334 0.995099 +21.4487 0.989809 +21.4639 0.999203 +21.4791 1.00341 +21.4943 1.00554 +21.5096 0.998038 +21.5248 0.999866 +21.54 0.998002 +21.5553 0.998078 +21.5705 0.995973 +21.5857 0.998171 +21.6009 0.998334 +21.6162 0.995652 +21.6314 0.996363 +21.6466 0.999092 +21.6619 0.99693 +21.6771 1.00172 +21.6923 0.998679 +21.7075 0.992616 +21.7228 1.00376 +21.738 1.00014 +21.7532 0.99862 +21.7685 0.999707 +21.7837 1.00186 +21.7989 0.997257 +21.8141 1.00062 +21.8294 0.998933 +21.8446 1.00082 +21.8598 0.999632 +21.8751 0.995114 +21.8903 0.994505 +21.9055 0.996186 +21.9207 1.00017 +21.936 1.00553 +21.9512 1.00206 +21.9664 1.00012 +21.9816 1.00268 +21.9969 0.991884 +22.0121 0.997788 +22.0273 0.99504 +22.0426 0.996637 +22.0578 0.997132 +22.073 1.00624 +22.0882 1.0025 +22.1035 1.00008 +22.1187 0.997431 +22.1339 1.00277 +22.1492 1.0005 +22.1644 1.00028 +22.1796 1.0045 +22.1948 1.0003 +22.2101 0.998891 +22.2253 0.993679 +22.2405 0.996379 +22.2558 0.999768 +22.271 1.00507 +22.2862 0.995205 +22.3014 1.00067 +22.3167 1.00208 +22.3319 0.995291 +22.3471 0.993997 +22.3623 0.995531 +22.3776 0.997707 +22.3928 1.00108 +22.408 1.00152 +22.4233 0.989714 +22.4385 1.00307 +22.4537 1.0033 +22.4689 1.0017 +22.4842 0.99486 +22.4994 0.99589 +22.5146 0.995885 +22.5299 1.00418 +22.5451 1.00161 +22.5603 1.00012 +22.5755 0.999632 +22.5908 0.999197 +22.606 0.997626 +22.6212 1.00919 +22.6365 0.999312 +22.6517 0.999223 +22.6669 0.999326 +22.6821 0.993461 +22.6974 1.00269 +22.7126 1.00336 +22.7278 1.00175 +22.743 1.0048 +22.7583 0.999003 +22.7735 0.999332 +22.7887 1.00336 +22.804 1.00058 +22.8192 1.00452 +22.8344 1.00517 +22.8496 0.995398 +22.8649 1.00111 +22.8801 0.998855 +22.8953 1.00567 +22.9106 0.999649 +22.9258 1.00115 +22.941 1.00112 +22.9562 1.00368 +22.9715 1.00887 +22.9867 1.00259 +23.0019 1.00518 +23.0172 1.00248 +23.0324 0.998291 +23.0476 1.00694 +23.0628 0.999678 +23.0781 1.00071 +23.0933 1.00532 +23.1085 1.00449 +23.1237 1.00214 +23.139 1.00212 +23.1542 0.999775 +23.1694 0.997448 +23.1847 0.99859 +23.1999 1.00529 +23.2151 1.00113 +23.2303 0.996896 +23.2456 1.00051 +23.2608 0.999121 +23.276 1.00335 +23.2913 1.00349 +23.3065 0.99828 +23.3217 1.0034 +23.3369 0.999027 +23.3522 1.00479 +23.3674 1.00243 +23.3826 1.00185 +23.3979 1.00283 +23.4131 1.00216 +23.4283 0.99977 +23.4435 0.997196 +23.4588 1.00571 +23.474 1.00058 +23.4892 0.995492 +23.5044 1.00218 +23.5197 1.00056 +23.5349 1.00404 +23.5501 1.00583 +23.5654 1.00327 +23.5806 1.00159 +23.5958 1.00094 +23.611 0.997837 +23.6263 0.999812 +23.6415 0.997139 +23.6567 1.00263 +23.672 1.00012 +23.6872 0.99408 +23.7024 1.00205 +23.7176 0.991222 +23.7329 0.999255 +23.7481 0.998864 +23.7633 0.999196 +23.7786 1.00195 +23.7938 1.00254 +23.809 1.00246 +23.8242 1.00131 +23.8395 0.995023 +23.8547 0.998827 +23.8699 0.999389 +23.8851 0.998125 +23.9004 0.996176 +23.9156 1.00486 +23.9308 1.00284 +23.9461 0.998025 +23.9613 0.995171 +23.9765 0.992233 +23.9917 0.999198 +24.007 0.996588 +24.0222 1.00457 +24.0374 0.997793 +24.0527 0.996594 +24.0679 1.00008 +24.0831 1.00188 +24.0983 0.9997 +24.1136 1.0018 +24.1288 0.999109 +24.144 1.00026 +24.1593 0.997703 +24.1745 0.998935 +24.1897 1.00307 +24.2049 1.00645 +24.2202 0.99525 +24.2354 0.996948 +24.2506 1.00168 +24.2658 1.00102 +24.2811 1.00543 +24.2963 0.995561 +24.3115 1.00096 +24.3268 0.997224 +24.342 1.00214 +24.3572 1.00126 +24.3724 1.00095 +24.3877 1.00089 +24.4029 0.999522 +24.4181 0.998522 +24.4334 0.999231 +24.4486 0.998617 +24.4638 1.00612 +24.479 0.996587 +24.4943 1.00999 +24.5095 1.00448 +24.5247 1.00117 +24.54 0.999463 +24.5552 1.00132 +24.5704 1.00034 +24.5856 1.00558 +24.6009 1.00302 +24.6161 0.99512 +24.6313 0.993028 +24.6465 1.00115 +24.6618 0.997173 +24.677 0.998631 +24.6922 1.00296 +24.7075 1 +24.7227 0.998358 +24.7379 0.999976 +24.7531 0.997337 +24.7684 0.998541 +24.7836 1.00185 +24.7988 0.994389 +24.8141 0.995106 +24.8293 0.998645 +24.8445 0.99787 +24.8597 1.00206 +24.875 0.996485 +24.8902 0.996423 +24.9054 1.00047 +24.9207 1.00072 +24.9359 0.999698 +24.9511 1.00333 +24.9663 0.996883 +24.9816 0.996674 +24.9968 1.0036 +25.012 0.995694 +25.0273 1.00058 +25.0425 0.995443 +25.0577 1.00488 +25.0729 1.00547 +25.0882 0.997571 +25.1034 0.998793 +25.1186 1.00479 +25.1338 0.995745 +25.1491 0.998277 +25.1643 1.00117 +25.1795 1.00319 +25.1948 0.995023 +25.21 0.998002 +25.2252 0.996658 +25.2404 0.99796 +25.2557 0.996297 +25.2709 0.999635 +25.2861 1.0053 +25.3014 1.00365 +25.3166 0.999025 +25.3318 0.996284 +25.347 1.00439 +25.3623 1.00172 +25.3775 0.997191 +25.3927 0.997091 +25.408 0.996893 +25.4232 1.00106 +25.4384 1.00067 +25.4536 0.999308 +25.4689 1.0026 +25.4841 0.99746 +25.4993 0.998779 +25.5145 0.999765 +25.5298 1.0032 +25.545 0.999851 +25.5602 0.996915 +25.5755 0.99947 +25.5907 1.00142 +25.6059 1.00072 +25.6211 1.00004 +25.6364 0.996358 +25.6516 0.99946 +25.6668 0.998759 +25.6821 0.99698 +25.6973 1.00646 +25.7125 0.996471 +25.7277 0.999867 +25.743 1.00194 +25.7582 0.994721 +25.7734 1.00047 +25.7887 0.996856 +25.8039 0.998782 +25.8191 1.00011 +25.8343 0.99849 +25.8496 1.00178 +25.8648 0.999791 +25.88 0.99774 +25.8952 1.00005 +25.9105 1.0004 +25.9257 0.999846 +25.9409 0.998851 +25.9562 0.998269 +25.9714 1.00212 +25.9866 0.9982 +26.0018 1.00176 +26.0171 0.999898 +26.0323 0.995255 +26.0475 0.99585 +26.0628 0.998067 +26.078 0.996079 +26.0932 1.00135 +26.1084 1.0028 +26.1237 0.996254 +26.1389 0.997658 +26.1541 0.994967 +26.1694 0.995345 +26.1846 1.00178 +26.1998 1.00647 +26.215 1.00109 +26.2303 1.00039 +26.2455 0.998728 +26.2607 0.997047 +26.2759 1.00197 +26.2912 0.999937 +26.3064 0.998281 +26.3216 0.994619 +26.3369 0.998552 +26.3521 1.00381 +26.3673 0.999926 +26.3825 1.00236 +26.3978 1.00152 +26.413 1.00115 +26.4282 1.00156 +26.4435 0.999225 +26.4587 0.999091 +26.4739 0.999024 +26.4891 0.999396 +26.5044 0.998598 +26.5196 0.999244 +26.5348 0.99708 +26.5501 1.00276 +26.5653 1.00152 +26.5805 1.00352 +26.5957 1.00357 +26.611 1.00525 +26.6262 0.997804 +26.6414 0.998594 +26.6566 0.995068 +26.6719 0.999998 +26.6871 1.00021 +26.7023 1.0027 +26.7176 0.9993 +26.7328 1.00225 +26.748 0.998992 +26.7632 0.996468 +26.7785 1.00607 +26.7937 1.00099 +26.8089 0.998228 +26.8242 0.998621 +26.8394 1.00451 +26.8546 0.996466 +26.8698 0.999503 +26.8851 1.0021 +26.9003 0.99847 +26.9155 0.99851 +26.9308 0.996003 +26.946 0.9992 +26.9612 0.998245 +26.9764 0.998857 +26.9917 1.00017 +27.0069 1.00556 +27.0221 1.00187 +27.0373 1.0026 +27.0526 0.997995 +27.0678 1.0002 +27.083 1.00025 +27.0983 1.00026 +27.1135 1.00449 +27.1287 1.00232 +27.1439 1.00199 +27.1592 0.997055 +27.1744 0.999161 +27.1896 1.00601 +27.2049 0.996989 +27.2201 1.00062 +27.2353 1.00093 +27.2505 1.00147 +27.2658 0.998069 +27.281 0.998241 +27.2962 1.00216 +27.3115 0.99927 +27.3267 0.998969 +27.3419 1.00493 +27.3571 0.998514 +27.3724 0.998109 +27.3876 1.00217 +27.4028 1.00006 +27.418 0.997144 +27.4333 1.0037 +27.4485 0.995328 +27.4637 0.999169 +27.479 0.998242 +27.4942 1.00106 +27.5094 1.00297 +27.5246 0.999329 +27.5399 1.00397 +27.5551 0.996963 +27.5703 1.00288 +27.5856 0.997821 +27.6008 1.00177 +27.616 1.0011 +27.6312 1.0005 +27.6465 1.00079 +27.6617 0.99898 +27.6769 0.999213 +27.6922 1.00073 +27.7074 0.995655 +27.7226 1.00389 +27.7378 1.00037 +27.7531 1.00112 +27.7683 1.00657 +27.7835 0.999549 +27.7987 1.00039 +27.814 0.999822 +27.8292 1.004 +27.8444 1.00202 +27.8597 1.00219 +27.8749 0.998042 +27.8901 1.00101 +27.9053 1.00378 +27.9206 0.999355 +27.9358 0.996409 +27.951 0.998745 +27.9663 0.998534 +27.9815 0.999825 +27.9967 0.998525 +28.0119 1.00447 +28.0272 1.00324 +28.0424 0.999877 +28.0576 1.00488 +28.0729 1.00332 +28.0881 1.00015 +28.1033 1.00115 +28.1185 1.00392 +28.1338 1.0026 +28.149 1.00011 +28.1642 1.00127 +28.1795 0.999719 +28.1947 0.998751 +28.2099 0.998337 +28.2251 0.994791 +28.2404 0.998638 +28.2556 0.998289 +28.2708 1.00161 +28.286 0.997075 +28.3013 1.00081 +28.3165 1.0002 +28.3317 0.992318 +28.347 1.00077 +28.3622 1.00315 +28.3774 1.00343 +28.3926 1.00571 +28.4079 0.998393 +28.4231 0.998101 +28.4383 0.999632 +28.4536 1.00026 +28.4688 0.999534 +28.484 0.99882 +28.4992 1.00634 +28.5145 0.997968 +28.5297 0.996157 +28.5449 1.00133 +28.5602 0.997157 +28.5754 0.997503 +28.5906 0.9979 +28.6058 0.998794 +28.6211 0.998178 +28.6363 1.00237 +28.6515 0.999492 +28.6667 1.00073 +28.682 0.996741 +28.6972 0.998567 +28.7124 0.997349 +28.7277 1.00626 +28.7429 0.999865 +28.7581 1.00237 +28.7733 0.999891 +28.7886 0.996883 +28.8038 1.0007 +28.819 1.00378 +28.8343 1.00169 +28.8495 0.99922 +28.8647 0.999077 +28.8799 0.999014 +28.8952 1.00091 +28.9104 0.997305 +28.9256 1.00105 +28.9409 0.999335 +28.9561 1.00103 +28.9713 1.00067 +28.9865 1.00059 +29.0018 0.994208 +29.017 0.995141 +29.0322 0.999291 +29.0474 1.00169 +29.0627 0.998696 +29.0779 1.00203 +29.0931 0.998819 +29.1084 0.99799 +29.1236 0.998192 +29.1388 1.00433 +29.154 1.00138 +29.1693 0.997828 +29.1845 0.997426 +29.1997 0.997582 +29.215 1.00067 +29.2302 0.999649 +29.2454 1.00444 +29.2606 1.00105 +29.2759 1.00134 +29.2911 1.00334 +29.3063 1.00224 +29.3216 1.00076 +29.3368 1.00436 +29.352 0.997439 +29.3672 1.00314 +29.3825 0.995197 +29.3977 0.999704 +29.4129 0.997496 +29.4281 0.997932 +29.4434 1.00134 +29.4586 1.0079 +29.4738 0.999905 +29.4891 0.999492 +29.5043 1.00292 +29.5195 0.99782 +29.5347 0.999292 +29.55 0.999365 +29.5652 1.00185 +29.5804 1.00363 +29.5957 1.00107 +29.6109 0.999141 +29.6261 1.00139 +29.6413 0.997065 +29.6566 1.00339 +29.6718 0.99591 +29.687 0.996254 +29.7023 0.997663 +29.7175 1.00078 +29.7327 0.99736 +29.7479 0.997492 +29.7632 1.00173 +29.7784 0.998626 +29.7936 0.998902 +29.8088 0.997452 +29.8241 0.998709 +29.8393 1.00049 +29.8545 0.998527 +29.8698 0.999086 +29.885 0.998812 +29.9002 1.00077 +29.9154 0.999024 +29.9307 1.00053 +29.9459 0.994986 +29.9611 0.996982 +29.9764 1.00171 +29.9916 1.00101 +30.0068 1.0003 +30.022 0.999753 +30.0373 1.00202 +30.0525 1.00256 +30.0677 1.00769 +30.083 1.00259 +30.0982 0.996561 +30.1134 1.00201 +30.1286 0.999462 +30.1439 0.996334 +30.1591 0.999031 +30.1743 1.00018 +30.1895 1.00502 +30.2048 1.00052 +30.22 0.999378 +30.2352 0.996158 +30.2505 1.00432 +30.2657 0.999367 +30.2809 1.00263 +30.2961 1.00223 +30.3114 1.00106 +30.3266 1.00262 +30.3418 1.00204 +30.3571 0.999411 +30.3723 1.00333 +30.3875 0.999748 +30.4027 0.999109 +30.418 0.999573 +30.4332 0.996037 +30.4484 0.998088 diff --git a/_basic/iso/source_code/SOLUTION/dcdread.h b/_basic/iso/source_code/SOLUTION/dcdread.h new file mode 100644 index 0000000..aba2945 --- /dev/null +++ b/_basic/iso/source_code/SOLUTION/dcdread.h @@ -0,0 +1,48 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, std::ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, std::ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, std::ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, std::ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, std::ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, std::ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, std::ios::cur); + + return; +} diff --git a/_basic/iso/source_code/SOLUTION/rdf.cpp b/_basic/iso/source_code/SOLUTION/rdf.cpp new file mode 100644 index 0000000..84295be --- /dev/null +++ b/_basic/iso/source_code/SOLUTION/rdf.cpp @@ -0,0 +1,212 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include + +#include +#include +#include +#include +#include + + +#ifdef USE_COUNTING_ITERATOR +#include +#endif + + +void pair_gpu(double *d_x, double *d_y, double *d_z, + std::atomic *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + + int nbin; + int numatm, nconf, inconf; + std::string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + std::cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + std::ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + + dcdreadhead(&numatm, &nconf, infile); + std::cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << std::endl; + if (inconf > nconf) + std::cout << "nconf is reset to " << nconf << std::endl; + else + { + nconf = inconf; + } + std::cout << "Calculating RDF for " << nconf << " frames" << std::endl; + //////////////////////////////////////////////////////// + + std::vector h_x(nconf * numatm); + std::vector h_y(nconf * numatm); + std::vector h_z(nconf * numatm); + + double *x = &h_x[0]; + double *y = &h_y[0]; + double *z = &h_z[0]; + + + //Note + std::atomic *h_g2 = new std::atomic[nbin]; + std::fill(std::execution::par, h_g2, h_g2 + nbin, 0); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + x[i * numatm + j] = ax[j]; + y[i * numatm + j] = ay[j]; + z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for Reading file + std::cout << "Reading of input file is completed" << std::endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(x, y, z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + + double g2[nbin]; + double s2 = 0.0l, s2bond = 0.0l; + + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double r, gr, lngr, lngrbond; + double box = std::min(xbox, ybox); + box = std::min(box, zbox); + double del = box / (2.0l * nbin); + + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << std::endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << std::endl; + stwo << "s2bond value is " << s2bond << std::endl; + + std::cout << "\n#Freeing Host memory" << std::endl; + + delete[] h_g2; + + std::cout << "#Number of atoms processed: " << numatm << std::endl + << std::endl; + std::cout << "#Number of confs processed: " << nconf << std::endl + << std::endl; + return 0; +} + +void pair_gpu(double *d_x, double *d_y, double *d_z, + std::atomic *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double cut; + double box; + box = std::min(xbox, ybox); + box = std::min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + +#ifndef USE_COUNTING_ITERATOR + std::vector indices(numatm * numatm); + std::generate(indices.begin(), indices.end(), [n = 0]() mutable { return n++; }); +#endif + + std::cout << "\n" << nconf << " "<< numatm; + for (int frame = 0; frame < nconf; frame++) + { + std::cout << "\n" << frame; +#ifdef USE_COUNTING_ITERATOR + std::for_each(std::execution::par, thrust::counting_iterator(0u), thrust::counting_iterator(numatm * numatm), +#else + std::for_each(std::execution::par, indices.begin(), indices.end(), +#endif + [d_x, d_y, d_z, d_g2, numatm, frame, xbox, ybox, zbox, cut, del](unsigned int index) { + int id1 = index / numatm; + int id2 = index % numatm; + + double dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + double dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + double dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (std::round(dx / xbox)); + dy = dy - ybox * (std::round(dy / ybox)); + dz = dz - zbox * (std::round(dz / zbox)); + + double r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + int ig2 = (int)(r / del); + ++d_g2[ig2]; + } + }); + } +} diff --git a/_basic/iso/source_code/SOLUTION/rdf.f90 b/_basic/iso/source_code/SOLUTION/rdf.f90 new file mode 100644 index 0000000..a668208 --- /dev/null +++ b/_basic/iso/source_code/SOLUTION/rdf.f90 @@ -0,0 +1,158 @@ + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + xbox=d(1) + ybox=d(3) + zbox=d(6) + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + !use cudafor + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + + nframes=10 + + call cpu_time(start) + + !x=0;y=0;z=0 + !g=0 + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + do concurrent(i=1 : natoms, j=1:natoms) + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include + +#include +#include +#include +#include + +//Note: The addition of execution header file +#include + +#ifdef USE_COUNTING_ITERATOR +#include +#endif + +void pair_gpu(double *d_x, double *d_y, double *d_z, + std::atomic *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + + int nbin; + int numatm, nconf, inconf; + std::string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + std::cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + std::ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + + dcdreadhead(&numatm, &nconf, infile); + std::cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << std::endl; + if (inconf > nconf) + std::cout << "nconf is reset to " << nconf << std::endl; + else + { + nconf = inconf; + } + std::cout << "Calculating RDF for " << nconf << " frames" << std::endl; + //////////////////////////////////////////////////////// + + std::vector h_x(nconf * numatm); + std::vector h_y(nconf * numatm); + std::vector h_z(nconf * numatm); + + double *x = &h_x[0]; + double *y = &h_y[0]; + double *z = &h_z[0]; + + //Note: We are using standard std atomic which gets mapped to respective atomic operations on GPU + + std::atomic *h_g2 = new std::atomic[nbin]; + std::fill(std::execution::par, h_g2, h_g2 + nbin, 0); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for Reading file + std::cout << "Reading of input file is completed" << std::endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(x, y, z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + + double g2[nbin]; + double s2 = 0.0l, s2bond = 0.0l; + + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double r, gr, lngr, lngrbond; + double box = std::min(xbox, ybox); + box = std::min(box, zbox); + double del = box / (2.0l * nbin); + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << std::endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << std::endl; + stwo << "s2bond value is " << s2bond << std::endl; + + std::cout << "\n#Freeing Host memory" << std::endl; + + delete[] h_g2; + + std::cout << "#Number of atoms processed: " << numatm << std::endl + << std::endl; + std::cout << "#Number of confs processed: " << nconf << std::endl + << std::endl; + return 0; +} + + +void pair_gpu(double *d_x, double *d_y, double *d_z, + std::atomic *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double cut; + double box; + box = std::min(xbox, ybox); + box = std::min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + +#ifndef USE_COUNTING_ITERATOR + std::vector indices(numatm * numatm); + std::generate(indices.begin(), indices.end(), [n = 0]() mutable { return n++; }); +#endif + + std::cout << "\n" << nconf << " "<< numatm; + for (int frame = 0; frame < nconf; frame++) + { + std::cout << "\n" << frame; +#ifdef USE_COUNTING_ITERATOR + // Todo : Use the right parallel execution policy and algorithm + std::Fill parallel algorithm Here(Fill execution policy here, thrust::counting_iterator(0u), thrust::counting_iterator(numatm * numatm), +#else + //std::Fill parallel algorithm Here(execution policy here, indices.begin(), indices.end(), +#endif + [d_x, d_y, d_z, d_g2, numatm, frame, xbox, ybox, zbox, cut, del](unsigned int index) { + int id1 = index / numatm; + int id2 = index % numatm; + + double dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + double dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + double dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (std::round(dx / xbox)); + dy = dy - ybox * (std::round(dy / ybox)); + dz = dz - zbox * (std::round(dz / zbox)); + + double r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + int ig2 = (int)(r / del); + ++d_g2[ig2]; + } + }); + } +} diff --git a/_basic/iso/source_code/rdf.f90 b/_basic/iso/source_code/rdf.f90 new file mode 100644 index 0000000..e181995 --- /dev/null +++ b/_basic/iso/source_code/rdf.f90 @@ -0,0 +1,161 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r\n", + "\n", + "For more information, please checkout this [Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf).\n", + "\n", + "The Kokkos programming model is characterized by 6 core abstractions: Execution Spaces, Execution Patterns, Execution Policies, Memory Spaces, Memory Layout and Memory Traits. These abstraction concepts allow the formulation of generic algorithms\n", + "and data structures which can then be mapped to different types of architectures.\n", + "\n", + "We will be providing brief introduction to Kokkos abstraction necessary for us to get started and run our code on GPU using Kokkos library in this tutorial. We do not intend to cover Memory Layout and Memory Traits in this tutorial which may be essential to get best performance on our code. Participants are recommended to go through reference section below and can optimize the code further.\n", + "\n", + "\n", + "## Concepts of threaded data parallelism\n", + "\n", + "In this section we will introduce you to core abstractions and concepts in Kokkos based on which we will port the **Pair Calculation algorithm**\n", + "\n", + "**Pattern**: In Kokkos, *Execution Patterns* are the fundamental parallel algorithms in which an application has to be expressed. To list a few:\n", + "- parallel_for: Dispatches a parallel \"for loop\" with independent iterations. We will be using this pattern during our porting exercise\n", + "- parallel_reduce: Combines a parallel_for execution with a reduction operation,\n", + "- parallel_scan: Combines a parallel_for operation with a prefix or postfix scan on output values of each operation, and\n", + "\n", + "\n", + "**Execution Policy:** An Execution Policy defines how computations are executed (static scheduling, dynamic scheduling etc.) For example the most simple form of execution policies are _Range Policies_. They are used to execute an operation once for each element in a range. We will be using this policy in our code. On the other hand _Team policies_ are used to implement hierarchical parallelism. You will see that CUDA Programming model is hierarchical in nature. In fact CUDA programming model inspired Kokkos' thread team model. \n", + "\n", + "**Computational Body:** Consists of code which performs each unit of work (e.g. the loop body)\n", + "\n", + "Pattern and policy together drive the computational body.\n", + "\n", + "The diagram below shows these three concepts in a sequential code. In our code we will choosing one among the available patterns and execution policies to run in parallel on GPU. \n", + "\n", + "\n", + "\n", + "\n", + "## Core Capabilities\n", + "\n", + "Kokkos supports multiple patterns and Execution policies. The capabilities listed below will be required for you to be familiar with in order to move a serial code to a GPU.\n", + "\n", + "### Parallel Loops:\n", + "Parallel loop pattern maps the work to computation unit core where \n", + "- Each iteration of a computational body is a unit of work.\n", + "- An iteration index identifies a particular unit of work.\n", + "- An iteration range identifies a total amount of work.\n", + "\n", + "We will be using Kokkos::parallel_for to map the work to cores: ```parallel_for ( ... );```\n", + "\n", + "```parallel_for``` is the most common parallel dispatch operation. It corresponds to the OpenMP construct ```#pragma omp parallel for```. Parallel_for splits the index range over the available hardware resources and executes the loop body in parallel. Each iteration is executed independently. Kokkos promises nothing about the loop order or the amount of work which actually runs concurrently. This means in particular that not all loop iterations are active at the same time \n", + "\n", + "The two key ways of writing computational bodies are: \n", + "\n", + "**Functor Based** : Functor is a common pattern used in C++ and sample code below demonstrates the use of functor. If you are new to functors don't worry. They are nothing but structures with function which overloads the () operator.\n", + "\n", + "```cpp\n", + "struct ParallelFunctor {\n", + "...\n", + " void operator ()( a work assignment based on index) const {\n", + " /* ... computational body ... */\n", + "...\n", + "};\n", + "```\n", + "A sample code of usage of functor for daxpy (```y = a*x + y```) operation is as follows:\n", + "\n", + "```cpp\n", + "//Define functor with member variable\n", + "struct Functor {\n", + " double *_x , *_y , _a;\n", + " Functor (x, y, a) :\n", + " _x(x), _y(y), _a(a) {}\n", + " void operator ()( const size_t i) {\n", + " _y [i] = _a * _x [i] + _y [i ];\n", + " }\n", + "};\n", + "\n", + "//Call functor by creating an object and calling parallel_for\n", + "Functor functor ( x , y, a );\n", + "Kokkos :: parallel_for ( vector_size , functor );\n", + "\n", + "```\n", + "\n", + "**C++11 Lambda** : Lambdas were first introduced in C++11 and are a very concise way of writing code. Basically Lambdas are compiler generated functors you can use.\n", + "\n", + "Sample usage of Lambdas for our example is shown above:\n", + "\n", + "```cpp \n", + "double * x = new double [N ]; // also y\n", + "parallel_for (N , [=] ( const size_t i) {\n", + " y[i ] = a * x[i] + y[i ];\n", + "});\n", + "\n", + "```\n", + "\n", + "Kokkos lets users choose whether to use a functor or a lambda. Lambdas are convenient for short loop bodies. For a much more complicated loop body, you might find it easier for testing to separate it out and name it as a functor\n", + "\n", + "### Execution Space\n", + "\n", + "Execution space defines where the parallel code will run. Types of Execution spaces include: Serial, Threads, OpenMP, CUDA, ROCm. Execution space can be defined either at compile time or run time as part of the policy. If none specified it will run in a default execution policy set during compilation of the Kokkos Core library. \n", + "\n", + "\n", + "### Memory Space\n", + "\n", + "Memory Spaces are the places where data resides. They specify physical location of data as well as certain access characteristics. Different physical locations correspond to things such as high bandwidth memory, on die scratch memory or non-volatile bulk storage. Different logical memory spaces allow for concepts such as UVM memory in the CUDA programming model, which is accessible from Host and GPU. \n", + "\n", + "In the code sample above both x,y arrays reside in a CPU memory.\n", + "\n", + "We need a way of storing data (multidimensional arrays) which can be communicated to an accelerator (GPU). This is done via _views_.\n", + "\n", + "**Views:** Views are a lightweight C++ class with a pointer to array data and some meta-data. \n", + "- Or a simple definition could be that Views are *like pointers* and needs to be copied inside functor.\n", + "- Views are multi dimensions and the dimensions are fixed at compile time\n", + "\n", + "For the daxpy code here is how the Views will get created.\n", + "```cpp\n", + "View < double *, ... > x (...) , y (...);\n", + " //... populate x , y ...\n", + " \n", + "parallel_for (N , [=] ( const size_t i) {\n", + " // Views x and y are captured by value ( copy )\n", + " y(i ) = a * x(i) + y(i );\n", + "});\n", + "```\n", + "\n", + "### Data Transfer\n", + "\n", + "Every view stores its data in a memory space set at compile time.\n", + "\n", + "```cpp \n", + "View data(...); \n", + "```\n", + "\n", + "If none specified it will chose the default execution policy. Since views are similar to pointers, we need to perform deep copies explicitly (unless we are making use of UVM: [Unified Virtual Memory Space](../GPU_Architecture_Terminologies.ipynb) supported by CUDA). \n", + "\n", + "In this example we intend to use explicit copies. The example below demonstrates that *view* resides in the Default Execution space while we create a mirror which resides in the host execution space. Then we can copy data back and forth between two views using the *deep_copy* API.\n", + "\n", + "```cpp\n", + "//Define a View pointing to default execution space.\n", + "typedef Kokkos :: View < double ** > ViewType ;\n", + "ViewType view (...);\n", + "\n", + "//Create a Host Mirror of View\n", + "ViewType :: HostMirror hostView = Kokkos :: createmirrorview ( view );\n", + "\n", + "// copying from host to device\n", + "Kokkos::deep_copy(view, host_view); \n", + "...\n", + "\n", + "// copying from device to host\n", + "Kokkos::deep_copy(host_view,view); \n", + "```\n", + "\n", + "\n", + "\n", + "ref: [Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf)\n", + "\n", + "## Kokkos Initialization and Finalize\n", + "\n", + "In order to use Kokkos an initialization call is required. That call is responsible for acquiring hardware resources such as threads. Typically, this call should be placed right at the start of a program\n", + "\n", + "The simplest way to initialize Kokkos is by calling the following function:\n", + "```cpp\n", + "Kokkos::initialize(int& argc, char* argv[]); \n", + "```\n", + "\n", + "At the end of each program, Kokkos needs to be shut down in order to free resources; do this by calling \n", + "```cpp\n", + "Kokkos::finalize()\n", + "```\n", + "\n", + "## Atomic Construct\n", + "\n", + "In the code you will also require one more construct which will help you in getting the right results. Kokkos atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if I want to count the number of elements that have a value greater than zero, we could write the following:\n", + "\n", + "\n", + "```cpp\n", + "if ( val > 0 )\n", + "{\n", + " Kokkos::atomic_increment(&cnt));\n", + "}\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets start modifying the original code and add the Kokkos contracts. From the top menu, click on *File*, and *Open* `rdf.cpp` and `dcdread.h` from the current directory at `C/source_code/kokkos` directory. Remember to **SAVE** your code after changes, before running below cells.\n", + "\n", + "**Note**: Look at *Todo* in your code and fill the right execution pattern and copy directions " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run for NVIDIA GPU\n", + "\n", + "Having added Kokkos API calls, let us compile the code. We will be using _nvcc_wrapper_ script which comes as part of Kkokkos source code for compilation. We link the code to a pre-compiled Kokkos library libkokkoscore.a.\n", + "\n", + "Also in order to enable Lambdas we will add two more flags compilation ```--expt-extended-lambda``` and ```-std=c++11``` " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for default execution space:: GPU\n", + "!cd ../source_code && /opt/kokkos/kokkos-master/build/install/bin/nvcc_wrapper -I/opt/kokkos/kokkos-master/build/install/include -L/opt/kokkos/kokkos-master/build/install/lib -lkokkoscore --expt-extended-lambda -std=c++11 -lnvToolsExt rdf.cpp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Make sure to validate the output by running the executable and validate the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Run code on default execution space\n", + "!cd ../source_code && ./rdf && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output entropy value should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_kokkos ./rdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. [Download the profiler output](../source_code/rdf_kokkos.nsys-rep) and open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Kokkos Analysis\n", + "\n", + "**Usage Scenarios**\n", + "- Kokkos was developed keeping 3 aspects as core to its design: Performance, Portatbility and Productivity through abstraction.\n", + " - It compiles and runs on multiple architectures\n", + " - Obtains performant memory access patterns across architecture and hence providing performance portability\n", + " - Allows developers to utilize architecture-specific features where possible\n", + "- Kokkos has proven itself to provide good performance for various architectures. It has active support of community and developed by Sandia National Laboratories. The most widely used package LAMMPS in MD has a branch which uses of Kokkos.\n", + "\n", + "**Limitations/Constraints**\n", + "1. Kokkos is primarily for only C++11 onwards development. \n", + "2. Using Kokkos\tis invasive, for example significant part of data structures need to be taken over to get perforamance out from code. \n", + "\n", + "**How is Kokkos different from other directive based methods like OpenMP or OpenACC?**\n", + "\n", + "- Kokkos uses C++ templates, rather then compiler pragmas, to generate parallel code for the GPU.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# Optional Exercise\n", + "\n", + "## Run on a multicore execution space\n", + "\n", + "Try using the multicore execution space and run the code on a multicore.\n", + "You can refer to [Kokkos Documetation](https://github.com/kokkos/kokkos/wiki/The-Kokkos-Programming-Guide) for more information.\n", + "\n", + "**Understand and analyze** the code present at:\n", + "\n", + "[RDF Code](../source_code/rdf.cpp)\n", + "\n", + "[File Reader](../source_code/dcdread.h)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore execution space:: GPU\n", + "! cd ../source_code && /opt/kokkos/kokkos-master/build/install/bin/nvcc_wrapper -I/opt/kokkos/kokkos-master/build/install/include -L/opt/kokkos/kokkos-master/build/install/lib -lkokkoscore --expt-extended-lambda -std=c++11 -lnvToolsExt rdf.cpp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output using nvptx\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_kokkos_multicore ./rdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. [Download the profiler output](../source_code/rdf_kokkos_multicore.nsys-rep) and open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "\n", + "\n", + "Feel free to checkout the [solution](../source_code/SOLUTION/rdf.cpp) to help you understand better or compare your implementation with the sample solution." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f nways_files.zip\n", + "zip -r nways_files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.\n", + "\n", + "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n", + "\n", + "-----\n", + "\n", + "#

HOME

\n", + "\n", + "-----\n", + "\n", + "\n", + "# Links and Resources\n", + "[Kokkos Download](https://github.com/kokkos/kokkos)\n", + "\n", + "[Kokkos Sample Codes](https://github.com/kokkos/kokkos-tutorials)\n", + "\n", + "[Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/kokkos/source_code/SOLUTION/rdf.cpp b/_basic/kokkos/source_code/SOLUTION/rdf.cpp new file mode 100644 index 0000000..faa29dd --- /dev/null +++ b/_basic/kokkos/source_code/SOLUTION/rdf.cpp @@ -0,0 +1,220 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +int l_round(float num); + +typedef Kokkos::View view_type_double; +typedef Kokkos::View view_type_long; +typedef view_type_double::HostMirror host_view_type_double; +typedef view_type_long::HostMirror host_view_type_long; + +void pair_gpu(view_type_double d_x, view_type_double d_y, view_type_double d_z, + view_type_long d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + Kokkos::initialize(argc, argv); + { + + printf("Default Kokkos execution space %s\n", + typeid(Kokkos::DefaultExecutionSpace).name()); + + double xbox, ybox, zbox; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + view_type_double x("x", nconf * numatm); + view_type_double y("y", nconf * numatm); + view_type_double z("z", nconf * numatm); + view_type_long g2("g2", nbin); + + host_view_type_double h_x = Kokkos::create_mirror_view(x); + host_view_type_double h_y = Kokkos::create_mirror_view(y); + host_view_type_double h_z = Kokkos::create_mirror_view(z); + host_view_type_long h_g2 = Kokkos::create_mirror_view(g2); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x(i * numatm + j) = ax[j]; + h_y(i * numatm + j) = ay[j]; + h_z(i * numatm + j) = az[j]; + } + } + for (int i = 0; i < nbin; i++) + h_g2(0) = 0; + + nvtxRangePop(); //pop for Reading file + cout << "Reading of input file is completed" << endl; + + nvtxRangePush("Pair_Calculation"); + Kokkos::deep_copy(x, h_x); + Kokkos::deep_copy(y, h_y); + Kokkos::deep_copy(z, h_z); + Kokkos::deep_copy(g2, h_g2); + ////////////////////////////////////////////////////////////////////////// + pair_gpu(x, y, z, g2, numatm, nconf, xbox, ybox, zbox, nbin); + Kokkos::deep_copy(h_g2, g2); + nvtxRangePop(); //Pop for Pair Calculation + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double t_g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + t_g2[i] = (double)h_g2(i) / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << t_g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = t_g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (t_g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(t_g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((t_g2[i] * lngrbond) - t_g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + + } // Kokkos Initialize ends here + Kokkos::finalize(); + return 0; +} +int l_round(float num) +{ + return num < 0 ? num - 0.5 : num + 0.5; +} + +void pair_gpu(view_type_double d_x, view_type_double d_y, view_type_double d_z, + view_type_long d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin) +{ + + printf("\n %d %d ", nconf, numatm); + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); + Kokkos::parallel_for((numatm * numatm), KOKKOS_LAMBDA(const int index) { + int id1 = index / numatm; + int id2 = index % numatm; + double r, cut, dx, dy, dz; + int ig2; + double box; + int myround; + float num; + box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0 * d_bin); + cut = box * 0.5; + + dx = d_x(frame * numatm + id1) - d_x(frame * numatm + id2); + dy = d_y(frame * numatm + id1) - d_y(frame * numatm + id2); + dz = d_z(frame * numatm + id1) - d_z(frame * numatm + id2); + + num = dx / xbox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dx = dx - xbox * myround; + + num = dy / ybox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dy = dy - ybox * myround; + + num = dz / zbox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dz = dz - zbox * myround; + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); + Kokkos::atomic_increment(&d_g2(ig2)); + } + }); + } +} diff --git a/_basic/kokkos/source_code/dcdread.h b/_basic/kokkos/source_code/dcdread.h new file mode 100644 index 0000000..66ddba0 --- /dev/null +++ b/_basic/kokkos/source_code/dcdread.h @@ -0,0 +1,49 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/kokkos/source_code/rdf.cpp b/_basic/kokkos/source_code/rdf.cpp new file mode 100644 index 0000000..cc2e29a --- /dev/null +++ b/_basic/kokkos/source_code/rdf.cpp @@ -0,0 +1,232 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include // Note:: Included the Kokkos core library +#include + +int l_round(float num); + +//Todo: Fill the correct data type and dimensions in the code +typedef Kokkos::View view_type_double; +typedef Kokkos::View view_type_long; + +typedef view_type_double::HostMirror host_view_type_double; +typedef view_type_long::HostMirror host_view_type_long; + +void pair_gpu(view_type_double d_x, view_type_double d_y, view_type_double d_z, + view_type_long d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + //Note:: We are initailizing the Kokkos library before calling any Kokkos API + Kokkos::initialize(argc, argv); + { + + //Note: This will print the default execution space with which Kokkos library was built + printf("Default Kokkos execution space %s\n", + typeid(Kokkos::DefaultExecutionSpace).name()); + + double xbox, ybox, zbox; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + //Todo: Fill the correct dimension is view type. This is where the allocation on default Memory space will occur + view_type_double x("x", Fill here); + view_type_double y("y", Fill here); + view_type_double z("z", Fill here); + view_type_long g2("g2", Fill here); + + //Todo : Fill the right mirror image variabe here + host_view_type_double h_x = Kokkos::create_mirror_view(x); + host_view_type_double h_y = Kokkos::create_mirror_view(Fill here); + host_view_type_double h_z = Kokkos::create_mirror_view(Fill here); + host_view_type_long h_g2 = Kokkos::create_mirror_view(Fill here); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x(i * numatm + j) = ax[j]; + h_y(i * numatm + j) = ay[j]; + h_z(i * numatm + j) = az[j]; + } + } + for (int i = 0; i < nbin; i++) + h_g2(0) = 0; + + nvtxRangePop(); //pop for Reading file + cout << "Reading of input file is completed" << endl; + + nvtxRangePush("Pair_Calculation"); + //Todo: Copy from Host to device h_x->x,h_y->y, h_z-> z and h_g2->g2 + Kokkos::deep_copy(Fill Destination View, Fill Source View); + Kokkos::deep_copy(Fill Destination View, Fill Source View); + Kokkos::deep_copy(Fill Destination View, Fill Source View); + Kokkos::deep_copy(Fill Destination View, Fill Source View); + ////////////////////////////////////////////////////////////////////////// + pair_gpu(x, y, z, g2, numatm, nconf, xbox, ybox, zbox, nbin); + //Todo: Copy from Device to host g2 -> h_g2 before being used on host + Kokkos::deep_copy(Fill Destination View, Fill Source View); + nvtxRangePop(); //Pop for Pair Calculation + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double t_g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + t_g2[i] = (double)h_g2(i) / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << t_g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = t_g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (t_g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(t_g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((t_g2[i] * lngrbond) - t_g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + + } // Kokkos Initialize ends here + //Note:: Free up the memory + Kokkos::finalize(); + return 0; +} +int l_round(float num) +{ + return num < 0 ? num - 0.5 : num + 0.5; +} + +void pair_gpu(view_type_double d_x, view_type_double d_y, view_type_double d_z, + view_type_long d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin) +{ + + printf("\n %d %d ", nconf, numatm); + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); + //Fill here the pattern we intend to use along with loop size + Kokkos::Fill_Here( + Fill the loop size here, KOKKOS_LAMBDA(const int index) { + int id1 = index / numatm; + int id2 = index % numatm; + double r, cut, dx, dy, dz; + int ig2; + double box; + int myround; + float num; + box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0 * d_bin); + cut = box * 0.5; + + dx = d_x(frame * numatm + id1) - d_x(frame * numatm + id2); + dy = d_y(frame * numatm + id1) - d_y(frame * numatm + id2); + dz = d_z(frame * numatm + id1) - d_z(frame * numatm + id2); + + num = dx / xbox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dx = dx - xbox * myround; + + num = dy / ybox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dy = dy - ybox * myround; + + num = dz / zbox; + myround = num < 0 ? num - 0.5 : num + 0.5; + dz = dz - zbox * myround; + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); + //Note: We are using a atomic increment here + Kokkos::atomic_increment(&d_g2(ig2)); + } + }); + } +} diff --git a/_basic/nways_Dockerfile b/_basic/nways_Dockerfile new file mode 100644 index 0000000..cf0dfa1 --- /dev/null +++ b/_basic/nways_Dockerfile @@ -0,0 +1,40 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the docker container, run: $ sudo docker build -f nways_Dockerfile -t nways:c . +# To run: $ sudo docker run --rm -it --runtime nvidia -p 8888:8888 nways:c +# Finally, open http://localhost:8888/ + +FROM nvcr.io/nvidia/nvhpc:22.7-devel-cuda_multi-ubuntu20.04 + +RUN apt-get -y update && \ + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends python3-pip python3-setuptools nginx zip make build-essential libtbb-dev && \ + rm -rf /var/lib/apt/lists/* && \ + pip3 install --upgrade pip &&\ + pip3 install gdown + +RUN apt-get update -y +RUN apt-get install -y git nvidia-modprobe +RUN pip3 install jupyterlab +# Install required python packages +RUN pip3 install ipywidgets + +############################################ +RUN apt-get update -y + +# TO COPY the data +COPY openacc/ /labs/openacc +COPY openmp/ /labs/openmp +COPY _common/ /labs/_common +COPY iso/ /labs/iso +COPY cuda/ /labs/cuda +COPY _start_nways.ipynb /labs + +RUN python3 /labs/_common/dataset.py + +################################################# +ENV PATH="/usr/local/bin:/opt/anaconda3/bin:/usr/bin:$PATH" +################################################# + +#ADD nways_labs/ /labs +WORKDIR /labs +CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/labs diff --git a/_basic/nways_Dockerfile_python b/_basic/nways_Dockerfile_python new file mode 100644 index 0000000..540acdf --- /dev/null +++ b/_basic/nways_Dockerfile_python @@ -0,0 +1,65 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the docker container, run: $ sudo docker build -f nways_Dockerfile_python -t nways:p . +# To run: $ sudo docker run --rm -it --runtime nvidia -p 8888:8888 nways:p +# Finally, open http://localhost:8888/ + +#FROM nvidia/cuda:11.2.2-devel-ubuntu20.04 +FROM nvidia/cuda:11.4.2-devel-ubuntu20.04 + +##### +# Read https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 +RUN apt-key del 7fa2af80 +RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub +##### + +RUN apt-get -y update && \ + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \ + python3-dev \ + python3-pip python3-setuptools nginx zip make build-essential libtbb-dev && \ + rm -rf /var/lib/apt/lists/* + +RUN pip3 install --no-cache-dir -U install setuptools pip +RUN pip3 install gdown +RUN apt-get update -y +RUN apt-get install -y git nvidia-modprobe +# Install required python packages +RUN pip3 install jupyterlab +RUN pip3 install ipywidgets +#RUN pip3 install --upgrade numpy==1.19.5 +RUN pip3 install --upgrade numpy==1.21.1 +#RUN pip3 install --no-cache-dir "cupy-cuda112==9.0.0" \ +RUN pip3 install --no-cache-dir "cupy-cuda114==10.3.1" \ + numba==0.53.1 scipy + +############################################ +# NVIDIA nsight-systems-cli-2022.1.1, nsight-compute-2022.1.1 +RUN apt-get update -y && \ + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + apt-transport-https \ + ca-certificates \ + gnupg \ + wget && \ + #apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys F60F4B3D7FA2AF80 && \ + wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add - &&\ + echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list &&\ + apt-get update -y + +RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends nsight-systems-cli-2022.1.1 nsight-compute-2022.1.1 + +# TO COPY the data +COPY python/ /labs/python +COPY _common/ /labs/_common +COPY _start_nways.ipynb /labs + +RUN python3 /labs/_common/dataset_python.py + +################################################# +ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/lib/python3.8/dist-packages:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" +ENV PATH="/opt/nvidia/nsight-systems/2022.1.1/bin:/opt/nvidia/nsight-compute/2022.1.1:/usr/local/bin:/bin:/usr/local/cuda/bin:/usr/bin${PATH:+:${PATH}}" + +RUN pip3 install --no-cache-dir MDAnalysis + +#ADD nways_labs/ /labs +WORKDIR /labs +CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/labs diff --git a/_basic/nways_Singularity b/_basic/nways_Singularity new file mode 100644 index 0000000..b4ddb5f --- /dev/null +++ b/_basic/nways_Singularity @@ -0,0 +1,60 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the singularity container, run: $ singularity build --fakeroot nways_c.simg nways_Singularity +# To copy the content of the container: $ singularity run nways_c.simg cp -rT /labs ~/labs +# To run: $ singularity run --nv nways_c.simg jupyter-lab --notebook-dir=~/labs +# Finally, open http://localhost:8888/ + +Bootstrap: docker +FROM: nvcr.io/nvidia/nvhpc:22.7-devel-cuda_multi-ubuntu20.04 + +%environment + export XDG_RUNTIME_DIR= + export PATH="$PATH:/usr/local/bin:/opt/anaconda3/bin:/usr/bin" + +%post + build_tmp=$(mktemp -d) && cd ${build_tmp} + + apt-get -y update + apt-get -y dist-upgrade + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \ + m4 vim-nox emacs-nox nano zip\ + python3-pip python3-setuptools git-core inotify-tools \ + curl git-lfs \ + build-essential libtbb-dev + rm -rf /var/lib/apt/cache/* + + pip3 install --upgrade pip + pip3 install gdown + apt-get update -y + apt-get -y install git nvidia-modprobe + pip3 install jupyterlab + pip3 install ipywidgets + + apt-get install --no-install-recommends -y build-essential + + python3 /labs/_common/dataset.py + + apt-get update -y + apt-get install --no-install-recommends -y build-essential + + wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh + bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/anaconda3 + rm Miniconda3-latest-Linux-x86_64.sh + + cd / + rm -rf ${build_tmp} + +%files + ../_basic/openacc/ /labs/openacc + ../_basic/openmp/ /labs/openmp + ../_basic/_common/ /labs/_common + ../_basic/iso/ /labs/iso + ../_basic/cuda/ /labs/cuda + ../_basic/_start_nways.ipynb /labs + +%runscript + "$@" + +%labels + AUTHOR mozhgank diff --git a/_basic/nways_Singularity_python b/_basic/nways_Singularity_python new file mode 100644 index 0000000..a5df837 --- /dev/null +++ b/_basic/nways_Singularity_python @@ -0,0 +1,86 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the singularity container, run: $ singularity build --fakeroot nways_p.simg nways_Singularity_python +# To copy the content of the container: $ singularity run nways_p.simg cp -rT /labs ~/labs +# To run: $ singularity run --nv nways_p.simg jupyter-lab --notebook-dir=~/labs +# Finally, open http://localhost:8888/ + +Bootstrap: docker +#FROM: nvidia/cuda:11.2.2-devel-ubuntu20.04 +FROM: nvidia/cuda:11.4.2-devel-ubuntu20.04 + +%environment + export XDG_RUNTIME_DIR= + export PATH="$PATH:/usr/local/bin:/usr/bin" + export PATH=/opt/nvidia/nsight-systems/2022.1.1/bin:/opt/nvidia/nsight-compute/2022.1.1:/bin:/usr/local/cuda/bin$PATH + export LD_LIBRARY_PATH="/usr/include/python3.8:/usr/local/lib:/usr/local/lib/python3.8/dist-packages:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" + + +%post + build_tmp=$(mktemp -d) && cd ${build_tmp} + +##### +# Read https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 + apt-key del 7fa2af80 + apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub +##### + + apt-get -y update + apt-get -y dist-upgrade + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends python3-dev \ + m4 vim-nox emacs-nox nano zip \ + python3-pip python3-setuptools nginx zip make build-essential libtbb-dev + rm -rf /var/lib/apt/cache/* + + pip3 install --no-cache-dir -U install setuptools pip + apt-get -y update + apt-get -y install git nvidia-modprobe + pip3 install 'chardet>=3.0.2,<3.1.0' 'idna>=2.5,<2.8' 'urllib3>=1.21.1,<1.24' 'certifi>=2017.4.17' + pip3 install jupyterlab + pip3 install ipywidgets + pip3 install gdown + pip3 install --upgrade numpy==1.21.1 + # pip3 install --upgrade numpy==1.19.5 + #pip3 install --no-cache-dir "cupy-cuda112==9.0.0" \ + pip3 install --no-cache-dir "cupy-cuda114==10.3.1" \ + numba==0.53.1 scipy + + #apt-get install --no-install-recommends -y build-essential + + python3 /labs/_common/dataset_python.py + + touch /labs/python/jupyter_notebook/cupy/RDF.dat + touch /labs/python/jupyter_notebook/cupy/Pair_entropy.dat + +# NVIDIA nsight-systems-cli-2022.1.1, nsight-compute-2022.1.1 + apt-get update -y + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends apt-transport-https ca-certificates gnupg wget + # apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys F60F4B3D7FA2AF80 + wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add - + echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list + apt-get update -y + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends nsight-systems-cli-2022.1.1 nsight-compute-2022.1.1 + #rm -rf /var/lib/apt/lists/* + + + + apt-get install --no-install-recommends -y build-essential + + pip3 install --no-cache-dir MDAnalysis + + chmod -R 777 /labs/python/jupyter_notebook/cupy/RDF.dat + chmod -R 777 /labs/python/jupyter_notebook/cupy/Pair_entropy.dat + + cd / + rm -rf ${build_tmp} + +%files + ../_basic/python/ /labs/python + ../_basic/_common/ /labs/_common + ../_basic/_start_nways.ipynb /labs + +%runscript + "$@" + +%labels + AUTHOR Tosin, Mozhgan diff --git a/_basic/openacc/Presentations/README.md b/_basic/openacc/Presentations/README.md new file mode 100644 index 0000000..929ea89 --- /dev/null +++ b/_basic/openacc/Presentations/README.md @@ -0,0 +1,5 @@ +For Partners who are interested in delivering the critical hands-on skills needed to advance science in form of Bootcamp can reach out to us at [Open Hackathons Partner](https://www.openhackathons.org/s/about-open-hackathons) website. In addition to current bootcamp material the Partners will be provided with the following: + +- Presentation: All the Bootcamps are accompanied with training material presentations which can be used during the Bootcamp session. +- Mini challenge : To test the knowledge gained during this Bootcamp a mini application challenge is provided along with sample Solution. +- Additional Support: On case to case basis the Partners can also be trained on how to effectively deliver the Bootcamp with maximal impact. \ No newline at end of file diff --git a/_basic/openacc/jupyter_notebook/nways_openacc.ipynb b/_basic/openacc/jupyter_notebook/nways_openacc.ipynb new file mode 100644 index 0000000..5f09810 --- /dev/null +++ b/_basic/openacc/jupyter_notebook/nways_openacc.ipynb @@ -0,0 +1,1283 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learning objectives\n", + "The **goal** of this lab is to:\n", + "\n", + "- Learn how to run the same code on both a multicore CPU and a GPU using the OpenACC programming model\n", + "- Understand the key directives and steps involved in making a sequential code parallel\n", + "- Learn how to interpret the compiler feedback\n", + "- Learn and understand the Nsight Systems profiler report\n", + "\n", + "We do not intend to cover:\n", + "- Optimization techniques in details\n", + "\n", + "\n", + "# OpenACC Directives\n", + "Using OpenACC directives will allow us to parallelize our code without explicitly alter our code. What this means is that, by using OpenACC directives, we can have a single code that will function as both a sequential code and a parallel code.\n", + "\n", + "### OpenACC Syntax\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```#pragma acc ```\n", + "
\n", + "
\n", + " \n", + "
\n", + " Fortran syntax\n", + " \n", + "```!$acc ```\n", + "
\n", + "
\n", + "\n", + "\n", + "**#pragma** in C/C++ and **!$acc** in Fortran are what's known as a \"compiler hint.\" These are very similar to programmer comments, however, the compiler will actually read our pragmas. Pragmas are a way for the programmer to \"guide\" the compiler, without running the chance damaging the code. If the compiler does not understand the pragma, it can ignore it, rather than throw a syntax error.\n", + "\n", + "**acc** specifies that this is an OpenACC related directive that will follow. Any non-OpenACC compiler will ignore this pragma. \n", + "\n", + "**directives** are commands in OpenACC that will tell the compiler to do some action. For now, we will only use directives that allow the compiler to parallelize our code.\n", + "\n", + "**clauses** are additions/alterations to our directives. These include (but are not limited to) optimizations. One way to think about it: directives describe a general action for our compiler to do (such as, paralellize our code), and clauses allow the programmer to be more specific (such as, how we specifically want the code to be parallelized).\n", + "\n", + "## 3 Key Directives\n", + "\n", + "OpenACC consists of 3 key types of directives responsible for **parallel execution**, **managing data movement** and **optimization** as shown in diagram below (example uses C/C++ syntax):\n", + "\n", + "\n", + "\n", + "We will be covering the parallel execution directive in this lab. The data directive is part of the additional section and can be tried out in the end.\n", + "\n", + "### Parallel and Loop Directives\n", + "\n", + "\n", + "There are three directives we will cover in this lab: `parallel`, `loop`, and `parallel loop`. Once we understand all three of them, you will be tasked with parallelizing **Pair Calculation** with your preferred directive \n", + "\n", + "The parallel directive may be the most straight-forward of the directives. It will mark a region of the code for parallelization (this usually only includes parallelizing a single **for** loop.) Let's take a look:\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "#pragma acc parallel loop\n", + "for (int i = 0; i < N; i++ )\n", + "{\n", + " < loop code >\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "!$acc parallel loop\n", + " do i=1,N\n", + " < loop code >\n", + " enddo\n", + "```\n", + "
\n", + "
\n", + " \n", + " \n", + "\n", + "\n", + "We may also define a \"parallel region\". The parallel region may have multiple loops (though this is often not recommended!) The parallel region is everything contained within the outer-most curly braces.\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + " \n", + "```cpp\n", + "#pragma acc parallel\n", + "{\n", + " #pragma acc loop\n", + " for (int i = 0; i < N; i++ )\n", + " {\n", + " < loop code >\n", + " }\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "!$acc parallel\n", + " !$acc loop\n", + " do i=1,N\n", + " < loop code >\n", + " enddo\n", + "!$acc end parallel\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "`#pragma acc parallel loop` in C/C++ or `!$acc parallel loop` in Fortran will mark the next loop for parallelization. It is extremely important to include the `loop`, otherwise you will not be parallelizing the loop properly. The parallel directive tells the compiler to \"redundantly parallelize\" the code. The `loop` directive specifically tells the compiler that we want the loop parallelized. Let's look at an example of why the loop directive is so important. The `parallel` directive tells the compiler to create somewhere to run parallel code. OpenACC calls that somewhere a `gang`, which might be a thread on the CPU or maying a CUDA threadblock or OpenCL workgroup. It will choose how many gangs to create based on where you're running, only a few on a CPU (like 1 per CPU core) or lots on a GPU (1000's possibly). Gangs allow OpenACC code to scale from small CPUs to large GPUs because each one works completely independently of each other gang. That's why there's a space between gangs in the images below.\n", + "\n", + "\n", + "
\n", + " Example diagram (C/C++ syntax)\n", + "\n", + "\n", + "---\n", + "\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example diagram (Fortran syntax)\n", + "\n", + "\n", + "---\n", + "\n", + "\n", + "
\n", + "
\n", + " \n", + "\n", + "\n", + "There's a good chance that we don't want my loop to be run redundantly in every gang though, that seems wasteful and potentially dangerous. Instead we want to instruct the compiler to break up the iterations of my loop and to run them in parallel on the gangs. To do that, we simply can add a `loop` directive to the interesting loops. This instructs the compiler that we want my loop to be parallelized and promises to the compiler that it's safe to do so. Now that we have both `parallel` and `loop`, things loop a lot better (and run a lot faster). Now the compiler is spreading my loop iterations to all of my gangs, but also running multiple iterations of the loop at the same time within each gang as a *vector*. Think of a vector like this, we have 10 numbers that I want to add to 10 other numbers (in pairs). Rather than looking up each pair of numbers, adding them together, storing the result, and then moving on to the next pair in-order, modern computer hardware allows me to add all 10 pairs together all at once, which is a lot more efficient. \n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "\n", + "
\n", + "
\n", + " \n", + "
\n", + " Fortran syntax\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "The `acc parallel loop` directive is both a promise and a request to the compiler. The programmer is promising that the loop can safely be parallelized and am requesting that the compiler do so in a way that makes sense for the machine we target. The compiler may make completely different decisions if we are compiling for a multicore CPU than it would for a GPU and that's the idea. OpenACC enables programmers to parallelize their codes without having to worry about the details of how best to do so for every possible machine. \n", + "\n", + "\n", + "\n", + "### Atomic Construct\n", + "\n", + "In the code you will also require one more construct which will help you in getting the right results. OpenACC atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if we want to count the number of elements that have a value greater than zero, we could write the following:\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "if ( val > 0 )\n", + "{\n", + " #pragma acc atomic\n", + " {\n", + " cnt++;\n", + " }\n", + "}\n", + "```\n", + "
\n", + "
\n", + " \n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "if(r\n", + "
\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets start modifying the original code and add the OpenACC directives. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run for Multicore\n", + "\n", + "After adding OpenACC directives now let us try to compile the code. For compiling we will be making use of these additional flags:\n", + "\n", + "**-Minfo** : This flag will give us feedback from the compiler about code optimizations and restrictions.\n", + "\n", + "**-Minfo=accel** will only give us feedback regarding our OpenACC parallelizations/optimizations.\n", + "\n", + "**-Minfo=all** will give us all possible feedback, including our parallelization/optimizations, sequential code optimizations, and sequential code restrictions.\n", + "\n", + "**-ta** : This flag allows us to compile our code for a specific target parallel hardware. Without this flag, the code will be compiled for sequential execution.\n", + "\n", + " -ta=multicore will allow us to compile our code for a multicore CPU.\n", + " \n", + " -ta=tesla will allow us to compile our code for a NVIDIA GPU\n", + "\n", + "After running the cells, you can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -acc -ta=multicore -Minfo=all -o rdf_c rdf.cpp -I/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/include -L/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/lib64 -lnvToolsExt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (C/C++ version) \n", + "\n", + "You can see from *Line 177*, it is generating a multicore code `177, Generating Multicore code`. It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. \n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -acc -ta=multicore -Minfo=all -o rdf_f rdf.f90 -lnvhpcwrapnvtx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (Fortran version)\n", + " \n", + "```\n", + "\trdf:\n", + " 97, Generating Multicore code\n", + " 98, !$acc loop gang\n", + " 99, Loop carried dependence of g prevents parallelization\n", + " Loop carried backward dependence of g prevents vectorization\n", + "```\n", + "\n", + "You can see from *Line 97*, it is generating a multicore code `97, Generating Multicore code`. It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. \n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's run the executable and validate the output first. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the output (C/C++ version)\n", + "!cd ../source_code && ./rdf_c && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " C/C++ version\n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# check the output (Fortran version)\n", + "!cd ../source_code && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Fortran version\n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's profile the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_multicore_c.nsys-rep) or the [Fortran version](../source_code/rdf_multicore_f.nsys-rep) and choosing save Link As. Once done, open it via the GUI. From the timeline view, checkout the NVTX markers displays as part of threads. **Why are we using NVTX?** Please see the section on [Using NVIDIA Tools Extension (NVTX)](../../_common/jupyter_notebook/nsight_systems.ipynb#Using-NVIDIA-Tools-Extension-(NVTX)).\n", + "\n", + "From the timeline view, right click on the nvtx row and click the \"show in events view\". Now you can see the nvtx statistic at the bottom of the window which shows the duration of each range. \n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "You can also checkout NVTX statistic from the terminal console once the profiling session ended. From the NVTX statistics, you can see most of the execution time is spend in `Pair_Calculation`. This is a function worth checking out.\n", + "\n", + "You can also compare the NVTX ranges with the serial version (see [screenshot](../../_common/jupyter_notebook/rdf_overview.ipynb))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run on a GPU\n", + "\n", + "Without changing the code, now let us try to recompile the code for NVIDIA GPU and rerun. The only difference is now we set **-ta=tesla:managed** instead of **-ta=multicore** . **Understand and analyze** the code present at [C/C++ version](../source_code/rdf.cpp) and/or the [Fortran version](../source_code/rdf.f90) .\n", + "\n", + "Open the downloaded files for inspection. Once done, compile the code by running the below cell. View the compiler feedback (enabled by adding `-Minfo=accel` flag) and investigate the compiler feedback for the OpenACC code. The compiler feedback provides useful information about applied optimizations.\n", + "\n", + "After running the cells, make sure to check the output first. You can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_c rdf.cpp " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (C/C++ version) \n", + "\n", + "\n", + "- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA Tesla GPU using \"CUDA Managed Memory\"\n", + "- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).\n", + "- The first line of the output, `round(float)`, tells us which function the following information is in reference to.\n", + "- The line starting with 157, shows that the function is built for the GPU and it will be called by each thread sequentially. When the `#pragma acc routine` is used, the compiler generate a device copy of the function.\n", + "- The line starting with 177, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `179, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`\n", + "- The rest of the information concerns data movement. Compiler detected possible need to move data and handled it for us. We will get into this later in this lab.\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_f rdf.f90 -lnvhpcwrapnvtx" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: OpenACC Fortran code\n", + " \n", + "```\n", + "\trdf:\n", + " 97, Generating Tesla code\n", + " 98, !$acc loop gang, vector(128) ! blockidx%x threadidx%x\n", + " 99, !$acc loop seq\n", + " 97, Generating implicit copyin(y(iconf,1:natoms),z(iconf,1:natoms),x(iconf,1:natoms)) [if not already present]\n", + " Generating implicit copy(g(:)) [if not already present]\n", + " 99, Complex loop carried dependence of g prevents parallelization\n", + " Loop carried dependence of g prevents parallelization\n", + " Loop carried backward dependence of g prevents vectorization \n", + "```\n", + "\n", + "- Using `-ta=tesla:managed`, instruct the compiler to build for an NVIDIA Tesla GPU using \"CUDA Managed Memory\"\n", + "- Using `-Minfo` command-line option, we will see all output from the compiler. In this example, we use `-Minfo=accel` to only see the output corresponding to the accelerator (in this case an NVIDIA GPU).\n", + "- The line starting with 97, shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang. `98, $acc loop gang, vector(128) ! blockidx%x threadidx%x`\n", + "- The rest of the information concerns data movement. Compiler detected possible need to move data and handled it for us. We will get into this later in this lab." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. Now, let's profile the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_gpu_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_gpu_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_gpu_c.nsys-rep) or the [Fortran version](../source_code/rdf_gpu_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. \n", + "\n", + "From the \"timeline view\" on the top pane, double click on the \"CUDA\" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by green and red/purple boxes (annotated with purple color) representing data movements.\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "Let's hover your mouse over the CUDA row (underlined with blue color in the below screenshot) and expand it till you see both kernels and memory row. In the below screenshot you can see the NVTX ranges in the \"Events View\" at the bottom of the timeline view window. You can right click on each row from the function table on the left (top window) and click on \"Show in Events View\" and checkout the detail related to that row (similar to the NVTX example in the below screenshot).\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "Nsight systems captures information about OpenACC execution in the profiled process. From the timeline tree, each thread that uses OpenACC shows the OpenACC trace information. To view this, you would need to click on the OpenACC API call to see the correlation with the underlying CUDA API calls. If the OpenACC API results in GPU works, that will also be highlighted.\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "Moreover, if you hover on a particular OpenACC construct, you can see details about that construct.\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "Feel free to checkout the solutions for [C/C++](../source_code/SOLUTION/rdf_parallel_directive.cpp) and [Fortran](../source_code/SOLUTION/rdf_parallel_directive.f90) versions to help you understand better." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## OpenACC Analysis\n", + "\n", + "**Usage Scenarios**\n", + "\n", + "There are multiple reasons to make use of Directive-based programming, but from an application developer point of view the key usage motivation is that it keeps the code readable/maintainable. Below are some usage scenarios under which OpenACC can be preferred:\n", + "- Legacy codes with sizeable codebase needs to be ported to GPUs with minimal code changes to sequential code.\n", + "- Developers want to see if the code structure favors GPU SIMD/SIMT style or as we say test the waters before moving a large piece of code to a GPU.\n", + "- Portable performance is an important feature for directive programming approach and OpenACC specification has rich features to achieve the same for target accelerators like GPU.\n", + "\n", + "Applications like Ansys Fluent, Gaussian, and VASP make use of OpenACC for adding parallelism. These applications are listed among top 5 applications which consume most of the compute clock cycles on supercomputers worldwide according to a report by [Intersect 360](http://www.intersect360.com/features-1/new-reports-on-gpu-and-accelerated-computing-from-intersect360).\n", + "\n", + "**Limitations/Constraints**\n", + "\n", + "Directive based programming model like OpenACC depends on a compiler to understand and convert your sequential code to CUDA constructs. While OpenACC compiler have evolved over time, it cannot match the best performance that say using CUDA C constructs directly can give. Things like minimizing register pressure, using specialized memory like texture etc are some of the examples. \n", + "\n", + "It is key to understand that OpenACC is not an alternative to CUDA. In fact, OpenACC can be seen as the first step in GPU porting with opportunity to port only the most critical kernel to CUDA. Developers can make use of interoperability techniques for combining OpenACC and CUDA in codes. For more details you can refer to [Interoperability](https://devblogs.nvidia.com/3-versatile-openacc-interoperability-techniques/)\n", + "\n", + "**Compilers Support for OpenACC**\n", + "\n", + "As of March 2020 here are the compilers that support OpenACC:\n", + "\n", + "| Compiler | Latest Version | Maintained by | Full or Partial Support |\n", + "| --- | --- | --- | --- |\n", + "| HPC SDK| 22.11 | NVIDIA HPC SDK | Full 2.6 spec, Partial 2.7 spec |\n", + "| GCC | 12 | Mentor Graphics, SUSE | 2.6 spec, Limited Kernel directive support, No Unified Memory |\n", + "| CCE| latest | Cray | 2.7 Spec | \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Optional Exercise\n", + "\n", + "### Kernel Directive \n", + "\n", + "The parallel directive leaves a lot of decisions up to the programmer. The programmer will decide what is and isn't parallelizable. The programmer will also have to provide all of the optimizations - the compiler assumes nothing. If any mistakes happen while parallelizing the code ( ignoring the data races etc.), it will be up to the programmer to identify and correct them.\n", + "\n", + "Another directive \"kernels\" is the exact opposite in all of these regards. The key difference between the two is as follows:\n", + "\n", + "The **parallel directive** gives a lot of control to the programmer. The programmer decides what to parallelize, and how it will be parallelized. Any mistakes made by the parallelization is at the fault of the programmer. It is recommended to use a parallel directive for each loop you want to parallelize.\n", + "\n", + "The **kernels directive** leaves majority of the control to the compiler. The compiler will analyze the loops, and decide which ones to parallelize. It may refuse to parallelize certain loops, but the programmer can override this decision. You may use the kernels directive to parallelize large portions of code, and these portions may include multiple loops.\n", + "We do not plan to cover this directive in details in the current lab.\n", + "\n", + "Use the kernels directive and observe any performance difference between **parallel** and **kernels** directives.\n", + "Sample usage of kernel directives is given as follows:\n", + "\n", + "\n", + "
\n", + " Example OpenACC C/C++ code\n", + " \n", + "```cpp\n", + "#pragma acc kernels\n", + "for (int i = 0; i < N; i++ )\n", + "{\n", + " for (int j = 0; j < N; j++ )\n", + " {\n", + " < loop code >\n", + " }\n", + "} \n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example OpenACC Fortran code\n", + " \n", + "```fortran\n", + "!$acc kernels\n", + " do i=1,N\n", + " < loop code >\n", + " enddo\n", + "```\n", + "
\n", + "
\n", + "\n", + "Now, lets start modifying the original code and add the OpenACC directives. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells.\n", + "\n", + "\n", + "After running the cells, make sure to check the output first. You can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_c rdf.cpp && echo \"Running the executable and validating the output\" && ./rdf_c && cat Pair_entropy.dat " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: OpenACC C/C++ code\n", + "The output should be the following:\n", + "\n", + " \n", + "```\n", + " \n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + " \n", + "```\n", + "\n", + " \n", + "If you only replaced the parallel directive with kernels (meaning only wrapping the loop with `#pragma acc kernels`), then the compiler feedback will look similar to below:\n", + "\n", + " \n", + "\n", + "\n", + "The line starting with 179, shows we created a serial kernel and the following loops will run in serial. When we use kernel directives, we let the compiler make decisions for us. In this case, the compiler thinks loop are not safe to parallelise due to dependency.\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_f rdf.f90 -lnvhpcwrapnvtx && echo \"Running the executable and validating the output\" && ./rdf_f && cat Pair_entropy.dat " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: OpenACC Fortran code\n", + "The output should be the following:\n", + "\n", + "```\n", + " \n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + " \n", + "```\n", + "\n", + "If you only replaced the parallel directive with kernels (meaning only wrapping the loop with `!$acc kernels`), then the compiler feedback will look similar to below:\n", + "\n", + "```\n", + "rdf:\n", + " 97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]\n", + " Generating implicit copy(g(:)) [if not already present]\n", + " 99, Loop carried dependence due to exposed use of g(:) prevents parallelization\n", + " Accelerator serial kernel generated\n", + " Generating Tesla code\n", + " 99, !$acc loop seq\n", + " 101, !$acc loop seq\n", + " 101, Loop carried dependence due to exposed use of g(:) prevents parallelization\n", + "```\n", + "\n", + "The line starting with 99, shows we created a serial kernel and the following loops will run in serial. When we use kernel directives, we let the compiler make decisions for us. In this case, the compiler thinks loop are not safe to parallelise due to dependency.\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### OpenACC Independent Clause\n", + "\n", + "In cases as such, we need to inform the compiler that the loop is safe to parallelise so it can generate kernels. To specify that loop iterations are data independent, we need to overwrite the compiler dependency analysis (Note: this is implied for *parallel loop*).\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "#pragma acc kernels\n", + "for (int i = 0; i < N; i++ )\n", + "{\n", + " #pragma acc loop independent\n", + " for (int j = 0; j < N; j++ )\n", + " {\n", + " < loop code >\n", + " }\n", + "} \n", + "```\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "\n", + "!$acc kernels\n", + " do i=1,N\n", + " !$ acc loop independent\n", + " do j=1,N\n", + " < loop code >\n", + " end do\n", + " enddo\n", + "```\n", + "
\n", + "
\n", + "\n", + "Now, lets start modifying the original code and add the OpenACC directives. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells.\n", + " \n", + "After running the cells, you can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_c rdf.cpp && echo \"Running the executable and validating the output\" && ./rdf_c && cat Pair_entropy.dat " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
\n", + " Compiler Feedback: OpenACC C/C++ code\n", + "\n", + "\n", + "\n", + "We can see that the compiler knows that the loop is parallelisable (`182, Loop is parallelizable`). Note that the loop is parallelized using vector(128) which that the compiler generated instructions for chunk of data of length 128 (vector size being 128 per gang) `182, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */`\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -acc -ta=tesla:managed,lineinfo -Minfo=accel -o rdf_f rdf.f90 -lnvhpcwrapnvtx && echo \"Running the executable and validating the output\" && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
\n", + " Compiler Feedback: OpenACC Fortran code\n", + "\n", + "```\n", + "rdf:\n", + " 97, Generating implicit copyin(y(iconf,:),z(iconf,:),x(iconf,:)) [if not already present]\n", + " Generating implicit copy(g(:)) [if not already present]\n", + " 99, Loop is parallelizable\n", + " 101, Loop is parallelizable\n", + " Generating Tesla code\n", + " 99, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x\n", + " 101, ! blockidx%x threadidx%x auto-collapsed\n", + "```\n", + "\n", + "We can see that the compiler knows that the loop is parallelisable (`99, Loop is parallelizable`). Note that the loop is parallelized using vector(128) which that the compiler generated instructions for chunk of data of length 128 (vector size being 128 per gang) `99, acc loop gang, vector(128) /* blockIdx%x threadIdx%x */`\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's profile the code now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_kernel_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_kernel_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_kernel_c.nsys-rep) or the [Fortran version](../source_code/rdf_kernel_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. \n", + " \n", + "Checkout the OpenACC row and hover over OpenACC constructs to see if the detail looks different from when you use parallel directives. Compare the profiler report with the previous section.\n", + "\n", + "Feel free to checkout the solutions for [C/C++](../source_code/SOLUTION/rdf_kernel_directive.cpp) and [Fortran](../source_code/SOLUTION/rdf_kernel_directive.f90) versions to help you understand better." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Directive \n", + "\n", + "In this lab so far, we added OpenACC parallel and loop directives and relied on a feature called [CUDA Managed Memory](../../_common/jupyter_notebook/GPU_Architecture_Terminologies.ipynb) to deal with the separate CPU & GPU memories for us. Just adding OpenACC to our loop we achieved a considerable performance boost. However, managed memory is not compatible with all GPUs or all compilers and it sometimes performs worse than programmer-defined memory management. Also when programming for a GPU, based on the application type handling data management explicitly between the CPU and GPU may result into better performance.\n", + "\n", + "Let's inspect the profiler report from the previous section when we used managed memory with parallel directives. From the \"timeline view\" on the top pane, double click on the \"CUDA\" from the function table on the left and expand it. Zoom in on the timeline and you can see a pattern similar to the screenshot below. The blue boxes are the compute kernels and each of these groupings of kernels is surrounded by purple and teal boxes (annotated with green color) representing data movements.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "What this graph is showing us is that we're doing a lot of data movement between GPU and CPU. The compiler feedback we collected earlier tells us quite a bit about data movement too. If we look again at the compiler feedback from earlier, we see the following.\n", + "\n", + "\n", + "\n", + "The compiler feedback is telling us that the compiler has inserted data movement around our parallel region at line 177 which copies the `d_g2` array in and out of the GPU memory and also copies `d_x`, `d_y` and `d_z` to the GPU memory. \n", + "\n", + "The compiler can only work with the information we provide. It knows we need all those arrays on on the GPU for the accelerated section within the `pair_gpu` function, but we didn't tell the compiler anything about what happens to the data outside of those sections. Without this knowledge, the compiler has to copy the full arrays to the GPU and back to the CPU for each accelerated section. This is a lot of unnecessary data transfers. \n", + "\n", + "Ideally, we would want to move the data to the GPU at the beginning, and only transfer it back to the CPU at the end (if needed). If we do not need to copy any data back to the CPU, then we only need to create space on the device (GPU) for an array. \n", + "\n", + "We need to give the compiler information about how to reduce the extra and unnecessary data movement. By adding OpenACC `data` directive to a structured code block, the compiler will know how to manage data according to the clauses. The following sections explains how to use data clauses in your program. For information on the data directive clauses, please visit [OpenACC 3.0 Specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf).\n", + "\n", + "\n", + "**Using OpenACC Data Clauses**\n", + "\n", + "Data clauses allow the programmer to specify data transfers between the host and device (or in our case, the CPU and the GPU). Let's look at an example where we do not use a data clause.\n", + "\n", + "\n", + "
\n", + " Example OpenACC C/C++ code\n", + " \n", + "```cpp\n", + "int *A = (int*) malloc(N * sizeof(int));\n", + "\n", + "#pragma acc parallel loop\n", + "for( int i = 0; i < N; i++ )\n", + "{\n", + " A[i] = 0;\n", + "} \n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example OpenACC Fortran code\n", + " \n", + "```fortran\n", + "allocate(A(N))\n", + "\n", + " !$acc parallel loop\n", + " do i=1,100\n", + " A(i) = 0\n", + " enddo\n", + "```\n", + "
\n", + "
\n", + "\n", + "We have allocated an array A outside of our parallel region. This means that A is allocated in the CPU memory. However, we access A inside of our loop, and that loop is contained within a parallel region. Within that parallel region, A[i] is attempting to access a memory location within the GPU memory. We didn't explicitly allocate A on the GPU, so one of two things will happen.\n", + "\n", + "1. The compiler will understand what we are trying to do, and automatically copy A from the CPU to the GPU.\n", + "2. The program will check for an array A in GPU memory, it won't find it, and it will throw an error.\n", + "Instead of hoping that we have a compiler that can figure this out, we could instead use a data clause.\n", + "\n", + "
\n", + " Example OpenACC C/C++ code\n", + " \n", + "```cpp\n", + "int *A = (int*) malloc(N * sizeof(int));\n", + "\n", + "#pragma acc parallel loop copy(A[0:N])\n", + "for( int i = 0; i < N; i++ )\n", + "{\n", + " A[i] = 0;\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example OpenACC Fortran code\n", + " \n", + "```fortran\n", + "allocate(A(N))\n", + "\n", + " !$acc parallel loop copy(A(1:N))\n", + " do i=1,100\n", + " A(i) = 0\n", + " enddo\n", + "```\n", + "
\n", + "
\n", + " \n", + "The image below offers step-by-step example of using the copy clause.\n", + "\n", + "\n", + "\n", + "Of course, we might not want to copy our data both to and from the GPU memory. Maybe we only need the array's values as inputs to the GPU region, or maybe it's only the final results we care about, or perhaps the array is only used temporarily on the GPU and we don't want to copy it either directive. The following OpenACC data clauses provide a bit more control than just the `copy` clause.\n", + "\n", + "* `copyin` - Create space for the array and copy the input values of the array to the device. At the end of the region, the array is deleted without copying anything back to the host.\n", + "* `copyout` - Create space for the array on the device, but don't initialize it to anything. At the end of the region, copy the results back and then delete the device array.\n", + "* `create` - Create space of the array on the device, but do not copy anything to the device at the beginning of the region, nor back to the host at the end. The array will be deleted from the device at the end of the region.\n", + "* `present` - Don't do anything with these variables. I've put them on the device somewhere else, so just assume they're available.\n", + "\n", + "You may also use them to operate on multiple arrays at once, by including those arrays as a comma separated list.\n", + "\n", + "
\n", + " Example OpenACC C/C++ code\n", + " \n", + "```cpp\n", + "#pragma acc parallel loop copy( A[0:N], B[0:M], C[0:Q] )\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example OpenACC Fortran code\n", + " \n", + "```fortran\n", + "!$acc parallel loop copy( A(1:N), B(1:M), C(1:Q) )\n", + "```\n", + "
\n", + "
\n", + "\n", + "You may also use more than one data clause at a time.\n", + "\n", + "
\n", + " Example OpenACC C/C++ code\n", + " \n", + "```cpp\n", + "#pragma acc parallel loop create( A[0:N] ) copyin( B[0:M] ) copyout( C[0:Q] )\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example OpenACC Fortran code\n", + " \n", + "```fortran\n", + "!$acc parallel loop create( A(1:N) ) copyin( B(1:M) ) copyout( C(1:Q) )\n", + "``` \n", + "
\n", + "
\n", + "\n", + "Let us try adding a data clause to our code and observe any performance differences between the two. **Note: We have removed the managed clause in order to handle data management explicitly.**\n", + "\n", + " \n", + "Now, lets start modifying the original code and add the OpenACC directives. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells.\n", + " \n", + "\n", + "After running the cells, make sure to check the output first. You can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU without managed memory (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -acc -ta=tesla,lineinfo -Minfo=accel -o rdf_c rdf.cpp && echo \"Running the executable and validating the output\" && ./rdf_c && cat Pair_entropy.dat " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: OpenACC C/C++ code\n", + "\n", + "\n", + "\n", + "You can see that on line 182, compiler is generating default present for `d_g2`, `d_x`,`d_z`, and `d_y` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist.\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU without managed memory (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -acc -ta=tesla,lineinfo -Minfo=accel -o rdf_f rdf.f90 -lnvhpcwrapnvtx echo \"Running the executable and validating the output\" && ./rdf_f && cat Pair_entropy.dat " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback: OpenACC Fortran code\n", + "\n", + "\n", + "```\n", + "rdf:\n", + " 95, Generating copy(g(:)) [if not already present]\n", + " Generating copyin(y(y$sd8:(y$sd8-1)+y$sd8,y$sd8:(y$sd8-1)+y$sd8),z(z$sd7:(z$sd7-1)+z$sd7,z$sd7:(z$sd7-1)+z$sd7),x(x$sd9:(x$sd9-1)+x$sd9,x$sd9:(x$sd9-1)+x$sd9)) [if not already present]\n", + " 98, Generating Tesla code\n", + " 99, !$acc loop gang, vector(128) ! blockidx%x threadidx%x\n", + " 100, !$acc loop seq\n", + " 100, Loop carried dependence of g prevents parallelization\n", + " Loop carried backward dependence of g prevents vectorization\n", + "```\n", + "\n", + "You can see that on line 95, compiler is generating default present for `g2`, `x`,`z`, and `y` arrays. In other words, it is assuming that data is present on the GPU and it only copies data to the GPU only if the data do not exist. Another key observation also is removal of the work *implicity copy* as we have added the data clauses. Also the data sizes are automatically calculated by the compiler here which we can also addionally give to compiler if needed. \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's profile the code now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output without managed memory (C/C++)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_no_managed_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output without managed memory (Fortran)\n", + "!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o rdf_no_managed_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_no_managed_c.nsys-rep) or the [Fortran version](../source_code/rdf_no_managed_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "\n", + "
\n", + " Example screenshot (C/C++ code)\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Example screenshot (Fortran code)\n", + " \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "Have a look at the data movements annotated with green color and compare it with the previous versions. We have accelerated the application and reduced the execution time by eliminating the unnecessary data transfers between CPU and GPU.\n", + "\n", + "Feel free to checkout the solutions for [C/C++](../source_code/SOLUTION/rdf_data_directive.cpp) and [Fortran](../source_code/SOLUTION/rdf_data_directive.f90) versions to help you understand better." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "\n", + "\n", + "# Links and Resources\n", + "[OpenACC API guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n", + "\n", + "[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/openacc/source_code/Makefile b/_basic/openacc/source_code/Makefile new file mode 100644 index 0000000..34a006f --- /dev/null +++ b/_basic/openacc/source_code/Makefile @@ -0,0 +1,12 @@ +# Copyright (c) 2020 NVIDIA Corporation. All rights reserved. + +FC := nvfortran +FLAGS := -O3 -w +ACCFLAGS := -Minfo=accel +NVTXLIB := -lnvhpcwrapnvtx + +rdf: rdf.f90 + ${FC} ${FLAGS} ${ACCFLAGS} -o rdf rdf.f90 ${NVTXLIB} + +clean: + rm -f *.o rdf diff --git a/_basic/openacc/source_code/Pair_entropy.dat b/_basic/openacc/source_code/Pair_entropy.dat new file mode 100644 index 0000000..0998d26 --- /dev/null +++ b/_basic/openacc/source_code/Pair_entropy.dat @@ -0,0 +1,2 @@ +s2 value is -2.43191 +s2bond value is -3.87014 diff --git a/_basic/openacc/source_code/RDF.dat b/_basic/openacc/source_code/RDF.dat new file mode 100644 index 0000000..ea809b6 --- /dev/null +++ b/_basic/openacc/source_code/RDF.dat @@ -0,0 +1,2000 @@ +0.00761401 2.27364e+06 +0.022842 0 +0.03807 0 +0.0532981 0 +0.0685261 0 +0.0837541 0 +0.0989821 0 +0.11421 0 +0.129438 0 +0.144666 0 +0.159894 0 +0.175122 0 +0.19035 0 +0.205578 0 +0.220806 0 +0.236034 0 +0.251262 0 +0.26649 0 +0.281718 0 +0.296946 0 +0.312174 0 +0.327402 0 +0.34263 0 +0.357858 0 +0.373086 0 +0.388314 0 +0.403543 0 +0.418771 0 +0.433999 0 +0.449227 0 +0.464455 0 +0.479683 0 +0.494911 0 +0.510139 0 +0.525367 0 +0.540595 0 +0.555823 0 +0.571051 0 +0.586279 0 +0.601507 0 +0.616735 0 +0.631963 0 +0.647191 0 +0.662419 0 +0.677647 0 +0.692875 0 +0.708103 0 +0.723331 0 +0.738559 0 +0.753787 0 +0.769015 0 +0.784243 0 +0.799471 0 +0.814699 0 +0.829927 0 +0.845155 0 +0.860383 0 +0.875611 0 +0.890839 0 +0.906067 0 +0.921295 0 +0.936523 0 +0.951751 0 +0.966979 0 +0.982207 0 +0.997435 0 +1.01266 0 +1.02789 0 +1.04312 0 +1.05835 0 +1.07358 0 +1.0888 0 +1.10403 0 +1.11926 0 +1.13449 0 +1.14972 0 +1.16494 0 +1.18017 0 +1.1954 0 +1.21063 0 +1.22586 0 +1.24108 0 +1.25631 0 +1.27154 0 +1.28677 0 +1.302 0 +1.31722 0 +1.33245 0 +1.34768 0 +1.36291 0 +1.37814 0 +1.39336 0 +1.40859 0.015817 +1.42382 0.0490216 +1.43905 0.272783 +1.45428 0.976895 +1.4695 2.98168 +1.48473 6.70771 +1.49996 12.3749 +1.51519 18.7254 +1.53042 23.1225 +1.54564 22.5968 +1.56087 18.7488 +1.5761 12.2483 +1.59133 6.82648 +1.60656 2.88983 +1.62178 1.03012 +1.63701 0.322052 +1.65224 0.0498164 +1.66747 0.0112871 +1.6827 0.00184729 +1.69792 0 +1.71315 0 +1.72838 0 +1.74361 0 +1.75884 0 +1.77406 0 +1.78929 0 +1.80452 0 +1.81975 0 +1.83498 0 +1.8502 0 +1.86543 0 +1.88066 0 +1.89589 0 +1.91112 0 +1.92634 0 +1.94157 0 +1.9568 0 +1.97203 0 +1.98726 0 +2.00248 0 +2.01771 0 +2.03294 0 +2.04817 0 +2.0634 0 +2.07862 0 +2.09385 0 +2.10908 0 +2.12431 0 +2.13954 0 +2.15476 0 +2.16999 0 +2.18522 0 +2.20045 0 +2.21568 0 +2.2309 0 +2.24613 0 +2.26136 0 +2.27659 0.0030276 +2.29182 0.00199167 +2.30704 0.00294819 +2.32227 0.00969885 +2.3375 0.0124448 +2.35273 0.0349627 +2.36796 0.0522381 +2.38319 0.0976197 +2.39841 0.160034 +2.41364 0.269354 +2.42887 0.425579 +2.4441 0.670716 +2.45933 0.911498 +2.47455 1.21124 +2.48978 1.65042 +2.50501 2.03301 +2.52024 2.55944 +2.53547 2.96246 +2.55069 3.34284 +2.56592 3.37557 +2.58115 3.46854 +2.59638 3.22468 +2.61161 2.9924 +2.62683 2.6076 +2.64206 2.14228 +2.65729 1.75705 +2.67252 1.3138 +2.68775 0.920997 +2.70297 0.61569 +2.7182 0.445989 +2.73343 0.277221 +2.74866 0.151618 +2.76389 0.087643 +2.77911 0.0650139 +2.79434 0.0515797 +2.80957 0.0404201 +2.8248 0.044574 +2.84003 0.0460427 +2.85525 0.05646 +2.87048 0.0628453 +2.88571 0.0722337 +2.90094 0.0795574 +2.91617 0.0897999 +2.93139 0.0967824 +2.94662 0.110243 +2.96185 0.114478 +2.97708 0.122753 +2.99231 0.141952 +3.00753 0.146879 +3.02276 0.174026 +3.03799 0.179086 +3.05322 0.177865 +3.06845 0.187215 +3.08367 0.187021 +3.0989 0.199348 +3.11413 0.201179 +3.12936 0.224864 +3.14459 0.234857 +3.15981 0.195928 +3.17504 0.200279 +3.19027 0.21122 +3.2055 0.206672 +3.22073 0.18657 +3.23595 0.185318 +3.25118 0.179132 +3.26641 0.175995 +3.28164 0.165137 +3.29687 0.157359 +3.31209 0.14924 +3.32732 0.152602 +3.34255 0.140915 +3.35778 0.134073 +3.37301 0.14206 +3.38823 0.123472 +3.40346 0.13682 +3.41869 0.12889 +3.43392 0.132186 +3.44915 0.128383 +3.46437 0.140767 +3.4796 0.150338 +3.49483 0.166589 +3.51006 0.163873 +3.52529 0.16709 +3.54051 0.18944 +3.55574 0.199404 +3.57097 0.220267 +3.5862 0.226128 +3.60143 0.256885 +3.61665 0.285117 +3.63188 0.294627 +3.64711 0.314193 +3.66234 0.337714 +3.67757 0.317906 +3.69279 0.367454 +3.70802 0.402104 +3.72325 0.410141 +3.73848 0.459575 +3.75371 0.471445 +3.76893 0.532818 +3.78416 0.551185 +3.79939 0.61236 +3.81462 0.708128 +3.82985 0.8209 +3.84507 0.899673 +3.8603 1.03509 +3.87553 1.12309 +3.89076 1.25771 +3.90599 1.32643 +3.92122 1.42024 +3.93644 1.44843 +3.95167 1.43294 +3.9669 1.43725 +3.98213 1.39131 +3.99736 1.28907 +4.01258 1.21239 +4.02781 1.13327 +4.04304 1.04412 +4.05827 0.933079 +4.0735 0.908779 +4.08872 0.886065 +4.10395 0.854347 +4.11918 0.791012 +4.13441 0.809676 +4.14964 0.824096 +4.16486 0.817478 +4.18009 0.852244 +4.19532 0.826752 +4.21055 0.882149 +4.22578 0.871116 +4.241 0.886682 +4.25623 0.897962 +4.27146 0.886124 +4.28669 0.937907 +4.30192 0.93891 +4.31714 0.948575 +4.33237 0.958081 +4.3476 0.954426 +4.36283 1.01977 +4.37806 1.0075 +4.39328 1.02357 +4.40851 1.04907 +4.42374 1.08543 +4.43897 1.05384 +4.4542 1.09278 +4.46942 1.08954 +4.48465 1.10686 +4.49988 1.15182 +4.51511 1.14201 +4.53034 1.15983 +4.54556 1.21004 +4.56079 1.16451 +4.57602 1.14853 +4.59125 1.17492 +4.60648 1.20512 +4.6217 1.18421 +4.63693 1.16939 +4.65216 1.17456 +4.66739 1.20412 +4.68262 1.20966 +4.69784 1.19235 +4.71307 1.16135 +4.7283 1.20676 +4.74353 1.18414 +4.75876 1.14747 +4.77398 1.20006 +4.78921 1.18652 +4.80444 1.14887 +4.81967 1.12856 +4.8349 1.16577 +4.85012 1.17247 +4.86535 1.18259 +4.88058 1.16666 +4.89581 1.18255 +4.91104 1.15506 +4.92626 1.17788 +4.94149 1.18884 +4.95672 1.21348 +4.97195 1.25832 +4.98718 1.23719 +5.0024 1.28005 +5.01763 1.30657 +5.03286 1.30136 +5.04809 1.32287 +5.06332 1.35981 +5.07854 1.37905 +5.09377 1.37706 +5.109 1.37668 +5.12423 1.40795 +5.13946 1.37903 +5.15468 1.38684 +5.16991 1.39805 +5.18514 1.37254 +5.20037 1.37805 +5.2156 1.31291 +5.23082 1.33662 +5.24605 1.29124 +5.26128 1.26413 +5.27651 1.24613 +5.29174 1.22851 +5.30696 1.22742 +5.32219 1.24201 +5.33742 1.21069 +5.35265 1.19761 +5.36788 1.1861 +5.3831 1.21622 +5.39833 1.21314 +5.41356 1.17866 +5.42879 1.20542 +5.44402 1.20081 +5.45925 1.17165 +5.47447 1.18033 +5.4897 1.18507 +5.50493 1.17473 +5.52016 1.17255 +5.53539 1.18727 +5.55061 1.17567 +5.56584 1.17516 +5.58107 1.16607 +5.5963 1.15188 +5.61153 1.15743 +5.62675 1.15927 +5.64198 1.15631 +5.65721 1.13522 +5.67244 1.16262 +5.68767 1.14039 +5.70289 1.14959 +5.71812 1.14235 +5.73335 1.11688 +5.74858 1.14009 +5.76381 1.12526 +5.77903 1.12106 +5.79426 1.09929 +5.80949 1.1143 +5.82472 1.09692 +5.83995 1.1162 +5.85517 1.12154 +5.8704 1.10374 +5.88563 1.09139 +5.90086 1.09283 +5.91609 1.07704 +5.93131 1.09308 +5.94654 1.08128 +5.96177 1.08003 +5.977 1.06941 +5.99223 1.09034 +6.00745 1.05511 +6.02268 1.03392 +6.03791 1.09069 +6.05314 1.0578 +6.06837 1.05833 +6.08359 1.06067 +6.09882 1.07141 +6.11405 1.08944 +6.12928 1.01957 +6.14451 1.03711 +6.15973 1.03791 +6.17496 1.04213 +6.19019 1.03319 +6.20542 1.03763 +6.22065 1.01133 +6.23587 1.02833 +6.2511 1.04327 +6.26633 1.02901 +6.28156 1.03675 +6.29679 1.01697 +6.31201 1.00327 +6.32724 1.02745 +6.34247 1.0159 +6.3577 1.02553 +6.37293 1.03145 +6.38815 1.03436 +6.40338 1.05381 +6.41861 1.04716 +6.43384 1.03905 +6.44907 1.0334 +6.46429 1.02941 +6.47952 1.03156 +6.49475 1.03193 +6.50998 1.02662 +6.52521 1.02023 +6.54043 1.01708 +6.55566 0.998726 +6.57089 0.990953 +6.58612 0.972268 +6.60135 0.965387 +6.61657 0.9901 +6.6318 0.967006 +6.64703 0.971459 +6.66226 0.952646 +6.67749 0.956869 +6.69271 0.94131 +6.70794 0.947619 +6.72317 0.95768 +6.7384 0.950131 +6.75363 0.956057 +6.76885 0.939089 +6.78408 0.935446 +6.79931 0.956264 +6.81454 0.940506 +6.82977 0.943382 +6.84499 0.924565 +6.86022 0.948027 +6.87545 0.930665 +6.89068 0.965112 +6.90591 0.932345 +6.92113 0.933379 +6.93636 0.91787 +6.95159 0.959205 +6.96682 0.931632 +6.98205 0.951177 +6.99728 0.925249 +7.0125 0.933041 +7.02773 0.916823 +7.04296 0.936061 +7.05819 0.929926 +7.07342 0.911918 +7.08864 0.910086 +7.10387 0.916035 +7.1191 0.892408 +7.13433 0.918404 +7.14956 0.913575 +7.16478 0.915096 +7.18001 0.909697 +7.19524 0.919288 +7.21047 0.908366 +7.2257 0.906946 +7.24092 0.909819 +7.25615 0.892395 +7.27138 0.900532 +7.28661 0.90091 +7.30184 0.881557 +7.31706 0.901241 +7.33229 0.921629 +7.34752 0.911516 +7.36275 0.898872 +7.37798 0.885941 +7.3932 0.904305 +7.40843 0.904212 +7.42366 0.890826 +7.43889 0.898525 +7.45412 0.908131 +7.46934 0.893931 +7.48457 0.902622 +7.4998 0.900913 +7.51503 0.891802 +7.53026 0.9072 +7.54548 0.90538 +7.56071 0.916834 +7.57594 0.90978 +7.59117 0.902775 +7.6064 0.915618 +7.62162 0.921957 +7.63685 0.930392 +7.65208 0.914901 +7.66731 0.928264 +7.68254 0.945414 +7.69776 0.933115 +7.71299 0.928379 +7.72822 0.930241 +7.74345 0.926761 +7.75868 0.913742 +7.7739 0.920898 +7.78913 0.918335 +7.80436 0.918705 +7.81959 0.926679 +7.83482 0.91882 +7.85004 0.924425 +7.86527 0.92347 +7.8805 0.929338 +7.89573 0.920387 +7.91096 0.910245 +7.92618 0.929063 +7.94141 0.915551 +7.95664 0.929648 +7.97187 0.909968 +7.9871 0.924868 +8.00232 0.938585 +8.01755 0.938848 +8.03278 0.92832 +8.04801 0.933774 +8.06324 0.943364 +8.07846 0.940132 +8.09369 0.93021 +8.10892 0.943742 +8.12415 0.946944 +8.13938 0.948457 +8.1546 0.950266 +8.16983 0.951507 +8.18506 0.940788 +8.20029 0.95192 +8.21552 0.951572 +8.23074 0.951606 +8.24597 0.948094 +8.2612 0.94169 +8.27643 0.971444 +8.29166 0.972215 +8.30688 0.966304 +8.32211 0.96179 +8.33734 0.95399 +8.35257 0.967459 +8.3678 0.956247 +8.38302 0.973691 +8.39825 0.967715 +8.41348 0.967023 +8.42871 0.989375 +8.44394 0.959033 +8.45916 0.992278 +8.47439 0.964024 +8.48962 0.973777 +8.50485 0.973258 +8.52008 0.971151 +8.5353 0.98413 +8.55053 0.978553 +8.56576 1.00359 +8.58099 0.997689 +8.59622 0.989557 +8.61145 0.987965 +8.62667 0.995936 +8.6419 0.99299 +8.65713 0.997037 +8.67236 0.998128 +8.68759 0.990266 +8.70281 0.995851 +8.71804 1.00834 +8.73327 1.01223 +8.7485 1.00208 +8.76373 0.998809 +8.77895 1.01231 +8.79418 1.00293 +8.80941 1.00526 +8.82464 1.003 +8.83987 1.02445 +8.85509 1.01159 +8.87032 1.01443 +8.88555 1.01672 +8.90078 1.01338 +8.91601 1.02644 +8.93123 1.00937 +8.94646 1.02096 +8.96169 1.02023 +8.97692 1.02391 +8.99215 1.03992 +9.00737 1.02596 +9.0226 1.03413 +9.03783 1.01874 +9.05306 1.01531 +9.06829 1.03073 +9.08351 1.03628 +9.09874 1.02732 +9.11397 1.0465 +9.1292 1.02776 +9.14443 1.02446 +9.15965 1.0465 +9.17488 1.03725 +9.19011 1.03839 +9.20534 1.02212 +9.22057 1.02933 +9.23579 1.04286 +9.25102 1.03864 +9.26625 1.0345 +9.28148 1.03948 +9.29671 1.03959 +9.31193 1.03402 +9.32716 1.0339 +9.34239 1.04221 +9.35762 1.04486 +9.37285 1.05873 +9.38807 1.04438 +9.4033 1.03402 +9.41853 1.04866 +9.43376 1.04222 +9.44899 1.04402 +9.46421 1.04534 +9.47944 1.03581 +9.49467 1.04015 +9.5099 1.01982 +9.52513 1.05334 +9.54035 1.04228 +9.55558 1.04641 +9.57081 1.03052 +9.58604 1.04205 +9.60127 1.03761 +9.61649 1.05254 +9.63172 1.04555 +9.64695 1.04298 +9.66218 1.04681 +9.67741 1.04089 +9.69263 1.04815 +9.70786 1.04381 +9.72309 1.03833 +9.73832 1.04771 +9.75355 1.03961 +9.76877 1.03818 +9.784 1.0415 +9.79923 1.04519 +9.81446 1.04151 +9.82969 1.03856 +9.84491 1.03918 +9.86014 1.05077 +9.87537 1.03063 +9.8906 1.03794 +9.90583 1.04232 +9.92105 1.0421 +9.93628 1.05062 +9.95151 1.03346 +9.96674 1.03362 +9.98197 1.03882 +9.99719 1.02744 +10.0124 1.02787 +10.0277 1.02953 +10.0429 1.022 +10.0581 1.03116 +10.0733 1.03738 +10.0886 1.03841 +10.1038 1.04333 +10.119 1.0361 +10.1342 1.02148 +10.1495 1.0205 +10.1647 1.04144 +10.1799 1.03162 +10.1952 1.03372 +10.2104 1.03164 +10.2256 1.04078 +10.2408 1.04197 +10.2561 1.02764 +10.2713 1.02886 +10.2865 1.02384 +10.3018 1.01692 +10.317 1.02414 +10.3322 1.02132 +10.3474 1.01519 +10.3627 1.02152 +10.3779 1.02537 +10.3931 1.03965 +10.4084 1.02686 +10.4236 1.01534 +10.4388 1.02145 +10.454 0.999813 +10.4693 1.00545 +10.4845 1.0188 +10.4997 1.0196 +10.5149 1.01774 +10.5302 1.01729 +10.5454 1.01704 +10.5606 1.00468 +10.5759 1.00492 +10.5911 1.01043 +10.6063 1.01423 +10.6215 1.02653 +10.6368 1.01957 +10.652 1.0155 +10.6672 1.015 +10.6825 1.00683 +10.6977 1.01174 +10.7129 1.01652 +10.7281 1.00918 +10.7434 1.00501 +10.7586 1.00538 +10.7738 1.00632 +10.7891 1.01332 +10.8043 1.00993 +10.8195 1.0083 +10.8347 0.992809 +10.85 1.00651 +10.8652 1.00201 +10.8804 1.01025 +10.8956 1.0169 +10.9109 1.00374 +10.9261 1.00002 +10.9413 0.997109 +10.9566 1.01582 +10.9718 0.992798 +10.987 1.00283 +11.0022 0.982988 +11.0175 0.998629 +11.0327 0.993983 +11.0479 0.99613 +11.0632 1.00155 +11.0784 0.994582 +11.0936 0.990791 +11.1088 1.01194 +11.1241 0.992389 +11.1393 0.989425 +11.1545 0.988659 +11.1698 0.992673 +11.185 0.992648 +11.2002 0.994037 +11.2154 0.995913 +11.2307 0.991514 +11.2459 0.988707 +11.2611 0.997336 +11.2763 0.99666 +11.2916 0.993194 +11.3068 0.987206 +11.322 0.994386 +11.3373 0.999408 +11.3525 0.984634 +11.3677 0.988231 +11.3829 0.988132 +11.3982 1.00011 +11.4134 0.98672 +11.4286 0.983892 +11.4439 0.98491 +11.4591 0.982931 +11.4743 0.986641 +11.4895 0.980461 +11.5048 0.968936 +11.52 0.973432 +11.5352 0.978253 +11.5505 0.980106 +11.5657 0.979677 +11.5809 0.989192 +11.5961 0.983523 +11.6114 0.984165 +11.6266 0.98658 +11.6418 0.975048 +11.657 0.971116 +11.6723 0.979756 +11.6875 0.977166 +11.7027 0.99796 +11.718 0.985845 +11.7332 0.977474 +11.7484 0.978997 +11.7636 0.9858 +11.7789 0.980538 +11.7941 0.986055 +11.8093 0.978488 +11.8246 0.979635 +11.8398 0.990661 +11.855 0.967723 +11.8702 0.987626 +11.8855 0.974989 +11.9007 0.984461 +11.9159 0.976494 +11.9312 0.971652 +11.9464 0.976983 +11.9616 0.985866 +11.9768 0.976214 +11.9921 0.976609 +12.0073 0.971812 +12.0225 0.982596 +12.0377 0.974517 +12.053 0.979185 +12.0682 0.977757 +12.0834 0.986757 +12.0987 0.972054 +12.1139 0.993208 +12.1291 0.998182 +12.1443 0.990041 +12.1596 0.98307 +12.1748 0.979413 +12.19 0.981015 +12.2053 0.980816 +12.2205 0.989581 +12.2357 0.97856 +12.2509 0.981112 +12.2662 0.993313 +12.2814 0.974518 +12.2966 0.984698 +12.3119 0.986853 +12.3271 0.983314 +12.3423 0.984083 +12.3575 0.979056 +12.3728 0.977399 +12.388 0.992448 +12.4032 0.982261 +12.4184 0.981787 +12.4337 0.983105 +12.4489 0.989612 +12.4641 0.990494 +12.4794 0.980085 +12.4946 0.996661 +12.5098 0.985045 +12.525 0.986719 +12.5403 0.990078 +12.5555 0.990133 +12.5707 1.00157 +12.586 0.987559 +12.6012 0.984581 +12.6164 0.995218 +12.6316 0.992296 +12.6469 0.991052 +12.6621 0.980677 +12.6773 1.00429 +12.6926 0.991009 +12.7078 0.992101 +12.723 0.991408 +12.7382 0.983333 +12.7535 0.99848 +12.7687 0.985898 +12.7839 0.987423 +12.7992 0.99989 +12.8144 0.995572 +12.8296 0.994036 +12.8448 0.993298 +12.8601 0.995881 +12.8753 0.997597 +12.8905 0.983185 +12.9057 0.98454 +12.921 0.982503 +12.9362 1.00241 +12.9514 0.998406 +12.9667 0.999609 +12.9819 0.984912 +12.9971 1.00075 +13.0123 1.0002 +13.0276 0.998851 +13.0428 0.993752 +13.058 1.00349 +13.0733 1.0003 +13.0885 0.994094 +13.1037 0.995654 +13.1189 1.00459 +13.1342 1.00538 +13.1494 1.00399 +13.1646 0.99627 +13.1799 1.00068 +13.1951 1.00159 +13.2103 1.00441 +13.2255 1.00293 +13.2408 0.988753 +13.256 1.00169 +13.2712 0.993724 +13.2864 1.01453 +13.3017 1.00304 +13.3169 0.995058 +13.3321 1.0002 +13.3474 1.0015 +13.3626 1.00121 +13.3778 1.00273 +13.393 0.99926 +13.4083 1.00325 +13.4235 0.995718 +13.4387 1.02358 +13.454 0.999682 +13.4692 0.999181 +13.4844 1.00656 +13.4996 1.01058 +13.5149 1.00097 +13.5301 1.00441 +13.5453 1.00263 +13.5606 1.00886 +13.5758 0.999558 +13.591 1.0046 +13.6062 1.01246 +13.6215 1.00696 +13.6367 1.01056 +13.6519 1.01443 +13.6671 1.00576 +13.6824 1.00771 +13.6976 1.01537 +13.7128 0.998899 +13.7281 1.00204 +13.7433 1.00561 +13.7585 1.00836 +13.7737 1.01148 +13.789 1.00545 +13.8042 1.01114 +13.8194 1.00184 +13.8347 1.01227 +13.8499 1.01124 +13.8651 1.00312 +13.8803 1.00778 +13.8956 0.995527 +13.9108 1.00581 +13.926 1.0065 +13.9413 1.00925 +13.9565 1.00606 +13.9717 1.01702 +13.9869 1.00879 +14.0022 1.00281 +14.0174 1.01072 +14.0326 1.01095 +14.0478 1.01793 +14.0631 1.0039 +14.0783 1.00854 +14.0935 1.00036 +14.1088 1.0044 +14.124 1.00764 +14.1392 1.0119 +14.1544 1.00936 +14.1697 1.00897 +14.1849 1.00069 +14.2001 1.00905 +14.2154 1.00775 +14.2306 1.01383 +14.2458 1.00666 +14.261 1.01318 +14.2763 1.00309 +14.2915 1.00106 +14.3067 1.00887 +14.322 1.00323 +14.3372 1.0128 +14.3524 1.00842 +14.3676 1.00231 +14.3829 1.00777 +14.3981 1.00567 +14.4133 1.00274 +14.4285 1.00474 +14.4438 1.00987 +14.459 1.00369 +14.4742 1.0025 +14.4895 1.00707 +14.5047 1.00872 +14.5199 1.01052 +14.5351 1.00823 +14.5504 1.00837 +14.5656 1.0055 +14.5808 1.00475 +14.5961 1.01167 +14.6113 0.998363 +14.6265 1.01088 +14.6417 1.01227 +14.657 1.00522 +14.6722 1.00732 +14.6874 1.00862 +14.7027 1.01171 +14.7179 1.00598 +14.7331 1.00163 +14.7483 1.00471 +14.7636 1.00259 +14.7788 1.0033 +14.794 1.00509 +14.8092 1.002 +14.8245 0.995416 +14.8397 1.00489 +14.8549 1.00615 +14.8702 1.00987 +14.8854 1.01387 +14.9006 1.01067 +14.9158 1.00599 +14.9311 1.00329 +14.9463 1.00082 +14.9615 0.999999 +14.9768 1.00249 +14.992 0.999873 +15.0072 0.999657 +15.0224 1.00289 +15.0377 1.00172 +15.0529 0.9996 +15.0681 1.00539 +15.0834 1.00327 +15.0986 1.00432 +15.1138 1.00793 +15.129 1.00771 +15.1443 1.00527 +15.1595 0.998929 +15.1747 1.00272 +15.1899 0.996197 +15.2052 1.00656 +15.2204 1.00242 +15.2356 1.0088 +15.2509 0.999342 +15.2661 1.00018 +15.2813 1.00681 +15.2965 1.00264 +15.3118 1.00544 +15.327 1.00505 +15.3422 0.99463 +15.3575 1.0069 +15.3727 1.00132 +15.3879 1.00035 +15.4031 1.00622 +15.4184 1.00362 +15.4336 0.999379 +15.4488 1.00344 +15.4641 1.00498 +15.4793 1.00211 +15.4945 0.993869 +15.5097 1.0079 +15.525 1.0023 +15.5402 0.999838 +15.5554 1.0038 +15.5707 1.00247 +15.5859 0.997689 +15.6011 1.00086 +15.6163 0.99927 +15.6316 0.995719 +15.6468 0.999059 +15.662 1.00104 +15.6772 0.995755 +15.6925 0.992995 +15.7077 0.990329 +15.7229 1.00235 +15.7382 0.993933 +15.7534 1.00392 +15.7686 1.00331 +15.7838 1.00066 +15.7991 0.990642 +15.8143 0.995261 +15.8295 1.00088 +15.8448 0.996147 +15.86 1.00313 +15.8752 0.999966 +15.8904 0.997574 +15.9057 1.00813 +15.9209 0.997475 +15.9361 1.00762 +15.9514 0.991799 +15.9666 1.00476 +15.9818 0.990112 +15.997 1.00309 +16.0123 0.999303 +16.0275 0.988324 +16.0427 0.996407 +16.0579 0.988229 +16.0732 0.99393 +16.0884 0.995404 +16.1036 0.998645 +16.1189 1.00401 +16.1341 0.994678 +16.1493 0.989233 +16.1645 0.996978 +16.1798 0.998579 +16.195 0.990041 +16.2102 0.998632 +16.2255 0.995467 +16.2407 0.995069 +16.2559 0.993938 +16.2711 0.991545 +16.2864 0.998999 +16.3016 1.00148 +16.3168 0.988161 +16.3321 0.995281 +16.3473 0.995091 +16.3625 0.992361 +16.3777 0.992252 +16.393 0.988035 +16.4082 1.00054 +16.4234 0.998491 +16.4386 0.99308 +16.4539 0.996208 +16.4691 0.997626 +16.4843 0.992839 +16.4996 0.994888 +16.5148 0.986361 +16.53 0.997887 +16.5452 0.998764 +16.5605 1.0012 +16.5757 0.994202 +16.5909 0.999865 +16.6062 1.00454 +16.6214 0.994557 +16.6366 1.00695 +16.6518 1.00094 +16.6671 0.994949 +16.6823 0.996254 +16.6975 0.996389 +16.7128 0.996278 +16.728 0.996072 +16.7432 0.995959 +16.7584 0.994522 +16.7737 0.991472 +16.7889 1.00073 +16.8041 1.00351 +16.8193 0.992454 +16.8346 0.997322 +16.8498 0.996165 +16.865 0.9949 +16.8803 0.996043 +16.8955 1.00158 +16.9107 0.998896 +16.9259 1.00849 +16.9412 1.00116 +16.9564 0.997542 +16.9716 0.998985 +16.9869 1.00118 +17.0021 0.997707 +17.0173 0.993592 +17.0325 0.996901 +17.0478 0.999206 +17.063 0.993668 +17.0782 0.994641 +17.0935 0.999475 +17.1087 1.0005 +17.1239 1.00145 +17.1391 0.990611 +17.1544 0.999606 +17.1696 0.99645 +17.1848 1.00269 +17.2 0.993685 +17.2153 0.996516 +17.2305 1.00043 +17.2457 0.998152 +17.261 0.998165 +17.2762 0.998825 +17.2914 0.994162 +17.3066 0.996814 +17.3219 0.995219 +17.3371 1.00938 +17.3523 1.00068 +17.3676 0.998782 +17.3828 0.990905 +17.398 1.00403 +17.4132 0.999689 +17.4285 0.999786 +17.4437 0.996597 +17.4589 0.996798 +17.4742 0.999653 +17.4894 0.99952 +17.5046 0.994026 +17.5198 0.99402 +17.5351 0.99175 +17.5503 0.997179 +17.5655 1.00362 +17.5807 0.997315 +17.596 0.999813 +17.6112 1.00139 +17.6264 1.00496 +17.6417 0.994087 +17.6569 0.997725 +17.6721 0.989625 +17.6873 1.00586 +17.7026 0.993616 +17.7178 0.997124 +17.733 0.998224 +17.7483 0.998886 +17.7635 0.999942 +17.7787 0.992025 +17.7939 1.00268 +17.8092 1.00168 +17.8244 0.994124 +17.8396 0.998065 +17.8549 0.999267 +17.8701 1.00212 +17.8853 0.994297 +17.9005 0.998222 +17.9158 1.00228 +17.931 1.00492 +17.9462 1.00299 +17.9614 0.998531 +17.9767 0.999786 +17.9919 0.999468 +18.0071 1.00673 +18.0224 1.00307 +18.0376 1.00901 +18.0528 1.00716 +18.068 1.00407 +18.0833 0.998623 +18.0985 1.00284 +18.1137 1.00499 +18.129 1.00515 +18.1442 0.998237 +18.1594 0.994327 +18.1746 0.991965 +18.1899 0.991822 +18.2051 0.998133 +18.2203 1.00003 +18.2356 1.00119 +18.2508 0.999832 +18.266 1.00296 +18.2812 1.002 +18.2965 1.00263 +18.3117 1.00378 +18.3269 0.999873 +18.3421 1.00774 +18.3574 0.999073 +18.3726 0.997758 +18.3878 1.0035 +18.4031 0.999818 +18.4183 1.00951 +18.4335 1.00258 +18.4487 1.00002 +18.464 0.999983 +18.4792 1.00451 +18.4944 1.00087 +18.5097 1.00218 +18.5249 1.0007 +18.5401 1.0021 +18.5553 0.995278 +18.5706 1.00548 +18.5858 0.997546 +18.601 1.00442 +18.6163 0.999869 +18.6315 1.00298 +18.6467 0.994695 +18.6619 0.996902 +18.6772 1.00595 +18.6924 1.00114 +18.7076 1.00758 +18.7229 1.00893 +18.7381 1.00742 +18.7533 1.00933 +18.7685 0.99939 +18.7838 0.997533 +18.799 0.998567 +18.8142 1.00338 +18.8294 0.99885 +18.8447 1.00266 +18.8599 1.00529 +18.8751 1.00437 +18.8904 1.00419 +18.9056 0.99647 +18.9208 1.00137 +18.936 1.00081 +18.9513 1.00005 +18.9665 0.995256 +18.9817 1.00025 +18.997 1.00585 +19.0122 1.00214 +19.0274 1.00047 +19.0426 1.0036 +19.0579 1.00061 +19.0731 1.00344 +19.0883 1.00321 +19.1036 0.997911 +19.1188 0.996393 +19.134 0.999137 +19.1492 1.00116 +19.1645 1.00109 +19.1797 1.00899 +19.1949 1.00364 +19.2101 1.0008 +19.2254 0.999186 +19.2406 0.995161 +19.2558 1.00288 +19.2711 1.00665 +19.2863 1.00518 +19.3015 1.00567 +19.3167 0.999429 +19.332 1.00221 +19.3472 1.00257 +19.3624 1.00423 +19.3777 1.00342 +19.3929 0.999955 +19.4081 1.00227 +19.4233 1.00342 +19.4386 1.00159 +19.4538 1.00671 +19.469 0.999214 +19.4843 0.998714 +19.4995 1.00289 +19.5147 0.996272 +19.5299 0.997873 +19.5452 1.00403 +19.5604 1.00298 +19.5756 1.00327 +19.5908 1.00311 +19.6061 0.999254 +19.6213 0.992405 +19.6365 1.0052 +19.6518 0.999625 +19.667 0.999106 +19.6822 0.999221 +19.6974 1.00876 +19.7127 0.99767 +19.7279 1.00024 +19.7431 0.999935 +19.7584 1.00093 +19.7736 0.999372 +19.7888 0.99678 +19.804 0.999702 +19.8193 0.999524 +19.8345 1.00203 +19.8497 1.00823 +19.865 0.998975 +19.8802 0.998319 +19.8954 1.00046 +19.9106 1.00297 +19.9259 1.00094 +19.9411 1.00732 +19.9563 0.996415 +19.9715 1.0019 +19.9868 1.00273 +20.002 1.00234 +20.0172 1.00121 +20.0325 1.00486 +20.0477 1.00002 +20.0629 0.999032 +20.0781 0.998205 +20.0934 0.997586 +20.1086 1.00454 +20.1238 1.00902 +20.1391 0.998134 +20.1543 0.997283 +20.1695 0.995032 +20.1847 0.998024 +20.2 1.00387 +20.2152 0.998808 +20.2304 1.00055 +20.2457 0.999123 +20.2609 1.0006 +20.2761 1.00313 +20.2913 1.00018 +20.3066 1.00623 +20.3218 1.00005 +20.337 0.99549 +20.3522 1.00021 +20.3675 0.996335 +20.3827 0.999291 +20.3979 0.99702 +20.4132 0.994805 +20.4284 0.999715 +20.4436 1.00425 +20.4588 0.993117 +20.4741 1.00318 +20.4893 0.9952 +20.5045 1.002 +20.5198 0.999938 +20.535 1.00127 +20.5502 0.993471 +20.5654 1.00144 +20.5807 0.997991 +20.5959 1.00089 +20.6111 0.998701 +20.6264 0.999329 +20.6416 0.997229 +20.6568 1.0031 +20.672 0.996631 +20.6873 0.991754 +20.7025 1.00318 +20.7177 0.998443 +20.7329 1.00324 +20.7482 0.996948 +20.7634 0.997379 +20.7786 0.996814 +20.7939 0.993044 +20.8091 0.996725 +20.8243 0.995521 +20.8395 1.00014 +20.8548 1.00374 +20.87 1.00182 +20.8852 0.9985 +20.9005 0.995106 +20.9157 0.998739 +20.9309 0.997573 +20.9461 1.00022 +20.9614 0.998283 +20.9766 0.994195 +20.9918 0.99083 +21.0071 0.991978 +21.0223 0.993855 +21.0375 1.00076 +21.0527 0.994345 +21.068 0.997493 +21.0832 0.994287 +21.0984 1.0028 +21.1136 1.00074 +21.1289 0.994069 +21.1441 0.998804 +21.1593 1.00031 +21.1746 1.00117 +21.1898 0.996983 +21.205 1.00341 +21.2202 0.996365 +21.2355 1.00166 +21.2507 1.00267 +21.2659 1.00139 +21.2812 0.998731 +21.2964 1.00212 +21.3116 0.99536 +21.3268 1.00053 +21.3421 1.00149 +21.3573 0.999913 +21.3725 0.990863 +21.3878 1.00145 +21.403 0.996825 +21.4182 0.999125 +21.4334 0.995099 +21.4487 0.989809 +21.4639 0.999203 +21.4791 1.00341 +21.4943 1.00554 +21.5096 0.998038 +21.5248 0.999866 +21.54 0.998002 +21.5553 0.998078 +21.5705 0.995973 +21.5857 0.998171 +21.6009 0.998334 +21.6162 0.995652 +21.6314 0.996363 +21.6466 0.999092 +21.6619 0.99693 +21.6771 1.00172 +21.6923 0.998679 +21.7075 0.992616 +21.7228 1.00376 +21.738 1.00014 +21.7532 0.99862 +21.7685 0.999707 +21.7837 1.00186 +21.7989 0.997257 +21.8141 1.00062 +21.8294 0.998933 +21.8446 1.00082 +21.8598 0.999632 +21.8751 0.995114 +21.8903 0.994505 +21.9055 0.996186 +21.9207 1.00017 +21.936 1.00553 +21.9512 1.00206 +21.9664 1.00012 +21.9816 1.00268 +21.9969 0.991884 +22.0121 0.997788 +22.0273 0.99504 +22.0426 0.996637 +22.0578 0.997132 +22.073 1.00624 +22.0882 1.0025 +22.1035 1.00008 +22.1187 0.997431 +22.1339 1.00277 +22.1492 1.0005 +22.1644 1.00028 +22.1796 1.0045 +22.1948 1.0003 +22.2101 0.998891 +22.2253 0.993679 +22.2405 0.996379 +22.2558 0.999768 +22.271 1.00507 +22.2862 0.995205 +22.3014 1.00067 +22.3167 1.00208 +22.3319 0.995291 +22.3471 0.993997 +22.3623 0.995531 +22.3776 0.997707 +22.3928 1.00108 +22.408 1.00152 +22.4233 0.989714 +22.4385 1.00307 +22.4537 1.0033 +22.4689 1.0017 +22.4842 0.99486 +22.4994 0.99589 +22.5146 0.995885 +22.5299 1.00418 +22.5451 1.00161 +22.5603 1.00012 +22.5755 0.999632 +22.5908 0.999197 +22.606 0.997626 +22.6212 1.00919 +22.6365 0.999312 +22.6517 0.999223 +22.6669 0.999326 +22.6821 0.993461 +22.6974 1.00269 +22.7126 1.00336 +22.7278 1.00175 +22.743 1.0048 +22.7583 0.999003 +22.7735 0.999332 +22.7887 1.00336 +22.804 1.00058 +22.8192 1.00452 +22.8344 1.00517 +22.8496 0.995398 +22.8649 1.00111 +22.8801 0.998855 +22.8953 1.00567 +22.9106 0.999649 +22.9258 1.00115 +22.941 1.00112 +22.9562 1.00368 +22.9715 1.00887 +22.9867 1.00259 +23.0019 1.00518 +23.0172 1.00248 +23.0324 0.998291 +23.0476 1.00694 +23.0628 0.999678 +23.0781 1.00071 +23.0933 1.00532 +23.1085 1.00449 +23.1237 1.00214 +23.139 1.00212 +23.1542 0.999775 +23.1694 0.997448 +23.1847 0.99859 +23.1999 1.00529 +23.2151 1.00113 +23.2303 0.996896 +23.2456 1.00051 +23.2608 0.999121 +23.276 1.00335 +23.2913 1.00349 +23.3065 0.99828 +23.3217 1.0034 +23.3369 0.999027 +23.3522 1.00479 +23.3674 1.00243 +23.3826 1.00185 +23.3979 1.00283 +23.4131 1.00216 +23.4283 0.99977 +23.4435 0.997196 +23.4588 1.00571 +23.474 1.00058 +23.4892 0.995492 +23.5044 1.00218 +23.5197 1.00056 +23.5349 1.00404 +23.5501 1.00583 +23.5654 1.00327 +23.5806 1.00159 +23.5958 1.00094 +23.611 0.997837 +23.6263 0.999812 +23.6415 0.997139 +23.6567 1.00263 +23.672 1.00012 +23.6872 0.99408 +23.7024 1.00205 +23.7176 0.991222 +23.7329 0.999255 +23.7481 0.998864 +23.7633 0.999196 +23.7786 1.00195 +23.7938 1.00254 +23.809 1.00246 +23.8242 1.00131 +23.8395 0.995023 +23.8547 0.998827 +23.8699 0.999389 +23.8851 0.998125 +23.9004 0.996176 +23.9156 1.00486 +23.9308 1.00284 +23.9461 0.998025 +23.9613 0.995171 +23.9765 0.992233 +23.9917 0.999198 +24.007 0.996588 +24.0222 1.00457 +24.0374 0.997793 +24.0527 0.996594 +24.0679 1.00008 +24.0831 1.00188 +24.0983 0.9997 +24.1136 1.0018 +24.1288 0.999109 +24.144 1.00026 +24.1593 0.997703 +24.1745 0.998935 +24.1897 1.00307 +24.2049 1.00645 +24.2202 0.99525 +24.2354 0.996948 +24.2506 1.00168 +24.2658 1.00102 +24.2811 1.00543 +24.2963 0.995561 +24.3115 1.00096 +24.3268 0.997224 +24.342 1.00214 +24.3572 1.00126 +24.3724 1.00095 +24.3877 1.00089 +24.4029 0.999522 +24.4181 0.998522 +24.4334 0.999231 +24.4486 0.998617 +24.4638 1.00612 +24.479 0.996587 +24.4943 1.00999 +24.5095 1.00448 +24.5247 1.00117 +24.54 0.999463 +24.5552 1.00132 +24.5704 1.00034 +24.5856 1.00558 +24.6009 1.00302 +24.6161 0.99512 +24.6313 0.993028 +24.6465 1.00115 +24.6618 0.997173 +24.677 0.998631 +24.6922 1.00296 +24.7075 1 +24.7227 0.998358 +24.7379 0.999976 +24.7531 0.997337 +24.7684 0.998541 +24.7836 1.00185 +24.7988 0.994389 +24.8141 0.995106 +24.8293 0.998645 +24.8445 0.99787 +24.8597 1.00206 +24.875 0.996485 +24.8902 0.996423 +24.9054 1.00047 +24.9207 1.00072 +24.9359 0.999698 +24.9511 1.00333 +24.9663 0.996883 +24.9816 0.996674 +24.9968 1.0036 +25.012 0.995694 +25.0273 1.00058 +25.0425 0.995443 +25.0577 1.00488 +25.0729 1.00547 +25.0882 0.997571 +25.1034 0.998793 +25.1186 1.00479 +25.1338 0.995745 +25.1491 0.998277 +25.1643 1.00117 +25.1795 1.00319 +25.1948 0.995023 +25.21 0.998002 +25.2252 0.996658 +25.2404 0.99796 +25.2557 0.996297 +25.2709 0.999635 +25.2861 1.0053 +25.3014 1.00365 +25.3166 0.999025 +25.3318 0.996284 +25.347 1.00439 +25.3623 1.00172 +25.3775 0.997191 +25.3927 0.997091 +25.408 0.996893 +25.4232 1.00106 +25.4384 1.00067 +25.4536 0.999308 +25.4689 1.0026 +25.4841 0.99746 +25.4993 0.998779 +25.5145 0.999765 +25.5298 1.0032 +25.545 0.999851 +25.5602 0.996915 +25.5755 0.99947 +25.5907 1.00142 +25.6059 1.00072 +25.6211 1.00004 +25.6364 0.996358 +25.6516 0.99946 +25.6668 0.998759 +25.6821 0.99698 +25.6973 1.00646 +25.7125 0.996471 +25.7277 0.999867 +25.743 1.00194 +25.7582 0.994721 +25.7734 1.00047 +25.7887 0.996856 +25.8039 0.998782 +25.8191 1.00011 +25.8343 0.99849 +25.8496 1.00178 +25.8648 0.999791 +25.88 0.99774 +25.8952 1.00005 +25.9105 1.0004 +25.9257 0.999846 +25.9409 0.998851 +25.9562 0.998269 +25.9714 1.00212 +25.9866 0.9982 +26.0018 1.00176 +26.0171 0.999898 +26.0323 0.995255 +26.0475 0.99585 +26.0628 0.998067 +26.078 0.996079 +26.0932 1.00135 +26.1084 1.0028 +26.1237 0.996254 +26.1389 0.997658 +26.1541 0.994967 +26.1694 0.995345 +26.1846 1.00178 +26.1998 1.00647 +26.215 1.00109 +26.2303 1.00039 +26.2455 0.998728 +26.2607 0.997047 +26.2759 1.00197 +26.2912 0.999937 +26.3064 0.998281 +26.3216 0.994619 +26.3369 0.998552 +26.3521 1.00381 +26.3673 0.999926 +26.3825 1.00236 +26.3978 1.00152 +26.413 1.00115 +26.4282 1.00156 +26.4435 0.999225 +26.4587 0.999091 +26.4739 0.999024 +26.4891 0.999396 +26.5044 0.998598 +26.5196 0.999244 +26.5348 0.99708 +26.5501 1.00276 +26.5653 1.00152 +26.5805 1.00352 +26.5957 1.00357 +26.611 1.00525 +26.6262 0.997804 +26.6414 0.998594 +26.6566 0.995068 +26.6719 0.999998 +26.6871 1.00021 +26.7023 1.0027 +26.7176 0.9993 +26.7328 1.00225 +26.748 0.998992 +26.7632 0.996468 +26.7785 1.00607 +26.7937 1.00099 +26.8089 0.998228 +26.8242 0.998621 +26.8394 1.00451 +26.8546 0.996466 +26.8698 0.999503 +26.8851 1.0021 +26.9003 0.99847 +26.9155 0.99851 +26.9308 0.996003 +26.946 0.9992 +26.9612 0.998245 +26.9764 0.998857 +26.9917 1.00017 +27.0069 1.00556 +27.0221 1.00187 +27.0373 1.0026 +27.0526 0.997995 +27.0678 1.0002 +27.083 1.00025 +27.0983 1.00026 +27.1135 1.00449 +27.1287 1.00232 +27.1439 1.00199 +27.1592 0.997055 +27.1744 0.999161 +27.1896 1.00601 +27.2049 0.996989 +27.2201 1.00062 +27.2353 1.00093 +27.2505 1.00147 +27.2658 0.998069 +27.281 0.998241 +27.2962 1.00216 +27.3115 0.99927 +27.3267 0.998969 +27.3419 1.00493 +27.3571 0.998514 +27.3724 0.998109 +27.3876 1.00217 +27.4028 1.00006 +27.418 0.997144 +27.4333 1.0037 +27.4485 0.995328 +27.4637 0.999169 +27.479 0.998242 +27.4942 1.00106 +27.5094 1.00297 +27.5246 0.999329 +27.5399 1.00397 +27.5551 0.996963 +27.5703 1.00288 +27.5856 0.997821 +27.6008 1.00177 +27.616 1.0011 +27.6312 1.0005 +27.6465 1.00079 +27.6617 0.99898 +27.6769 0.999213 +27.6922 1.00073 +27.7074 0.995655 +27.7226 1.00389 +27.7378 1.00037 +27.7531 1.00112 +27.7683 1.00657 +27.7835 0.999549 +27.7987 1.00039 +27.814 0.999822 +27.8292 1.004 +27.8444 1.00202 +27.8597 1.00219 +27.8749 0.998042 +27.8901 1.00101 +27.9053 1.00378 +27.9206 0.999355 +27.9358 0.996409 +27.951 0.998745 +27.9663 0.998534 +27.9815 0.999825 +27.9967 0.998525 +28.0119 1.00447 +28.0272 1.00324 +28.0424 0.999877 +28.0576 1.00488 +28.0729 1.00332 +28.0881 1.00015 +28.1033 1.00115 +28.1185 1.00392 +28.1338 1.0026 +28.149 1.00011 +28.1642 1.00127 +28.1795 0.999719 +28.1947 0.998751 +28.2099 0.998337 +28.2251 0.994791 +28.2404 0.998638 +28.2556 0.998289 +28.2708 1.00161 +28.286 0.997075 +28.3013 1.00081 +28.3165 1.0002 +28.3317 0.992318 +28.347 1.00077 +28.3622 1.00315 +28.3774 1.00343 +28.3926 1.00571 +28.4079 0.998393 +28.4231 0.998101 +28.4383 0.999632 +28.4536 1.00026 +28.4688 0.999534 +28.484 0.99882 +28.4992 1.00634 +28.5145 0.997968 +28.5297 0.996157 +28.5449 1.00133 +28.5602 0.997157 +28.5754 0.997503 +28.5906 0.9979 +28.6058 0.998794 +28.6211 0.998178 +28.6363 1.00237 +28.6515 0.999492 +28.6667 1.00073 +28.682 0.996741 +28.6972 0.998567 +28.7124 0.997349 +28.7277 1.00626 +28.7429 0.999865 +28.7581 1.00237 +28.7733 0.999891 +28.7886 0.996883 +28.8038 1.0007 +28.819 1.00378 +28.8343 1.00169 +28.8495 0.99922 +28.8647 0.999077 +28.8799 0.999014 +28.8952 1.00091 +28.9104 0.997305 +28.9256 1.00105 +28.9409 0.999335 +28.9561 1.00103 +28.9713 1.00067 +28.9865 1.00059 +29.0018 0.994208 +29.017 0.995141 +29.0322 0.999291 +29.0474 1.00169 +29.0627 0.998696 +29.0779 1.00203 +29.0931 0.998819 +29.1084 0.99799 +29.1236 0.998192 +29.1388 1.00433 +29.154 1.00138 +29.1693 0.997828 +29.1845 0.997426 +29.1997 0.997582 +29.215 1.00067 +29.2302 0.999649 +29.2454 1.00444 +29.2606 1.00105 +29.2759 1.00134 +29.2911 1.00334 +29.3063 1.00224 +29.3216 1.00076 +29.3368 1.00436 +29.352 0.997439 +29.3672 1.00314 +29.3825 0.995197 +29.3977 0.999704 +29.4129 0.997496 +29.4281 0.997932 +29.4434 1.00134 +29.4586 1.0079 +29.4738 0.999905 +29.4891 0.999492 +29.5043 1.00292 +29.5195 0.99782 +29.5347 0.999292 +29.55 0.999365 +29.5652 1.00185 +29.5804 1.00363 +29.5957 1.00107 +29.6109 0.999141 +29.6261 1.00139 +29.6413 0.997065 +29.6566 1.00339 +29.6718 0.99591 +29.687 0.996254 +29.7023 0.997663 +29.7175 1.00078 +29.7327 0.99736 +29.7479 0.997492 +29.7632 1.00173 +29.7784 0.998626 +29.7936 0.998902 +29.8088 0.997452 +29.8241 0.998709 +29.8393 1.00049 +29.8545 0.998527 +29.8698 0.999086 +29.885 0.998812 +29.9002 1.00077 +29.9154 0.999024 +29.9307 1.00053 +29.9459 0.994986 +29.9611 0.996982 +29.9764 1.00171 +29.9916 1.00101 +30.0068 1.0003 +30.022 0.999753 +30.0373 1.00202 +30.0525 1.00256 +30.0677 1.00769 +30.083 1.00259 +30.0982 0.996561 +30.1134 1.00201 +30.1286 0.999462 +30.1439 0.996334 +30.1591 0.999031 +30.1743 1.00018 +30.1895 1.00502 +30.2048 1.00052 +30.22 0.999378 +30.2352 0.996158 +30.2505 1.00432 +30.2657 0.999367 +30.2809 1.00263 +30.2961 1.00223 +30.3114 1.00106 +30.3266 1.00262 +30.3418 1.00204 +30.3571 0.999411 +30.3723 1.00333 +30.3875 0.999748 +30.4027 0.999109 +30.418 0.999573 +30.4332 0.996037 +30.4484 0.998088 diff --git a/_basic/openacc/source_code/SOLUTION/dcdread.h b/_basic/openacc/source_code/SOLUTION/dcdread.h new file mode 100644 index 0000000..66ddba0 --- /dev/null +++ b/_basic/openacc/source_code/SOLUTION/dcdread.h @@ -0,0 +1,49 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/openacc/source_code/SOLUTION/rdf.cpp b/_basic/openacc/source_code/SOLUTION/rdf.cpp new file mode 100644 index 0000000..a93f9a9 --- /dev/null +++ b/_basic/openacc/source_code/SOLUTION/rdf.cpp @@ -0,0 +1,202 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + +#pragma acc data copy(h_g2 [0:nbin]) copyin(h_x [0:nconf * numatm], h_z [0:nconf * numatm], h_y [0:nconf * numatm]) + + { + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + } + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + // double start; + //double end; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + // cout< +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned long long int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned long long int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned long long int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + ////////////////////////////////////////////////////////////////////////// + + double pi = acos(-1.0l); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +int round(float num) +{ + return num < 0 ? num - 0.5 : num + 0.5; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + + printf("\n %d %d ", nconf, numatm); + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); + for (int id1 = 0; id1 < numatm; id1++) + { + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma acc atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } +} diff --git a/_basic/openacc/source_code/rdf.f90 b/_basic/openacc/source_code/rdf.f90 new file mode 100644 index 0000000..e181995 --- /dev/null +++ b/_basic/openacc/source_code/rdf.f90 @@ -0,0 +1,161 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r\n", + " C/C++ syntax\n", + " \n", + "```#pragma omp directive ``` \n", + "
\n", + "
\n", + " \n", + "
\n", + " Fortran syntax\n", + " \n", + "```!$omp directive ``` \n", + "
\n", + "
\n", + "\n", + "**#pragma** in C/C++ and **!$** in Fortran are what's known as a \"compiler hint.\" These are very similar to programmer comments, however, the compiler will actually read our pragmas. Pragmas are a way for the programmer to \"guide\" the compiler, without running the chance damaging the code. If the compiler does not understand the pragma, it can ignore it, rather than throw a syntax error.\n", + "\n", + "**omp** is an addition to our pragma, it is known as the “sentinel”. It specifies that this is an OpenMP pragma. Any non-OpenMP compiler will ignore this pragma. \n", + "\n", + "**directives** are commands in OpenMP that will tell the compiler to do some action. For now, we will only use directives that allow the compiler to parallelize our code.\n", + "\n", + "For beginners who are new to OpenMP directive, we will be introducing some terminologies and concepts before starting to add ```target``` directives to our code to offload onto GPU computation and data. \n", + "\n", + "## OpenMP Fork-Join Model\n", + "\n", + "OpenMP uses the fork-join model of parallel execution. All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the first parallel region construct is encountered.\n", + "\n", + "**FORK**: the master thread then creates a team of parallel threads.The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads.\n", + "\n", + "**JOIN**: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.\n", + "\n", + "\n", + "\n", + "## OpenMP Parallel Region\n", + "\n", + "A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct. When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code redundantly.There is an implied barrier at the end of a parallel region. Only the master thread continues execution past this point\n", + "\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + " \n", + "```cpp\n", + "//Include the header file\n", + "#include \n", + "\n", + " main(int argc, char *argv[]) {\n", + "\n", + " int nthreads;\n", + "\n", + " /* Fork a team of threads*/\n", + " #pragma omp parallel\n", + " {\n", + "\n", + " /* Obtain and print thread id */\n", + " printf(\"Hello World from thread = %d\\n\", omp_get_thread_num());\n", + "\n", + " /* Only master thread does this */\n", + " if (omp_get_thread_num() == 0) \n", + " {\n", + " nthreads = omp_get_num_threads();\n", + " printf(\"Number of threads = %d\\n\", nthreads);\n", + " }\n", + "\n", + " } /* All threads join master thread and terminate */\n", + "\n", + " }\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "program hello\n", + " integer :: omp_rank\n", + "!$omp parallel private(omp_rank)\n", + " omp_rank = omp_get_thread_num()\n", + " print *, 'Hello world! by thread ', omp_rank\n", + "!$omp end parallel\n", + "end program hello\n", + " ```\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "## OpenMP Data-sharing\n", + "In OpenMP, several constructs accept clauses that allow the user to control the data sharing. For example, you can use one of below clauses in a *Parallel* construct.\n", + "\n", + "- `private`: Declares variables to be private to each thread in a team. Private copies of the variable are initialized from the original object when entering the region.\n", + "- `shared`: Shares variables among all the threads in a team.\n", + "- `default`: Enables you to affect the data-scope attributes of variables.\n", + "\n", + " \n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "\n", + "#pragma omp parallel for default(shared) private(dx)\n", + "{\n", + " for (int i=0; i < N; i++){\n", + " for (int j=0; j < N; j++){\n", + " dx = a[i] + b[j];\n", + " }\n", + " } \n", + "\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "\n", + "```fortran\n", + "\n", + "program hello\n", + " integer :: omp_rank\n", + "!$omp parallel default(shared) private(omp_rank)\n", + " omp_rank = omp_get_thread_num()\n", + " print *, 'Hello world! by thread ', omp_rank\n", + "!$omp end parallel\n", + "end program hello\n", + "\n", + "```\n", + "
\n", + "
\n", + "\n", + "## OpenMP Work-sharing\n", + "\n", + "As described before ```parallel``` construct creates team of threads and the execution continues redundantly on all threads of team. Ideally we would need all threads within the team to work share i.e. split the work. A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. Work-sharing constructs do not launch new threads but Divides (“workshares”) the iterations of the loop across the threads in the team . There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct. \n", + "\n", + "There are multiple ways to allow worksharing, the code below makes use of ```for``` to divide the iteration of loop among threads.\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "\n", + "//Create a team of threads\n", + "#pragma omp parallel \n", + "{\n", + "//workshare this loop across those threads.\n", + " #pragma omp for\n", + " for (i=0; i < N; i++)\n", + " c[i] = a[i] + b[i];\n", + "\n", + " } /* end of parallel region */\n", + "\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "\n", + "\n", + "```fortran\n", + "\n", + "!Create a team of threads\n", + "!$omp parallel\n", + "!workshare this loop across those threads.\n", + " !$omp for\n", + " do i=1,N\n", + " < loop code >\n", + " end do\n", + "!$omp end parallel\n", + "\n", + "```\n", + "
\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "## OpenMP Target Offloading\n", + "\n", + "By now you should have got familiar with the OpenMP programming model. Now let us start introducing key directives and construct used to add GPU offloading. \n", + "\n", + "\n", + "### ```target ```\n", + "\n", + "```target``` construct consists of a target directive and an execution region. ```target``` directive define a target region, which is a block of computation that operates within a distinct data environment and is intended to be offloaded onto a parallel computation device during execution ( GPU in our case). Data used within the region may be implicitly or explicitly mapped to the device. All of OpenMP is allowed within target regions, but only a subset will run well on GPUs.\n", + "\n", + "The example below shows usage of target directive with implicitly mapped data\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "while (iter < iter_max )\n", + "{\n", + " error = 0.0;\n", + " //Moves this region of code to the GPU and implicitly maps data.\n", + " #pragma omp target\n", + " {\n", + " #pragma omp parallel for reduction(max:error)\n", + " for( int j = 1; j < n-1; j++) {\n", + " ANew[j] = A [j-1] + A[j+1];\n", + " }\n", + " }\n", + " iter++;\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + " !Moves this region of code to the GPU and implicitly maps data.\n", + " !$omp target\n", + " !$omp parallel do\n", + " do i=1,N \n", + " ANew(j) = A (j-1) + A(j+1)\n", + " end do\n", + " !$omp end target \n", + "```\n", + "
\n", + "
\n", + " \n", + "### ```target data``` to explicitly map the data\n", + "\n", + "Map a variable to/from the device.Map directive helps developer to explicitly define and reduce data copies. The ```target data```construct is used to mark such regions\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + "\n", + "```cpp\n", + "#pragma omp target map(map-type: list)\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "!$omp target data map(map-type: list)\n", + "```\n", + "
\n", + "
\n", + "\n", + "Example of mapping data directives are as follows: \n", + "- `to` (list)\n", + " - Allocates memory on the device and copies data in when entering the region, the values are not copied back\n", + "- `from` (list)\n", + " - Allocates memory on the device and copies the data to the host when exiting the region\n", + "- `alloc` (list)\n", + " - Allocates memory on the device. If the data is already present on the device a reference counter is incremented\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "while (iter < iter_max )\n", + "{\n", + " error = 0.0;\n", + " //Moves this region of code to the GPU and explicitly maps data.\n", + " #pragma omp target data map(to:A[:n]) map(from:ANew[:n])\n", + " {\n", + " #pragma omp parallel for reduction(max:error)\n", + " for( int j = 1; j < n-1; j++) {\n", + " ANew[j] = A [j-1] + A[j+1];\n", + " }\n", + " }\n", + " iter++;\n", + "}\n", + "```\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "!Moves this region of code to the GPU and explicitly maps data.\n", + "!$omp target data map(to:A(:)) map(from:ANew(:))\n", + " !$omp parallel do\n", + " do i=1,N \n", + " ANew(j) = A (j-1) + A(j+1)\n", + " end do\n", + "!$omp end target data\n", + "\n", + "\n", + "```\n", + "
\n", + "
\n", + " \n", + "### ```teams``` directive\n", + "```teams``` directve creates a league of thread teams where the master thread of each team executes the region. Each of these master threads executes sequentially. Or in other words teams directive spawn 1 or more thread teams with the same number of threads. The execution continues on the master threads of each team (redundantly). There is no synchronization allowed between teams. \n", + "\n", + "OpenMP calls that somewhere a team, which might be a thread on the CPU or maying a CUDA threadblock or OpenCL workgroup. It will choose how many teams to create based on where you're running, only a few on a CPU (like 1 per CPU core) or lots on a GPU (1000's possibly). ```teams``` allow OpenMP code to scale from small CPUs to large GPUs because each one works completely independently of each other ```teams```.\n", + "\n", + "\n", + "\n", + "### ```distribute``` \n", + "There's a good chance that we don't want the loop to be run redundantly in every master thread of ```teams``` though, that seems wasteful and potentially dangerous. With usage of ```distribute``` construct the iterations of the next loop are broken into groups that are *distributed* to the master threads of the teams. The iterations are distributed statically and there is no guarantee about the order teams will execute. Also it does not generate parallelism/worksharing within the thread teams.\n", + "\n", + "\n", + "\n", + "Th example below of simple stencil code shows the usage of ```distribute``` along with ```team```:\n", + "\n", + " \n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "#pragma omp target teams distribute \n", + " for( int j = 1; j < n-1; j++) {\n", + " for( int i = 1; i < m-1; i++) {\n", + " Anew[j][i] = 0.25 * ( A[j][i+1] + A [j][i-1]\n", + " + A[j-1][i] + A[j+1][i]);\n", + " error = fmax (error, fabs(Anew[j][i] - A[j][i]));\n", + " }}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + "\n", + "```fortran\n", + " !$omp target teams distribute\n", + " do j=1,N \n", + " do i=1,N \n", + " ANew(j,i) = 0.25 * (A(j-1,i) + A(j+1,i) + A(j,i-1) + A(j,i+1))\n", + " enddo\n", + " enddo\n", + " !$omp end target \n", + "```\n", + "
\n", + "
\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "### Work sharing to improve parallelism\n", + "\n", + "As shown in the image only the master thread performs the computation which is not so optimal in case of GPU architecture. To solve this problem we will make use of work-sharing as we did before. When any team encounters a worksharing construct, the work inside the construct is divided among the members of the team, and executed cooperatively instead of being executed by every thread. There are many work sharing constructs defined, one is using `teams distribute` construct: \n", + "\n", + " \n", + "\n", + " \n", + " \n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "#pragma omp target teams distribute parallel for\n", + " for( int j = 1; j < n-1; j++) {\n", + " for( int i = 1; i < m-1; i++) {\n", + " Anew[j][i] = 0.25 * ( A[j][i+1] + A [j][i-1]\n", + " + A[j-1][i] + A[j+1][i]);\n", + " error = fmax (error, fabs(Anew[j][i] - A[j][i]));\n", + " }}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + "\n", + "```fortran\n", + " !$omp target teams distribute parallel do\n", + " do j=1,N \n", + " do i=1,N \n", + " ANew(j,i) = 0.25 * (A(j-1,i) + A(j+1,i) + A(j,i-1) + A(j,i+1))\n", + " enddo\n", + " enddo\n", + " !$omp end target \n", + "```\n", + "
\n", + "
\n", + "\n", + "In the above examples, programmer explicitly requests the steps the compiler should take to map parallelism to the target architecture. Another way to expose more parallelism in a program is to allow a compilerto do the mapping onto the target architectures. The HPC compilers' implementation of loop supports this descriptive model. In the below examples, the programmer specifies the loop regions to be parallelized by the compiler and the compilers parallelize loop across teams and threads using `teams loop` construct which is a shortcut for specifying a teams construct containing a loop construct and no other statements:\n", + "\n", + " \n", + "
\n", + " C/C++ syntax\n", + " \n", + "```cpp\n", + "#pragma omp target teams loop reduction(max:error) \n", + "for( int j = 1; j < n-1; j++) {\n", + " #pragma omp loop reduction(max:error)\n", + " for( int i = 1; i < m-1; i++ ) {\n", + " Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]\n", + " + A[j-1][i] + A[j+1][i]);\n", + " error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));\n", + " }\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "
\n", + " Fortran syntax\n", + "\n", + "```fortran \n", + "!$omp target teams loop \n", + "do n1loc_blk = 1, n1loc_blksize\n", + " do igp = 1, ngpown \n", + " do ig_blk = 1, ig_blksize \n", + " do ig = ig_blk, ncouls, ig_blksize\n", + " do n1_loc = n1loc_blk, ntband_dist, n1loc_blksize\n", + " !expensive computation codes \n", + " enddo \n", + " enddo \n", + " enddo \n", + " enddo \n", + "enddo\n", + "```\n", + "
\n", + "
\n", + "\n", + "Moreover, further tuning when using a `loop` construct can be done with the `bind` clause, where binding can be one of `teams`, `parallel`, `thread`. For more information, please visit (OpenMP documentation)[https://www.openmp.org/spec-html/5.1/openmpsu51.html].\n", + "\n", + "## Atomic Construct\n", + "\n", + "In the code you will also require one more construct which will help you in getting the right results. OpenMP atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if we want to count the number of elements that have a value greater than zero, we could write the following:\n", + "\n", + "\n", + "
\n", + " C/C++ syntax\n", + "\n", + "\n", + "```cpp\n", + "if ( val > 0 )\n", + "{\n", + " #pragma omp atomic\n", + " {\n", + " cnt++;\n", + " }\n", + "}\n", + "```\n", + "
\n", + "
\n", + "\n", + "\n", + "
\n", + " Fortran syntax\n", + " \n", + "```fortran\n", + "if(r\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, lets start modifying the original code and add the OpenMP directives. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run for Multicore\n", + "\n", + "Having added OpenMP directives, let us compile the code. We will be using NVIDIA HPC SDK compiler for this exercise. The flags used for enabling OpenMP target offloading are as follows:\n", + "\n", + "\n", + "\n", + "`-mp=gpu|multicore` : Select the target device for all parallel programming paradigms used (OpenACC, OpenMP, Standard Languages)\n", + "- `gpu` Globally set the target device to an NVIDIA GPU\n", + "- `multicore` Globally set the target device to the host CPU\n", + "\n", + "**NOTE:** `-Minfo=mp` enables OpenMP information.\n", + "\n", + "After running the cells, make sure to check the output first. You can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (C/C++)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -mp=multicore -Minfo=mp -I/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/include -o rdf_c rdf.cpp && ./rdf_c && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (C/C++ version) \n", + "You can see from *Line 174* that it is generating a multicore code `174, Generating Multicore code`.\n", + "\n", + "\n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Compile the code for multicore (Fortran)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -mp=multicore -Minfo=mp -o rdf_f rdf.f90 -lnvhpcwrapnvtx && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (Fortran version) \n", + "You can see from *Line 98* that it is generating a multicore code `98, Generating Multicore code`.\n", + "\n", + "```\n", + "\tMulticore output\n", + "\t\trdf:\n", + " 98, !$omp target teams distribute parallel do\n", + " 98, Generating Multicore code\n", + " 99, Loop parallelized across teams and threads, schedule(static)\n", + "```\n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + "```\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's profile the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++ version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran version)\n", + "!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's checkout the profiler's report. Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_multicore_c.nsys-rep) or the [Fortran version](../source_code/rdf_multicore_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "
\n", + " Example screenshot (C/C++ code) \n", + "\n", + "
\n", + "
\n", + "
\n", + " Example screenshot (Fortran code) \n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "Feel free to checkout the solutions for [C/C++](../source_code/SOLUTION/rdf_offload.cpp) and [Fortran](../source_code/SOLUTION/rdf_offload.f90) versions to help you understand better.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compile and Run for an NVIDIA GPU\n", + "\n", + "Without changing the code now let us try to recompile the code for NVIDIA GPU and rerun.\n", + "The only difference is that now we pass `gpu` value to the `-mp` compiler option.`-mp=gpu`. **Understand and analyze** the code present at [C/C++ version](../source_code/SOLUTION/rdf_offload.cpp) and/or the [Fortran version](../source_code/SOLUTION/rdf_offload.f90) .\n", + "\n", + "Open the downloaded files for inspection. Once done, modify your code by using `teams loop` construct. Click on the [C/C++ version](../source_code/rdf.cpp) or the [Fortran version](../source_code/rdf.f90) links, and start modifying the C or Fortran version of the RDF code. Remember to **SAVE** your code after changes, before running below cells. \n", + "\n", + "After running the cells, make sure to check the output first. You can inspect part of the compiler feedback for C or Fortran version and see what it's telling us (your compiler feedback will be similar to the below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (C/C++ version)\n", + "!cd ../source_code && echo \"compiling C/C++ version .. \" && nvc++ -mp=gpu -Minfo=mp -o rdf_c rdf.cpp && ./rdf_c && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (C/C++ version) \n", + "Inspect the compiler feedback (you should get a similar output as below) and you can see below: \n", + "\n", + "- *Line 86* shows variables mapped to the device\n", + "- *Line 172* shows the GPU kernel is generated `Generating \"nvkernel__Z8pair_gpuPKdS0_S0_Pjiidddi_F1L174_1\" GPU kernel`\n", + "\n", + "\n", + "```\n", + "main:\n", + " 86, Generating map(tofrom:h_g2[:nbin],h_x[:numatm*nconf],h_y[:numatm*nconf],h_z[:numatm*nconf]) \n", + "pair_gpu(const double *, const double *, const double *, unsigned int *, int, int, double, double, double, int):\n", + " 172, #omp target teams loop\n", + " 172, Generating \"nvkernel__Z8pair_gpuPKdS0_S0_Pjiidddi_F1L172_2\" GPU kernel\n", + " Generating NVIDIA GPU code\n", + " 172, Loop parallelized across teams /* blockIdx.x */\n", + " 175, Loop parallelized across threads(128) /* threadIdx.x */\n", + " 172, Generating Multicore code\n", + " 172, Loop parallelized across threads\n", + " 172, Generating implicit map(tofrom:d_g2[:]) \n", + " Generating implicit allocate(d_x[:]) [if not already present]\n", + " Generating implicit copyin(d_x[numatm*frame:*]) [if not already present]\n", + " Generating implicit allocate(d_y[:],d_z[:]) [if not already present]\n", + " Generating implicit copyin(d_y[numatm*frame:*],d_z[numatm*frame:*]) [if not already present]\n", + " 175, Loop is parallelizable\n", + "```\n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "
\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#compile for Tesla GPU (Fortran version)\n", + "!cd ../source_code && echo \"compiling Fortran version .. \" && nvfortran -mp=gpu -Minfo=mp -o rdf_f rdf.f90 -lnvhpcwrapnvtx && ./rdf_f && cat Pair_entropy.dat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Compiler Feedback (Fortran version) \n", + "Inspect the compiler feedback (you should get a similar output as below) and you can see below: \n", + "\n", + "- *Line 94* shows variables mapped to the device\n", + "- *Line 98* shows the GPU kernel is generated\n", + "\n", + "```\n", + "rdf:\n", + " 98, !$omp target teams loop\n", + " 94, Generating map(tofrom:g(:),x(x$sd9:(x$sd9-1)+x$sd9,x$sd9:(x$sd9-1)+x$sd9),\n", + "y(y$sd8:(y$sd8-1)+y$sd8,y$sd8:(y$sd8-1)+y$sd8),z(z$sd7:(z$sd7-1)+z$sd7,z$sd7:(z$sd7-1)+z$sd7)) \n", + " 98, Generating \"nvkernel_MAIN__F1L98_2\" GPU kernel\n", + " Generating NVIDIA GPU code\n", + " 99, Loop parallelized across teams ! blockidx%x\n", + " 101, Loop parallelized across threads(128) ! threadidx%x\n", + " 98, Generating Multicore code\n", + " 99, Loop parallelized across threads\n", + " 98, Generating implicit map(tofrom:g(:),x(:,:),z(:,:),y(:,:)) \n", + " 101, Loop is parallelizable\n", + "\n", + "```\n", + " \n", + "The output should be the following:\n", + "\n", + "```\n", + "s2 : -2.452690945278331 \n", + "s2bond : -24.37502820694527 \n", + "```\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is very important to inspect the feedback to make sure the compiler is doing what you have asked of it. Now, let's profile the code. Now, let's profile the code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (C/C++)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_gpu_c ./rdf_c" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#profile and see output of nvptx (Fortran)\n", + "!cd ../source_code && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_gpu_f ./rdf_f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download and save the report file by holding down Shift and Right-Clicking the [C/C++ version](../source_code/rdf_gpu_c.nsys-rep) or the [Fortran version](../source_code/rdf_gpu_f.nsys-rep) and choosing save Link As Once done, open it via the GUI. Have a look at the example expected profiler report below:\n", + "\n", + "
\n", + " Example screenshot (C/C++ code) \n", + "\n", + "\n", + "
\n", + "
\n", + "
\n", + " Example screenshot (Fortran code) \n", + "\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you expand the CUDA row (Timeline view), you can see memory movements as well as kernels. Checkout the NVTX row and compare the execution time for the `Pair_Calculation` for the multicore version and the GPU offload version. In the *example screenshot*, we were able to reduce the timing significantly.\n", + "\n", + "Feel free to checkout the solutions for [C/C++](../source_code/SOLUTION/rdf_offload_loop.cpp) and [Fortran](../source_code/SOLUTION/rdf_offload_loop.f90) versions to help you understand better.\n", + "\n", + "# OpenMP Analysis\n", + "\n", + "**Usage Scenarios**\n", + "- Legacy codes with sizeable codebase needs to be ported to GPUs with minimal code changes to sequential code.\n", + "- Developers want to see if the code structure favors GPU SIMD/SIMT style or as we say test the waters before moving a large piece of code to a GPU.\n", + "\n", + "\n", + "**Limitations/Constraints**\n", + "- Directive based programming model like OpenMP depends on a compiler to understand and convert your sequential code to CUDA constructs. OpenMP compiler with target offload support are evloving and they it cannot match the best performance that say using CUDA C constructs directly can give. Things like controlling execution at warp level or limiting the register counts etc are some of the examples\n", + " \n", + "**Which Compilers Support OpenMP on GPU?**\n", + "As of March 2020 here are the compilers that support OpenMP on GPU:\n", + "\n", + "| Compiler | Latest Version | Maintained by | Full or Partial Support |\n", + "| --- | --- | --- | --- |\n", + "| GCC | 12 | Mentor Graphics | 4.5 and 5.0 partial spec supported |\n", + "| CCE| latest | Cray | 4.5 partial spec supported | \n", + "| XL | latest | IBM | 4.5 partial spec supported |\n", + "| Clang | 13.0 | Community | 4.5 partial spec supported |\n", + "| HPC SDK | 22.11 | NVIDIA HPC SDK | 5.0 spec supported |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "\n", + "\n", + "# Links and Resources\n", + "[OpenMP Programming Model](https://computing.llnl.gov/tutorials/openMP/)\n", + "\n", + "[OpenMP Target Directive](https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf)\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/openmp/source_code/Makefile b/_basic/openmp/source_code/Makefile new file mode 100644 index 0000000..3b1c2b9 --- /dev/null +++ b/_basic/openmp/source_code/Makefile @@ -0,0 +1,11 @@ +# Copyright (c) 2020 NVIDIA Corporation. All rights reserved. + +FC := nvfortran +FLAGS := -O3 -w +ACCFLAGS := -Minfo=accel +NVTXLIB := -lnvhpcwrapnvtx + +rdf: rdf.f90 + ${FC} ${FLAGS} ${ACCFLAGS} rdf.f90 -o rdf ${NVTXLIB} +clean: + rm -f *.o rdf diff --git a/_basic/openmp/source_code/Pair_entropy.dat b/_basic/openmp/source_code/Pair_entropy.dat new file mode 100644 index 0000000..0998d26 --- /dev/null +++ b/_basic/openmp/source_code/Pair_entropy.dat @@ -0,0 +1,2 @@ +s2 value is -2.43191 +s2bond value is -3.87014 diff --git a/_basic/openmp/source_code/RDF.dat b/_basic/openmp/source_code/RDF.dat new file mode 100644 index 0000000..ea809b6 --- /dev/null +++ b/_basic/openmp/source_code/RDF.dat @@ -0,0 +1,2000 @@ +0.00761401 2.27364e+06 +0.022842 0 +0.03807 0 +0.0532981 0 +0.0685261 0 +0.0837541 0 +0.0989821 0 +0.11421 0 +0.129438 0 +0.144666 0 +0.159894 0 +0.175122 0 +0.19035 0 +0.205578 0 +0.220806 0 +0.236034 0 +0.251262 0 +0.26649 0 +0.281718 0 +0.296946 0 +0.312174 0 +0.327402 0 +0.34263 0 +0.357858 0 +0.373086 0 +0.388314 0 +0.403543 0 +0.418771 0 +0.433999 0 +0.449227 0 +0.464455 0 +0.479683 0 +0.494911 0 +0.510139 0 +0.525367 0 +0.540595 0 +0.555823 0 +0.571051 0 +0.586279 0 +0.601507 0 +0.616735 0 +0.631963 0 +0.647191 0 +0.662419 0 +0.677647 0 +0.692875 0 +0.708103 0 +0.723331 0 +0.738559 0 +0.753787 0 +0.769015 0 +0.784243 0 +0.799471 0 +0.814699 0 +0.829927 0 +0.845155 0 +0.860383 0 +0.875611 0 +0.890839 0 +0.906067 0 +0.921295 0 +0.936523 0 +0.951751 0 +0.966979 0 +0.982207 0 +0.997435 0 +1.01266 0 +1.02789 0 +1.04312 0 +1.05835 0 +1.07358 0 +1.0888 0 +1.10403 0 +1.11926 0 +1.13449 0 +1.14972 0 +1.16494 0 +1.18017 0 +1.1954 0 +1.21063 0 +1.22586 0 +1.24108 0 +1.25631 0 +1.27154 0 +1.28677 0 +1.302 0 +1.31722 0 +1.33245 0 +1.34768 0 +1.36291 0 +1.37814 0 +1.39336 0 +1.40859 0.015817 +1.42382 0.0490216 +1.43905 0.272783 +1.45428 0.976895 +1.4695 2.98168 +1.48473 6.70771 +1.49996 12.3749 +1.51519 18.7254 +1.53042 23.1225 +1.54564 22.5968 +1.56087 18.7488 +1.5761 12.2483 +1.59133 6.82648 +1.60656 2.88983 +1.62178 1.03012 +1.63701 0.322052 +1.65224 0.0498164 +1.66747 0.0112871 +1.6827 0.00184729 +1.69792 0 +1.71315 0 +1.72838 0 +1.74361 0 +1.75884 0 +1.77406 0 +1.78929 0 +1.80452 0 +1.81975 0 +1.83498 0 +1.8502 0 +1.86543 0 +1.88066 0 +1.89589 0 +1.91112 0 +1.92634 0 +1.94157 0 +1.9568 0 +1.97203 0 +1.98726 0 +2.00248 0 +2.01771 0 +2.03294 0 +2.04817 0 +2.0634 0 +2.07862 0 +2.09385 0 +2.10908 0 +2.12431 0 +2.13954 0 +2.15476 0 +2.16999 0 +2.18522 0 +2.20045 0 +2.21568 0 +2.2309 0 +2.24613 0 +2.26136 0 +2.27659 0.0030276 +2.29182 0.00199167 +2.30704 0.00294819 +2.32227 0.00969885 +2.3375 0.0124448 +2.35273 0.0349627 +2.36796 0.0522381 +2.38319 0.0976197 +2.39841 0.160034 +2.41364 0.269354 +2.42887 0.425579 +2.4441 0.670716 +2.45933 0.911498 +2.47455 1.21124 +2.48978 1.65042 +2.50501 2.03301 +2.52024 2.55944 +2.53547 2.96246 +2.55069 3.34284 +2.56592 3.37557 +2.58115 3.46854 +2.59638 3.22468 +2.61161 2.9924 +2.62683 2.6076 +2.64206 2.14228 +2.65729 1.75705 +2.67252 1.3138 +2.68775 0.920997 +2.70297 0.61569 +2.7182 0.445989 +2.73343 0.277221 +2.74866 0.151618 +2.76389 0.087643 +2.77911 0.0650139 +2.79434 0.0515797 +2.80957 0.0404201 +2.8248 0.044574 +2.84003 0.0460427 +2.85525 0.05646 +2.87048 0.0628453 +2.88571 0.0722337 +2.90094 0.0795574 +2.91617 0.0897999 +2.93139 0.0967824 +2.94662 0.110243 +2.96185 0.114478 +2.97708 0.122753 +2.99231 0.141952 +3.00753 0.146879 +3.02276 0.174026 +3.03799 0.179086 +3.05322 0.177865 +3.06845 0.187215 +3.08367 0.187021 +3.0989 0.199348 +3.11413 0.201179 +3.12936 0.224864 +3.14459 0.234857 +3.15981 0.195928 +3.17504 0.200279 +3.19027 0.21122 +3.2055 0.206672 +3.22073 0.18657 +3.23595 0.185318 +3.25118 0.179132 +3.26641 0.175995 +3.28164 0.165137 +3.29687 0.157359 +3.31209 0.14924 +3.32732 0.152602 +3.34255 0.140915 +3.35778 0.134073 +3.37301 0.14206 +3.38823 0.123472 +3.40346 0.13682 +3.41869 0.12889 +3.43392 0.132186 +3.44915 0.128383 +3.46437 0.140767 +3.4796 0.150338 +3.49483 0.166589 +3.51006 0.163873 +3.52529 0.16709 +3.54051 0.18944 +3.55574 0.199404 +3.57097 0.220267 +3.5862 0.226128 +3.60143 0.256885 +3.61665 0.285117 +3.63188 0.294627 +3.64711 0.314193 +3.66234 0.337714 +3.67757 0.317906 +3.69279 0.367454 +3.70802 0.402104 +3.72325 0.410141 +3.73848 0.459575 +3.75371 0.471445 +3.76893 0.532818 +3.78416 0.551185 +3.79939 0.61236 +3.81462 0.708128 +3.82985 0.8209 +3.84507 0.899673 +3.8603 1.03509 +3.87553 1.12309 +3.89076 1.25771 +3.90599 1.32643 +3.92122 1.42024 +3.93644 1.44843 +3.95167 1.43294 +3.9669 1.43725 +3.98213 1.39131 +3.99736 1.28907 +4.01258 1.21239 +4.02781 1.13327 +4.04304 1.04412 +4.05827 0.933079 +4.0735 0.908779 +4.08872 0.886065 +4.10395 0.854347 +4.11918 0.791012 +4.13441 0.809676 +4.14964 0.824096 +4.16486 0.817478 +4.18009 0.852244 +4.19532 0.826752 +4.21055 0.882149 +4.22578 0.871116 +4.241 0.886682 +4.25623 0.897962 +4.27146 0.886124 +4.28669 0.937907 +4.30192 0.93891 +4.31714 0.948575 +4.33237 0.958081 +4.3476 0.954426 +4.36283 1.01977 +4.37806 1.0075 +4.39328 1.02357 +4.40851 1.04907 +4.42374 1.08543 +4.43897 1.05384 +4.4542 1.09278 +4.46942 1.08954 +4.48465 1.10686 +4.49988 1.15182 +4.51511 1.14201 +4.53034 1.15983 +4.54556 1.21004 +4.56079 1.16451 +4.57602 1.14853 +4.59125 1.17492 +4.60648 1.20512 +4.6217 1.18421 +4.63693 1.16939 +4.65216 1.17456 +4.66739 1.20412 +4.68262 1.20966 +4.69784 1.19235 +4.71307 1.16135 +4.7283 1.20676 +4.74353 1.18414 +4.75876 1.14747 +4.77398 1.20006 +4.78921 1.18652 +4.80444 1.14887 +4.81967 1.12856 +4.8349 1.16577 +4.85012 1.17247 +4.86535 1.18259 +4.88058 1.16666 +4.89581 1.18255 +4.91104 1.15506 +4.92626 1.17788 +4.94149 1.18884 +4.95672 1.21348 +4.97195 1.25832 +4.98718 1.23719 +5.0024 1.28005 +5.01763 1.30657 +5.03286 1.30136 +5.04809 1.32287 +5.06332 1.35981 +5.07854 1.37905 +5.09377 1.37706 +5.109 1.37668 +5.12423 1.40795 +5.13946 1.37903 +5.15468 1.38684 +5.16991 1.39805 +5.18514 1.37254 +5.20037 1.37805 +5.2156 1.31291 +5.23082 1.33662 +5.24605 1.29124 +5.26128 1.26413 +5.27651 1.24613 +5.29174 1.22851 +5.30696 1.22742 +5.32219 1.24201 +5.33742 1.21069 +5.35265 1.19761 +5.36788 1.1861 +5.3831 1.21622 +5.39833 1.21314 +5.41356 1.17866 +5.42879 1.20542 +5.44402 1.20081 +5.45925 1.17165 +5.47447 1.18033 +5.4897 1.18507 +5.50493 1.17473 +5.52016 1.17255 +5.53539 1.18727 +5.55061 1.17567 +5.56584 1.17516 +5.58107 1.16607 +5.5963 1.15188 +5.61153 1.15743 +5.62675 1.15927 +5.64198 1.15631 +5.65721 1.13522 +5.67244 1.16262 +5.68767 1.14039 +5.70289 1.14959 +5.71812 1.14235 +5.73335 1.11688 +5.74858 1.14009 +5.76381 1.12526 +5.77903 1.12106 +5.79426 1.09929 +5.80949 1.1143 +5.82472 1.09692 +5.83995 1.1162 +5.85517 1.12154 +5.8704 1.10374 +5.88563 1.09139 +5.90086 1.09283 +5.91609 1.07704 +5.93131 1.09308 +5.94654 1.08128 +5.96177 1.08003 +5.977 1.06941 +5.99223 1.09034 +6.00745 1.05511 +6.02268 1.03392 +6.03791 1.09069 +6.05314 1.0578 +6.06837 1.05833 +6.08359 1.06067 +6.09882 1.07141 +6.11405 1.08944 +6.12928 1.01957 +6.14451 1.03711 +6.15973 1.03791 +6.17496 1.04213 +6.19019 1.03319 +6.20542 1.03763 +6.22065 1.01133 +6.23587 1.02833 +6.2511 1.04327 +6.26633 1.02901 +6.28156 1.03675 +6.29679 1.01697 +6.31201 1.00327 +6.32724 1.02745 +6.34247 1.0159 +6.3577 1.02553 +6.37293 1.03145 +6.38815 1.03436 +6.40338 1.05381 +6.41861 1.04716 +6.43384 1.03905 +6.44907 1.0334 +6.46429 1.02941 +6.47952 1.03156 +6.49475 1.03193 +6.50998 1.02662 +6.52521 1.02023 +6.54043 1.01708 +6.55566 0.998726 +6.57089 0.990953 +6.58612 0.972268 +6.60135 0.965387 +6.61657 0.9901 +6.6318 0.967006 +6.64703 0.971459 +6.66226 0.952646 +6.67749 0.956869 +6.69271 0.94131 +6.70794 0.947619 +6.72317 0.95768 +6.7384 0.950131 +6.75363 0.956057 +6.76885 0.939089 +6.78408 0.935446 +6.79931 0.956264 +6.81454 0.940506 +6.82977 0.943382 +6.84499 0.924565 +6.86022 0.948027 +6.87545 0.930665 +6.89068 0.965112 +6.90591 0.932345 +6.92113 0.933379 +6.93636 0.91787 +6.95159 0.959205 +6.96682 0.931632 +6.98205 0.951177 +6.99728 0.925249 +7.0125 0.933041 +7.02773 0.916823 +7.04296 0.936061 +7.05819 0.929926 +7.07342 0.911918 +7.08864 0.910086 +7.10387 0.916035 +7.1191 0.892408 +7.13433 0.918404 +7.14956 0.913575 +7.16478 0.915096 +7.18001 0.909697 +7.19524 0.919288 +7.21047 0.908366 +7.2257 0.906946 +7.24092 0.909819 +7.25615 0.892395 +7.27138 0.900532 +7.28661 0.90091 +7.30184 0.881557 +7.31706 0.901241 +7.33229 0.921629 +7.34752 0.911516 +7.36275 0.898872 +7.37798 0.885941 +7.3932 0.904305 +7.40843 0.904212 +7.42366 0.890826 +7.43889 0.898525 +7.45412 0.908131 +7.46934 0.893931 +7.48457 0.902622 +7.4998 0.900913 +7.51503 0.891802 +7.53026 0.9072 +7.54548 0.90538 +7.56071 0.916834 +7.57594 0.90978 +7.59117 0.902775 +7.6064 0.915618 +7.62162 0.921957 +7.63685 0.930392 +7.65208 0.914901 +7.66731 0.928264 +7.68254 0.945414 +7.69776 0.933115 +7.71299 0.928379 +7.72822 0.930241 +7.74345 0.926761 +7.75868 0.913742 +7.7739 0.920898 +7.78913 0.918335 +7.80436 0.918705 +7.81959 0.926679 +7.83482 0.91882 +7.85004 0.924425 +7.86527 0.92347 +7.8805 0.929338 +7.89573 0.920387 +7.91096 0.910245 +7.92618 0.929063 +7.94141 0.915551 +7.95664 0.929648 +7.97187 0.909968 +7.9871 0.924868 +8.00232 0.938585 +8.01755 0.938848 +8.03278 0.92832 +8.04801 0.933774 +8.06324 0.943364 +8.07846 0.940132 +8.09369 0.93021 +8.10892 0.943742 +8.12415 0.946944 +8.13938 0.948457 +8.1546 0.950266 +8.16983 0.951507 +8.18506 0.940788 +8.20029 0.95192 +8.21552 0.951572 +8.23074 0.951606 +8.24597 0.948094 +8.2612 0.94169 +8.27643 0.971444 +8.29166 0.972215 +8.30688 0.966304 +8.32211 0.96179 +8.33734 0.95399 +8.35257 0.967459 +8.3678 0.956247 +8.38302 0.973691 +8.39825 0.967715 +8.41348 0.967023 +8.42871 0.989375 +8.44394 0.959033 +8.45916 0.992278 +8.47439 0.964024 +8.48962 0.973777 +8.50485 0.973258 +8.52008 0.971151 +8.5353 0.98413 +8.55053 0.978553 +8.56576 1.00359 +8.58099 0.997689 +8.59622 0.989557 +8.61145 0.987965 +8.62667 0.995936 +8.6419 0.99299 +8.65713 0.997037 +8.67236 0.998128 +8.68759 0.990266 +8.70281 0.995851 +8.71804 1.00834 +8.73327 1.01223 +8.7485 1.00208 +8.76373 0.998809 +8.77895 1.01231 +8.79418 1.00293 +8.80941 1.00526 +8.82464 1.003 +8.83987 1.02445 +8.85509 1.01159 +8.87032 1.01443 +8.88555 1.01672 +8.90078 1.01338 +8.91601 1.02644 +8.93123 1.00937 +8.94646 1.02096 +8.96169 1.02023 +8.97692 1.02391 +8.99215 1.03992 +9.00737 1.02596 +9.0226 1.03413 +9.03783 1.01874 +9.05306 1.01531 +9.06829 1.03073 +9.08351 1.03628 +9.09874 1.02732 +9.11397 1.0465 +9.1292 1.02776 +9.14443 1.02446 +9.15965 1.0465 +9.17488 1.03725 +9.19011 1.03839 +9.20534 1.02212 +9.22057 1.02933 +9.23579 1.04286 +9.25102 1.03864 +9.26625 1.0345 +9.28148 1.03948 +9.29671 1.03959 +9.31193 1.03402 +9.32716 1.0339 +9.34239 1.04221 +9.35762 1.04486 +9.37285 1.05873 +9.38807 1.04438 +9.4033 1.03402 +9.41853 1.04866 +9.43376 1.04222 +9.44899 1.04402 +9.46421 1.04534 +9.47944 1.03581 +9.49467 1.04015 +9.5099 1.01982 +9.52513 1.05334 +9.54035 1.04228 +9.55558 1.04641 +9.57081 1.03052 +9.58604 1.04205 +9.60127 1.03761 +9.61649 1.05254 +9.63172 1.04555 +9.64695 1.04298 +9.66218 1.04681 +9.67741 1.04089 +9.69263 1.04815 +9.70786 1.04381 +9.72309 1.03833 +9.73832 1.04771 +9.75355 1.03961 +9.76877 1.03818 +9.784 1.0415 +9.79923 1.04519 +9.81446 1.04151 +9.82969 1.03856 +9.84491 1.03918 +9.86014 1.05077 +9.87537 1.03063 +9.8906 1.03794 +9.90583 1.04232 +9.92105 1.0421 +9.93628 1.05062 +9.95151 1.03346 +9.96674 1.03362 +9.98197 1.03882 +9.99719 1.02744 +10.0124 1.02787 +10.0277 1.02953 +10.0429 1.022 +10.0581 1.03116 +10.0733 1.03738 +10.0886 1.03841 +10.1038 1.04333 +10.119 1.0361 +10.1342 1.02148 +10.1495 1.0205 +10.1647 1.04144 +10.1799 1.03162 +10.1952 1.03372 +10.2104 1.03164 +10.2256 1.04078 +10.2408 1.04197 +10.2561 1.02764 +10.2713 1.02886 +10.2865 1.02384 +10.3018 1.01692 +10.317 1.02414 +10.3322 1.02132 +10.3474 1.01519 +10.3627 1.02152 +10.3779 1.02537 +10.3931 1.03965 +10.4084 1.02686 +10.4236 1.01534 +10.4388 1.02145 +10.454 0.999813 +10.4693 1.00545 +10.4845 1.0188 +10.4997 1.0196 +10.5149 1.01774 +10.5302 1.01729 +10.5454 1.01704 +10.5606 1.00468 +10.5759 1.00492 +10.5911 1.01043 +10.6063 1.01423 +10.6215 1.02653 +10.6368 1.01957 +10.652 1.0155 +10.6672 1.015 +10.6825 1.00683 +10.6977 1.01174 +10.7129 1.01652 +10.7281 1.00918 +10.7434 1.00501 +10.7586 1.00538 +10.7738 1.00632 +10.7891 1.01332 +10.8043 1.00993 +10.8195 1.0083 +10.8347 0.992809 +10.85 1.00651 +10.8652 1.00201 +10.8804 1.01025 +10.8956 1.0169 +10.9109 1.00374 +10.9261 1.00002 +10.9413 0.997109 +10.9566 1.01582 +10.9718 0.992798 +10.987 1.00283 +11.0022 0.982988 +11.0175 0.998629 +11.0327 0.993983 +11.0479 0.99613 +11.0632 1.00155 +11.0784 0.994582 +11.0936 0.990791 +11.1088 1.01194 +11.1241 0.992389 +11.1393 0.989425 +11.1545 0.988659 +11.1698 0.992673 +11.185 0.992648 +11.2002 0.994037 +11.2154 0.995913 +11.2307 0.991514 +11.2459 0.988707 +11.2611 0.997336 +11.2763 0.99666 +11.2916 0.993194 +11.3068 0.987206 +11.322 0.994386 +11.3373 0.999408 +11.3525 0.984634 +11.3677 0.988231 +11.3829 0.988132 +11.3982 1.00011 +11.4134 0.98672 +11.4286 0.983892 +11.4439 0.98491 +11.4591 0.982931 +11.4743 0.986641 +11.4895 0.980461 +11.5048 0.968936 +11.52 0.973432 +11.5352 0.978253 +11.5505 0.980106 +11.5657 0.979677 +11.5809 0.989192 +11.5961 0.983523 +11.6114 0.984165 +11.6266 0.98658 +11.6418 0.975048 +11.657 0.971116 +11.6723 0.979756 +11.6875 0.977166 +11.7027 0.99796 +11.718 0.985845 +11.7332 0.977474 +11.7484 0.978997 +11.7636 0.9858 +11.7789 0.980538 +11.7941 0.986055 +11.8093 0.978488 +11.8246 0.979635 +11.8398 0.990661 +11.855 0.967723 +11.8702 0.987626 +11.8855 0.974989 +11.9007 0.984461 +11.9159 0.976494 +11.9312 0.971652 +11.9464 0.976983 +11.9616 0.985866 +11.9768 0.976214 +11.9921 0.976609 +12.0073 0.971812 +12.0225 0.982596 +12.0377 0.974517 +12.053 0.979185 +12.0682 0.977757 +12.0834 0.986757 +12.0987 0.972054 +12.1139 0.993208 +12.1291 0.998182 +12.1443 0.990041 +12.1596 0.98307 +12.1748 0.979413 +12.19 0.981015 +12.2053 0.980816 +12.2205 0.989581 +12.2357 0.97856 +12.2509 0.981112 +12.2662 0.993313 +12.2814 0.974518 +12.2966 0.984698 +12.3119 0.986853 +12.3271 0.983314 +12.3423 0.984083 +12.3575 0.979056 +12.3728 0.977399 +12.388 0.992448 +12.4032 0.982261 +12.4184 0.981787 +12.4337 0.983105 +12.4489 0.989612 +12.4641 0.990494 +12.4794 0.980085 +12.4946 0.996661 +12.5098 0.985045 +12.525 0.986719 +12.5403 0.990078 +12.5555 0.990133 +12.5707 1.00157 +12.586 0.987559 +12.6012 0.984581 +12.6164 0.995218 +12.6316 0.992296 +12.6469 0.991052 +12.6621 0.980677 +12.6773 1.00429 +12.6926 0.991009 +12.7078 0.992101 +12.723 0.991408 +12.7382 0.983333 +12.7535 0.99848 +12.7687 0.985898 +12.7839 0.987423 +12.7992 0.99989 +12.8144 0.995572 +12.8296 0.994036 +12.8448 0.993298 +12.8601 0.995881 +12.8753 0.997597 +12.8905 0.983185 +12.9057 0.98454 +12.921 0.982503 +12.9362 1.00241 +12.9514 0.998406 +12.9667 0.999609 +12.9819 0.984912 +12.9971 1.00075 +13.0123 1.0002 +13.0276 0.998851 +13.0428 0.993752 +13.058 1.00349 +13.0733 1.0003 +13.0885 0.994094 +13.1037 0.995654 +13.1189 1.00459 +13.1342 1.00538 +13.1494 1.00399 +13.1646 0.99627 +13.1799 1.00068 +13.1951 1.00159 +13.2103 1.00441 +13.2255 1.00293 +13.2408 0.988753 +13.256 1.00169 +13.2712 0.993724 +13.2864 1.01453 +13.3017 1.00304 +13.3169 0.995058 +13.3321 1.0002 +13.3474 1.0015 +13.3626 1.00121 +13.3778 1.00273 +13.393 0.99926 +13.4083 1.00325 +13.4235 0.995718 +13.4387 1.02358 +13.454 0.999682 +13.4692 0.999181 +13.4844 1.00656 +13.4996 1.01058 +13.5149 1.00097 +13.5301 1.00441 +13.5453 1.00263 +13.5606 1.00886 +13.5758 0.999558 +13.591 1.0046 +13.6062 1.01246 +13.6215 1.00696 +13.6367 1.01056 +13.6519 1.01443 +13.6671 1.00576 +13.6824 1.00771 +13.6976 1.01537 +13.7128 0.998899 +13.7281 1.00204 +13.7433 1.00561 +13.7585 1.00836 +13.7737 1.01148 +13.789 1.00545 +13.8042 1.01114 +13.8194 1.00184 +13.8347 1.01227 +13.8499 1.01124 +13.8651 1.00312 +13.8803 1.00778 +13.8956 0.995527 +13.9108 1.00581 +13.926 1.0065 +13.9413 1.00925 +13.9565 1.00606 +13.9717 1.01702 +13.9869 1.00879 +14.0022 1.00281 +14.0174 1.01072 +14.0326 1.01095 +14.0478 1.01793 +14.0631 1.0039 +14.0783 1.00854 +14.0935 1.00036 +14.1088 1.0044 +14.124 1.00764 +14.1392 1.0119 +14.1544 1.00936 +14.1697 1.00897 +14.1849 1.00069 +14.2001 1.00905 +14.2154 1.00775 +14.2306 1.01383 +14.2458 1.00666 +14.261 1.01318 +14.2763 1.00309 +14.2915 1.00106 +14.3067 1.00887 +14.322 1.00323 +14.3372 1.0128 +14.3524 1.00842 +14.3676 1.00231 +14.3829 1.00777 +14.3981 1.00567 +14.4133 1.00274 +14.4285 1.00474 +14.4438 1.00987 +14.459 1.00369 +14.4742 1.0025 +14.4895 1.00707 +14.5047 1.00872 +14.5199 1.01052 +14.5351 1.00823 +14.5504 1.00837 +14.5656 1.0055 +14.5808 1.00475 +14.5961 1.01167 +14.6113 0.998363 +14.6265 1.01088 +14.6417 1.01227 +14.657 1.00522 +14.6722 1.00732 +14.6874 1.00862 +14.7027 1.01171 +14.7179 1.00598 +14.7331 1.00163 +14.7483 1.00471 +14.7636 1.00259 +14.7788 1.0033 +14.794 1.00509 +14.8092 1.002 +14.8245 0.995416 +14.8397 1.00489 +14.8549 1.00615 +14.8702 1.00987 +14.8854 1.01387 +14.9006 1.01067 +14.9158 1.00599 +14.9311 1.00329 +14.9463 1.00082 +14.9615 0.999999 +14.9768 1.00249 +14.992 0.999873 +15.0072 0.999657 +15.0224 1.00289 +15.0377 1.00172 +15.0529 0.9996 +15.0681 1.00539 +15.0834 1.00327 +15.0986 1.00432 +15.1138 1.00793 +15.129 1.00771 +15.1443 1.00527 +15.1595 0.998929 +15.1747 1.00272 +15.1899 0.996197 +15.2052 1.00656 +15.2204 1.00242 +15.2356 1.0088 +15.2509 0.999342 +15.2661 1.00018 +15.2813 1.00681 +15.2965 1.00264 +15.3118 1.00544 +15.327 1.00505 +15.3422 0.99463 +15.3575 1.0069 +15.3727 1.00132 +15.3879 1.00035 +15.4031 1.00622 +15.4184 1.00362 +15.4336 0.999379 +15.4488 1.00344 +15.4641 1.00498 +15.4793 1.00211 +15.4945 0.993869 +15.5097 1.0079 +15.525 1.0023 +15.5402 0.999838 +15.5554 1.0038 +15.5707 1.00247 +15.5859 0.997689 +15.6011 1.00086 +15.6163 0.99927 +15.6316 0.995719 +15.6468 0.999059 +15.662 1.00104 +15.6772 0.995755 +15.6925 0.992995 +15.7077 0.990329 +15.7229 1.00235 +15.7382 0.993933 +15.7534 1.00392 +15.7686 1.00331 +15.7838 1.00066 +15.7991 0.990642 +15.8143 0.995261 +15.8295 1.00088 +15.8448 0.996147 +15.86 1.00313 +15.8752 0.999966 +15.8904 0.997574 +15.9057 1.00813 +15.9209 0.997475 +15.9361 1.00762 +15.9514 0.991799 +15.9666 1.00476 +15.9818 0.990112 +15.997 1.00309 +16.0123 0.999303 +16.0275 0.988324 +16.0427 0.996407 +16.0579 0.988229 +16.0732 0.99393 +16.0884 0.995404 +16.1036 0.998645 +16.1189 1.00401 +16.1341 0.994678 +16.1493 0.989233 +16.1645 0.996978 +16.1798 0.998579 +16.195 0.990041 +16.2102 0.998632 +16.2255 0.995467 +16.2407 0.995069 +16.2559 0.993938 +16.2711 0.991545 +16.2864 0.998999 +16.3016 1.00148 +16.3168 0.988161 +16.3321 0.995281 +16.3473 0.995091 +16.3625 0.992361 +16.3777 0.992252 +16.393 0.988035 +16.4082 1.00054 +16.4234 0.998491 +16.4386 0.99308 +16.4539 0.996208 +16.4691 0.997626 +16.4843 0.992839 +16.4996 0.994888 +16.5148 0.986361 +16.53 0.997887 +16.5452 0.998764 +16.5605 1.0012 +16.5757 0.994202 +16.5909 0.999865 +16.6062 1.00454 +16.6214 0.994557 +16.6366 1.00695 +16.6518 1.00094 +16.6671 0.994949 +16.6823 0.996254 +16.6975 0.996389 +16.7128 0.996278 +16.728 0.996072 +16.7432 0.995959 +16.7584 0.994522 +16.7737 0.991472 +16.7889 1.00073 +16.8041 1.00351 +16.8193 0.992454 +16.8346 0.997322 +16.8498 0.996165 +16.865 0.9949 +16.8803 0.996043 +16.8955 1.00158 +16.9107 0.998896 +16.9259 1.00849 +16.9412 1.00116 +16.9564 0.997542 +16.9716 0.998985 +16.9869 1.00118 +17.0021 0.997707 +17.0173 0.993592 +17.0325 0.996901 +17.0478 0.999206 +17.063 0.993668 +17.0782 0.994641 +17.0935 0.999475 +17.1087 1.0005 +17.1239 1.00145 +17.1391 0.990611 +17.1544 0.999606 +17.1696 0.99645 +17.1848 1.00269 +17.2 0.993685 +17.2153 0.996516 +17.2305 1.00043 +17.2457 0.998152 +17.261 0.998165 +17.2762 0.998825 +17.2914 0.994162 +17.3066 0.996814 +17.3219 0.995219 +17.3371 1.00938 +17.3523 1.00068 +17.3676 0.998782 +17.3828 0.990905 +17.398 1.00403 +17.4132 0.999689 +17.4285 0.999786 +17.4437 0.996597 +17.4589 0.996798 +17.4742 0.999653 +17.4894 0.99952 +17.5046 0.994026 +17.5198 0.99402 +17.5351 0.99175 +17.5503 0.997179 +17.5655 1.00362 +17.5807 0.997315 +17.596 0.999813 +17.6112 1.00139 +17.6264 1.00496 +17.6417 0.994087 +17.6569 0.997725 +17.6721 0.989625 +17.6873 1.00586 +17.7026 0.993616 +17.7178 0.997124 +17.733 0.998224 +17.7483 0.998886 +17.7635 0.999942 +17.7787 0.992025 +17.7939 1.00268 +17.8092 1.00168 +17.8244 0.994124 +17.8396 0.998065 +17.8549 0.999267 +17.8701 1.00212 +17.8853 0.994297 +17.9005 0.998222 +17.9158 1.00228 +17.931 1.00492 +17.9462 1.00299 +17.9614 0.998531 +17.9767 0.999786 +17.9919 0.999468 +18.0071 1.00673 +18.0224 1.00307 +18.0376 1.00901 +18.0528 1.00716 +18.068 1.00407 +18.0833 0.998623 +18.0985 1.00284 +18.1137 1.00499 +18.129 1.00515 +18.1442 0.998237 +18.1594 0.994327 +18.1746 0.991965 +18.1899 0.991822 +18.2051 0.998133 +18.2203 1.00003 +18.2356 1.00119 +18.2508 0.999832 +18.266 1.00296 +18.2812 1.002 +18.2965 1.00263 +18.3117 1.00378 +18.3269 0.999873 +18.3421 1.00774 +18.3574 0.999073 +18.3726 0.997758 +18.3878 1.0035 +18.4031 0.999818 +18.4183 1.00951 +18.4335 1.00258 +18.4487 1.00002 +18.464 0.999983 +18.4792 1.00451 +18.4944 1.00087 +18.5097 1.00218 +18.5249 1.0007 +18.5401 1.0021 +18.5553 0.995278 +18.5706 1.00548 +18.5858 0.997546 +18.601 1.00442 +18.6163 0.999869 +18.6315 1.00298 +18.6467 0.994695 +18.6619 0.996902 +18.6772 1.00595 +18.6924 1.00114 +18.7076 1.00758 +18.7229 1.00893 +18.7381 1.00742 +18.7533 1.00933 +18.7685 0.99939 +18.7838 0.997533 +18.799 0.998567 +18.8142 1.00338 +18.8294 0.99885 +18.8447 1.00266 +18.8599 1.00529 +18.8751 1.00437 +18.8904 1.00419 +18.9056 0.99647 +18.9208 1.00137 +18.936 1.00081 +18.9513 1.00005 +18.9665 0.995256 +18.9817 1.00025 +18.997 1.00585 +19.0122 1.00214 +19.0274 1.00047 +19.0426 1.0036 +19.0579 1.00061 +19.0731 1.00344 +19.0883 1.00321 +19.1036 0.997911 +19.1188 0.996393 +19.134 0.999137 +19.1492 1.00116 +19.1645 1.00109 +19.1797 1.00899 +19.1949 1.00364 +19.2101 1.0008 +19.2254 0.999186 +19.2406 0.995161 +19.2558 1.00288 +19.2711 1.00665 +19.2863 1.00518 +19.3015 1.00567 +19.3167 0.999429 +19.332 1.00221 +19.3472 1.00257 +19.3624 1.00423 +19.3777 1.00342 +19.3929 0.999955 +19.4081 1.00227 +19.4233 1.00342 +19.4386 1.00159 +19.4538 1.00671 +19.469 0.999214 +19.4843 0.998714 +19.4995 1.00289 +19.5147 0.996272 +19.5299 0.997873 +19.5452 1.00403 +19.5604 1.00298 +19.5756 1.00327 +19.5908 1.00311 +19.6061 0.999254 +19.6213 0.992405 +19.6365 1.0052 +19.6518 0.999625 +19.667 0.999106 +19.6822 0.999221 +19.6974 1.00876 +19.7127 0.99767 +19.7279 1.00024 +19.7431 0.999935 +19.7584 1.00093 +19.7736 0.999372 +19.7888 0.99678 +19.804 0.999702 +19.8193 0.999524 +19.8345 1.00203 +19.8497 1.00823 +19.865 0.998975 +19.8802 0.998319 +19.8954 1.00046 +19.9106 1.00297 +19.9259 1.00094 +19.9411 1.00732 +19.9563 0.996415 +19.9715 1.0019 +19.9868 1.00273 +20.002 1.00234 +20.0172 1.00121 +20.0325 1.00486 +20.0477 1.00002 +20.0629 0.999032 +20.0781 0.998205 +20.0934 0.997586 +20.1086 1.00454 +20.1238 1.00902 +20.1391 0.998134 +20.1543 0.997283 +20.1695 0.995032 +20.1847 0.998024 +20.2 1.00387 +20.2152 0.998808 +20.2304 1.00055 +20.2457 0.999123 +20.2609 1.0006 +20.2761 1.00313 +20.2913 1.00018 +20.3066 1.00623 +20.3218 1.00005 +20.337 0.99549 +20.3522 1.00021 +20.3675 0.996335 +20.3827 0.999291 +20.3979 0.99702 +20.4132 0.994805 +20.4284 0.999715 +20.4436 1.00425 +20.4588 0.993117 +20.4741 1.00318 +20.4893 0.9952 +20.5045 1.002 +20.5198 0.999938 +20.535 1.00127 +20.5502 0.993471 +20.5654 1.00144 +20.5807 0.997991 +20.5959 1.00089 +20.6111 0.998701 +20.6264 0.999329 +20.6416 0.997229 +20.6568 1.0031 +20.672 0.996631 +20.6873 0.991754 +20.7025 1.00318 +20.7177 0.998443 +20.7329 1.00324 +20.7482 0.996948 +20.7634 0.997379 +20.7786 0.996814 +20.7939 0.993044 +20.8091 0.996725 +20.8243 0.995521 +20.8395 1.00014 +20.8548 1.00374 +20.87 1.00182 +20.8852 0.9985 +20.9005 0.995106 +20.9157 0.998739 +20.9309 0.997573 +20.9461 1.00022 +20.9614 0.998283 +20.9766 0.994195 +20.9918 0.99083 +21.0071 0.991978 +21.0223 0.993855 +21.0375 1.00076 +21.0527 0.994345 +21.068 0.997493 +21.0832 0.994287 +21.0984 1.0028 +21.1136 1.00074 +21.1289 0.994069 +21.1441 0.998804 +21.1593 1.00031 +21.1746 1.00117 +21.1898 0.996983 +21.205 1.00341 +21.2202 0.996365 +21.2355 1.00166 +21.2507 1.00267 +21.2659 1.00139 +21.2812 0.998731 +21.2964 1.00212 +21.3116 0.99536 +21.3268 1.00053 +21.3421 1.00149 +21.3573 0.999913 +21.3725 0.990863 +21.3878 1.00145 +21.403 0.996825 +21.4182 0.999125 +21.4334 0.995099 +21.4487 0.989809 +21.4639 0.999203 +21.4791 1.00341 +21.4943 1.00554 +21.5096 0.998038 +21.5248 0.999866 +21.54 0.998002 +21.5553 0.998078 +21.5705 0.995973 +21.5857 0.998171 +21.6009 0.998334 +21.6162 0.995652 +21.6314 0.996363 +21.6466 0.999092 +21.6619 0.99693 +21.6771 1.00172 +21.6923 0.998679 +21.7075 0.992616 +21.7228 1.00376 +21.738 1.00014 +21.7532 0.99862 +21.7685 0.999707 +21.7837 1.00186 +21.7989 0.997257 +21.8141 1.00062 +21.8294 0.998933 +21.8446 1.00082 +21.8598 0.999632 +21.8751 0.995114 +21.8903 0.994505 +21.9055 0.996186 +21.9207 1.00017 +21.936 1.00553 +21.9512 1.00206 +21.9664 1.00012 +21.9816 1.00268 +21.9969 0.991884 +22.0121 0.997788 +22.0273 0.99504 +22.0426 0.996637 +22.0578 0.997132 +22.073 1.00624 +22.0882 1.0025 +22.1035 1.00008 +22.1187 0.997431 +22.1339 1.00277 +22.1492 1.0005 +22.1644 1.00028 +22.1796 1.0045 +22.1948 1.0003 +22.2101 0.998891 +22.2253 0.993679 +22.2405 0.996379 +22.2558 0.999768 +22.271 1.00507 +22.2862 0.995205 +22.3014 1.00067 +22.3167 1.00208 +22.3319 0.995291 +22.3471 0.993997 +22.3623 0.995531 +22.3776 0.997707 +22.3928 1.00108 +22.408 1.00152 +22.4233 0.989714 +22.4385 1.00307 +22.4537 1.0033 +22.4689 1.0017 +22.4842 0.99486 +22.4994 0.99589 +22.5146 0.995885 +22.5299 1.00418 +22.5451 1.00161 +22.5603 1.00012 +22.5755 0.999632 +22.5908 0.999197 +22.606 0.997626 +22.6212 1.00919 +22.6365 0.999312 +22.6517 0.999223 +22.6669 0.999326 +22.6821 0.993461 +22.6974 1.00269 +22.7126 1.00336 +22.7278 1.00175 +22.743 1.0048 +22.7583 0.999003 +22.7735 0.999332 +22.7887 1.00336 +22.804 1.00058 +22.8192 1.00452 +22.8344 1.00517 +22.8496 0.995398 +22.8649 1.00111 +22.8801 0.998855 +22.8953 1.00567 +22.9106 0.999649 +22.9258 1.00115 +22.941 1.00112 +22.9562 1.00368 +22.9715 1.00887 +22.9867 1.00259 +23.0019 1.00518 +23.0172 1.00248 +23.0324 0.998291 +23.0476 1.00694 +23.0628 0.999678 +23.0781 1.00071 +23.0933 1.00532 +23.1085 1.00449 +23.1237 1.00214 +23.139 1.00212 +23.1542 0.999775 +23.1694 0.997448 +23.1847 0.99859 +23.1999 1.00529 +23.2151 1.00113 +23.2303 0.996896 +23.2456 1.00051 +23.2608 0.999121 +23.276 1.00335 +23.2913 1.00349 +23.3065 0.99828 +23.3217 1.0034 +23.3369 0.999027 +23.3522 1.00479 +23.3674 1.00243 +23.3826 1.00185 +23.3979 1.00283 +23.4131 1.00216 +23.4283 0.99977 +23.4435 0.997196 +23.4588 1.00571 +23.474 1.00058 +23.4892 0.995492 +23.5044 1.00218 +23.5197 1.00056 +23.5349 1.00404 +23.5501 1.00583 +23.5654 1.00327 +23.5806 1.00159 +23.5958 1.00094 +23.611 0.997837 +23.6263 0.999812 +23.6415 0.997139 +23.6567 1.00263 +23.672 1.00012 +23.6872 0.99408 +23.7024 1.00205 +23.7176 0.991222 +23.7329 0.999255 +23.7481 0.998864 +23.7633 0.999196 +23.7786 1.00195 +23.7938 1.00254 +23.809 1.00246 +23.8242 1.00131 +23.8395 0.995023 +23.8547 0.998827 +23.8699 0.999389 +23.8851 0.998125 +23.9004 0.996176 +23.9156 1.00486 +23.9308 1.00284 +23.9461 0.998025 +23.9613 0.995171 +23.9765 0.992233 +23.9917 0.999198 +24.007 0.996588 +24.0222 1.00457 +24.0374 0.997793 +24.0527 0.996594 +24.0679 1.00008 +24.0831 1.00188 +24.0983 0.9997 +24.1136 1.0018 +24.1288 0.999109 +24.144 1.00026 +24.1593 0.997703 +24.1745 0.998935 +24.1897 1.00307 +24.2049 1.00645 +24.2202 0.99525 +24.2354 0.996948 +24.2506 1.00168 +24.2658 1.00102 +24.2811 1.00543 +24.2963 0.995561 +24.3115 1.00096 +24.3268 0.997224 +24.342 1.00214 +24.3572 1.00126 +24.3724 1.00095 +24.3877 1.00089 +24.4029 0.999522 +24.4181 0.998522 +24.4334 0.999231 +24.4486 0.998617 +24.4638 1.00612 +24.479 0.996587 +24.4943 1.00999 +24.5095 1.00448 +24.5247 1.00117 +24.54 0.999463 +24.5552 1.00132 +24.5704 1.00034 +24.5856 1.00558 +24.6009 1.00302 +24.6161 0.99512 +24.6313 0.993028 +24.6465 1.00115 +24.6618 0.997173 +24.677 0.998631 +24.6922 1.00296 +24.7075 1 +24.7227 0.998358 +24.7379 0.999976 +24.7531 0.997337 +24.7684 0.998541 +24.7836 1.00185 +24.7988 0.994389 +24.8141 0.995106 +24.8293 0.998645 +24.8445 0.99787 +24.8597 1.00206 +24.875 0.996485 +24.8902 0.996423 +24.9054 1.00047 +24.9207 1.00072 +24.9359 0.999698 +24.9511 1.00333 +24.9663 0.996883 +24.9816 0.996674 +24.9968 1.0036 +25.012 0.995694 +25.0273 1.00058 +25.0425 0.995443 +25.0577 1.00488 +25.0729 1.00547 +25.0882 0.997571 +25.1034 0.998793 +25.1186 1.00479 +25.1338 0.995745 +25.1491 0.998277 +25.1643 1.00117 +25.1795 1.00319 +25.1948 0.995023 +25.21 0.998002 +25.2252 0.996658 +25.2404 0.99796 +25.2557 0.996297 +25.2709 0.999635 +25.2861 1.0053 +25.3014 1.00365 +25.3166 0.999025 +25.3318 0.996284 +25.347 1.00439 +25.3623 1.00172 +25.3775 0.997191 +25.3927 0.997091 +25.408 0.996893 +25.4232 1.00106 +25.4384 1.00067 +25.4536 0.999308 +25.4689 1.0026 +25.4841 0.99746 +25.4993 0.998779 +25.5145 0.999765 +25.5298 1.0032 +25.545 0.999851 +25.5602 0.996915 +25.5755 0.99947 +25.5907 1.00142 +25.6059 1.00072 +25.6211 1.00004 +25.6364 0.996358 +25.6516 0.99946 +25.6668 0.998759 +25.6821 0.99698 +25.6973 1.00646 +25.7125 0.996471 +25.7277 0.999867 +25.743 1.00194 +25.7582 0.994721 +25.7734 1.00047 +25.7887 0.996856 +25.8039 0.998782 +25.8191 1.00011 +25.8343 0.99849 +25.8496 1.00178 +25.8648 0.999791 +25.88 0.99774 +25.8952 1.00005 +25.9105 1.0004 +25.9257 0.999846 +25.9409 0.998851 +25.9562 0.998269 +25.9714 1.00212 +25.9866 0.9982 +26.0018 1.00176 +26.0171 0.999898 +26.0323 0.995255 +26.0475 0.99585 +26.0628 0.998067 +26.078 0.996079 +26.0932 1.00135 +26.1084 1.0028 +26.1237 0.996254 +26.1389 0.997658 +26.1541 0.994967 +26.1694 0.995345 +26.1846 1.00178 +26.1998 1.00647 +26.215 1.00109 +26.2303 1.00039 +26.2455 0.998728 +26.2607 0.997047 +26.2759 1.00197 +26.2912 0.999937 +26.3064 0.998281 +26.3216 0.994619 +26.3369 0.998552 +26.3521 1.00381 +26.3673 0.999926 +26.3825 1.00236 +26.3978 1.00152 +26.413 1.00115 +26.4282 1.00156 +26.4435 0.999225 +26.4587 0.999091 +26.4739 0.999024 +26.4891 0.999396 +26.5044 0.998598 +26.5196 0.999244 +26.5348 0.99708 +26.5501 1.00276 +26.5653 1.00152 +26.5805 1.00352 +26.5957 1.00357 +26.611 1.00525 +26.6262 0.997804 +26.6414 0.998594 +26.6566 0.995068 +26.6719 0.999998 +26.6871 1.00021 +26.7023 1.0027 +26.7176 0.9993 +26.7328 1.00225 +26.748 0.998992 +26.7632 0.996468 +26.7785 1.00607 +26.7937 1.00099 +26.8089 0.998228 +26.8242 0.998621 +26.8394 1.00451 +26.8546 0.996466 +26.8698 0.999503 +26.8851 1.0021 +26.9003 0.99847 +26.9155 0.99851 +26.9308 0.996003 +26.946 0.9992 +26.9612 0.998245 +26.9764 0.998857 +26.9917 1.00017 +27.0069 1.00556 +27.0221 1.00187 +27.0373 1.0026 +27.0526 0.997995 +27.0678 1.0002 +27.083 1.00025 +27.0983 1.00026 +27.1135 1.00449 +27.1287 1.00232 +27.1439 1.00199 +27.1592 0.997055 +27.1744 0.999161 +27.1896 1.00601 +27.2049 0.996989 +27.2201 1.00062 +27.2353 1.00093 +27.2505 1.00147 +27.2658 0.998069 +27.281 0.998241 +27.2962 1.00216 +27.3115 0.99927 +27.3267 0.998969 +27.3419 1.00493 +27.3571 0.998514 +27.3724 0.998109 +27.3876 1.00217 +27.4028 1.00006 +27.418 0.997144 +27.4333 1.0037 +27.4485 0.995328 +27.4637 0.999169 +27.479 0.998242 +27.4942 1.00106 +27.5094 1.00297 +27.5246 0.999329 +27.5399 1.00397 +27.5551 0.996963 +27.5703 1.00288 +27.5856 0.997821 +27.6008 1.00177 +27.616 1.0011 +27.6312 1.0005 +27.6465 1.00079 +27.6617 0.99898 +27.6769 0.999213 +27.6922 1.00073 +27.7074 0.995655 +27.7226 1.00389 +27.7378 1.00037 +27.7531 1.00112 +27.7683 1.00657 +27.7835 0.999549 +27.7987 1.00039 +27.814 0.999822 +27.8292 1.004 +27.8444 1.00202 +27.8597 1.00219 +27.8749 0.998042 +27.8901 1.00101 +27.9053 1.00378 +27.9206 0.999355 +27.9358 0.996409 +27.951 0.998745 +27.9663 0.998534 +27.9815 0.999825 +27.9967 0.998525 +28.0119 1.00447 +28.0272 1.00324 +28.0424 0.999877 +28.0576 1.00488 +28.0729 1.00332 +28.0881 1.00015 +28.1033 1.00115 +28.1185 1.00392 +28.1338 1.0026 +28.149 1.00011 +28.1642 1.00127 +28.1795 0.999719 +28.1947 0.998751 +28.2099 0.998337 +28.2251 0.994791 +28.2404 0.998638 +28.2556 0.998289 +28.2708 1.00161 +28.286 0.997075 +28.3013 1.00081 +28.3165 1.0002 +28.3317 0.992318 +28.347 1.00077 +28.3622 1.00315 +28.3774 1.00343 +28.3926 1.00571 +28.4079 0.998393 +28.4231 0.998101 +28.4383 0.999632 +28.4536 1.00026 +28.4688 0.999534 +28.484 0.99882 +28.4992 1.00634 +28.5145 0.997968 +28.5297 0.996157 +28.5449 1.00133 +28.5602 0.997157 +28.5754 0.997503 +28.5906 0.9979 +28.6058 0.998794 +28.6211 0.998178 +28.6363 1.00237 +28.6515 0.999492 +28.6667 1.00073 +28.682 0.996741 +28.6972 0.998567 +28.7124 0.997349 +28.7277 1.00626 +28.7429 0.999865 +28.7581 1.00237 +28.7733 0.999891 +28.7886 0.996883 +28.8038 1.0007 +28.819 1.00378 +28.8343 1.00169 +28.8495 0.99922 +28.8647 0.999077 +28.8799 0.999014 +28.8952 1.00091 +28.9104 0.997305 +28.9256 1.00105 +28.9409 0.999335 +28.9561 1.00103 +28.9713 1.00067 +28.9865 1.00059 +29.0018 0.994208 +29.017 0.995141 +29.0322 0.999291 +29.0474 1.00169 +29.0627 0.998696 +29.0779 1.00203 +29.0931 0.998819 +29.1084 0.99799 +29.1236 0.998192 +29.1388 1.00433 +29.154 1.00138 +29.1693 0.997828 +29.1845 0.997426 +29.1997 0.997582 +29.215 1.00067 +29.2302 0.999649 +29.2454 1.00444 +29.2606 1.00105 +29.2759 1.00134 +29.2911 1.00334 +29.3063 1.00224 +29.3216 1.00076 +29.3368 1.00436 +29.352 0.997439 +29.3672 1.00314 +29.3825 0.995197 +29.3977 0.999704 +29.4129 0.997496 +29.4281 0.997932 +29.4434 1.00134 +29.4586 1.0079 +29.4738 0.999905 +29.4891 0.999492 +29.5043 1.00292 +29.5195 0.99782 +29.5347 0.999292 +29.55 0.999365 +29.5652 1.00185 +29.5804 1.00363 +29.5957 1.00107 +29.6109 0.999141 +29.6261 1.00139 +29.6413 0.997065 +29.6566 1.00339 +29.6718 0.99591 +29.687 0.996254 +29.7023 0.997663 +29.7175 1.00078 +29.7327 0.99736 +29.7479 0.997492 +29.7632 1.00173 +29.7784 0.998626 +29.7936 0.998902 +29.8088 0.997452 +29.8241 0.998709 +29.8393 1.00049 +29.8545 0.998527 +29.8698 0.999086 +29.885 0.998812 +29.9002 1.00077 +29.9154 0.999024 +29.9307 1.00053 +29.9459 0.994986 +29.9611 0.996982 +29.9764 1.00171 +29.9916 1.00101 +30.0068 1.0003 +30.022 0.999753 +30.0373 1.00202 +30.0525 1.00256 +30.0677 1.00769 +30.083 1.00259 +30.0982 0.996561 +30.1134 1.00201 +30.1286 0.999462 +30.1439 0.996334 +30.1591 0.999031 +30.1743 1.00018 +30.1895 1.00502 +30.2048 1.00052 +30.22 0.999378 +30.2352 0.996158 +30.2505 1.00432 +30.2657 0.999367 +30.2809 1.00263 +30.2961 1.00223 +30.3114 1.00106 +30.3266 1.00262 +30.3418 1.00204 +30.3571 0.999411 +30.3723 1.00333 +30.3875 0.999748 +30.4027 0.999109 +30.418 0.999573 +30.4332 0.996037 +30.4484 0.998088 diff --git a/_basic/openmp/source_code/SOLUTION/dcdread.h b/_basic/openmp/source_code/SOLUTION/dcdread.h new file mode 100644 index 0000000..66ddba0 --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/dcdread.h @@ -0,0 +1,49 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +using namespace std; + +void dcdreadhead(int *natom, int *nframes, std::istream &infile) +{ + + infile.seekg(8, ios::beg); + infile.read((char *)nframes, sizeof(int)); + infile.seekg(64 * 4, ios::cur); + infile.read((char *)natom, sizeof(int)); + infile.seekg(1 * 8, ios::cur); + return; +} + +void dcdreadframe(double *x, double *y, double *z, std::istream &infile, + int natom, double &xbox, double &ybox, double &zbox) +{ + + double d[6]; + for (int i = 0; i < 6; i++) + { + infile.read((char *)&d[i], sizeof(double)); + } + xbox = d[0]; + ybox = d[2]; + zbox = d[5]; + float a, b, c; + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&a, sizeof(float)); + x[i] = a; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&b, sizeof(float)); + y[i] = b; + } + infile.seekg(1 * 8, ios::cur); + for (int i = 0; i < natom; i++) + { + infile.read((char *)&c, sizeof(float)); + z[i] = c; + } + infile.seekg(1 * 8, ios::cur); + + return; +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf.cpp b/_basic/openmp/source_code/SOLUTION/rdf.cpp new file mode 100644 index 0000000..bdbc15b --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf.cpp @@ -0,0 +1,194 @@ +// Copyright (c) 2021 NVIDIA Corporation. All rights reserved. +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp parallel for private(dx, dy, dz, r, ig2) + for (int id1 = 0; id1 < numatm; id1++) + { + + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + //frame ends + } // end of target map +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload.cpp b/_basic/openmp/source_code/SOLUTION/rdf_offload.cpp new file mode 100644 index 0000000..f8f0c0a --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload.cpp @@ -0,0 +1,194 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; +////////////////////////////////////////////////////////////////////////// +#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin]) + { + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + } + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp target teams distribute parallel for private(dx, dy, dz, r, ig2) + for (int id1 = 0; id1 < numatm; id1++) + { + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload.f90 b/_basic/openmp/source_code/SOLUTION/rdf_offload.f90 new file mode 100644 index 0000000..0fc631c --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload.f90 @@ -0,0 +1,165 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + !$omp target data map(x(:,:), y (:,:), z (:,:), g (:)) + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + !$omp target teams distribute parallel do private(dx,dy,dz,r,ind) + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; +////////////////////////////////////////////////////////////////////////// +#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin]) + { + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + } + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp target teams distribute parallel for collapse(2) private(dx, dy, dz, r, ig2) + for (int id1 = 0; id1 < numatm; id1++) + { + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload_collapse.f90 b/_basic/openmp/source_code/SOLUTION/rdf_offload_collapse.f90 new file mode 100644 index 0000000..42d8016 --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload_collapse.f90 @@ -0,0 +1,165 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + !$omp target data map(x(:,:), y (:,:), z (:,:), g (:)) + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + !$omp target teams distribute parallel do private(dx,dy,dz,r,ind) collapse(2) + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; +////////////////////////////////////////////////////////////////////////// +#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin]) + { + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + } + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp target teams loop private(dx, dy, dz, r, ig2) + for (int id1 = 0; id1 < numatm; id1++) + { + #pragma omp loop + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload_loop.f90 b/_basic/openmp/source_code/SOLUTION/rdf_offload_loop.f90 new file mode 100644 index 0000000..a14211c --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload_loop.f90 @@ -0,0 +1,166 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + !$omp target data map(x(:,:), y (:,:), z (:,:), g (:)) + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + !$omp target teams loop private(dx,dy,dz,r,ind) + do i=1,natoms + !$omp loop + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; +////////////////////////////////////////////////////////////////////////// +#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin]) + { + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + } + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp target teams distribute + + for (int id1 = 0; id1 < numatm; id1++) + { +#pragma omp parallel for private(dx, dy, dz, r, ig2) + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload_split.f90 b/_basic/openmp/source_code/SOLUTION/rdf_offload_split.f90 new file mode 100644 index 0000000..f41bc8d --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload_split.f90 @@ -0,0 +1,166 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + !$omp target data map(x(:,:), y (:,:), z (:,:), g (:)) + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + !$omp target teams distribute + do i=1,natoms + !$omp parallel do private(dx,dy,dz,r,ind) + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; +////////////////////////////////////////////////////////////////////////// +#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin]) + { + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + } + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); +#pragma omp target teams distribute num_teams(65535) + + for (int id1 = 0; id1 < numatm; id1++) + { +#pragma omp parallel for private(dx, dy, dz, r, ig2) + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); +#pragma omp atomic + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends +} diff --git a/_basic/openmp/source_code/SOLUTION/rdf_offload_split_num.f90 b/_basic/openmp/source_code/SOLUTION/rdf_offload_split_num.f90 new file mode 100644 index 0000000..e5bd724 --- /dev/null +++ b/_basic/openmp/source_code/SOLUTION/rdf_offload_split_num.f90 @@ -0,0 +1,166 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + !$omp target data map(x(:,:), y (:,:), z (:,:), g (:)) + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + !$omp target teams distribute num_teams(65535) + do i=1,natoms + !$omp parallel do private(dx,dy,dz,r,ind) + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r +#include +#include +#include +#include +#include +#include +#include "dcdread.h" +#include +#include + +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, + int d_bin); + +int main(int argc, char *argv[]) +{ + double xbox, ybox, zbox; + double *h_x, *h_y, *h_z; + unsigned int *h_g2; + int nbin; + int numatm, nconf, inconf; + string file; + + /////////////////////////////////////////////////////////////// + + inconf = 10; + nbin = 2000; + file = "../../_common/input/alk.traj.dcd"; + /////////////////////////////////////// + std::ifstream infile; + infile.open(file.c_str()); + if (!infile) + { + cout << "file " << file.c_str() << " not found\n"; + return 1; + } + assert(infile); + + ofstream pairfile, stwo; + pairfile.open("RDF.dat"); + stwo.open("Pair_entropy.dat"); + + ///////////////////////////////////////////////////////// + dcdreadhead(&numatm, &nconf, infile); + cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl; + if (inconf > nconf) + cout << "nconf is reset to " << nconf << endl; + else + { + nconf = inconf; + } + cout << "Calculating RDF for " << nconf << " frames" << endl; + //////////////////////////////////////////////////////// + + unsigned long long int sizef = nconf * numatm * sizeof(double); + unsigned long long int sizebin = nbin * sizeof(unsigned int); + + h_x = (double *)malloc(sizef); + h_y = (double *)malloc(sizef); + h_z = (double *)malloc(sizef); + h_g2 = (unsigned int *)malloc(sizebin); + + memset(h_g2, 0, sizebin); + + /////////reading cordinates////////////////////////////////////////////// + nvtxRangePush("Read_File"); + + double ax[numatm], ay[numatm], az[numatm]; + for (int i = 0; i < nconf; i++) + { + dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox); + for (int j = 0; j < numatm; j++) + { + h_x[i * numatm + j] = ax[j]; + h_y[i * numatm + j] = ay[j]; + h_z[i * numatm + j] = az[j]; + } + } + nvtxRangePop(); //pop for REading file + cout << "Reading of input file is completed" << endl; + ////////////////////////////////////////////////////////////////////////// + nvtxRangePush("Pair_Calculation"); + pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin); + nvtxRangePop(); //Pop for Pair Calculation + //////////////////////////////////////////////////////////////////////// + double pi = acos(-1.0); + double rho = (numatm) / (xbox * ybox * zbox); + double norm = (4.0l * pi * rho) / 3.0l; + double rl, ru, nideal; + double g2[nbin]; + double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l; + double box = min(xbox, ybox); + box = min(box, zbox); + double del = box / (2.0l * nbin); + nvtxRangePush("Entropy_Calculation"); + for (int i = 0; i < nbin; i++) + { + rl = (i)*del; + ru = rl + del; + nideal = norm * (ru * ru * ru - rl * rl * rl); + g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal); + r = (i)*del; + pairfile << (i + 0.5l) * del << " " << g2[i] << endl; + if (r < 2.0l) + { + gr = 0.0l; + } + else + { + gr = g2[i]; + } + if (gr < 1e-5) + { + lngr = 0.0l; + } + else + { + lngr = log(gr); + } + + if (g2[i] < 1e-6) + { + lngrbond = 0.0l; + } + else + { + lngrbond = log(g2[i]); + } + s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r; + s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r; + } + nvtxRangePop(); //Pop for Entropy Calculation + stwo << "s2 value is " << s2 << endl; + stwo << "s2bond value is " << s2bond << endl; + + cout << "#Freeing Host memory" << endl; + free(h_x); + free(h_y); + free(h_z); + free(h_g2); + + cout << "#Number of atoms processed: " << numatm << endl + << endl; + cout << "#Number of confs processed: " << nconf << endl + << endl; + return 0; +} +void pair_gpu(const double *d_x, const double *d_y, const double *d_z, + unsigned int *d_g2, int numatm, int nconf, + const double xbox, const double ybox, const double zbox, int d_bin) +{ + double r, cut, dx, dy, dz; + int ig2; + double box; + box = min(xbox, ybox); + box = min(box, zbox); + + double del = box / (2.0 * d_bin); + cut = box * 0.5; + printf("\n %d %d ", nconf, numatm); + + { + for (int frame = 0; frame < nconf; frame++) + { + printf("\n %d ", frame); + + for (int id1 = 0; id1 < numatm; id1++) + { + for (int id2 = 0; id2 < numatm; id2++) + { + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]; + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]; + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2]; + + dx = dx - xbox * (round(dx / xbox)); + dy = dy - ybox * (round(dy / ybox)); + dz = dz - zbox * (round(dz / zbox)); + + r = sqrtf(dx * dx + dy * dy + dz * dz); + if (r < cut) + { + ig2 = (int)(r / del); + + d_g2[ig2] = d_g2[ig2] + 1; + } + } + } + } //frame ends + } // end of target map +} diff --git a/_basic/openmp/source_code/rdf.f90 b/_basic/openmp/source_code/rdf.f90 new file mode 100644 index 0000000..e181995 --- /dev/null +++ b/_basic/openmp/source_code/rdf.f90 @@ -0,0 +1,161 @@ +!///////////////////////////////////////////////////////////////////////////////////////// +!// Author: Manish Agarwal and Gourav Shrivastava , IIT Delhi +!///////////////////////////////////////////////////////////////////////////////////////// + +! Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +module readdata + contains + subroutine readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + integer i,j + integer maxframes,maxatoms + + double precision d(6),xbox,ybox,zbox + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + + real*4 dummyr + integer*4 nset, natoms, dummyi,nframes,tframes + character*4 dummyc + + open(10,file='../../_common/input/alk.traj.dcd',status='old',form='unformatted') + read(10) dummyc, tframes,(dummyi,i=1,8),dummyr, (dummyi,i=1,9) + read(10) dummyi, dummyr,dummyr + read(10) natoms + print*,"Total number of frames and atoms are",tframes,natoms + + allocate ( x(maxframes,natoms) ) + allocate ( y(maxframes,natoms) ) + allocate ( z(maxframes,natoms) ) + + do i = 1,nframes + read(10) (d(j),j=1, 6) + + read(10) (x(i,j),j=1,natoms) + read(10) (y(i,j),j=1,natoms) + read(10) (z(i,j),j=1,natoms) + end do + + xbox=d(1) + ybox=d(3) + zbox=d(6) + + print*,"File reading is done: xbox,ybox,zbox",xbox,ybox,zbox + return + + end subroutine readdcd + end module readdata + +program rdf + use readdata + use nvtx + implicit none + integer n,i,j,iconf,ind + integer natoms,nframes,nbin + integer maxframes,maxatoms + parameter (maxframes=10,maxatoms=60000,nbin=2000) + real*4, allocatable :: x(:,:) + real*4, allocatable :: y(:,:) + real*4, allocatable :: z(:,:) + double precision dx,dy,dz + double precision xbox,ybox,zbox,cut + double precision vol,r,del,s2,s2bond + double precision, allocatable :: g(:) + double precision rho,gr,lngr,lngrbond,pi,const,nideal,rf + double precision rlower,rupper + character atmnm*4 + real*4 start,finish + + open(23,file='RDF.dat',status='unknown') + open(24,file='Pair_entropy.dat',status='unknown') + + nframes=10 + + call cpu_time(start) + + print*,"Going to read coordinates" + call nvtxStartRange("Read File") + call readdcd(maxframes,maxatoms,x,y,z,xbox,ybox,zbox,natoms,nframes) + call nvtxEndRange + + allocate ( g(nbin) ) + g = 0.0d0 + + pi=dacos(-1.0d0) + vol=xbox*ybox*zbox + rho=dble(natoms)/vol + + del=xbox/dble(2.0*nbin) + write(*,*) "bin width is : ",del + cut = dble(xbox * 0.5); + + !pair calculation + call nvtxStartRange("Pair Calculation") + do iconf=1,nframes + if (mod(iconf,1).eq.0) print*,iconf + do i=1,natoms + do j=1,natoms + dx=x(iconf,i)-x(iconf,j) + dy=y(iconf,i)-y(iconf,j) + dz=z(iconf,i)-z(iconf,j) + + dx=dx-nint(dx/xbox)*xbox + dy=dy-nint(dy/ybox)*ybox + dz=dz-nint(dz/zbox)*zbox + + r=dsqrt(dx**2+dy**2+dz**2) + ind=int(r/del)+1 + !if (ind.le.nbin) then + if(r nconf:\n", + " print(\"nconf is reset to {}\".format(nconf))\n", + " else:\n", + " nconf = inconf\n", + " print(\"Calculating RDF for {} frames\".format(nconf))\n", + " #numatm = 100\n", + " sizef = nconf * numatm\n", + " sizebin = nbin\n", + "\n", + " ########### reading cordinates ##############\n", + " nvtx.RangePush(\"Read_File\")\n", + " xbox, ybox, zbox, d_x, d_y, d_z = dcdreadframe(infile, numatm, nconf)\n", + " nvtx.RangePop() # pop for reading file\n", + " print(\"Reading of input file is completed\")\n", + " ############# Stream from Host to Device #########################\n", + " d_x = cp.asarray(d_x)\n", + " d_y = cp.asarray(d_y)\n", + " d_z = cp.asarray(d_z)\n", + " d_g2 = np.zeros(sizebin, dtype=np.int64)\n", + " d_g2 = cp.asarray(d_g2)\n", + "\n", + " ############################## RAW KERNEL #################################################\n", + " nthreads = 128;\n", + " near2 = nthreads * (int(0.5 * numatm * (numatm - 1) / nthreads) + 1);\n", + " nblock = (near2 / nthreads);\n", + " print(\" Initial blocks are {} and now changing to\".format(nblock))\n", + " maxblock = 65535\n", + " blockloop = int(nblock / maxblock)\n", + " if blockloop != 0:\n", + " nblock = maxblock\n", + " print(\"{} and will run over {} blockloops\".format(nblock, blockloop+1))\n", + "\n", + " nvtx.RangePush(\"CuPy_Pair_gpu_Circulation\")\n", + " #t1 = timer()\n", + " for bl in range(blockloop+1):\n", + " raw_kernel((nblock,),(nthreads,), (d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin, bl)) ## cupy raw kernel\n", + " \n", + " cp.cuda.Device(0).synchronize()\n", + " #print(\"Kernel compute time:\", timer() - t1)\n", + " \n", + " d_g2 = cp.asnumpy(d_g2)\n", + " nvtx.RangePop() # pop for Pair Calculation\n", + " #############################################################################################\n", + " pi = math.acos(np.int64(-1.0))\n", + " rho = (numatm) / (xbox * ybox * zbox)\n", + " norm = (np.int64(4.0) * pi * rho) / np.int64(3.0)\n", + " g2 = np.zeros(nbin, dtype=np.float32)\n", + " s2 =np.int64(0.0); s2bond = np.int64(0.0)\n", + " lngrbond = np.float32(0.0)\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del =box / (np.int64(2.0) * nbin)\n", + " gr = np.float32(0.0)\n", + " # loop to calculate entropy\n", + " nvtx.RangePush(\"Entropy_Calculation\")\n", + " for i in range(nbin):\n", + " rl = (i) * _del\n", + " ru = rl + _del\n", + " nideal = norm * (ru * ru * ru - rl * rl * rl)\n", + " g2[i] = d_g2[i] / (nconf * numatm * nideal)\n", + " r = (i) * _del\n", + " temp = (i + 0.5) * _del\n", + " pairfile.write(str(temp) + \" \" + str(g2[i]) + \"\\n\")\n", + "\n", + " if r < np.int64(2.0):\n", + " gr = np.int64(0.0)\n", + " else:\n", + " gr = g2[i]\n", + " if gr < 1e-5:\n", + " lngr = np.int64(0.0)\n", + " else:\n", + " lngr = math.log(gr)\n", + " if g2[i] < 1e-6:\n", + " lngrbond = np.int64(0.0)\n", + " else:\n", + " lngrbond = math.log(g2[i])\n", + " s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r)\n", + " s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r\n", + "\n", + " nvtx.RangePop() # pop for entropy Calculation\n", + " stwo.writelines(\"s2 value is {}\\n\".format(s2))\n", + " stwo.writelines(\"s2bond value is {}\".format(s2bond))\n", + " \n", + " print(\"\\n s2 value is {}\\n\".format(s2))\n", + " print(\"s2bond value is {}\\n\".format(s2bond))\n", + "\n", + " print(\"#Freeing Host memory\")\n", + " del (d_x)\n", + " del (d_y)\n", + " del (d_z)\n", + " del (d_g2)\n", + " print(\"#Number of atoms processed: {} \\n\".format(numatm))\n", + " print(\"#number of confs processed: {} \\n\".format(nconf))\n", + " #total_time = timer() - start\n", + " #print(\"total time spent:\", total_time)\n", + " \n", + "\n", + "if __name__ == \"__main__\":\n", + " main()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "### Output Files\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "### Profiling\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "#

HOME

\n", + "\n", + "---\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of the Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "---\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/jupyter_notebook/cupy/cupy_guide.ipynb b/_basic/python/jupyter_notebook/cupy/cupy_guide.ipynb new file mode 100644 index 0000000..0b46d1a --- /dev/null +++ b/_basic/python/jupyter_notebook/cupy/cupy_guide.ipynb @@ -0,0 +1,983 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# CuPy Lab 1\n", + "---\n", + "\n", + "## Learning Objectives\n", + "- **The goal of this lab is to:**\n", + " - enable you to quickly start using CuPy (beginner to intermediate level)\n", + " - teach you to apply the concepts of GPU programming to HPC field(s); and\n", + " - show you how to achieve a computational speedup on the GPU to maximize the throughput of your HPC implementation.\n", + "\n", + "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl-Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Introduction\n", + "CuPy is an open-source library that implements GPU-accelerated NumPy arrays on CUDA. CuPy represents a GPU version of NumPy. NumPy runs only on CPU cores while CuPy leverages on multiple CUDA cores for parallel execution, therefore, CuPy is considered to run fastest and delivers maximum speed up. Due to the NumPy-compatibility nature of CuPy, almost all NumPy functionalities including multi-dimensional arrays and data types are implemented by CuPy. The rest of this notebook includes simple illustration on `CuPy architecture`, `CuPy fundamentals`, `CuPy CUDA kernels` and, frequently use terms like `Host` (this refers to a CPU), `Device` (means a GPU), and `Kernel` (a CuPy user-defined function that runs on the GPU).\n", + " \n", + " \n", + "## CuPy Architecture\n", + "The CuPy architecture exposes functionalities within the CuPy API that allows developers (or users) to create a user-defined CUDA kernel and make use of deep neural network utility through the `cuDNN` functionality. Linear algebras are solved through `cuBLAS` while systems of equations are solved with `cuSOLVER`. The `cuSPARSE` and `cuTENSOR` API functions specifically target sparse matrix and tensor operations respectively. Random numbers are generated using `cuRAND`. Sort, Scan and Reduction operations are conveniently executed using `CUB` and `Thrust`. Furthermore, Multi-GPU data transfer tasks are initiated with `NCCL` functionality. It is important to know that all these API functionalities rely on `CUDA`, while the CUDA itself depends on the `NVIDIA GPU` as shown in figure 1.0.\n", + "
\n", + "
Figure 1.0 CuPy Architecture
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CuPy Fundamentals\n", + "\n", + "In this section, three frequently used CuPy paradigm namely variable initialization, data transfer, and device selection would be considered. \n", + "\n", + "- **Variable or data initialization**: This is the process of assigning data or value to CuPy ndarray. The first step is to import the CuPy library and then initialize variables with data type as follows:\n", + "```python \n", + "import cupy as cp\n", + "X1 = cp.array([1,2,3,4,5,6,7,8,9,10], dtype=cp.int32)#array of 10 values\n", + "X2 = cp.arange(100, dtype=cp.float32)#generating array of 100 values \n", + "X3 = cp.empty((3,3), dtype=cp.float32)#initializing empty 2D array of 3X3 matrix\n", + "Sizebin = 10000\n", + "X4 = cp.zeros(sizebin, dtype=cp.int64)#initializing array filled with 10,000 zeros\n", + "```\n", + "\n", + "\n", + "- **Data transfer**: The idea is to move or copy data from the Host (NumPy) to the Device (CuPy) and vice versa such that data is visible to the Kernel and the resulting output would be copied back to the Host.\n", + "```python\n", + "import numpy as np\n", + "import cupy as cp\n", + "#copy data from Host to Device using cp.asarray()\n", + "h_X = np.arange(100, dtype=np.float32)#generating array of 100 values on the Host with NumPy\n", + "d_X = cp.asarray(x)# copy data to Device \n", + "#copy data from Device to Host using cp.asnumpy()\n", + "h_X = cp.asnumpy(d_X)\n", + "```\n", + "\n", + "\n", + "- **Device selection**: This is a mechanism used by CuPy to select a particular GPU or switch from one Device to another (when there are more than one Device, default device is given 0 index id).\n", + "```python\n", + "Using default Device\n", + "X1 = cp.array([1,2,3,4,5,6,7,8,9,10], dtype=cp.int32)\n", + "```\n", + "Switching Devices\n", + "```python\n", + "cp.cuda.Device(1)\n", + "X1 = cp.array([1,2,3,4,5,6,7,8,9,10], dtype=cp.int32)\n", + "```\n", + "\n", + "\n", + "Switch GPU temporarily to GPU index 2 (minimum of 3 GPUs must exist to use index 2)\n", + "```python\n", + "with cp.cuda.Device(2): \n", + "\tX2 = cp.arange(100, dtype=cp.float32)\n", + "```\n", + "```python\n", + "Sizebin = 10000\n", + "X4 = cp.zeros(sizebin, dtype=cp.int64)# back to default GPU with index 0 \n", + "```\n", + "\n", + "Having establish some basic steps, let’s consider example 1.\n", + "\n", + "**Example 1**: *Write a CuPy program that adds two arrays A and B and store the result in array C. Assume that A and B have 10,000 elements each*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "N = 10000\n", + "#select Device with index 1. \n", + "with cp.cuda.Device(1):\n", + " #input data initialzed\n", + " d_A = cp.arange(N, dtype=cp.int32)\n", + " d_B = cp.arange(N, dtype=cp.int32)\n", + " d_C = cp.zeros(N, dtype=cp.int32) # initialize zero filled array\n", + " d_C = d_A + d_B\n", + "\n", + "#optional: copy result from Device to Host \n", + "h_C = cp.asnumpy(d_C)\n", + "print(h_C)\n", + "#expected output: [ 0 2 4 ... 19994 19996 19998]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "**Exercise 1**: *Follow the steps highlighted above and write a CuPy program to add two arrays. The size of each array is 500,000*. Execute this task in the cell below: \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "#expected output: [ 0 2 4 ... 999994 999996 999998]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2-Dimensional Array\n", + "\n", + "In this section, the focus would be on performing simple calculation with 2D arrays. 2D arrays are usually in matrix form and matrix-matrix multiplication operation can be perform on them using CuPy `SGEMM` (Single precision GEneral Matrix Multiplication) and `DGEMM` (Double precision GEneral Matrix Multiplication). Let’s consider two examples of matrix multiplication. First example would be a simple mathematically verifiable `4x4` matrixes `A and B` as shown in figure 2.0, while the second example is a large matrixes `d_A and d_B` of shape `10,000x10,000`. The latter example would use `cuRAND` to randomly generate values for `d_A & d _B` on the Device and python matrix operator `@` based on `cuBLAS` to perform matrix multiplication. \n", + "\n", + "**Example 2**: Multiplication of matrix A & B using `cp.dot()` and `@`. \n", + "\n", + "
\n", + "
Figure 2.0 Matrix A & B multiplication
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "N = 4\n", + "A = cp.array([[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]],dtype=cp.int32)\n", + "B = cp.array([[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]],dtype=cp.int32)\n", + "\n", + "C = cp.dot(A,B)\n", + "C2 = A@B\n", + "print(\"dot ops:\", C)\n", + "print(\"@ ops:\", C2)\n", + "\n", + "#expected output\n", + "#dot ops: \n", + "#[[ 0 0 0 0]\n", + "# [ 0 4 8 12]\n", + "# [ 0 8 16 24]\n", + "# [ 0 12 24 36]]\n", + "#@ ops: \n", + "#[[ 0 0 0 0]\n", + "# [ 0 4 8 12]\n", + "# [ 0 8 16 24]\n", + "# [ 0 12 24 36]]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Example 3**: Multiply matrixes d_A and d_B using Python matrix operator `@`. \n", + "\n", + "- **Step 1**: initialize matrix size (assume the two matrixes have equal rows and columns)\n", + "```python\n", + "import cupy as cp\n", + "N = 10000\n", + "```\n", + "\n", + "\n", + "- **Step 2**: Fetch or generate matrix values. Matrixes d_A and d_B would be generated using `cuRAND`\n", + "```python\n", + "d_A = cp.random.random((N,N), dtype=cp.float32)\n", + "d_B = cp.random.random(N*N, dtype=cp.float32).reshape(N, N)\n", + "```\n", + "\n", + "\n", + "- **Step 3**: Apply Python matrix operator `@` that uses `cuBLAS` \n", + "```python\n", + "d_C = d_A@d_B\n", + "print(d_C)\n", + "#expected output\n", + "...\n", + "[2496.929 2493.3096 2512.024 ... 2523.2388 2486.2688 2502.8193]\n", + "[2512.366 2522.0713 2518.3489 ... 2529.164 2493.486 2488.1067]\n", + "[2493.215 2483.601 2493.606 ... 2523.578 2474.8271 2469.6057]]\n", + "```\n", + "\n", + "---\n", + "**Exercise 2**: *Write a CuPy program that multiply two matrixes of dimensions 225 x 225. Part of the code has been written for you in the cell below and you are to complete the rest*.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "N = 225\n", + "\n", + "#generate matrix \n", + "\n", + "\n", + "\n", + "#apply matrix operator @ or cp.dot()\n", + "\n", + "\n", + "\n", + "#expected output:\n", + "#[[ 848610000 848635200 848660400 ... 854204400 854229600 854254800]\n", + "# [ 2124360000 2124435825 2124511650 ... 2141193150 2141268975 2141344800]\n", + "# [ -894857296 -894730846 -894604396 ... -866785396 -866658946 -866532496]\n", + "# ...\n", + "# [ 597268464 608532414 619796364 ... -1197101932 -1185837982 -1174574032]\n", + "# [ 1873018464 1884333039 1895647614 ... 89886818 101201393 112515968]\n", + "# [-1146198832 -1134833632 -1123468432 ... 1376875568 1388240768 1399605968]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Kernel Fusion\n", + "\n", + "Kernel fusion is all about fusing functions and it is defined by specifying a decorator `@cp.fuse()` at the top of a user-defined function. Kernel fusion creates and caches the CUDA kernel on it first call in a way that subsequent calls with the same input type are executed on the cached kernel, hence more speed up is gained.\n", + "\n", + "```python\n", + "@cp.fuse(kernel_name='')\n", + "def function_name():\n", + " # \n", + "```\n", + "or as\n", + "\n", + "```python\n", + "@cp.fuse()\n", + "def function_name():\n", + " # \n", + "```\n", + "\n", + "**Example 4**: compute z = $∑_{𝑖=1}$ $𝑥_{𝑖}$ * $𝑤_{𝑖}$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "@cp.fuse()\n", + "def compute(x,w):\n", + " return cp.sum(x * w)\n", + "\n", + "N = 225\n", + "#input data\n", + "x = cp.random.random((N), dtype=cp.float32)\n", + "w = cp.random.random((N), dtype=cp.float32)\n", + "\n", + "#calling fuse function\n", + "z = compute(x,w)\n", + "print(z)\n", + "#expected output: 57.776024. output may varies because of random values of x & w" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CuPy CUDA Kernels\n", + "\n", + "CuPy CUDA kernels are user defined kernels namely:\n", + "- Elementwise Kernels\n", + "- Reduction Kernels\n", + "- Raw Kernels\n", + "\n", + "### Elementwise Kernels\n", + "\n", + "The elementwise kernel class definition comprises list of input and output arguments with data types specified explicitly or in generic form`()` that follows C language style. It also includes the kernel body code that denotes computation statement and the kernel name. Note that character `i` and `n`, and variable names that begin with `“_”` are not allowed for use within the elementwise kernel definition. A stepwise example is illustrated below:\n", + "\n", + "**Example 5**: compute r= √($x^2$+$y^2$+$z^2$ ) \n", + "\n", + "**Step 1**: Set the list of input and output arguments and their data type\n", + "```python\n", + "input_list = 'float32 d_x, float32 d_y, float32 d_z '\n", + "output_list = 'float32 r'\n", + "```\n", + "you may as well use a generic form of data type as follows:\n", + "\n", + "```python\n", + "input_list = 'T d_x, T d_y, T d_z'\n", + "output_list = 'T r'\n", + "```\n", + "**Step 2**: Write the kernel body code to compute the equation\n", + "```python\n", + "code_body = 'r = sqrt(d_x*d_x + d_y*d_y + d_z*d_z)'\n", + "```\n", + "**Step 3**: Define elementwise class and set the kernel name\n", + "```python\n", + "compute_call = cp.ElementwiseKernel(input_list, output_list, code_body,'compute')\n", + "```\n", + "**Step 4**: Initialize input values\n", + "```python\n", + "N =2000\n", + "d_x = cp.arange(N, dtype=cp.float32)\n", + "d_y = cp.arange(N, dtype=cp.float32)\n", + "d_z = cp.arange(N, dtype=cp.float32)\n", + "r = cp.empty(N, dtype=cp.float32)\n", + "```\n", + "\n", + "**step 5**: Make the kernel call\n", + "```python\n", + "compute_call(d_x,d_y, d_z, r)\n", + "print(r)\n", + "#expected output: [0.0000000e+00 1.7320508e+00 3.4641016e+00 ... 3.4589055e+03 3.4606375e+03 3.4623696e+03]\n", + "```\n", + "You can run the above code in the cell below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "input_list = 'float32 d_x, float32 d_y, float32 d_z '\n", + "output_list = 'float32 r'\n", + "code_body = 'r = sqrt(d_x*d_x + d_y*d_y + d_z*d_z)'\n", + "\n", + "# elementwisekernel class defined\n", + "compute_call = cp.ElementwiseKernel(input_list, output_list, code_body,'compute')\n", + "# data\n", + "N =2000\n", + "\n", + "d_x = cp.arange(N, dtype=cp.float32)\n", + "d_y = cp.arange(N, dtype=cp.float32)\n", + "d_z = cp.arange(N, dtype=cp.float32)\n", + "r = cp.empty(N, dtype=cp.float32)\n", + "# kernel call with argument passing\n", + "compute_call(d_x,d_y, d_z, r)\n", + "print(r)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Reduction Kernels\n", + "\n", + "Reduction kernels is defined as follows:\n", + "\n", + " - Input and output arguments with data types specified explicitly or in generic form() that follows C language style.\n", + " - Identity value that initialized argument to be reduced to zero. \n", + "\t- mapping expression that maps each argument values to operands a & b and applies arithmetic operator. \n", + "\t- reduction expression that sums operand a & b and stores the output in a, \n", + "\t- post mapping expression that executes further operation on operand a. \n", + "\t- kernel name\n", + "\n", + "For ease of understanding, `example 6` is used to exemplify reduction kernel.\n", + "\n", + "**Example 6**: Evaluate z = $∑_{𝑖=1}$ $𝑥_{𝑖}$ * $𝑤_{𝑖}$ + bais\n", + "\n", + "\n", + "**Step 1**: Set the list of input and output arguments and their data type\n", + "```python\n", + "input_list = 'float32 x, float32 w, float32 bias'\n", + "output_list = 'float32 y'\n", + "```\n", + "you may as well use a generic form of data type as follows:\n", + "```python\n", + "input_list = 'T x, T w, T bias'\n", + "output_list = 'T y'\n", + "```\n", + "**Step 2**: set mapping expression\n", + "```python\n", + "mapping_expr = 'x * w'\n", + "```\n", + "**Step 3**: set reduction expression `a & b`\n", + "```python\n", + "reduction_expr= 'a + b'\n", + "```\n", + "**Step 4**: set post expression for `a`\n", + "```python\n", + "post_expr = 'y = a + bias'\n", + "```\n", + "**Step 5**: initialize identity value 0\n", + "```python\n", + "identity_value = '0'\n", + "```\n", + "**Step 6**: define reduction kernel class and set the kernel name\n", + "```python\n", + "dnnLayer = cp.ReductionKernel(\n", + " input_list,\n", + " output_list,\n", + " mapping_expr,\n", + " reduction_expr,\n", + " post_expr,\n", + " identity_value,\n", + " 'dnnLayer')\n", + "```\n", + "**Step 7**: Initialize input values\n", + "```python\n", + "N = 2000\n", + "x = cp.random.random(N, dtype=cp.float32)\n", + "w = cp.random.random(N, dtype=cp.float32)\n", + "bias = -0.01\n", + "```\n", + "**Step 8**: make the kernel call\n", + "```python\n", + "y = dnnLayer(x,w,bias)\n", + "print(y)\n", + "```\n", + "You can run the above code in the cell below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "\n", + "input_list = 'float32 x, float32 w, float32 bias'\n", + "output_list = 'float32 y'\n", + "mapping_expr = 'x * w'\n", + "reduction_expr= 'a + b'\n", + "post_expr = 'y = a + bias'\n", + "identity_value = '0'\n", + "\n", + "dnnLayer = cp.ReductionKernel(\n", + " input_list,\n", + " output_list,\n", + " mapping_expr,\n", + " reduction_expr,\n", + " post_expr,\n", + " identity_value,\n", + " 'dnnLayer' )\n", + "\n", + "N = 2000\n", + "x = cp.random.random(N, dtype=cp.float32)\n", + "w = cp.random.random(N, dtype=cp.float32)\n", + "bias = -0.01\n", + "\n", + "y = dnnLayer(x,w,bias)\n", + "print(y)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Raw Kernels\n", + "\n", + "The CuPy Raw kernels are defined through the RawKernel object that enables the direct use of kernels from CUDA source using CUDA’s cuLaunchKernel interface. Raw Kernels are written using CUDA C paradigm therefore there is need to understand the memory architecture to know how best to manipulate threads, thread blocks and grid size. This is important to effectively write Raw Kernels that solve complex task.\n", + "\n", + "### Memory Architecture\n", + "\n", + "When written codes run on the device (GPU), execution is shared amongst threads and blocks of memory space. The execution could be mapped to thousands of threads modelled in blocks and grids form. This idea is illustrated in figure 3.0 with a view that a thread can be seen as a single executing unit on the device. A `thread block` (also known as a block) is as collection of threads that can communicate, while a collection of these blocks is referred to as a `Grid`. In several devices the maximum number of threads within a thread block is `1,024` and `65,535` blocks within a grid.\n", + "\n", + "
\n", + "
Figure 3.0. Thread, block, and grid concept
\n", + "\n", + "\n", + "As shown in figure 4.0, the GPU memory space is hierarchically arranged into `shared memory`, `local memory`, `global memory`, `constant memory`, and `texture memory`. Within a block, each thread has its own local memory and register and does communicate with other threads using the shared memory.\n", + "\n", + "
\n", + "
Figure 4.0. Memory Architecture
\n", + "\n", + "**Image source** : Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.\n", + "\n", + "\n", + "\n", + "A raw kernel runs on the Device and it is defined by creating a `RawKernel` object the embeds CUDA C kernel codes. Let’s illustrate this using example 7 as follows:\n", + "\n", + "**Example 7**: Write a CuPy raw kernel program that adds two arrays assume that both arrays contain 10,000 elements each.\n", + "\n", + "**Step 1**:\n", + "- First, import `cupy as cp` at the top of your notebook to access `RawKernel` class.\n", + "- Next, write an empty raw kernel function enclosed in parenthesis. An example is given below:\n", + "```python\n", + "import cupy as cp\n", + "add_array = cp.RawKernel(r'''\n", + "extern \"C\" __global__\n", + "void () {\n", + "\n", + " \n", + "}\n", + "''', '')\n", + "```\n", + "\n", + "\n", + "- **Write code body**: To successfully write the kernel code body, it is important to know that computations within CUDA kernels execute in thread blocks and grids in a way that input array elements are accessed using global thread id as index. Therefore, it is necessary to uniquely identify distinct threads. A simple illustration on how to estimate global thread `id(s)` is given in figure 5.0 using four blocks of threads stacked over each other to form a matrix in rows and columns arrangement. Global thread ids are calculated in `x-dimension` (ideally thread block are in x,y,z dimensions) by rearranging the thread blocks as single row and then estimate using statement below:\n", + "\n", + "```python\n", + "tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + "```\n", + "
\n", + "
Figure 5.0 Estimating thread id for threads in green and orange
\n", + "\n", + "Now that we know how to compute global thread ids, we can proceed to write the CUDA C body code within the Raw Kernel as follows:\n", + "\n", + "```python\n", + "import cupy as cp\n", + "N = 10000 #initialize array size\n", + "add_array = cp.RawKernel(r'''\n", + "extern \"C\" __global__\n", + "void addFunc(const int* d_A, const int* d_B, int* d_C ) {\n", + " int tid = blockDim.x * blockIdx.x + threadIdx.x;\n", + " d_C[tid]= d_A[tid] + d_B[**tid];\n", + "}\n", + "''', 'addFunc')\n", + "```\n", + "\n", + "**Step 2**:\n", + "\n", + "- **Write the Host code**: The first thing to do is to initialize your input arrays as follows\n", + "```python\n", + "import numpy as np\n", + "h_A = np.arange(N, dtype=np.int32)\n", + "h_B = np.arange(N, dtype=np.int32)\n", + "```\n", + "Do data transfer by copying data (input array) from the `Host` to the `Device` using `cp.asarray()` function.\n", + "\n", + "```python\n", + "d_A = cp.asarray(h_A)\n", + "d_B = cp.asarray(h_B)\n", + "d_C = cp.zeros(N, dtype=cp.int32) # initialize zero filled array\n", + "```\n", + "\n", + "**Step 3**:\n", + "\n", + "The next step is to call the raw kernel function from the Host. But before that, a vital move would be to initialize the number of threads that would make up a single block (thread block) so that number of blocks required in a grid to execute the raw kernel can be estimated. In CuPy, raw kernel calls have a definition pattern as follows:\n", + "```python\n", + "((),(),())\n", + "```\n", + "The total number of threads required is equivalent to the size of initialized array, which is 10,000, therefore:\n", + "\n", + "```python\n", + "num_of_threads_per_block = 256 # this has not exceeded the limit i.e < 1024\n", + "```\n", + "Then, `num_of_blocks_per_grid` can be estimated as:\n", + "```python\n", + "num_of_blocks_per_grid = math.ceil (N / num_of_threads_per_block)\n", + "```\n", + "Subsequently, the raw kernel function is called this way:\n", + "\n", + "```python\n", + "add_array((num_of_blocks_per_grid,),(num_of_threads_per_block,),(d_A, d_B, d_C))\n", + "```\n", + "\n", + "**Step 4**:\n", + "\n", + "Copy result from Device to Host using `cp.asnumpy()` function, thus:\n", + "\n", + "```python\n", + "h_C = cp.asnumpy(d_C)\n", + "```\n", + "\n", + "You can run the entire code in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "import numpy as np\n", + "import math\n", + "\n", + "N = 10000 #initialize array size\n", + "add_array = cp.RawKernel(r'''\n", + "extern \"C\" __global__\n", + "void addFunc(const int* d_A, const int* d_B, int* d_C ) {\n", + " int tid = blockDim.x * blockIdx.x + threadIdx.x;\n", + " d_C[tid]= d_A[tid] + d_B[tid];\n", + "\n", + "}\n", + "''', 'addFunc')\n", + "\n", + "h_A = np.arange(N, dtype=np.int32)\n", + "h_B = np.arange(N, dtype=np.int32)\n", + "\n", + "d_A = cp.asarray(h_A)\n", + "d_B = cp.asarray(h_B)\n", + "d_C = cp.zeros(N, dtype=cp.int32) # initialize zero filled array\n", + "\n", + "num_of_threads_per_block = 256\n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "\n", + "add_array((num_of_blocks_per_grid,),(num_of_threads_per_block,),(d_A, d_B, d_C))\n", + "h_C = cp.asnumpy(d_C)\n", + "print(h_C)\n", + "\n", + "#expected output: [ 0 2 4 ... 19994 19996 19998]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "**Exercise 3**: *Follow the steps highlighted above and write a CuPy Raw Kernel program that multiply two arrays and store the result in a third array. The size of each array is `500,000`. Execute this task in the cell below:*\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "import numpy as np\n", + "import math\n", + "\n", + "N = 500000 #initialize array size\n", + "\n", + "\n", + "\n", + "#expected output[ 0 1 4 ... 888896841 889896836 890896833]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Raw Modules\n", + "\n", + "The Raw Modules has the same procedure as the Raw Kernel. In addition to this, several CUDA C kernel functions can be included within in form of a module as the name connote. Each kernel function within the Raw Module can be accessed by instantiating the object of `RawModule` class and a call to `get_function()` method. \n", + "\n", + "**Example 8:** \n", + "(i) z = $∑_{𝑖=1}$ $𝑥_{𝑖}$ * $𝑤_{𝑖}$ \n", + "\n", + "(ii) r= √($x^2$+$y^2$+$z^2$)\n", + "\n", + "The two tasks in `example 8` are solved using raw module approach. Kernel `sum_mul` and `compute_xyz` proffer solutions to example 8(i) and 8(ii) respectively. In the `sum_mul` kernel, `__syncthread()` was used to synchronize threads in blocks in a way that all threads within a block complete the multiplication operation before moving ahead to the sum operation. The `atomicAdd()` method helps avoid incorrect sum by preventing multiple threads from performing addition operation at same time, thus, only a single thread is allowed at a time. Note that this is not the best approach, it is however written this way to reduce complexity at this level. \n", + "\n", + "```python\n", + "raw_module_code = r'''\n", + "extern \"C\" {\n", + " __global__ void sum_mul(float* d_x, float* d_w, float* d_z) \n", + " {\n", + " float sum[2000];\n", + " int tid = blockDim.x * blockIdx.x + threadIdx.x;\n", + " sum[tid] = d_x[tid] * d_w[tid];\n", + " __syncthreads();\n", + " atomicAdd(d_z, sum[tid]);\n", + " }\n", + " __global__ void compute_xyz(float* x, float* y, float* z, float* r ) \n", + " {\n", + " int tid = blockDim.x * blockIdx.x + threadIdx.x;\n", + " r[tid] = sqrt(x[tid] * x[tid] + y[tid] * y[tid] + z[tid] * z[tid]) ;\n", + "\n", + " }\n", + " }\n", + "'''\n", + "```\n", + "The next step is to load the raw module by creating an object.\n", + "```python\n", + "#loading module through RawModule object\n", + "raw_module_object = cp.RawModule(code = raw_module_code)\n", + "```\n", + "Get the kernels within the raw module through the `get_function()` method\n", + "```python\n", + "#acessing kernels within the Raw module\n", + "sum_mul = raw_module_object.get_function('sum_mul')\n", + "compute_xyz = raw_module_object.get_function('compute_xyz')\n", + "```\n", + "Initialize data size, thread block size and, grid size\n", + "```python\n", + "#data\n", + "N = 2000 #initialize array size\n", + "num_of_threads_per_block = 128\n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "```\n", + "Initialize data for example 8(i) and copy data to the Device using `cp.asarray()`\n", + "```python\n", + "h_x = np.arange(N, dtype=np.float32)\n", + "h_w = np.arange(N, dtype=np.float32)\n", + "\n", + "d_x = cp.asarray(h_x)\n", + "d_w = cp.asarray(h_w)\n", + "d_z = cp.zeros(1, dtype=cp.float32)# initialize zero\n", + "```\n", + "\n", + "Call kernel `sum_mul` and pass the required arguments\n", + "```python\n", + "sum_mul((num_of_blocks_per_grid,),(num_of_threads_per_block,),(d_x, d_w, d_z))\n", + "h_z = cp.asnumpy(d_z)\n", + "print(\"h_z:\", h_z)\n", + "\n", + "verifying result\n", + "print(\"non kernel:\", cp.sum(h_x * h_w))\n", + "```\n", + "Initialize data for example 8(ii) directly on the Device using `cp.arange()`\n", + "\n", + "```python\n", + "x = cp.arange(N, dtype=cp.float32)\n", + "y = cp.arange(N, dtype=cp.float32)\n", + "z = cp.arange(N, dtype=cp.float32)\n", + "r = cp.empty(N, dtype=cp.float32)\n", + "```\n", + "Call kernel `compute_xyz` and pass the required arguments \n", + "```python\n", + "compute_xyz((num_of_blocks_per_grid,),(num_of_threads_per_block,),(x, y, z, r))\n", + "h_r = cp.asnumpy(r)\n", + "print(\"h_r:\", h_r)\n", + "\n", + "Verifying result\n", + "print(\"non kernel:\", cp.sqrt(x * x + y * y+ z * z ))\n", + "#expected result: \n", + "h_z: [2.6646702e+09]\n", + "h_r: [0.0000000e+00 1.7320508e+00 3.4641016e+00 ... 3.4589055e+03 3.4606375e+03 3.4623696e+03]\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## JIT Kernel\n", + "\n", + "The JIT kernel is defined through the `cupyx.jit.rawkernel` decorator. It uses the same concept as the raw kernel but differs by using python functions rather than CUDA C kernels. The decorator the specify at the top of a python function, hence the function becomes a JIT kernel. Let’s illustrate this using `example 7` from the raw kernel section.\n", + "\n", + "Firstly, `import jit from cupyx` library\n", + "\n", + "```python\n", + "import cupy as cp\n", + "from cupyx import jit\n", + "```\n", + "Next, write the Jit kernel\n", + "\n", + "```python\n", + "@jit.rawkernel()\n", + "def addFunc(d_A, d_B, d_C):\n", + " tid = jit.blockDim.x * jit.blockIdx.x + jit.threadIdx.x\n", + " d_C[tid] = d_A[tid] + d_B[tid]\n", + "```\n", + "Initialize data size, thread block size and, grid size\n", + "\n", + "```python\n", + "N = 10000 #initialize array size\n", + "num_of_threads_per_block = 128\n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "```\n", + "Initialize data directly on the Device using `cp.arange()`\n", + "\n", + "```python\n", + "d_A = cp.arange(N, dtype=cp.float32)\n", + "d_B = cp.arange(N, dtype=cp.float32)\n", + "d_C = cp.zeros(N, dtype=cp.int32) # initialize zero filled array\n", + "```\n", + "Call jit kernel `addFunc` and pass the required arguments\n", + "```python\n", + "addFunc((num_of_blocks_per_grid,),(num_of_threads_per_block,),(d_A, d_B, d_C))\n", + "\n", + "print(\"d_C:\", d_C)\n", + "\n", + "#expected output: d_C: [ 0 2 4 ... 19994 19996 19998]\n", + "```\n", + "\n", + "There are two notable actions in the code above, first, data was not initialized on the Host but directly on the Device; second, the output of the `d_C` was not copy to the Host using `asnumpy()` but was used directly on the Host. This shows data visibility, however, it is not the best data management approach in some context. Please run the cell below: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "from cupyx import jit\n", + "import math\n", + "\n", + "@jit.rawkernel()\n", + "def addFunc(d_A, d_B, d_C):\n", + " tid = jit.blockDim.x * jit.blockIdx.x + jit.threadIdx.x\n", + " d_C[tid] = d_A[tid] + d_B[tid]\n", + "\n", + "N = 100000 #initialize array size\n", + "num_of_threads_per_block = 128\n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "\n", + "d_A = cp.arange(N, dtype=cp.int32)\n", + "d_B = cp.arange(N, dtype=cp.int32)\n", + "d_C = cp.zeros(N, dtype=cp.int32) # initialize zero filled array\n", + "addFunc((num_of_blocks_per_grid,),(num_of_threads_per_block,),(d_A, d_B, d_C))\n", + "print(\"d_C:\", d_C)\n", + "\n", + "#expected output: [ 0 2 4 ... 19994 19996 19998]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "**Exercise 4**: *Follow the steps highlighted above and write a CuPy Raw Kernel program that multiply two arrays and store the result in a third array. The size of each array is 500,000. Execute this task in the cell below:*\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cupy as cp\n", + "import numpy as np\n", + "import math\n", + "\n", + "N = 500000 #initialize array size\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "#expected output: [ 0 1 4 ... 2147483647 2147483647 2147483647]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Lab Task\n", + "\n", + "In this section, you are expected to click on the **Serial Code Lab Assignment** link and proceed to Lab 2. In this lab, you will find three python serial code functions. You are required to revise the **pair_gpu** function to run on the GPU, and likewise do a few modifications within the **main** function.\n", + "\n", + "##

[Serial Code Lab Assignment](serial_RDF.ipynb)
\n", + "\n", + " \n", + "---\n", + "\n", + "\n", + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, we recommend you go to your browser's File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied as well. You can also execute the following cell block to create a zip-file of the files you've been working on and download it with the link below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n", + "\n", + "---\n", + "#
[HOME](../../../_common/_start_nways_python.ipynb)
\n", + "\n", + "---\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "---\n", + "\n", + "\n", + "## References\n", + "- https://docs.cupy.dev/en/stable/\n", + "- https://cupy.dev/\n", + "- CuPy Documentation Release 8.5.0, Preferred Networks, inc. and Preferred Infrastructure inc., Feb 26, 2021.\n", + "- Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.\n", + "- Crissman Loomis and Emilio Castillo, CuPy Overview: NumPy Syntax Computation with Advanced CUDA Features, GTC Digital March, March 2020.\n", + "- https://www.gpuhackathons.org/technical-resources\n", + "- https://rapids.ai/start.html\n", + "\n", + "--- \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/jupyter_notebook/cupy/serial_RDF.ipynb b/_basic/python/jupyter_notebook/cupy/serial_RDF.ipynb new file mode 100644 index 0000000..a5398f7 --- /dev/null +++ b/_basic/python/jupyter_notebook/cupy/serial_RDF.ipynb @@ -0,0 +1,416 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \n", + "\n", + "\n", + "# CuPy Lab 2: Serial Code Lab Assignment\n", + "---\n", + "\n", + "#### [<<--CuPy Lab 1](cupy_guide.ipynb)\n", + "\n", + "\n", + "## A Recap on RDF\n", + "\n", + "- The radial distribution function (RDF) denoted as g(r) defines the probability of finding a particle at a distance r from another tagged particle. The RDF is strongly dependent on the type of matter so will vary greatly for solids, gases and liquids. You can read more [here](https://en.wikibooks.org/wiki/Molecular_Simulation/Radial_Distribution_Functions).\n", + "- The code complexity of the algorithm is $N^{2}$. \n", + "- The input data for the serial code is fetched from a DCD binary trajectory file.\n", + "\n", + "\n", + "### The Serial Code\n", + "- The cell below consists of two functions, namely **dcdreadhead** and **dcdreadframe**\n", + "- The **dcdreadhead** function computes the total number of frames and atoms from the DCDFile **(input/alk.traj.dcd)**, while the **dcdreadframe** function reads 10 frames and 6720 atoms (note: each frame contains 6720 atoms) using the MDAnalysis library. \n", + "- Both functions run on the Host (CPU) and are being called from the function **main()**.\n", + "### Cell 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import cupy as cp\n", + "import numpy as np\n", + "import math\n", + "import cupy.cuda.nvtx as nvtx\n", + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "from timeit import default_timer as timer\n", + "\n", + "\n", + "def dcdreadhead(infile):\n", + " nconf = infile.n_frames\n", + " _infile = infile.header\n", + " numatm = _infile['natoms']\n", + " return numatm, nconf\n", + "\n", + "def dcdreadframe(infile, numatm, nconf):\n", + "\n", + " d_x = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_y = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_z = np.zeros(numatm * nconf, dtype=np.float64)\n", + "\n", + " for i in range(nconf):\n", + " data = infile.readframes(i, i+1)\n", + " box = data[1]\n", + " atomset = data[0][0]\n", + " xbox = round(box[0][0], 8)\n", + " ybox = round(box[0][2],8)\n", + " zbox = round(box[0][5], 8)\n", + "\n", + " for row in range(numatm):\n", + " d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column\n", + " d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column\n", + " d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column\n", + "\n", + " return xbox, ybox, zbox, d_x, d_y, d_z" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## pair_gpu function\n", + "\n", + "- The pair_gpu is the function where the main task of the RDF serial implementation is being executed. The function computes differences in xyz DCD frames.\n", + "- The essence of njit(just-in-time) decorator is to get pair_gpu function to compile under no python mode, and this is important for good performance. \n", + "- The decorator **@njit** or **@jit(nopython=True)** ensures that an exception is raised when compilation fails as a way to alert the user that a bug is found within the decorated function. You can read more [here](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html).\n", + "\n", + "### Cell 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from numba import njit\n", + "\n", + "@njit()\n", + "def pair_gpu(d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, d_bin):\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del = box / (2.0 * d_bin)\n", + " cut = box * 0.5\n", + "\n", + " for frame in range(nconf):\n", + " # print(\"\\n {}\".format(frame))\n", + " for id1 in range(numatm):\n", + " for id2 in range(numatm):\n", + " dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]\n", + " dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]\n", + " dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2 ]\n", + " dx = dx - xbox * (round(dx / xbox))\n", + " dy = dy - ybox * (round(dy / ybox))\n", + " dz = dz - zbox * (round(dz / zbox))\n", + "\n", + " r = math.sqrt(dx * dx + dy * dy + dz * dz)\n", + " if r < cut :\n", + " ig2 = int((r/_del))\n", + " d_g2[ig2] = d_g2[ig2] + 1\n", + "\n", + " return d_g2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Brief Analysis on Tasks Performed within pair_gpu function\n", + "- The graphic below identifies the various operations executed in the pair_gpu function. This function executes three nested loops using tricky indexing manipulation within the arrays.\n", + "\n", + "\n", + "\n", + "\n", + "- The indexing flow for the operation 1 is simulated using the graphic below. Each green box simulates the subtraction operation within the two inner loops (id1 & id2) while the indexes written in blue signifies the outer-most loop (frame) which iterates 10 times. \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "### The Main Function\n", + "- This is the entry point of the program where every other function including the **pair_gpu** function are called. The output of the main function is written into two files. An image version of the output files (\"**cupy_RDF.dat**\" & \"**cupy_Pair_entropy.dat**\") are displayed below the code cell.\n", + "\n", + "### Cell 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "import os\n", + "from pathlib import Path\n", + "\n", + "def main():\n", + " start = timer()\n", + " ########## Input Details ###########\n", + " inconf = 10\n", + " nbin = 2000\n", + " global xbox, ybox, zbox\n", + " \n", + " fileDir = os.path.dirname(os.path.realpath('__file__'))\n", + " dataRoot = Path(fileDir).parents[1]\n", + " file = os.path.join(dataRoot, 'source_code/input/alk.traj.dcd')\n", + " \n", + " infile = DCDFile(file)\n", + " pairfile = open(\"RDF.dat\", \"w+\")\n", + " stwo = open(\"Pair_entropy.dat\", \"w+\")\n", + "\n", + " numatm, nconf = dcdreadhead(infile)\n", + " print(\"Dcd file has {} atoms and {} frames\".format(numatm, nconf))\n", + " if inconf > nconf:\n", + " print(\"nconf is reset to {}\".format(nconf))\n", + " else:\n", + " nconf = inconf\n", + " print(\"Calculating RDF for {} frames\".format(nconf))\n", + " #numatm = 50\n", + " sizef = nconf * numatm\n", + " sizebin = nbin\n", + " ########### reading cordinates ##############\n", + " nvtx.RangePush(\"Read_File\")\n", + " xbox, ybox, zbox, h_x, h_y, h_z = dcdreadframe(infile, numatm, nconf)\n", + " nvtx.RangePop() # pop for reading file\n", + "\n", + " h_g2 = np.zeros(sizebin, dtype=np.longlong)\n", + " print(\"Reading of input file is completed\")\n", + " \n", + " print(\"\\n {} {}\".format(nconf, numatm))\n", + " ############# This where we will concentrate #########################\n", + " nvtx.RangePush(\"Pair_Circulation\")\n", + " h_g2 = pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin)\n", + " nvtx.RangePop() #pop for Pair Calculation\n", + " ######################################################################\n", + " \n", + " pi = math.acos(np.int64(-1.0))\n", + " rho = (numatm) / (xbox * ybox * zbox)\n", + " norm = (np.int64(4.0) * pi * rho) / np.int64(3.0)\n", + " g2 = np.zeros(nbin, dtype=np.float32)\n", + " s2 = np.int64(0.0);\n", + " s2bond = np.int64(0.0)\n", + " lngrbond = np.int64(0.0)\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del = box / (np.int64(2.0) * nbin)\n", + " gr = np.float32(0.0)\n", + " # loop to calculate entropy\n", + " nvtx.RangePush(\"Entropy_Calculation\")\n", + " for i in range(nbin):\n", + " rl = (i) * _del\n", + " ru = rl + _del\n", + " nideal = norm * (ru * ru * ru - rl * rl * rl)\n", + " g2[i] = h_g2[i] / (nconf * numatm * nideal)\n", + " r = (i) * _del\n", + " temp = (i + 0.5) * _del\n", + " \n", + " #writing to file\n", + " pairfile.write(str(temp) + \" \" + str(g2[i]) + \"\\n\")\n", + "\n", + " if r < np.int64(2.0):\n", + " gr = np.int64(0.0)\n", + " else:\n", + " gr = g2[i]\n", + " if gr < 1e-5:\n", + " lngr = np.int64(0.0)\n", + " else:\n", + " lngr = math.log(gr)\n", + " if g2[i] < 1e-6:\n", + " lngrbond = np.int64(0.0)\n", + " else:\n", + " lngrbond = math.log(g2[i])\n", + " s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r)\n", + " s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r\n", + "\n", + " nvtx.RangePop() # pop for entropy Calculation\n", + " \n", + " #writing s2 and s2bond to file\n", + " stwo.writelines(\"s2 value is {}\\n\".format(s2))\n", + " stwo.writelines(\"s2bond value is {}\".format(s2bond))\n", + " \n", + " # printing s2 and s2bond to jupyter output\n", + " print(\"\\n s2 value is {}\\n\".format(s2))\n", + " print(\"s2bond value is {}\\n\".format(s2bond))\n", + "\n", + " print(\"#Freeing Host memory\")\n", + " del(h_x)\n", + " del(h_y)\n", + " del(h_z)\n", + " del(h_g2)\n", + " print(\"#Number of atoms processed: {} \\n\".format(numatm))\n", + " print(\"#number of confs processed: {} \\n\".format(nconf))\n", + " \n", + "\n", + "if __name__ == \"__main__\":\n", + " #main() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### Output Files\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + "
\n", + "\n", + "\n", + "---\n", + "\n", + "# Lab Task \n", + "\n", + "1. **Run the serial code from cell 1, 2, & 3**.\n", + " - Remove the **\"#\"** behind the **main()** before running the cell 3:\n", + " ```python\n", + " if __name__ == \"__main__\":\n", + " main()\n", + " ```\n", + "2. **Now, let's start modifying the original code to CuPy code constructs.**\n", + "> Click on the [Modify](../../source_code/serial/nways_serial.py) link, and modify **nways_serial.py** to `CuPy code construct`. Remember to SAVE your code after changes, and then run the cell below. \n", + "> Hints: focus on the **pair_gpu** function and you may need to modify few lines in the **main** function as well.\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run ../../source_code/serial/nways_serial.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "\n", + "3. **Profile the code by running the cell bellow** " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cd ../../source_code/serial&& nsys profile --stats=true --force-overwrite true -o serial_cpu_rdf python3 nways_serial.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the profiler report, you need to download and save the report file by holding down Shift and Right-Clicking [Here](../../source_code/serial/serial_cpu_rdf.nsys-rep) and choosing save Link As Once done, open it via the GUI. A sample expected profile report is shown below:\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "From the profile report, we can see that the pair_gpu function now takes milliseconds to run as compared to the serial version which takes more than 3 seconds as shown [here](../serial/rdf_overview.ipynb). \n", + " \n", + "\n", + "---\n", + "### [View ](../../source_code/cupy/cupy_rdf.py) or [Run](../../jupyter_notebook/cupy/cupy_RDF.ipynb) Solution\n", + "---\n", + "\n", + "\n", + "\n", + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, we recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied as well. You can also execute the following cell block to create a zip-file of the files you've been working on and download it with the link below.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "---\n", + "\n", + "#

HOME

\n", + "\n", + "---\n", + "\n", + "\n", + "## Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of the Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "--- \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/jupyter_notebook/numba/numba_RDF.ipynb b/_basic/python/jupyter_notebook/numba/numba_RDF.ipynb new file mode 100644 index 0000000..01a4cfb --- /dev/null +++ b/_basic/python/jupyter_notebook/numba/numba_RDF.ipynb @@ -0,0 +1,314 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# \n", + "\n", + "\n", + "# Numba Lab 3: Solution\n", + "---\n", + "\n", + "#### [<<-- Numba Lab 2](serial_RDF.ipynb)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import math\n", + "import cupy.cuda.nvtx as nvtx\n", + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "from timeit import default_timer as timer\n", + "import numba.cuda as cuda\n", + "\n", + "\n", + "def dcdreadhead(infile):\n", + " nconf = infile.n_frames\n", + " _infile = infile.header\n", + " numatm = _infile['natoms']\n", + " return numatm, nconf\n", + "\n", + "def dcdreadframe(infile, numatm, nconf):\n", + "\n", + " d_x = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_y = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_z = np.zeros(numatm * nconf, dtype=np.float64)\n", + "\n", + " for i in range(nconf):\n", + " data = infile.readframes(i, i+1)\n", + " box = data[1]\n", + " atomset = data[0][0]\n", + " xbox = round(box[0][0], 8)\n", + " ybox = round(box[0][2],8)\n", + " zbox = round(box[0][5], 8)\n", + "\n", + " for row in range(numatm):\n", + " d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column\n", + " d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column\n", + " d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column\n", + "\n", + " return xbox, ybox, zbox, d_x, d_y, d_z\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### The Numba CUDA-jit pair_gpu acceleration code " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@cuda.jit\n", + "def pair_gpu_kernel(d_x, d_y,d_z, d_g2, numatm, nconf, xbox, ybox,zbox,d_bin, bl):\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del= box / (2.0 * d_bin)\n", + " cut = box * 0.5;\n", + " i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n", + " maxi = min(int(0.5 * numatm * (numatm - 1) - (bl * 65535 * 128)), (65535 * 128))\n", + "\n", + " if i < maxi:\n", + " thisi=bl * 65535 * 128+i\n", + " n = (0.5) * (1+ ( math.sqrt (1.0+4.0 * 2.0 * thisi)))\n", + " id1 = int(n)\n", + " id2 = thisi-(0.5 * id1 * (id1-1))\n", + " for frame in range(0, nconf):\n", + " t1 = int(frame * numatm+id1)\n", + " t2 = int(frame * numatm+id2)\n", + " dx = d_x[t1] - d_x[t2]\n", + " dy = d_y[t1] - d_y[t2]\n", + " dz = d_z[t1] - d_z[t2]\n", + " dx = dx - xbox * (round(dx / xbox))\n", + " dy = dy - ybox * (round(dy / ybox))\n", + " dz = dz - zbox * (round(dz / zbox))\n", + "\n", + " r= math.sqrt(dx * dx+dy * dy+dz * dz)\n", + " if r < cut:\n", + " ig2=(int)(r / _del )\n", + " cuda.atomic.add(d_g2, ig2, 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### The Main Function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "import os\n", + "from pathlib import Path\n", + "\n", + "def main():\n", + " #start = timer()\n", + " ########## Input Details ###########\n", + " global xbox, ybox, zbox\n", + " inconf = 10\n", + " nbin =np.int32(2000)\n", + " xbox = np.float32(0)\n", + " ybox =np.float32(0)\n", + " zbox = np.float32(0)\n", + " \n", + " fileDir = os.path.dirname(os.path.realpath('__file__'))\n", + " dataRoot = Path(fileDir).parents[1]\n", + " file = os.path.join(dataRoot, 'source_code/input/alk.traj.dcd')\n", + " \n", + " infile = DCDFile(file)\n", + "\n", + " pairfile = open(\"numba_RDF.dat\", \"w+\")\n", + " stwo = open(\"numba_Pair_entropy.dat\", \"w+\")\n", + "\n", + " numatm, nconf = dcdreadhead(infile)\n", + "\n", + " print(\"Dcd file has {} atoms and {} frames\".format(numatm, nconf))\n", + " if inconf > nconf:\n", + " print(\"nconf is reset to {}\".format(nconf))\n", + " else:\n", + " nconf = inconf\n", + " print(\"Calculating RDF for {} frames\".format(nconf))\n", + "\n", + " #numatm = 100\n", + " sizef = nconf * numatm\n", + " sizebin = nbin\n", + "\n", + " ########### reading cordinates ##############\n", + " nvtx.RangePush(\"Read_File\")\n", + " xbox, ybox, zbox, d_x, d_y, d_z = dcdreadframe(infile, numatm, nconf)\n", + " nvtx.RangePop() # pop for reading file\n", + " print(\"Reading of input file is completed\")\n", + "\n", + " ############################## Numba KERNEL #################################################\n", + "\n", + " nthreads = 128;\n", + " near2 = nthreads * (int(0.5 * numatm * (numatm - 1) / nthreads) + 1);\n", + " nblock = (near2 / nthreads);\n", + " print(\" Initial blocks are {} and now changing to\".format(nblock))\n", + " maxblock = 65535\n", + " blockloop = int(nblock / maxblock)\n", + " if blockloop != 0:\n", + " nblock = maxblock\n", + " print(\"{} and will run over {} blockloops \\n\".format(nblock, blockloop+1))\n", + "\n", + " # cp.cuda.runtime.memset(d_g2,0,sizebin)\n", + " d_g2 = np.zeros(sizebin, dtype=np.int64)\n", + " d_g2 = cuda.to_device(d_g2) #numba copy to device\n", + "\n", + " nvtx.RangePush(\"Pair_Circulation_Numba\")\n", + " #t1 = timer()\n", + " for bl in range(blockloop+1):\n", + " pair_gpu_kernel[nblock,nthreads ](d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin, bl) ## numba jit kernel\n", + " \n", + " cuda.synchronize()\n", + " # print(\"Kernel compute time:\", timer() - t1)\n", + " d_g2 = d_g2.copy_to_host() ## numba copy to host\n", + " nvtx.RangePop() # pop for Pair Calculation\n", + "\n", + " pi = math.acos(np.int64(-1.0))\n", + " rho = (numatm) / (xbox * ybox * zbox)\n", + " norm = (np.int64(4.0) * pi * rho) / np.int64(3.0)\n", + " g2 = np.zeros(nbin, dtype=np.float32)\n", + " s2 =np.int64(0.0); s2bond = np.int64(0.0)\n", + " lngrbond = np.float32(0.0)\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del =box / (np.int64(2.0) * nbin)\n", + " gr = np.float32(0.0)\n", + " # loop to calculate entropy\n", + " nvtx.RangePush(\"Entropy_Calculation\")\n", + " for i in range(nbin):\n", + " rl = (i) * _del\n", + " ru = rl + _del\n", + " nideal = norm * (ru * ru * ru - rl * rl * rl)\n", + " g2[i] = d_g2[i] / (nconf * numatm * nideal)\n", + " r = (i) * _del\n", + " temp = (i + 0.5) * _del\n", + " pairfile.write(str(temp) + \" \" + str(g2[i]) + \"\\n\")\n", + "\n", + " if r < np.int64(2.0):\n", + " gr = np.int64(0.0)\n", + " else:\n", + " gr = g2[i]\n", + " if gr < 1e-5:\n", + " lngr = np.int64(0.0)\n", + " else:\n", + " lngr = math.log(gr)\n", + " if g2[i] < 1e-6:\n", + " lngrbond = np.int64(0.0)\n", + " else:\n", + " lngrbond = math.log(g2[i])\n", + " s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r)\n", + " s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r\n", + "\n", + " nvtx.RangePop() # pop for entropy Calculation\n", + " stwo.writelines(\"s2 value is {}\\n\".format(s2))\n", + " stwo.writelines(\"s2bond value is {}\".format(s2bond))\n", + " \n", + " print(\"\\n s2 value is {}\\n\".format(s2))\n", + " print(\"s2bond value is {}\\n\".format(s2bond))\n", + " \n", + " print(\"#Freeing Host memory\")\n", + " del (d_x)\n", + " del (d_y)\n", + " del (d_z)\n", + " del (d_g2)\n", + " print(\"#Number of atoms processed: {} \\n\".format(numatm))\n", + " print(\"#number of confs processed: {} \\n\".format(nconf))\n", + " #total_time = timer() - start\n", + " #print(\"total time spent:\", total_time)\n", + " \n", + "if __name__ == \"__main__\":\n", + " main()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Output Files\n", + "\n", + "\n", + "\n", + "### Profiling Sample\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "---\n", + "\n", + "#

HOME

\n", + "\n", + "---\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of the Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/jupyter_notebook/numba/numba_guide.ipynb b/_basic/python/jupyter_notebook/numba/numba_guide.ipynb new file mode 100644 index 0000000..072c9ef --- /dev/null +++ b/_basic/python/jupyter_notebook/numba/numba_guide.ipynb @@ -0,0 +1,720 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Numba Lab1: Numba For CUDA GPU\n", + "---\n", + "\n", + "## Learning Objectives\n", + "- **The goal of this lab is to:**\n", + " - enable you to quickly start using Numba (beginner to advanced level)\n", + " - teach you to apply the concepts of CUDA GPU programming to HPC field(s); and\n", + " - show you how to achieve computational speedup on GPUs to maximize the throughput of your HPC implementation.\n", + "\n", + "\n", + "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl-Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " \n", + "## Introduction\n", + "\n", + "**Numba** is a just-in-time compiler for python that converts python functions into optimized machine code at runtime. In other words, user-defined functions written in python would be run at native machine code speed. For example, a programmer can delegate functions that are computationally intensive (especially those with consecutive nested loops and arrays) within his/her code to Numba execution and gain speed up. This is achievable by placing Numba decorator at the top of a user-define function. A Numba decorator determines how a function would be compiled, and more on it would be explained later in the notebook. Numba has huge support for NumPy library and also enables parallel programming on `CPU (multicore) and GPU (via CUDA API binding)`, thus, making execution on NumPy arrays faster. Our focus would be using `Numba for CUDA GPU`, therefore, parallel programming concept would be described from CUDA C kernel perspective. The rest of the notebook include frequently used terms like **Host** (this refers to a CPU), **Device** (means a GPU), and **Kernel** (a user-defined function that runs on the GPU with Numba decorator specified at the top).\n", + "\n", + "## Memory Architecture\n", + "\n", + "When written codes run on the device (GPU), execution is shared amongst threads and blocks of memory space. The execution could be mapped to thousands of threads modelled in blocks and grids form. This idea is illustrated in figure 1.0 with a view that a thread can be seen as a single executing unit on the device. A `thread block` (also known as a block) is as collection of threads that can communicate, while a collection of these blocks is referred to as a `Grid`. In several devices the maximum number of threads within a thread block is `1,024` and `65,535` blocks within a grid.\n", + "\n", + "
\n", + "
Figure 1.0. Thread, block, and grid concept
\n", + " \n", + " As shown in figure 2.0, the GPU memory space is hierarchically arranged into `shared memory`, `local memory`, `global memory`, `constant memory`, and `texture memory`. Within a block, each thread has its own local memory and register and does communicate with other threads using the shared memory.\n", + " \n", + "
\n", + "
Figure 2.0. Memory Architecture
\n", + " \n", + " **Image source**: Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.\n", + " \n", + "\n", + "## Numba CUDA Kernel\n", + "\n", + "This section highlights steps in writing your first CUDA Kernel in Numba. The steps are illustrated using a simple task as follows:\n", + "\n", + "**Example 1**: Write a Numba Kernel program that adds two arrays A and B and stores the result in array C. Assume A and B contain 10,000 values.\n", + "\n", + "#### Step 1: \n", + " - First, import `numba.cuda as cuda` library at the top of your notebook in order to access `cuda.jit`. \n", + " - Next, write an empty python function and specify `@cuda.jit` at the top of the function. An example is given below:\n", + "\n", + "\n", + "```python\n", + "import numba.cuda as cuda\n", + "@cuda.jit\n", + "def ():\n", + " #...code body ...\n", + "```\n", + "\n", + " - **Write code body**: To successfully write the kernel code body, it is important to know that computations within CUDA kernels execute in thread blocks and grids in a way that input array elements are accessed using global thread id as index. Therefore, it is necessary to uniquely identify distinct threads. A simple illustration on how to estimate global thread `id(s)` is given in figure 3.0 using four blocks of threads stacked over each other to form a matrix in rows and columns arrangement. Global thread ids are calculated in `x-dimension` (ideally thread block are in x,y,z dimensions) by rearranging the thread blocks as single row and then estimate using statement below:\n", + "\n", + "```python\n", + "tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + "```\n", + "
\n", + "
Figure 3.0 Estimating thread id for threads in green and orange
\n", + "\n", + "Now that we know how to compute thread ids, we can proceed to write the kernel body code as follows: \n", + "\n", + "```python\n", + "import numba.cuda as cuda\n", + "N = 10000 #initialize array size\n", + "@cuda.jit\n", + "def Add_arrays(d_A, d_B, d_C):\n", + "\ttid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + " if tid < N: #ensuring that index is not out of bound\n", + " d_C[tid] = d_A[tid] + d_B[tid]\n", + "\n", + "```\n", + "\n", + "Note that kernel function does not return value through variable but stream/copy back from the Device to the Host.\n", + "\n", + "#### Step 2:\n", + "\n", + "- **Write the Host code**: The first thing to do is to initialize your input arrays as follows:\n", + " \n", + "```python\n", + "import numpy as np\n", + "h_A = np.arange(N, dtype=np.int32)\n", + "h_B = np.arange(N, dtype=np.int32)\n", + "h_C = np.zeros(N, dtype=np.int32) #initialize zero filled array\n", + "````\n", + "Do data transfer by copying data (input array) from the `Host` to the `Device` using `cuda.to_device()` function. \n", + " \n", + "```python\n", + "d_A = cuda.to_device(h_A)\n", + "d_B = cuda.to_device(h_B)\n", + "d_C = cuda.to_device(h_C)\n", + "\n", + "```\n", + "**Attention!!!** Numba Kernels automatically have direct access to NumPy arrays residing on the Host thus, explicitly stating `cuda.to_device()` is may not be required for data visibility to the Device.\n", + "\n", + "#### Step 3: \n", + "\n", + "The next step is to call the kernel function from the Host. But before that, a vital move would be to initialize the number of threads that would make up a single block (thread block) so that number of blocks required in a grid to execute the `Add_array` Kernel can be estimated. In Numba, Kernel calls have a definition pattern as follows:\n", + "\n", + "` [ , ] ()`\n", + "\n", + "The total number of threads required is equivalent to the size of initialized array, which is 10,000, therefore:\n", + "\n", + "```python \n", + "num_of_threads_per_block = 256 # this has not exceed the limit i.e < 1024 \n", + "```\n", + "\n", + "Then, `num_of_blocks_per_grid` can be estimated as:\n", + "\n", + "```python \n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "```\n", + "\n", + "Subsequently, `Add_arrays` Kernel function can be called this way:\n", + "\n", + "`Add_arrays[num_of_blocks_per_grid , num_of_threads_per_block](d_A, d_B, d_C)`\n", + "\n", + "#### Step 4:\n", + "\n", + "Copy result from Device to Host using `copy_to_host()` function, thus:\n", + "\n", + "```python\n", + "h_C = d_C.copy_to_host()\n", + "````\n", + "You can run the entire code in the below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import numba.cuda as cuda\n", + "import numpy as np\n", + "\n", + "N = 10000 #initialize array size\n", + "#kernel function\n", + "@cuda.jit\n", + "def Add_arrays(d_A, d_B, d_C):\n", + " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + " if tid < N: #ensuring that index is not out of bound\n", + " d_C[tid] = d_A[tid] + d_B[tid]\n", + " \n", + "#input data initialzed on the Host\n", + "h_A = np.arange(N, dtype=np.int32)\n", + "h_B = np.arange(N, dtype=np.int32)\n", + "h_C = np.zeros(N, dtype=np.int32) # initialize zero filled array\n", + "\n", + "#input data copied to Device\n", + "d_A = cuda.to_device(h_A)\n", + "d_B = cuda.to_device(h_B)\n", + "d_C = cuda.to_device(h_C)\n", + "\n", + "#set block and grid size and call kernel\n", + "num_of_threads_per_block = 256 # this has no exceed the limit i.e < 1024\n", + "num_of_blocks_per_grid = math.ceil(N / num_of_threads_per_block)\n", + "Add_arrays[num_of_blocks_per_grid , num_of_threads_per_block](d_A, d_B, d_C)\n", + "\n", + "#copy result back to Host\n", + "h_C = d_C.copy_to_host()\n", + "print(\"h_C :\", h_C)\n", + "\n", + "####################################################################\n", + "#expected output: h_C : [ 0 2 4 ... 19994 19996 19998]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "**Exercise 1**: Follow the steps highlighted above and write a Numba program (Host & Device code) that adds two arrays and stores the result in a third array. The size of each array is 500,000. Execute this task in the cell below:\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "#import library\n", + "#import library\n", + "\n", + "N = 500000 #initialize array size\n", + "\n", + "#kernel function\n", + "def Addition_kernel(A, B, C):\n", + " #write kernel code body\n", + "\n", + "\n", + "########## Host code body #########\n", + "\n", + "#input data initialzed on the Host\n", + "\n", + "\n", + "\n", + "#input data copied to Device\n", + "\n", + "\n", + "\n", + "#set block and grid size and call kernel\n", + "\n", + "\n", + "\n", + "#copy result back to Host\n", + "\n", + "\n", + "\n", + "################### expected output #################################\n", + "# [ 0 2 4 ... 999994 999996 999998]\n", + "######################################################################" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Thread Reuse\n", + "\n", + "It is possible to specify a few numbers of thread for a large data in a way that threads are reused to complete the computation of the entire data. This is one of the approaches used when data to be computed is larger than the maximum number of threads available on the device memory. The statement below is used in a `while loop` to achieve such purpose: \n", + "\n", + "```python \n", + "tid += cuda.blockDim.x * cuda.gridDim.x\n", + "```\n", + "The sample code given below illustrates thread reuse using `example 1` as a case study. In the example, number of blocks per grid is set to 1 on purpose to show the possibility of this approach. Therefore, a single block of thread having 256 threads would be reused to compute addition operation on two arrays, each of size 10,000. \n", + "\n", + "```python\n", + "import numba.cuda as cuda\n", + "import numpy as np\n", + "\n", + "N = 10000 #initialize array size\n", + "#kernel function\n", + "@cuda.jit\n", + "def arrayAdd(d_A, d_B, d_C):\n", + " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + " while tid < N:\n", + " array_out[tid] = array_A[tid] + array_B[tid]\n", + " tid += cuda.blockDim.x * cuda.gridDim.x\n", + " \n", + "################################################\n", + "#set block and grid size and call kernel\n", + "num_of_threads_per_block = 256\n", + "num_of_blocks_per_grid = 1\n", + "arrayAdd[num_of_blocks_per_grid, threadsperblock](d_A, d_B, d_C)\n", + "\n", + "print(array_out)\n", + "\n", + "#output: [ 0 2 4 ... 19994 19996 19998]\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2-Dimensional Array\n", + "\n", + "In this section, the focus would be on performing simple calculation with 2D arrays. In two different approaches, let’s consider multiplication of `4x4` arrays A and B. In the first approach, each array is split into `2x2` segments in a way that threads per block and blocks per grid would be `2x2` respectively. In the second approach, each array would fit into a thread block as `4x4` and block per grid would be `1x1(1 grid)`. This process is exemplified in figure 4.0 using small array size for ease of understanding. It is assumed that the mathematical process of multiply two matrixes is already known, however, a glimpse on the process is shown in figure 5.0.\n", + "\n", + "
\n", + "
Figure 4.0 2D array device fitting logic
\n", + "
\n", + "
\n", + "
Figure 5.0 Matrix multiplication logic
\n", + "\n", + "**Implementation:** The 4 basic steps that were previously explained would be followed to solve the first approach.\n", + "\n", + "#### Step 1\n", + "```python\n", + "import numba.cuda as cuda\n", + "import numpy as np\n", + "import math\n", + "\n", + "N = 4 #initialize array size\n", + "\n", + "@cuda.jit()\n", + "def MatrixMul2D(d_A, d_B, d_C):\n", + " row, col = cuda.grid(2)\n", + " if row < d_C.shape[0] and col < d_C.shape[1]:\n", + " for k in range(N):\n", + " d_C[row][col]+= d_A[row][k] * d_B[k][col]\n", + "```\n", + "#### step 2\n", + "```python\n", + "#input data initialzed on the Host \n", + "h_A = np.array([[0,0,0,0],[1,1,1,1],[2,2,2,2],[3,3,3,3]], dtype=np.int32)\n", + "h_B = np.array([[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]], dtype=np.int32)\n", + "h_C = np.zeros(N*N, dtype=np.int32).reshape(N, N)\n", + "\n", + "#input data copied to Device\n", + "d_A = cuda.to_device(h_A)\n", + "d_B = cuda.to_device(h_B)\n", + "d_C = cuda.to_device(h_C)\n", + "\n", + "```\n", + "#### step 3\n", + "```python\n", + "#set block and grid size and call kernel\n", + "num_of_threads_per_block = (2,2)\n", + "num_of_blocks_per_grid_x = (math.ceil( N / num_of_threads_per_block[0]))\n", + "num_of_blocks_per_grid_y = (math.ceil( N / num_of_threads_per_block[1]))\n", + "num_of_blocks_per_grid = (num_of_blocks_per_grid_x , num_of_blocks_per_grid_y)\n", + "\n", + "MatrixMul2D[num_of_blocks_per_grid, num_of_threads_per_block](d_A, d_B, d_C)\n", + "```\n", + "#### Step 4\n", + "```python\n", + "h_C = d_C.copy_to_host()\n", + "\n", + "print(\"h_A:\\n {}\\n\".format(h_A))\n", + "print(\"h_B:\\n {}\\n\".format(h_B))\n", + "print(\"h_C:\\n {}\".format(h_C))\n", + "\n", + "```\n", + "---\n", + "**Exercise 2**: Write a Numba program (Host & Device) that multiply two matrixes of dimension `225 x 225`. Part of the code has been written for you in the cell below and you are to complete the rest. In the solution, the intention is to split each matrix into `25x25` segments such that threads per block and blocks per grid would be `25x25` respectively.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import numpy \n", + "#import library\n", + "\n", + "N = 225 #initialize array size\n", + "\n", + "#kernel function\n", + "def MatrixMul2D(d_A, d_B, d_C):\n", + " x, y = cuda.grid(2)\n", + " #complete kernel code\n", + "\n", + "\n", + "#input data initialized on the Host\n", + "h_A = np.arange((N*N), dtype=np.int32).reshape(N,N)\n", + "h_B = np.arange((N*N), dtype=np.int32).reshape(N,N)\n", + "h_C = np.zeros((N*N), dtype=np.int32).reshape(N,N)\n", + "\n", + "\n", + "#input data copied to Device\n", + "\n", + "\n", + "\n", + "#set block and grid size and call kernel\n", + "num_of_threads_per_block = (25,25)\n", + "\n", + "\n", + "\n", + "#copy result back to Host\n", + "\n", + "\n", + "\n", + "\n", + "###################### expected output #########################################\n", + "\n", + "#[[ 848610000 848635200 848660400 ... 854204400 854229600 854254800]\n", + "# [ 2124360000 2124435825 2124511650 ... 2141193150 2141268975 2141344800]\n", + "# [ -894857296 -894730846 -894604396 ... -866785396 -866658946 -866532496]\n", + "# ...\n", + "# [ 597268464 608532414 619796364 ... -1197101932 -1185837982 -1174574032]\n", + "# [ 1873018464 1884333039 1895647614 ... 89886818 101201393 112515968]\n", + "# [-1146198832 -1134833632 -1123468432 ... 1376875568 1388240768 1399605968]]\n", + "################################################################################\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CUDA Universal Functions (Ufuncs) and Device Function\n", + "\n", + "Ufuncs are NumPy functions used to implement vectorization on ndarray. Vectorization supports broadcasting mechanism that eliminates the use of loop(s) when operating on ndarrays and as a result, execution is faster. Numba implements CUDA Ufuncs using `@vectorize()` decorator as shown in figure 6.0. Please note that the arguments are not limited to just two but depends on the task to solve.\n", + "\n", + "\n", + "
Figure 6.0 Ufuncs definition
\n", + "\n", + "**Example 2**- Addition of two squares: $a^2$+ $b^2$= $(a-b)^2$ + 2ab.\n", + "\n", + "**Solution**:\n", + "```python\n", + "from numba import vectorize\n", + "import numpy as np\n", + "\n", + "@vectorize(['float32(float32, float32)'], target='cuda')\n", + "def additionOfSquares(a, b):\n", + " return (a - b)**2 + (2*a*b)\n", + "\n", + "N = 10000\n", + "# prepare the input\n", + "A = np.arange(N , dtype=np.float32)\n", + "B = np.arange(N, dtype=np.float32)\n", + "\n", + "# calling ufuncs\n", + "C = additionOfSquares(A, B)\n", + "print(C.reshape(100,100)) # print result\n", + "\n", + "#expected result:\n", + "...\n", + "[1.8818000e+08 1.8821880e+08 1.8825760e+08 ... 1.9196242e+08 1.9200160e+08 1.9204080e+08]\n", + "[1.9208000e+08 1.9211920e+08 1.9215840e+08 ... 1.9590122e+08 1.9594080e+08 1.9598040e+08]\n", + "[1.9602000e+08 1.9605960e+08 1.9609920e+08 ... 1.9988002e+08 1.9992000e+08 1.9996000e+08]]\n", + "```\n", + "---\n", + "**Exercise 3**: Write a CUDA ufunc to solve difference of two squares $a^2$- $b^2$ = (a-b)(a+b)\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#write Ufunc here\n", + "\n", + "\n", + "N = 10000\n", + "# prepare the input\n", + "A = np.arange(N , dtype=np.float32)\n", + "B = np.arange(N, dtype=np.float32)\n", + "\n", + "# call Ufunc here\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The CUDA device functions can only be invoked from a Kernel function (not from the Host) and would return value like Python functions. The device function is usually placed before the CUDA ufunc kernel otherwise a call to it may not be visible in the ufunc kernel. The attributes `device=True` and `inline=true` indicates `device_ufunc` as a device function. Solution for addition of two squares is be rewritten as follows:\n", + "\n", + "```python\n", + "from numba import vectorize\n", + "import numpy as np\n", + "\n", + "@cuda.jit('float32(float32,float32)', device=True, inline=True)\n", + "def device_ufunc(a,b):\n", + " return (a - b)**2\n", + "\n", + "@vectorize(['float32(float32, float32)'], target='cuda')\n", + "def additionOfSquares(a, b):\n", + " return device_ufunc(a,b) + (2*a*b)\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Atomic Operation\n", + "\n", + "Atomic operation is required when multiple threads attempt to modify a common portion of the memory. Typical example includes simultaneous withdrawal from a bank account through ATM machines or many threads modifying a particular index of an array based on certain condition(s). In parallel execution, atomic operation helps eliminate race conditions that often occur in share resources. List of presently implemented atomic operations supported in Numba are:\n", + "\n", + " - cuda.atomic.add(array, index, value)\n", + " - cuda.atomic.min(array, index, value)\n", + "\t- cuda.atomic.max(array, index, value)\n", + "\t- cuda.atomic.nanmax(array, index, value)\n", + "\t- cuda.atomic.nanmin(array, index, value)\n", + " - cuda.atomic.compare_and_swap(array, old_value, current_value)\n", + "\t- cuda.atomic.sub(array, index, value)\n", + "\n", + "Complete list can be found here: https://numba.pydata.org/numba-doc/dev/cuda/intrinsics.html#\n", + "\n", + "There are two examples in the cell below that sum elements of an array in parallel. The first uses atomic operation approach and gives correct result while the second does not. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Task ==> sum of an array: [1,2,3,4,5,6,7,8,9,10] in parallel\n", + "# Note that threads are executed randomly\n", + "\n", + "import numba.cuda as cuda\n", + "import numpy as np\n", + "\n", + "# atomic operation example \n", + "size = 10\n", + "nthread = 10\n", + "@cuda.jit()\n", + "def add_atomic(my_array, total):\n", + " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + " cuda.atomic.add(total,0, my_array[tid])\n", + "\n", + "my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)\n", + "total = np.zeros(1, dtype=np.int32)\n", + "nblock = int(size / nthread)\n", + "add_atomic[nblock, nthread](my_array, total)\n", + "print(\"Atomic:\", total)\n", + "\n", + "######################################################################################\n", + "# Non-atomic operation example \n", + "size = 10\n", + "nthread = 10\n", + "@cuda.jit()\n", + "def add_non_atomic(my_array, total):\n", + " tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x\n", + " total[0] += my_array[tid]\n", + " \n", + "\n", + "my_array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.int32)\n", + "total = np.zeros(1, dtype=np.int32)\n", + "nblock = int(size / nthread)\n", + "add_non_atomic[nblock, nthread](my_array, total)\n", + "print(\"Non atomic: \", total)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the atomic operation sample code, `cuda.to_device()` and `copy_to_host()` were purposely not used in order to show that NumPy arrays are visible from Host to Device and vice versa. \n", + "\n", + "---\n", + "**Exercise 4**: Write a Numba program (Host & Device) that counts prime numbers between `1 – 1,000,000`. The serial function is written for you to serve as a guide to writing the corresponding Numba Kernel `count_prime_nos`. Part of the code has be wrriten for you and you are to complete the rest in the cell below.\n", + "\n", + "---\n", + "\n", + "```python \n", + "\n", + "from numba import njit\n", + "import numpy as np\n", + "\n", + "N = 1000000\n", + "\n", + "@njit()\n", + "def prime_nos_count_serial(num, counter):\n", + " for i in num:\n", + " if i > 1:\n", + " track = 0\n", + " for j in range(2, i):\n", + " if (i % j) == 0:\n", + " track += 1\n", + " break\n", + " if track == 0:\n", + " counter[0] += 1\n", + " #print(i)\n", + " return counter\n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numba.cuda as cuda\n", + "import numpy as np\n", + "import math\n", + "\n", + "#kernel function\n", + "@cuda.jit()\n", + "def count_prime_nos(num_list, counter):\n", + " ##complete the code \n", + " \n", + " cuda.atomic.add(counter, 0, 1)\n", + "\n", + " \n", + "\n", + "\n", + " \n", + "#input data initialized on the Host\n", + "counter = np.zeros(1, dtype=np.int32)\n", + "num_list = np.arange(N, dtype=np.int32)\n", + "##prime_nos_count_serial(num_list, counter)\n", + "\n", + "\n", + "#set block and grid size and call kernel\n", + "threads_in_block = 512\n", + "##complete the rest\n", + "\n", + "\n", + "print(counter)\n", + "\n", + "################# expected output #########\n", + "# [78498]\n", + "###########################################" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "## Lab Task\n", + "\n", + "In this section, you are expected to click on the **Serial Code Lab Assignment** link and proceed to Lab 2. In this lab you will find three python serial code functions. You are required to revise the **pair_gpu** function to run on the GPU, and likewise do a few modifications within the **main** function.\n", + "\n", + "##

[Serial Code Lab Assignment](serial_RDF.ipynb)
\n", + "\n", + "---\n", + "\n", + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, we recommend you go to your browser's File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied as well. You can also execute the following cell block to create a zip-file of the files you've been working on and download it with the link below.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n", + "\n", + "---\n", + "#
[HOME](../../../_common/_start_nways_python.ipynb)
\n", + "\n", + "---\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "---\n", + "\n", + "\n", + "## References\n", + "\n", + "- Numba Documentation, Release 0.52.0-py3.7-linux-x86_64.egg, Anaconda, Nov 30, 2020.\n", + "- Bhaumik Vaidya, Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA, Packt Publishing, 2018.\n", + "- https://docs.nvidia.com/cuda/cuda-c-programming-guide/\n", + "\n", + "\n", + "--- \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/jupyter_notebook/numba/serial_RDF.ipynb b/_basic/python/jupyter_notebook/numba/serial_RDF.ipynb new file mode 100644 index 0000000..171b86c --- /dev/null +++ b/_basic/python/jupyter_notebook/numba/serial_RDF.ipynb @@ -0,0 +1,415 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \n", + "\n", + "\n", + "# Numba Lab 2: HPC Approach with Serial Code\n", + "---\n", + "\n", + "#### [<<--Numba Lab 1](numba_guide.ipynb)\n", + "\n", + "\n", + "## A Recap on RDF\n", + "\n", + "- The radial distribution function (RDF) denoted as g(r) defines the probability of finding a particle at a distance r from another tagged particle. The RDF is strongly dependent on the type of matter so will vary greatly for solids, gases and liquids. You can read more [here](https://en.wikibooks.org/wiki/Molecular_Simulation/Radial_Distribution_Functions).\n", + "- The code complexity of the algorithm is $N^{2}$. \n", + "- The input data for the serial code is fetched from a DCD binary trajectory file.\n", + "\n", + "\n", + "### The Serial Code\n", + "- The cell below consists of two functions, namely **dcdreadhead** and **dcdreadframe**\n", + "- The **dcdreadhead** function computes the total number of frames and atoms from the DCDFile **(input/alk.traj.dcd)**, while the **dcdreadframe** function reads 10 frames and 6720 atoms (note: each frame contains 6720 atoms) using the MDAnalysis library. \n", + "- Both functions run on the Host (CPU) and are being called from the function **main()**.\n", + "\n", + "### Cell 1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import cupy as cp\n", + "import numpy as np\n", + "import math\n", + "import cupy.cuda.nvtx as nvtx\n", + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "from timeit import default_timer as timer\n", + "\n", + "\n", + "def dcdreadhead(infile):\n", + " nconf = infile.n_frames\n", + " _infile = infile.header\n", + " numatm = _infile['natoms']\n", + " return numatm, nconf\n", + "\n", + "def dcdreadframe(infile, numatm, nconf):\n", + "\n", + " d_x = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_y = np.zeros(numatm * nconf, dtype=np.float64)\n", + " d_z = np.zeros(numatm * nconf, dtype=np.float64)\n", + "\n", + " for i in range(nconf):\n", + " data = infile.readframes(i, i+1)\n", + " box = data[1]\n", + " atomset = data[0][0]\n", + " xbox = round(box[0][0], 8)\n", + " ybox = round(box[0][2],8)\n", + " zbox = round(box[0][5], 8)\n", + "\n", + " for row in range(numatm):\n", + " d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column\n", + " d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column\n", + " d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column\n", + "\n", + " return xbox, ybox, zbox, d_x, d_y, d_z" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## pair_gpu function\n", + "\n", + "- The pair_gpu is the function where the main task of the RDF serial implementation is being executed. The function computes differences in xyz DCD frames.\n", + "- The essence of njit(just-in-time) decorator is to get pair_gpu function to compile under no python mode, and this is important for good performance. \n", + "- The decorator **@njit** or **@jit(nopython=True)** ensures that an exception is raised when compilation fails as a way to alert the user that a bug is found within the decorated function. You can read more [here](https://numba.pydata.org/numba-doc/latest/user/performance-tips.html).\n", + "\n", + "### Cell 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from numba import njit\n", + "\n", + "@njit()\n", + "def pair_gpu(d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, d_bin):\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del = box / (2.0 * d_bin)\n", + " cut = box * 0.5\n", + " #print(\"\\n {} {}\".format(nconf, numatm))\n", + "\n", + " for frame in range(nconf):\n", + " #print(\"\\n {}\".format(frame))\n", + " for id1 in range(numatm):\n", + " for id2 in range(numatm):\n", + " dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2]\n", + " dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2]\n", + " dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2 ]\n", + " dx = dx - xbox * (round(dx / xbox))\n", + " dy = dy - ybox * (round(dy / ybox))\n", + " dz = dz - zbox * (round(dz / zbox))\n", + "\n", + " r = math.sqrt(dx * dx + dy * dy + dz * dz)\n", + " if r < cut :\n", + " ig2 = int((r/_del))\n", + " d_g2[ig2] = d_g2[ig2] + 1\n", + "\n", + " return d_g2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Brief Analysis on Tasks Performed within pair_gpu function\n", + "- The graphic below identifies the various operations executed in the pair_gpu function. This function executes three nested loops using tricky indexing manipulation within the arrays.\n", + "\n", + "\n", + "\n", + "\n", + "- The indexing flow for the operation 1 is simulated using the graphic below. Each green box simulates the subtraction operation within the two inner loops (id1 & id2) while the indexes written in blue signifies the outer-most loop (frame) which iterates 10 times. \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "### The Main Function\n", + "- This is the entry point of the program where every other function including the **pair_gpu** function are called. The output of the main function is written into two files. An image version of the output files (\"**cupy_RDF.dat**\" & \"**cupy_Pair_entropy.dat**\") are displayed below the code cell.\n", + "\n", + "\n", + "### Cell 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from MDAnalysis.lib.formats.libdcd import DCDFile\n", + "import os\n", + "from pathlib import Path\n", + "\n", + "def main():\n", + " \n", + " ########## Input Details ###########\n", + " inconf = 10\n", + " nbin = 2000\n", + " global xbox, ybox, zbox\n", + " \n", + " fileDir = os.path.dirname(os.path.realpath('__file__'))\n", + " dataRoot = Path(fileDir).parents[1]\n", + " file = os.path.join(dataRoot, 'source_code/input/alk.traj.dcd')\n", + " \n", + " infile = DCDFile(file)\n", + " pairfile = open(\"RDF.dat\", \"w+\")\n", + " stwo = open(\"Pair_entropy.dat\", \"w+\")\n", + "\n", + " numatm, nconf = dcdreadhead(infile)\n", + " print(\"Dcd file has {} atoms and {} frames\".format(numatm, nconf))\n", + " if inconf > nconf:\n", + " print(\"nconf is reset to {}\".format(nconf))\n", + " else:\n", + " nconf = inconf\n", + " print(\"Calculating RDF for {} frames\".format(nconf))\n", + " #numatm = 50\n", + " sizef = nconf * numatm\n", + " sizebin = nbin\n", + " ########### reading cordinates ##############\n", + " nvtx.RangePush(\"Read_File\")\n", + " xbox, ybox, zbox, h_x, h_y, h_z = dcdreadframe(infile, numatm, nconf)\n", + " nvtx.RangePop() # pop for reading file\n", + "\n", + " h_g2 = np.zeros(sizebin, dtype=np.longlong)\n", + " print(\"Reading of input file is completed\")\n", + " print(\"\\n {} {}\".format(nconf, numatm))\n", + " ############# This where we will concentrate #########################\n", + " nvtx.RangePush(\"Pair_Circulation\")\n", + " h_g2 = pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin)\n", + " nvtx.RangePop() #pop for Pair Calculation\n", + " ######################################################################\n", + " \n", + " pi = math.acos(np.int64(-1.0))\n", + " rho = (numatm) / (xbox * ybox * zbox)\n", + " norm = (np.int64(4.0) * pi * rho) / np.int64(3.0)\n", + " g2 = np.zeros(nbin, dtype=np.float32)\n", + " s2 = np.int64(0.0);\n", + " s2bond = np.int64(0.0)\n", + " lngrbond = np.int64(0.0)\n", + " box = min(xbox, ybox)\n", + " box = min(box, zbox)\n", + " _del = box / (np.int64(2.0) * nbin)\n", + " gr = np.float32(0.0)\n", + " # loop to calculate entropy\n", + " nvtx.RangePush(\"Entropy_Calculation\")\n", + " for i in range(nbin):\n", + " rl = (i) * _del\n", + " ru = rl + _del\n", + " nideal = norm * (ru * ru * ru - rl * rl * rl)\n", + " g2[i] = h_g2[i] / (nconf * numatm * nideal)\n", + " r = (i) * _del\n", + " temp = (i + 0.5) * _del\n", + " \n", + " #writing to file\n", + " pairfile.write(str(temp) + \" \" + str(g2[i]) + \"\\n\")\n", + "\n", + " if r < np.int64(2.0):\n", + " gr = np.int64(0.0)\n", + " else:\n", + " gr = g2[i]\n", + " if gr < 1e-5:\n", + " lngr = np.int64(0.0)\n", + " else:\n", + " lngr = math.log(gr)\n", + " if g2[i] < 1e-6:\n", + " lngrbond = np.int64(0.0)\n", + " else:\n", + " lngrbond = math.log(g2[i])\n", + " s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r)\n", + " s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r\n", + "\n", + " nvtx.RangePop() # pop for entropy Calculation\n", + " \n", + " #writing s2 and s2bond to file\n", + " stwo.writelines(\"s2 value is {}\\n\".format(s2))\n", + " stwo.writelines(\"s2bond value is {}\".format(s2bond))\n", + " \n", + " # printing s2 and s2bond to jupyter output\n", + " print(\"\\n s2 value is {}\\n\".format(s2))\n", + " print(\"s2bond value is {}\\n\".format(s2bond))\n", + "\n", + " print(\"#Freeing Host memory\")\n", + " del(h_x)\n", + " del(h_y)\n", + " del(h_z)\n", + " del(h_g2)\n", + " print(\"#Number of atoms processed: {} \\n\".format(numatm))\n", + " print(\"#number of confs processed: {} \\n\".format(nconf))\n", + " \n", + "\n", + "if __name__ == \"__main__\":\n", + " #main() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### Console Output and Output Files\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + "
\n", + "\n", + "\n", + "---\n", + "\n", + "## Lab Task \n", + "\n", + "1. 1. **Run the serial code from cell 1, 2, & 3**.\n", + " - Remove the **\"#\"** behind the **main()** before running the cell 3:\n", + " ```python\n", + " if __name__ == \"__main__\":\n", + " main()\n", + " ```\n", + "2. **Now, lets start modifying the original code to Numba code constructs.**\n", + "> Click on the [Modify](../../source_code/serial/nways_serial.py) link, and modify **nways_serial.py** to `Numba code construct`. Remember to SAVE your code after changes, and then run the cell below. \n", + "> Hints: focus on the **pair_gpu** function and you may need to modify few lines in the **main** function as well.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%run ../../source_code/serial/nways_serial.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The output should be the following:\n", + "\n", + "```\n", + "s2 value is -2.43191\n", + "s2bond value is -3.87014\n", + "```\n", + "\n", + "3. **Profile the code by running the cell bellow** " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cd ../../source_code/serial&& nsys profile --stats=true --force-overwrite true -o serial_cpu_rdf python3 nways_serial.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To view the profiler report, you need to download and save the report file by holding down Shift and Right-Clicking [Here](../../source_code/serial/serial_cpu_rdf.nsys-rep) and choosing save Link As Once done, open it via the GUI. A sample expected profile report is given below:\n", + "\n", + "\n", + "\n", + "\n", + "From the profile report, we can see that the pair_gpu function now takes milliseconds to run as compared to the serial version which takes more than 3 seconds as shown [here](../serial/rdf_overview.ipynb). \n", + "\n", + "---\n", + "### [View](../../source_code/numba/numba_rdf.py) or [Run](../../jupyter_notebook/numba/numba_RDF.ipynb) Solution \n", + "--- \n", + "\n", + "## Post-Lab Summary\n", + "\n", + "If you would like to download this lab for later viewing, we recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page. This will ensure the images are copied as well. You can also execute the following cell block to create a zip-file of the files you've been working on and download it with the link below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd ..\n", + "rm -f _files.zip\n", + "zip -r _files.zip *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "**After** executing the above zip command, you should be able to download and save the zip file by holding down Shift and Right-Clicking [Here](../_files.zip) and choosing save Link As.\n", + "\n", + "---\n", + "\n", + "#

HOME

\n", + "\n", + "---\n", + "\n", + "\n", + "# Links and Resources\n", + "\n", + "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n", + "\n", + "[NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n", + "\n", + "**NOTE**: To be able to see the Nsight System profiler output, please download the latest version of the Nsight System from [here](https://developer.nvidia.com/nsight-systems).\n", + "\n", + "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n", + "\n", + "---\n", + "\n", + "\n", + "## Licensing \n", + "\n", + "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/_basic/python/source_code/cupy/cupy_rdf.py b/_basic/python/source_code/cupy/cupy_rdf.py new file mode 100644 index 0000000..fbdadb7 --- /dev/null +++ b/_basic/python/source_code/cupy/cupy_rdf.py @@ -0,0 +1,210 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +import cupy as cp +import numpy as np +import math +import cupy.cuda.nvtx as nvtx +from MDAnalysis.lib.formats.libdcd import DCDFile +from timeit import default_timer as timer +import os +from pathlib import Path + +#pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed) +#cp.cuda.set_allocator(pool.malloc) + + +def dcdreadhead(infile): + nconf = infile.n_frames + _infile = infile.header + numatm = _infile['natoms'] + return numatm, nconf + +def dcdreadframe(infile, numatm, nconf): + + d_x = np.zeros(numatm * nconf, dtype=np.float64) + d_y = np.zeros(numatm * nconf, dtype=np.float64) + d_z = np.zeros(numatm * nconf, dtype=np.float64) + + for i in range(nconf): + data = infile.readframes(i, i+1) + box = data[1] + atomset = data[0][0] + xbox = round(box[0][0], 8) + ybox = round(box[0][2],8) + zbox = round(box[0][5], 8) + + for row in range(numatm): + d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column + d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column + d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column + + return xbox, ybox, zbox, d_x, d_y, d_z + + +def main(): + start = timer() + ########## Input Details ########### + global xbox, ybox, zbox + inconf = 10 + nbin =np.int32(2000) + xbox = np.float32(0) + ybox =np.float32(0) + zbox = np.float32(0) + + ########use on jupyter notebook####### + fileDir = os.path.dirname(os.path.realpath('__file__')) + dataRoot = Path(fileDir).parents[1] + file = os.path.join(dataRoot, 'source_code/input/alk.traj.dcd') + + ########use on local computer########## + #file = "input/alk.traj.dcd" + ####################################### + infile = DCDFile(file) + pairfile = open("cupy_RDF.dat", "w+") + stwo = open("cupy_Pair_entropy.dat", "w+") + + numatm, nconf = dcdreadhead(infile) + print("Dcd file has {} atoms and {} frames".format(numatm, nconf)) + if inconf > nconf: + print("nconf is reset to {}".format(nconf)) + else: + nconf = inconf + print("Calculating RDF for {} frames".format(nconf)) + #numatm = 10 + sizef = nconf * numatm + sizebin = nbin + + ########### reading cordinates ############## + nvtx.RangePush("Read_File") + xbox, ybox, zbox, d_x, d_y, d_z = dcdreadframe(infile, numatm, nconf) + nvtx.RangePop() # pop for reading file + print("Reading of input file is completed") + ############# Stream from Host to Device ######################### + d_x = cp.asarray(d_x) + d_y = cp.asarray(d_y) + d_z = cp.asarray(d_z) + d_g2 = np.zeros(sizebin, dtype=np.int64) + d_g2 = cp.asarray(d_g2) + ############################## RAW KERNEL ################################################# + nthreads = 128; + near2 = nthreads * (int(0.5 * numatm * (numatm - 1) / nthreads) + 1); + nblock = (near2 / nthreads); + print(" Initial blocks are {} and now changing to".format(nblock)) + maxblock = 65535 + blockloop = int(nblock / maxblock) + if blockloop != 0: + nblock = maxblock + print("{} and will run over {} blockloops".format(nblock, blockloop+1)) + + nvtx.RangePush("CuPy_Pair_Circulation") + ################################# + t1 = timer() + for bl in range(blockloop+1): + raw_kernel((nblock,),(nthreads,), (d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin, bl)) ## cupy raw kernel + cp.cuda.Device(0).synchronize() + print("Kernel compute time:", timer() - t1) + d_g2 = cp.asnumpy(d_g2) + nvtx.RangePop() # pop for Pair Calculation + ###################################################################### + pi = math.acos(np.int64(-1.0)) + rho = (numatm) / (xbox * ybox * zbox) + norm = (np.int64(4.0) * pi * rho) / np.int64(3.0) + g2 = np.zeros(nbin, dtype=np.float32) + s2 =np.int64(0.0); s2bond = np.int64(0.0) + lngrbond = np.float32(0.0) + box = min(xbox, ybox) + box = min(box, zbox) + _del =box / (np.int64(2.0) * nbin) + gr = np.float32(0.0) + # loop to calculate entropy + nvtx.RangePush("Entropy_Calculation") + for i in range(nbin): + rl = (i) * _del + ru = rl + _del + nideal = norm * (ru * ru * ru - rl * rl * rl) + g2[i] = d_g2[i] / (nconf * numatm * nideal) + r = (i) * _del + temp = (i + 0.5) * _del + pairfile.write(str(temp) + " " + str(g2[i]) + "\n") + + if r < np.int64(2.0): + gr = np.int64(0.0) + else: + gr = g2[i] + if gr < 1e-5: + lngr = np.int64(0.0) + else: + lngr = math.log(gr) + if g2[i] < 1e-6: + lngrbond = np.int64(0.0) + else: + lngrbond = math.log(g2[i]) + s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r) + s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r + + nvtx.RangePop() # pop for entropy Calculation + stwo.writelines("s2 value is {}\n".format(s2)) + stwo.writelines("s2bond value is {}".format(s2bond)) + + print("#Freeing Host memory") + del (d_x) + del (d_y) + del (d_z) + del (d_g2) + print("#Number of atoms processed: {} \n".format(numatm)) + print("#number of confs processed: {} \n".format(nconf)) + total_time = timer() - start + print("total time spent:", total_time) + +################################################################################## + +raw_kernel = cp.RawKernel(r''' +extern "C" +__global__ void cupy_pair_gpu( + const double* d_x, const double* d_y, const double* d_z, + unsigned long long int *d_g2, int numatm, int nconf, + const double xbox,const double ybox,const double zbox,int d_bin, unsigned long long int bl) +{ + double r,cut,dx,dy,dz; + int ig2,id1,id2; + double box; + box=min(xbox,ybox); + box=min(box,zbox); + + double del=box/(2.0*d_bin); + cut=box*0.5; + int thisi; + double n; + + int i = blockIdx.x * blockDim.x + threadIdx.x; + int maxi = min(int(0.5*numatm*(numatm-1)-(bl*65535*128)),(65535*128)); + + if ( i < maxi ) { + thisi=bl*65535*128+i; + + n=(0.5)*(1+ ((double) sqrt (1.0+4.0*2.0*thisi))); + id1=int(n); + id2=thisi-(0.5*id1*(id1-1)); + + for (int frame=0;frame nconf: + print("nconf is reset to {}".format(nconf)) + else: + nconf = inconf + print("Calculating RDF for {} frames".format(nconf)) + + #numatm = 100 + sizef = nconf * numatm + sizebin = nbin + + ########### reading cordinates ############## + nvtx.RangePush("Read_File") + xbox, ybox, zbox, d_x, d_y, d_z = dcdreadframe(infile, numatm, nconf) + nvtx.RangePop() # pop for reading file + print("Reading of input file is completed") + + ############################## Numba KERNEL ################################################# + + nthreads = 128; + near2 = nthreads * (int(0.5 * numatm * (numatm - 1) / nthreads) + 1); + nblock = (near2 / nthreads); + print(" Initial blocks are {} and now changing to".format(nblock)) + maxblock = 65535 + blockloop = int(nblock / maxblock) + if blockloop != 0: + nblock = maxblock + print("{} and will run over {} blockloops".format(nblock, blockloop+1)) + + # cp.cuda.runtime.memset(d_g2,0,sizebin) + d_g2 = np.zeros(sizebin, dtype=np.int64) + d_g2 = cuda.to_device(d_g2) #numba copy to device + + nvtx.RangePush("Pair_Circulation_Numba") + t1 = timer() + for bl in range(blockloop+1): + pair_gpu_kernel[nblock,nthreads ](d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, nbin, bl) ## numba jit kernel + cuda.synchronize() + print("Kernel compute time:", timer() - t1) + d_g2 = d_g2.copy_to_host() ## numba copy to host + nvtx.RangePop() # pop for Pair Calculation + + pi = math.acos(np.int64(-1.0)) + rho = (numatm) / (xbox * ybox * zbox) + norm = (np.int64(4.0) * pi * rho) / np.int64(3.0) + g2 = np.zeros(nbin, dtype=np.float32) + s2 =np.int64(0.0); s2bond = np.int64(0.0) + lngrbond = np.float32(0.0) + box = min(xbox, ybox) + box = min(box, zbox) + _del =box / (np.int64(2.0) * nbin) + gr = np.float32(0.0) + # loop to calculate entropy + nvtx.RangePush("Entropy_Calculation") + for i in range(nbin): + rl = (i) * _del + ru = rl + _del + nideal = norm * (ru * ru * ru - rl * rl * rl) + g2[i] = d_g2[i] / (nconf * numatm * nideal) + r = (i) * _del + temp = (i + 0.5) * _del + pairfile.write(str(temp) + " " + str(g2[i]) + "\n") + + if r < np.int64(2.0): + gr = np.int64(0.0) + else: + gr = g2[i] + if gr < 1e-5: + lngr = np.int64(0.0) + else: + lngr = math.log(gr) + if g2[i] < 1e-6: + lngrbond = np.int64(0.0) + else: + lngrbond = math.log(g2[i]) + s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r) + s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r + + nvtx.RangePop() # pop for entropy Calculation + stwo.writelines("s2 value is {}\n".format(s2)) + stwo.writelines("s2bond value is {}".format(s2bond)) + + print("#Freeing Host memory") + del (d_x) + del (d_y) + del (d_z) + del (d_g2) + print("#Number of atoms processed: {} \n".format(numatm)) + print("#number of confs processed: {} \n".format(nconf)) + total_time = timer() - start + print("total time spent:", total_time) + + +##--------------------------NUMBA KERNEL---------------------------------------------------------## +@cuda.jit +def pair_gpu_kernel(d_x, d_y,d_z, d_g2, numatm, nconf, xbox, ybox,zbox,d_bin, bl): + box = min(xbox, ybox) + box = min(box, zbox) + _del= box / (2.0 * d_bin) + cut = box * 0.5; + i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x + maxi = min(int(0.5 * numatm * (numatm - 1) - (bl * 65535 * 128)), (65535 * 128)) + + if i < maxi: + thisi=bl * 65535 * 128+i + n = (0.5) * (1+ ( math.sqrt (1.0+4.0 * 2.0 * thisi))) + id1 = int(n) + id2 = thisi-(0.5 * id1 * (id1-1)) + for frame in range(0, nconf): + t1 = int(frame * numatm+id1) + t2 = int(frame * numatm+id2) + dx = d_x[t1] - d_x[t2] + dy = d_y[t1] - d_y[t2] + dz = d_z[t1] - d_z[t2] + dx = dx - xbox * (round(dx / xbox)) + dy = dy - ybox * (round(dy / ybox)) + dz = dz - zbox * (round(dz / zbox)) + + r= math.sqrt(dx * dx+dy * dy+dz * dz) + if r < cut: + ig2=(int)(r / _del ) + cuda.atomic.add(d_g2, ig2, 2) + +##--------------------------END NUMBA KERNEL---------------------------------------------------------## + + +if __name__ == "__main__": + main() + diff --git a/_basic/python/source_code/serial/nways_serial.py b/_basic/python/source_code/serial/nways_serial.py new file mode 100644 index 0000000..a5a05b6 --- /dev/null +++ b/_basic/python/source_code/serial/nways_serial.py @@ -0,0 +1,160 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +import numpy as np +import math +import cupy.cuda.nvtx as nvtx +from MDAnalysis.lib.formats.libdcd import DCDFile +from timeit import default_timer as timer +from numba import njit +import os +from pathlib import Path + +def dcdreadhead(infile): + nconf = infile.n_frames + _infile = infile.header + numatm = _infile['natoms'] + return numatm, nconf + +def dcdreadframe(infile, numatm, nconf): + + d_x = np.zeros(numatm * nconf, dtype=np.float64) + d_y = np.zeros(numatm * nconf, dtype=np.float64) + d_z = np.zeros(numatm * nconf, dtype=np.float64) + + for i in range(nconf): + data = infile.readframes(i, i+1) + box = data[1] + atomset = data[0][0] + xbox = round(box[0][0], 8) + ybox = round(box[0][2],8) + zbox = round(box[0][5], 8) + + for row in range(numatm): + d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column + d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column + d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column + + return xbox, ybox, zbox, d_x, d_y, d_z + +def main(): + start = timer() + ########## Input Details ########### + inconf = 10 + nbin = 2000 + global xbox, ybox, zbox + ######## for jupyter notebook ######################## + fileDir = os.path.dirname(os.path.realpath('__file__')) + dataRoot = Path(fileDir).parents[1] + file = os.path.join(dataRoot, 'source_code/input/alk.traj.dcd') + + ######## local computer ############# + #file = "input/alk.traj.dcd" + + infile = DCDFile(file) + pairfile = open("RDF.dat", "w+") + stwo = open("Pair_entropy.dat", "w+") + + numatm, nconf = dcdreadhead(infile) + print("Dcd file has {} atoms and {} frames".format(numatm, nconf)) + if inconf > nconf: + print("nconf is reset to {}".format(nconf)) + else: + nconf = inconf + print("Calculating RDF for {} frames".format(nconf)) + #numatm = 50 + sizef = nconf * numatm + sizebin = nbin + ########### reading cordinates ############## + nvtx.RangePush("Read_File") + xbox, ybox, zbox, h_x, h_y, h_z = dcdreadframe(infile, numatm, nconf) + nvtx.RangePop() # pop for reading file + + h_g2 = np.zeros(sizebin, dtype=np.longlong) + print("Reading of input file is completed") + ############# This where we will concentrate ######################### + nvtx.RangePush("Pair_Circulation") + h_g2 = pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin) + nvtx.RangePop() #pop for Pair Calculation + ###################################################################### + pi = math.acos(np.int64(-1.0)) + rho = (numatm) / (xbox * ybox * zbox) + norm = (np.int64(4.0) * pi * rho) / np.int64(3.0) + g2 = np.zeros(nbin, dtype=np.float32) + s2 = np.int64(0.0); + s2bond = np.int64(0.0) + lngrbond = np.float32(0.0) + box = min(xbox, ybox) + box = min(box, zbox) + _del = box / (np.int64(2.0) * nbin) + gr = np.float32(0.0) + # loop to calculate entropy + nvtx.RangePush("Entropy_Calculation") + for i in range(nbin): + rl = (i) * _del + ru = rl + _del + nideal = norm * (ru * ru * ru - rl * rl * rl) + g2[i] = h_g2[i] / (nconf * numatm * nideal) + r = (i) * _del + temp = (i + 0.5) * _del + pairfile.write(str(temp) + " " + str(g2[i]) + "\n") + + if r < np.int64(2.0): + gr = np.int64(0.0) + else: + gr = g2[i] + if gr < 1e-5: + lngr = np.int64(0.0) + else: + lngr = math.log(gr) + if g2[i] < 1e-6: + lngrbond = np.int64(0.0) + else: + lngrbond = math.log(g2[i]) + s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r) + s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r + + nvtx.RangePop() # pop for entropy Calculation + stwo.writelines("s2 value is {}\n".format(s2)) + stwo.writelines("s2bond value is {}".format(s2bond)) + + print("\n s2 value is {}\n".format(s2)) + print("s2bond value is {}\n".format(s2bond)) + print("#Freeing Host memory") + del(h_x) + del(h_y) + del(h_z) + del(h_g2) + print("#Number of atoms processed: {} \n".format(numatm)) + print("#number of confs processed: {} \n".format(nconf)) + total_time = timer() - start + print("total time spent:", total_time) + +@njit() +def pair_gpu(d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, d_bin): + box = min(xbox, ybox) + box = min(box, zbox) + _del = box / (2.0 * d_bin) + cut = box * 0.5 + #print("\n {} {}".format(nconf, numatm)) + + for frame in range(nconf): + #print("\n {}".format(frame)) + for id1 in range(numatm): + for id2 in range(numatm): + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2] + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2] + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2 ] + dx = dx - xbox * (round(dx / xbox)) + dy = dy - ybox * (round(dy / ybox)) + dz = dz - zbox * (round(dz / zbox)) + + r = math.sqrt(dx * dx + dy * dy + dz * dz) + if r < cut : + ig2 = int((r/_del)) + d_g2[ig2] = d_g2[ig2] + 1 + + return d_g2 + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/_basic/python/source_code/serial/nways_serial_overview.py b/_basic/python/source_code/serial/nways_serial_overview.py new file mode 100644 index 0000000..4359f55 --- /dev/null +++ b/_basic/python/source_code/serial/nways_serial_overview.py @@ -0,0 +1,160 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +import numpy as np +import math +import cupy.cuda.nvtx as nvtx +from MDAnalysis.lib.formats.libdcd import DCDFile +from timeit import default_timer as timer +from numba import njit +import os +from pathlib import Path + +def dcdreadhead(infile): + nconf = infile.n_frames + _infile = infile.header + numatm = _infile['natoms'] + return numatm, nconf + +def dcdreadframe(infile, numatm, nconf): + + d_x = np.zeros(numatm * nconf, dtype=np.float64) + d_y = np.zeros(numatm * nconf, dtype=np.float64) + d_z = np.zeros(numatm * nconf, dtype=np.float64) + + for i in range(nconf): + data = infile.readframes(i, i+1) + box = data[1] + atomset = data[0][0] + xbox = round(box[0][0], 8) + ybox = round(box[0][2],8) + zbox = round(box[0][5], 8) + + for row in range(numatm): + d_x[i * numatm + row] = round(atomset[row][0], 8) # 0 is column + d_y[i * numatm + row] = round(atomset[row][1], 8) # 1 is column + d_z[i * numatm + row] = round(atomset[row][2], 8) # 2 is column + + return xbox, ybox, zbox, d_x, d_y, d_z + +def main(): + start = timer() + ########## Input Details ########### + inconf = 10 + nbin = 2000 + global xbox, ybox, zbox + ######## for jupyter notebook ######################## + fileDir = os.path.dirname(os.path.realpath('__file__')) + dataRoot = Path(fileDir).parents[1] + file = os.path.join(dataRoot, 'python/source_code/input/alk.traj.dcd') + + ######## local computer ############# + #file = "input/alk.traj.dcd" + + infile = DCDFile(file) + pairfile = open("RDF.dat", "w+") + stwo = open("Pair_entropy.dat", "w+") + + numatm, nconf = dcdreadhead(infile) + print("Dcd file has {} atoms and {} frames".format(numatm, nconf)) + if inconf > nconf: + print("nconf is reset to {}".format(nconf)) + else: + nconf = inconf + print("Calculating RDF for {} frames".format(nconf)) + #numatm = 50 + sizef = nconf * numatm + sizebin = nbin + ########### reading cordinates ############## + nvtx.RangePush("Read_File") + xbox, ybox, zbox, h_x, h_y, h_z = dcdreadframe(infile, numatm, nconf) + nvtx.RangePop() # pop for reading file + + h_g2 = np.zeros(sizebin, dtype=np.longlong) + print("Reading of input file is completed") + ############# This where we will concentrate ######################### + nvtx.RangePush("Pair_Circulation") + h_g2 = pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin) + nvtx.RangePop() #pop for Pair Calculation + ###################################################################### + pi = math.acos(np.int64(-1.0)) + rho = (numatm) / (xbox * ybox * zbox) + norm = (np.int64(4.0) * pi * rho) / np.int64(3.0) + g2 = np.zeros(nbin, dtype=np.float32) + s2 = np.int64(0.0); + s2bond = np.int64(0.0) + lngrbond = np.float32(0.0) + box = min(xbox, ybox) + box = min(box, zbox) + _del = box / (np.int64(2.0) * nbin) + gr = np.float32(0.0) + # loop to calculate entropy + nvtx.RangePush("Entropy_Calculation") + for i in range(nbin): + rl = (i) * _del + ru = rl + _del + nideal = norm * (ru * ru * ru - rl * rl * rl) + g2[i] = h_g2[i] / (nconf * numatm * nideal) + r = (i) * _del + temp = (i + 0.5) * _del + pairfile.write(str(temp) + " " + str(g2[i]) + "\n") + + if r < np.int64(2.0): + gr = np.int64(0.0) + else: + gr = g2[i] + if gr < 1e-5: + lngr = np.int64(0.0) + else: + lngr = math.log(gr) + if g2[i] < 1e-6: + lngrbond = np.int64(0.0) + else: + lngrbond = math.log(g2[i]) + s2 = s2 - (np.int64(2.0) * pi * rho * ((gr * lngr) - gr + np.int64(1.0)) * _del * r * r) + s2bond = s2bond - np.int64(2.0) * pi * rho * ((g2[i] * lngrbond) - g2[i] + np.int64(1.0)) * _del * r * r + + nvtx.RangePop() # pop for entropy Calculation + stwo.writelines("s2 value is {}\n".format(s2)) + stwo.writelines("s2bond value is {}".format(s2bond)) + + print("\n s2 value is {}\n".format(s2)) + print("s2bond value is {}\n".format(s2bond)) + print("#Freeing Host memory") + del(h_x) + del(h_y) + del(h_z) + del(h_g2) + print("#Number of atoms processed: {} \n".format(numatm)) + print("#number of confs processed: {} \n".format(nconf)) + total_time = timer() - start + print("total time spent:", total_time) + +@njit() +def pair_gpu(d_x, d_y, d_z, d_g2, numatm, nconf, xbox, ybox, zbox, d_bin): + box = min(xbox, ybox) + box = min(box, zbox) + _del = box / (2.0 * d_bin) + cut = box * 0.5 + #print("\n {} {}".format(nconf, numatm)) + + for frame in range(nconf): + #print("\n {}".format(frame)) + for id1 in range(numatm): + for id2 in range(numatm): + dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2] + dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2] + dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2 ] + dx = dx - xbox * (round(dx / xbox)) + dy = dy - ybox * (round(dy / ybox)) + dz = dz - zbox * (round(dz / zbox)) + + r = math.sqrt(dx * dx + dy * dy + dz * dz) + if r < cut : + ig2 = int((r/_del)) + d_g2[ig2] = d_g2[ig2] + 1 + + return d_g2 + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/_scripts/cuda_Singularity b/_scripts/cuda_Singularity new file mode 100644 index 0000000..7caee05 --- /dev/null +++ b/_scripts/cuda_Singularity @@ -0,0 +1,52 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +Bootstrap: docker +FROM: nvcr.io/nvidia/nvhpc:22.11-devel-cuda_multi-ubuntu20.04 + +%environment + export XDG_RUNTIME_DIR= + export PATH="$PATH:/usr/local/bin:/opt/anaconda3/bin:/usr/bin" + +%post + build_tmp=$(mktemp -d) && cd ${build_tmp} + + apt-get -y update + apt-get -y dist-upgrade + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \ + m4 vim-nox emacs-nox nano zip\ + python3-pip python3-setuptools git-core inotify-tools \ + curl git-lfs \ + build-essential libtbb-dev + rm -rf /var/lib/apt/cache/* + + pip3 install --upgrade pip + pip3 install gdown + apt-get update -y + apt-get -y install git nvidia-modprobe + pip3 install jupyterlab + pip3 install ipywidgets + + apt-get install --no-install-recommends -y build-essential + + python3 /labs/_common/dataset.py + + apt-get update -y + apt-get install --no-install-recommends -y build-essential + + wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh + bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/anaconda3 + rm Miniconda3-latest-Linux-x86_64.sh + + cd / + rm -rf ${build_tmp} + +%files + ../_basic/_common/ /labs/_common + ../_basic/cuda/ /labs/cuda + ../_basic/_start_nways.ipynb /labs + +%runscript + "$@" + +%labels + AUTHOR mozhgank diff --git a/_scripts/nways_Dockerfile b/_scripts/nways_Dockerfile new file mode 100644 index 0000000..cf0dfa1 --- /dev/null +++ b/_scripts/nways_Dockerfile @@ -0,0 +1,40 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the docker container, run: $ sudo docker build -f nways_Dockerfile -t nways:c . +# To run: $ sudo docker run --rm -it --runtime nvidia -p 8888:8888 nways:c +# Finally, open http://localhost:8888/ + +FROM nvcr.io/nvidia/nvhpc:22.7-devel-cuda_multi-ubuntu20.04 + +RUN apt-get -y update && \ + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends python3-pip python3-setuptools nginx zip make build-essential libtbb-dev && \ + rm -rf /var/lib/apt/lists/* && \ + pip3 install --upgrade pip &&\ + pip3 install gdown + +RUN apt-get update -y +RUN apt-get install -y git nvidia-modprobe +RUN pip3 install jupyterlab +# Install required python packages +RUN pip3 install ipywidgets + +############################################ +RUN apt-get update -y + +# TO COPY the data +COPY openacc/ /labs/openacc +COPY openmp/ /labs/openmp +COPY _common/ /labs/_common +COPY iso/ /labs/iso +COPY cuda/ /labs/cuda +COPY _start_nways.ipynb /labs + +RUN python3 /labs/_common/dataset.py + +################################################# +ENV PATH="/usr/local/bin:/opt/anaconda3/bin:/usr/bin:$PATH" +################################################# + +#ADD nways_labs/ /labs +WORKDIR /labs +CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/labs diff --git a/_scripts/nways_Dockerfile_python b/_scripts/nways_Dockerfile_python new file mode 100644 index 0000000..540acdf --- /dev/null +++ b/_scripts/nways_Dockerfile_python @@ -0,0 +1,65 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the docker container, run: $ sudo docker build -f nways_Dockerfile_python -t nways:p . +# To run: $ sudo docker run --rm -it --runtime nvidia -p 8888:8888 nways:p +# Finally, open http://localhost:8888/ + +#FROM nvidia/cuda:11.2.2-devel-ubuntu20.04 +FROM nvidia/cuda:11.4.2-devel-ubuntu20.04 + +##### +# Read https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 +RUN apt-key del 7fa2af80 +RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub +##### + +RUN apt-get -y update && \ + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \ + python3-dev \ + python3-pip python3-setuptools nginx zip make build-essential libtbb-dev && \ + rm -rf /var/lib/apt/lists/* + +RUN pip3 install --no-cache-dir -U install setuptools pip +RUN pip3 install gdown +RUN apt-get update -y +RUN apt-get install -y git nvidia-modprobe +# Install required python packages +RUN pip3 install jupyterlab +RUN pip3 install ipywidgets +#RUN pip3 install --upgrade numpy==1.19.5 +RUN pip3 install --upgrade numpy==1.21.1 +#RUN pip3 install --no-cache-dir "cupy-cuda112==9.0.0" \ +RUN pip3 install --no-cache-dir "cupy-cuda114==10.3.1" \ + numba==0.53.1 scipy + +############################################ +# NVIDIA nsight-systems-cli-2022.1.1, nsight-compute-2022.1.1 +RUN apt-get update -y && \ + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ + apt-transport-https \ + ca-certificates \ + gnupg \ + wget && \ + #apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys F60F4B3D7FA2AF80 && \ + wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add - &&\ + echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list &&\ + apt-get update -y + +RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends nsight-systems-cli-2022.1.1 nsight-compute-2022.1.1 + +# TO COPY the data +COPY python/ /labs/python +COPY _common/ /labs/_common +COPY _start_nways.ipynb /labs + +RUN python3 /labs/_common/dataset_python.py + +################################################# +ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/lib/python3.8/dist-packages:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" +ENV PATH="/opt/nvidia/nsight-systems/2022.1.1/bin:/opt/nvidia/nsight-compute/2022.1.1:/usr/local/bin:/bin:/usr/local/cuda/bin:/usr/bin${PATH:+:${PATH}}" + +RUN pip3 install --no-cache-dir MDAnalysis + +#ADD nways_labs/ /labs +WORKDIR /labs +CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/labs diff --git a/_scripts/nways_Singularity b/_scripts/nways_Singularity new file mode 100644 index 0000000..b4ddb5f --- /dev/null +++ b/_scripts/nways_Singularity @@ -0,0 +1,60 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the singularity container, run: $ singularity build --fakeroot nways_c.simg nways_Singularity +# To copy the content of the container: $ singularity run nways_c.simg cp -rT /labs ~/labs +# To run: $ singularity run --nv nways_c.simg jupyter-lab --notebook-dir=~/labs +# Finally, open http://localhost:8888/ + +Bootstrap: docker +FROM: nvcr.io/nvidia/nvhpc:22.7-devel-cuda_multi-ubuntu20.04 + +%environment + export XDG_RUNTIME_DIR= + export PATH="$PATH:/usr/local/bin:/opt/anaconda3/bin:/usr/bin" + +%post + build_tmp=$(mktemp -d) && cd ${build_tmp} + + apt-get -y update + apt-get -y dist-upgrade + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \ + m4 vim-nox emacs-nox nano zip\ + python3-pip python3-setuptools git-core inotify-tools \ + curl git-lfs \ + build-essential libtbb-dev + rm -rf /var/lib/apt/cache/* + + pip3 install --upgrade pip + pip3 install gdown + apt-get update -y + apt-get -y install git nvidia-modprobe + pip3 install jupyterlab + pip3 install ipywidgets + + apt-get install --no-install-recommends -y build-essential + + python3 /labs/_common/dataset.py + + apt-get update -y + apt-get install --no-install-recommends -y build-essential + + wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh + bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/anaconda3 + rm Miniconda3-latest-Linux-x86_64.sh + + cd / + rm -rf ${build_tmp} + +%files + ../_basic/openacc/ /labs/openacc + ../_basic/openmp/ /labs/openmp + ../_basic/_common/ /labs/_common + ../_basic/iso/ /labs/iso + ../_basic/cuda/ /labs/cuda + ../_basic/_start_nways.ipynb /labs + +%runscript + "$@" + +%labels + AUTHOR mozhgank diff --git a/_scripts/nways_Singularity_python b/_scripts/nways_Singularity_python new file mode 100644 index 0000000..a5df837 --- /dev/null +++ b/_scripts/nways_Singularity_python @@ -0,0 +1,86 @@ +# Copyright (c) 2021 NVIDIA Corporation. All rights reserved. + +# To build the singularity container, run: $ singularity build --fakeroot nways_p.simg nways_Singularity_python +# To copy the content of the container: $ singularity run nways_p.simg cp -rT /labs ~/labs +# To run: $ singularity run --nv nways_p.simg jupyter-lab --notebook-dir=~/labs +# Finally, open http://localhost:8888/ + +Bootstrap: docker +#FROM: nvidia/cuda:11.2.2-devel-ubuntu20.04 +FROM: nvidia/cuda:11.4.2-devel-ubuntu20.04 + +%environment + export XDG_RUNTIME_DIR= + export PATH="$PATH:/usr/local/bin:/usr/bin" + export PATH=/opt/nvidia/nsight-systems/2022.1.1/bin:/opt/nvidia/nsight-compute/2022.1.1:/bin:/usr/local/cuda/bin$PATH + export LD_LIBRARY_PATH="/usr/include/python3.8:/usr/local/lib:/usr/local/lib/python3.8/dist-packages:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" + + +%post + build_tmp=$(mktemp -d) && cd ${build_tmp} + +##### +# Read https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212772 + apt-key del 7fa2af80 + apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub +##### + + apt-get -y update + apt-get -y dist-upgrade + DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends python3-dev \ + m4 vim-nox emacs-nox nano zip \ + python3-pip python3-setuptools nginx zip make build-essential libtbb-dev + rm -rf /var/lib/apt/cache/* + + pip3 install --no-cache-dir -U install setuptools pip + apt-get -y update + apt-get -y install git nvidia-modprobe + pip3 install 'chardet>=3.0.2,<3.1.0' 'idna>=2.5,<2.8' 'urllib3>=1.21.1,<1.24' 'certifi>=2017.4.17' + pip3 install jupyterlab + pip3 install ipywidgets + pip3 install gdown + pip3 install --upgrade numpy==1.21.1 + # pip3 install --upgrade numpy==1.19.5 + #pip3 install --no-cache-dir "cupy-cuda112==9.0.0" \ + pip3 install --no-cache-dir "cupy-cuda114==10.3.1" \ + numba==0.53.1 scipy + + #apt-get install --no-install-recommends -y build-essential + + python3 /labs/_common/dataset_python.py + + touch /labs/python/jupyter_notebook/cupy/RDF.dat + touch /labs/python/jupyter_notebook/cupy/Pair_entropy.dat + +# NVIDIA nsight-systems-cli-2022.1.1, nsight-compute-2022.1.1 + apt-get update -y + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends apt-transport-https ca-certificates gnupg wget + # apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys F60F4B3D7FA2AF80 + wget -qO - https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/nvidia.pub | apt-key add - + echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list + apt-get update -y + DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends nsight-systems-cli-2022.1.1 nsight-compute-2022.1.1 + #rm -rf /var/lib/apt/lists/* + + + + apt-get install --no-install-recommends -y build-essential + + pip3 install --no-cache-dir MDAnalysis + + chmod -R 777 /labs/python/jupyter_notebook/cupy/RDF.dat + chmod -R 777 /labs/python/jupyter_notebook/cupy/Pair_entropy.dat + + cd / + rm -rf ${build_tmp} + +%files + ../_basic/python/ /labs/python + ../_basic/_common/ /labs/_common + ../_basic/_start_nways.ipynb /labs + +%runscript + "$@" + +%labels + AUTHOR Tosin, Mozhgan