Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Venado optimizations #297

Merged
merged 75 commits into from
Feb 27, 2025
Merged
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
9d12c58
Add gpmdk for the Venado hackathon
mewall Jul 2, 2024
f2a3c08
Add build script for hackathon
mewall Jul 2, 2024
3efdf37
Update bml submodule
mewall Jul 2, 2024
8c1bcab
Update bml submodule
mewall Jul 3, 2024
48b8022
Add electrons.dat to latteTBparams
mewall Jul 3, 2024
05314a5
Add MPI barrier to pin down performance issue
mewall Jul 5, 2024
ff1bd08
Add venado build, env, and run scripts
mewall Jul 5, 2024
7525e35
Bug fix
mewall Jul 5, 2024
17ad747
Add TrpCage example for gpmdk
mewall Jul 5, 2024
639473f
Fixed line truncations
Jul 8, 2024
61a5be8
Added sedacs partition and field induced forces
Jul 17, 2024
a61c7a6
Added main fro field
Jul 17, 2024
f62b952
NVTX tags
mewall Jul 15, 2024
c1aaa69
Introducing nvtx tags and some optimizations
mewall Jul 17, 2024
e3f9a44
Debug resizing and resize two more arrays
mewall Jul 19, 2024
2c63bed
Resize zqt array and use new bml_transpose_inplace to avoid allocation
mewall Jul 19, 2024
a044282
Move response into gpmdk in preparation for GPU kernel
mewall Jul 19, 2024
dfeb3a1
Add some useful build scripts
mewall Jul 19, 2024
1aebd59
Update bml submodule
mewall Jul 19, 2024
15594cf
Reduce allocations in hcsf method
mewall Jul 19, 2024
11695ec
Update build script
mewall Jul 19, 2024
f1d152c
Modifications to support the new bml_transpose Fortran API
mewall Jul 22, 2024
6195172
Add nvtx tags for charges and thread charge calculation
mewall Jul 24, 2024
2c72a40
Add nvtx tag methods to gpmdk
mewall Jul 24, 2024
2c7afd7
Debug cray build and work on omp offload
mewall Jul 25, 2024
0ae84e1
Working omp offload for gpmdcov_response
mewall Jul 29, 2024
468bca1
Bug fix
mewall Jul 30, 2024
07f9dcb
Allocate smaller array for work in kernel
mewall Jul 30, 2024
5edf99b
Code decorations to investigate MPI imbalance
mewall Aug 6, 2024
b941d17
Build updates
mewall Aug 7, 2024
77f9065
Beginning update of graph MPI
mewall Aug 7, 2024
ee06263
Preliminary low-communication graph update introduced. Not tested.
mewall Aug 8, 2024
df07623
Working subgraph graph update method with less MPI communication
mewall Aug 8, 2024
a4a61d9
Added new script
Aug 8, 2024
6f08dcd
added bch dslp script
Aug 8, 2024
769d607
Attempt to fix the hackathon branch
mewall Aug 12, 2024
6a2cb5f
Eliminate debug output of FORCESS
mewall Aug 12, 2024
6722bda
Update bml submodule
mewall Aug 13, 2024
87cf0c8
Update bml submodule and move build scripts to scripts/ dir
mewall Aug 13, 2024
63828d5
breaking lines
Aug 22, 2024
52e7fd6
fixing gpmd.py
Aug 22, 2024
23c3af7
Add more nvtx tags
mewall Aug 23, 2024
69d937a
Added err_var
Aug 26, 2024
a4dc3a2
Workaround for Cray matmul bug in gpmdk
mewall Sep 6, 2024
26f34d8
Report max time and rank for dH+dS
mewall Sep 9, 2024
e2b232c
Fix kernel bug. Better matmul fix.
mewall Sep 9, 2024
fde467f
Clean up comments and begin working on offload of get_dH_or_dS_vect
mewall Sep 10, 2024
59a45c5
Added protections against double alloc
Sep 11, 2024
49154ca
Prepare for OMP offload optimization
mewall Sep 12, 2024
9fff9ab
Fix bug in graph update
mewall Sep 13, 2024
0315f8c
Use MAGMA pointer in offload response kernel
mewall Sep 17, 2024
9b5b4bd
Working on openACC
mewall Oct 29, 2024
829eb87
Working offload of response using magma pointer
mewall Oct 31, 2024
578dec1
fixing compilation bug gfortran
Nov 6, 2024
bdc7776
Added voltage option
Nov 21, 2024
7779af8
Added dos
Nov 22, 2024
c5dc023
changes for coarse MD
Dec 4, 2024
7b2ece1
Update bml submodule
mewall Dec 15, 2024
bfd58e4
Added formatting statements to prg_system_mod XYZ trajectory writing …
rcorrigan Jan 8, 2025
8a35116
Opts for nvidia build
mewall Nov 18, 2024
89f9eaa
NVIDIA opt get_hsmat
mewall Nov 19, 2024
305dbc9
NVIDIA opt hsderivative
mewall Nov 20, 2024
cb20d3d
OpenACC accelerated nonorthocoul
mewall Nov 23, 2024
33cf90a
Add variables to select MD steps for nsys profiling
mewall Jan 14, 2025
0933417
Disable bisection in gpmd. Nonorthocoul opts.
mewall Jan 14, 2025
03ef662
Fix merge error
mewall Jan 23, 2025
aa9f795
More NTVX tags
mewall Jan 31, 2025
06f07f5
dH optimization
mewall Jan 31, 2025
dc22aa2
Bug fix
mewall Jan 31, 2025
b73ceec
Remove LANL-specific build script
mewall Feb 6, 2025
d21ba4c
Debugging
mewall Feb 13, 2025
26114df
Logistic SMD
mewall Feb 26, 2025
90b8ffb
Fix some git merge issues with examples in gpmdk
mewall Feb 27, 2025
f2b430b
Update bml submodule
mewall Feb 27, 2025
0e29867
Update .gitmodules
mewall Feb 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Working on openACC
mewall committed Jan 14, 2025
commit 9b5b4bd1fa6575d2c7929bba83cc067ff20845a6
32 changes: 16 additions & 16 deletions examples/gpmdk/src/gpmdcov_energandforces.F90
Original file line number Diff line number Diff line change
@@ -82,27 +82,29 @@ subroutine get_SKBlock_vect_local(sp,refcoord,coord,lattice_vectors&
integer, intent(in) :: norbs(:), sp(:), atnum
real(dp), allocatable :: HPPP(:), HPPS(:), HSPS(:), HSSS(:), PPSMPP(:)
real(dp), allocatable :: L(:), M(:), N(:)
real(dp), allocatable, save :: dr(:), rab(:,:)
real(dp), allocatable :: dr_m(:)
real(dp), allocatable :: dr(:), dr_m(:), rab(:,:)
real(dp), allocatable :: onsites_m(:)
real(dp), intent(inout) :: blk_out(:,:)
real(dp), allocatable, save :: blk(:,:)
real(dp), allocatable :: blk(:,:)
real(dp), intent(in) :: refcoord(:),coord(:,:), lattice_vectors(:,:)
real(dp), intent(in) :: onsites(:,:)
real(dp), intent(in) :: intParams1(:,:,:),intParams2(:,:,:)
logical, allocatable,save :: dist_mask(:), onsite_mask(:), calc_mask(:), calcs_mask(:)
logical, allocatable,save :: calcsp_mask(:), param_mask(:,:), calc_mask_for_porbs(:)
logical, allocatable,save :: sorb_mask(:), pxorb_mask(:), pyorb_mask(:), pzorb_mask(:)
logical, allocatable,save :: sorb_at_mask(:), sporb_at_mask(:)
integer, allocatable,save :: atomidx(:), atomidx_m(:), orbidx(:), orbidx_m(:), orbidx_sel(:)
real(dp), allocatable,save :: intParams(:,:)
logical, allocatable :: dist_mask(:), onsite_mask(:), calc_mask(:), calcs_mask(:)
logical, allocatable :: calcsp_mask(:), param_mask(:,:), calc_mask_for_porbs(:)
logical, allocatable :: sorb_mask(:), pxorb_mask(:), pyorb_mask(:), pzorb_mask(:)
logical, allocatable :: sorb_at_mask(:), sporb_at_mask(:)
integer, allocatable :: atomidx(:), atomidx_m(:), orbidx(:), orbidx_m(:), orbidx_sel(:)
real(dp), allocatable :: intParams(:,:)

nats = size(coord,dim=2)
norbsall = sum(norbs)

allocate(blk(size(blk_out,dim=1),size(blk_out,dim=2)))

blk(:,:)=0.0_dp

if(allocated(dr))then
if(nats.ne.size(dr,dim=1))then
deallocate(blk)
deallocate(dr)
deallocate(atomidx)
deallocate(orbidx)
@@ -123,7 +125,6 @@ subroutine get_SKBlock_vect_local(sp,refcoord,coord,lattice_vectors&
endif
if(.not.allocated(dr))then
allocate(dr(nats))
allocate(blk(4,size(blk_out,dim=2)))
allocate(atomidx(nats))
atomidx = (/(i,i=1,nats)/)
allocate(orbidx(norbsall))
@@ -175,9 +176,7 @@ subroutine get_SKBlock_vect_local(sp,refcoord,coord,lattice_vectors&
endif
enddo
endif

blk(:,:)=0.0_dp


do i = 1,3
Rab(:,i) = coord(i,:)
Rab(:,i) = modulo((Rab(:,i) - refcoord(i) + 0.5_dp*lattice_vectors(i,i)),lattice_vectors(i,i)) - 0.5_dp * lattice_vectors(i,i)
@@ -331,7 +330,7 @@ subroutine get_dH_or_dS_vect_local(dx,coords,hindex,spindex,intPairsH,onsitesH,s
real(dp) :: Rax_m(3), Rax_p(3), Ray_m(3), Ray_p(3)
real(dp) :: Raz_m(3), Raz_p(3), Rb(3), d, maxblockij
real(dp), allocatable :: Rx(:), Ry(:), Rz(:), blockm(:,:,:)
real(dp), allocatable :: blockp(:,:,:), dh0(:,:), dH0x(:,:), dH0y(:,:), dH0z(:,:)
real(dp), allocatable :: blockp(:,:,:), dH0x(:,:), dH0y(:,:), dH0z(:,:)
real(dp), intent(in) :: coords(:,:), dx, lattice_vectors(:,:), onsitesH(:,:)
real(dp), intent(in) :: threshold
type(bml_matrix_t), intent(inout) :: dH0x_bml, dH0y_bml, dH0z_bml
@@ -620,6 +619,7 @@ subroutine gpmdcov_EnergAndForces(charges)
type(rankReduceData_t) :: mpimax_in(1), mpimax_out(1)
integer :: k
logical :: testsmd

call gpmdcov_msMem("gpmdcov","Before gpmd_EnergAndForces",lt%verbose,myRank)

if(.not.allocated(coul_forces)) allocate(coul_forces(3,sy%nats))
@@ -781,7 +781,7 @@ subroutine gpmdcov_EnergAndForces(charges)
GFSCOUL(:,gpat%sgraph(ipt)%core_halo_index(i)+1) = syprt(ipt)%estr%FSCOUL(:,i)
SKForce(:,gpat%sgraph(ipt)%core_halo_index(i)+1) = syprt(ipt)%estr%SKForce(:,i)
enddo

call bml_deallocate(dSx_bml)
call bml_deallocate(dSy_bml)
call bml_deallocate(dSz_bml)
60 changes: 45 additions & 15 deletions examples/gpmdk/src/gpmdcov_response.F90
Original file line number Diff line number Diff line change
@@ -86,37 +86,63 @@ subroutine gpmdcov_response_dpdmu(P1_bml,dPdMu,H1_bml,Norbs,beta,Q_bml,evals,mu0
#endif
#endif ! USE_OFFLOAD

#ifdef USE_NVTX
call nvtxStartRange("Response Kernel",3)
#endif

#ifdef USE_OFFLOAD
P1_bml_c_ptr = bml_get_data_ptr_dense(P1_bml)
P1_bml_ld = bml_get_ld_dense(P1_bml)

!$omp target enter data map(alloc:p_0(1:HDIM))
!$omp target update to(p_0(1:HDIM))

call c_f_pointer(P1_bml_c_ptr,P1_bml_ptr,shape=[P1_bml_ld,HDIM])

#endif
#ifdef USE_NVTX
call nvtxStartRange("Response Kernel",3)
#endif
!call offload_kernel(p_0,P1_bml_ptr,P1_bml_ld,HDIM)

#ifdef USE_OMP
!$omp target enter data map(alloc:p_0(1:HDIM))
!$omp target update to(p_0(1:HDIM))
#else
!$acc enter data copyin(p_0(1:HDIM))
#endif
do i = 1,m ! Loop over m recursion steps
#ifdef USE_OFFLOAD
#ifdef USE_OMP
!$omp target teams distribute default(none) &
!$omp shared(HDIM,P1_bml_ptr,p_0)

#else
!$acc parallel loop deviceptr(P1_bml_ptr) present(p_0)
#endif
do k = 1,HDIM
#ifdef USE_OMP
!$omp parallel do
#else
!$acc loop
#endif
do j = 1,HDIM
P1_bml_ptr(j,k) = 1.D0/(2.D0*p_0(j)*(p_0(j)-1.D0)+1.D0)*((p_0(j) + p_0(k))*P1_bml_ptr(j,k) &
& + 2.D0*(P1_bml_ptr(j,k)-(p_0(j) + p_0(k))*P1_bml_ptr(j,k))*1.D0/(2.D0*p_0(k)*(p_0(k)-1.0D0)+1.D0)*p_0(k)*p_0(k))
enddo
#ifdef USE_OMP
!$omp end parallel do
enddo
!$omp end target teams distribute
!$omp target
p_0 = 1.D0/(2.D0*(p_0(:)*p_0(:)-p_0(:))+1.D0)*p_0(:)*p_0(:)
!$omp end target
#else
!$acc end loop
#endif
enddo
#ifdef USE_OMP
!$omp end target teams distribute
!$omp target
#else
!$acc end parallel
!$acc kernels present(p_0)
#endif
p_0 = 1.D0/(2.D0*(p_0(:)*p_0(:)-p_0(:))+1.D0)*p_0(:)*p_0(:)
#ifdef USE_OMP
!$omp end target
#else
!$acc end kernels
#endif
enddo
#else
do i = 1,m ! Loop over m recursion steps
!$omp parallel do default(none) &
!$omp private(k,j) &
!$omp shared(HDIM,P1,p_0)
@@ -130,13 +156,17 @@ subroutine gpmdcov_response_dpdmu(P1_bml,dPdMu,H1_bml,Norbs,beta,Q_bml,evals,mu0
enddo
!$omp end parallel do
p_0 = 1.D0/(2.D0*(p_0(:)*p_0(:)-p_0(:))+1.D0)*p_0(:)*p_0(:)
#endif
enddo
#endif
#ifdef USE_NVTX
call nvtxEndRange
#endif
#ifdef USE_OFFLOAD
#ifdef USE_OMP
!$omp target exit data map(delete:p_0(1:HDIM))
#else
!$acc exit data delete(p_0(1:HDIM))
#endif
#else

bml_type = bml_get_type(P1_bml)