GPU in liggghts

Submitted by jikra on Fri, 08/12/2011 - 22:16

hi,

Can i install GPU to ligggghts? How?
Have somebody any experiences with this?

I try to install GPU to ligggts:
- download and install nvidia cuda toolkit
- add directory to PATH and include cuda/lib

but when i start "make -f Makefile.nvidia" i get error: "gb_gpu_memory.cu:67: error: ‘__T11’ has not been declared"

many thanks
jikra

ckloss_ | Fri, 08/12/2011 - 22:49

jikra,

I would ask you for a bit of patience. there has been considerable progress in lammps regarding gpu, and this is not in liggghts yet. That will be part of the next major release, probably by the end of the year.

For the future, we will try to keep liggghts more tightly in-line with lammps so that this kind of stuff is available faster for liggghts users

Christoph

chegdan | Mon, 07/09/2012 - 16:42

Jikra and Christoph,

I'm curious what the status of the GPU-LIGGGHTS project is currently. Internet searches for GPU-LIGGGHTS lead me to either this post or (https://nf.nci.org.au/facilities/software/software.php?software=LIGGGHTS...) without a download link. I understand that there is a GPU-LAMMPS (http://lammps.sandia.gov/doc/Section_accelerate.html#acc_4), and if so, what needs to be done to tie in the GPU-LAMMPS into LIGGGHTS? Is there anything that I can do to help move this forward? Thanks in advance for your reply.

Dan

richti83's picture

richti83 | Thu, 12/06/2012 - 21:27

The upcoming PGI compiler R2013 will support C++ OpenACC acceleration.

https://developer.nvidia.com/openacc
http://www.pgroup.com/pricing/aewsa.htm
I have given it a try yet:

int main(void) {
clock_t begin=clock();
double pi = 0.0f; long i;
#pragma acc parallel loop //here the magic happens
for (i=0; i N; i++) {
double t= (double)((i+0.5)/N);
pi +=4.0/(1.0+t*t);
}
clock_t end=clock();
}

with a Quattro 4000 and cuda 5 vs a XEON E5-2687W I got:

#Portland-Group-Compiler
dem@T7600:~/playground/pgiTEST$ pgcc -ta=nvidia:5.0 -acc -Minfo=accel pi.c -o pi
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.700000)
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.600000)
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.600000)


#Standard-Gnu-C-Compiler
dem@T7600:~/playground/pgiTEST$ gcc pi.c -o piCPU
dem@T7600:~/playground/pgiTEST$ ./piCPU
pi=3.141592653589731 (time=2.000000)
dem@T7600:~/playground/pgiTEST$ ./piCPU
pi=3.141592653589731 (time=2.100000)

Now I'm waiting for the release

PGI answered me:

C++ support will be included in PGI release 13.0 due out in early
December. We'll send you a notice when it's available for download.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

richti83's picture

richti83 | Tue, 12/11/2012 - 12:42

true, but it's only 300 EUR for the PGI-compiler with accelerator (university lic), every GPU is more expensive.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

jorisheyman | Tue, 12/11/2012 - 13:04

yep that's true.
I wonder if we can slowly move to GPU by modifying successively a few routines while keeping the others in CPU. For instance, the loop over all particles in fix_addforce... That way we could parrallelize the work needed to GPU migration ;)

richti83's picture

richti83 | Wed, 12/19/2012 - 11:38

modifying successively a few routines while keeping the others in CPU

I tried this with a simple fix which should add only a constant velocity to all atoms and with pair_gran_hooke - the big problem is that we have to copy a lot of data to/from the GPU and the costs for that are much bigger than the parrallized enhancements.


//speeded up set_vel function
int set_vel(int n,double **vel,double vx, double vy,double vz) {
#pragma acc data copyout(vel[0:n][0:3])
{
#pragma acc parallel loop
for (int i = 0; i < n; i++) {
vel[i][0]=vx;
vel[i][1]=vy;
vel[i][2]=vz;
}
}
}

2 times slower
and:

#pragma acc kernels loop \
copyin(numneigh[0:maxatoms]) \
copyin(ilist[0:maxatoms]) \
copyin(x[0:maxatoms][0:3]) \
copy(v[0:maxatoms][0:3]) \
copy(f[0:maxatoms][0:3]) \
copy(torque[0:maxatoms][0:3]) \
copy(omega[0:maxatoms][0:3]) \
copyin(rmass[0:maxatoms]) \
copyin(type[0:maxatoms]) \
copyin(radius[0:maxatoms]) \
copyin(mass[0:maxatoms]) \
copyin(firstneigh[0:maxatoms][0:3]) \ //<- wrong 2nd dimension but with n x n my GPU runs out of memory !
create(betaeff[0:maxtype][0:maxtype])
{
for (ii = 0; ii < inum; ii++) {
restrict int i = ilist[ii];
xtmp = x[i][0];
ytmp = x[i][1];
ztmp = x[i][2];
radi = radius[i];
jlist = firstneigh[i];
jnum = numneigh[i];

for (jj = 0; jj < 3; jj++) { //jnum
...

~50 times slower because the copy clauses are invoced every calculation step

conclusion: need to switch in one big step to GPU.
Now it's time to think about the way this could happen.
USER-CUDA: pro: no special compiler needed, con: a lot of handcraft needed, no Idea about granular-atom-style
USER-GPU: same
openACC: pro: fast and easy by adding some derictives, con: needs extra compiler, less control over generated code, at the moment no C++ support (I avoided this by offloading the compute function into a C'99 library linked this into liggghts with pgiCXX)

for motivation a nice graph about this topic (taken from here: http://developer.download.nvidia.com/GTC/GTC-Express-PGI-Webinar.pdf)

Have a nice Christmas,
Christian

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

richti83's picture

richti83 | Sun, 11/03/2013 - 23:01

I successfull crossed lammpscuda with liggghts 2.3.8.
NVE/SPHERE/CUDA seems to be working,
atom granular/cuda seems to be working too.
neighbourlist can be used as is.

I forked the official git repository and uploaded the changes here:
https://github.com/richti83/LIGGGHTS-CUDA

Next step is to write new cuda-kernels for the new pair styles in liggghts.

At last some timings with granular package:

#CPU only:
#Loop time of 39.8425 on 1 procs for 1000 steps with 32000 atoms
#Loop time of 12.2629 on 4 procs for 1000 steps with 32000 atoms
#with GPU acceleration
#Loop time of 9.15723 on 1 procs for 1000 steps with 32000 atoms

The used Testscirpt can be downloaded here:
https://raw.github.com/richti83/LIGGGHTS-CUDA/master/cuda_benchmark/in.b...

Unfortunately it does not work with mpi at the moment ...

As this is not an official founded project I have less time to code and I'm happy about any contribution / donation.

Best,
Christian.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

ckloss's picture

ckloss | Mon, 11/04/2013 - 12:36

hi christian,

thanks for sharing the paper! For the moment, we're not looking into GPU as we have no one that could honestly work on it and support it.
Also afaik the LAMMPS guys plan to deprecate the CUDA package at some point so I think we won't work on this one but wait for the next major release from the LAMMPS guys instead.

Cheers
Christoph

ckloss's picture

ckloss | Mon, 01/25/2016 - 22:42

Hi Victor,

no news on this - we're following the developments on the LAMMPS side (that's where the GPU package/lib comes from), but are not fully convinced of the merits for granular simulations. So no developments planned in the very near future

Christoph