hi,
Can i install GPU to ligggghts? How?
Have somebody any experiences with this?
I try to install GPU to ligggts:
- download and install nvidia cuda toolkit
- add directory to PATH and include cuda/lib
but when i start "make -f Makefile.nvidia" i get error: "gb_gpu_memory.cu:67: error: ‘__T11’ has not been declared"
many thanks
jikra
ckloss_ | Fri, 08/12/2011 - 22:49
jikra, I would ask you for a
jikra,
I would ask you for a bit of patience. there has been considerable progress in lammps regarding gpu, and this is not in liggghts yet. That will be part of the next major release, probably by the end of the year.
For the future, we will try to keep liggghts more tightly in-line with lammps so that this kind of stuff is available faster for liggghts users
Christoph
chegdan | Mon, 07/09/2012 - 16:42
What is the status of GPU-LIGGGHTS?
Jikra and Christoph,
I'm curious what the status of the GPU-LIGGGHTS project is currently. Internet searches for GPU-LIGGGHTS lead me to either this post or (https://nf.nci.org.au/facilities/software/software.php?software=LIGGGHTS...) without a download link. I understand that there is a GPU-LAMMPS (http://lammps.sandia.gov/doc/Section_accelerate.html#acc_4), and if so, what needs to be done to tie in the GPU-LAMMPS into LIGGGHTS? Is there anything that I can do to help move this forward? Thanks in advance for your reply.
Dan
jorisheyman | Thu, 12/06/2012 - 12:32
Ready to help also !
Ready to help also !
richti83 | Thu, 12/06/2012 - 21:27
Maybe OpenACC is the way we should go
The upcoming PGI compiler R2013 will support C++ OpenACC acceleration.
https://developer.nvidia.com/openacc
http://www.pgroup.com/pricing/aewsa.htm
I have given it a try yet:
int main(void) {
clock_t begin=clock();
double pi = 0.0f; long i;
#pragma acc parallel loop //here the magic happens
for (i=0; i N; i++) {
double t= (double)((i+0.5)/N);
pi +=4.0/(1.0+t*t);
}
clock_t end=clock();
}
with a Quattro 4000 and cuda 5 vs a XEON E5-2687W I got:
#Portland-Group-Compiler
dem@T7600:~/playground/pgiTEST$ pgcc -ta=nvidia:5.0 -acc -Minfo=accel pi.c -o pi
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.700000)
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.600000)
dem@T7600:~/playground/pgiTEST$ ./pi
pi=3.141592653589731 (time=0.600000)
#Standard-Gnu-C-Compiler
dem@T7600:~/playground/pgiTEST$ gcc pi.c -o piCPU
dem@T7600:~/playground/pgiTEST$ ./piCPU
pi=3.141592653589731 (time=2.000000)
dem@T7600:~/playground/pgiTEST$ ./piCPU
pi=3.141592653589731 (time=2.100000)
Now I'm waiting for the release
PGI answered me:
C++ support will be included in PGI release 13.0 due out in early
December. We'll send you a notice when it's available for download.
jorisheyman | Tue, 12/11/2012 - 08:58
It seems not to be OpenSource
It seems not to be OpenSource, isn't it ?
richti83 | Tue, 12/11/2012 - 12:42
true, but it's only 300 EUR
true, but it's only 300 EUR for the PGI-compiler with accelerator (university lic), every GPU is more expensive.
jorisheyman | Tue, 12/11/2012 - 13:04
yep that's true.
yep that's true.
I wonder if we can slowly move to GPU by modifying successively a few routines while keeping the others in CPU. For instance, the loop over all particles in fix_addforce... That way we could parrallelize the work needed to GPU migration ;)
richti83 | Wed, 12/19/2012 - 11:38
parrallelize the work needed to GPU migration
modifying successively a few routines while keeping the others in CPU
I tried this with a simple fix which should add only a constant velocity to all atoms and with pair_gran_hooke - the big problem is that we have to copy a lot of data to/from the GPU and the costs for that are much bigger than the parrallized enhancements.
//speeded up set_vel function
int set_vel(int n,double **vel,double vx, double vy,double vz) {
#pragma acc data copyout(vel[0:n][0:3])
{
#pragma acc parallel loop
for (int i = 0; i < n; i++) {
vel[i][0]=vx;
vel[i][1]=vy;
vel[i][2]=vz;
}
}
}
2 times slower
and:
#pragma acc kernels loop \
copyin(numneigh[0:maxatoms]) \
copyin(ilist[0:maxatoms]) \
copyin(x[0:maxatoms][0:3]) \
copy(v[0:maxatoms][0:3]) \
copy(f[0:maxatoms][0:3]) \
copy(torque[0:maxatoms][0:3]) \
copy(omega[0:maxatoms][0:3]) \
copyin(rmass[0:maxatoms]) \
copyin(type[0:maxatoms]) \
copyin(radius[0:maxatoms]) \
copyin(mass[0:maxatoms]) \
copyin(firstneigh[0:maxatoms][0:3]) \ //<- wrong 2nd dimension but with n x n my GPU runs out of memory !
create(betaeff[0:maxtype][0:maxtype])
{
for (ii = 0; ii < inum; ii++) {
restrict int i = ilist[ii];
xtmp = x[i][0];
ytmp = x[i][1];
ztmp = x[i][2];
radi = radius[i];
jlist = firstneigh[i];
jnum = numneigh[i];
for (jj = 0; jj < 3; jj++) { //jnum
...
~50 times slower because the copy clauses are invoced every calculation step
conclusion: need to switch in one big step to GPU.
Now it's time to think about the way this could happen.
USER-CUDA: pro: no special compiler needed, con: a lot of handcraft needed, no Idea about granular-atom-style
USER-GPU: same
openACC: pro: fast and easy by adding some derictives, con: needs extra compiler, less control over generated code, at the moment no C++ support (I avoided this by offloading the compute function into a C'99 library linked this into liggghts with pgiCXX)
for motivation a nice graph about this topic (taken from here: http://developer.download.nvidia.com/GTC/GTC-Express-PGI-Webinar.pdf)
Have a nice Christmas,
Christian
richti83 | Sat, 11/02/2013 - 14:46
UPDATE
THIS sounds very very interesting:
http://hpc2013.hpclatam.org/papers/HPCLatAm2013-paper-10.pdf
Are there plans to support this in LIGGGHTS 3 ?
I'm willing to help as much as I can.
Best,
Christian.
richti83 | Sun, 11/03/2013 - 23:01
UPDATE 2
I successfull crossed lammpscuda with liggghts 2.3.8.
NVE/SPHERE/CUDA seems to be working,
atom granular/cuda seems to be working too.
neighbourlist can be used as is.
I forked the official git repository and uploaded the changes here:
https://github.com/richti83/LIGGGHTS-CUDA
Next step is to write new cuda-kernels for the new pair styles in liggghts.
At last some timings with granular package:
#CPU only:
#Loop time of 39.8425 on 1 procs for 1000 steps with 32000 atoms
#Loop time of 12.2629 on 4 procs for 1000 steps with 32000 atoms
#with GPU acceleration
#Loop time of 9.15723 on 1 procs for 1000 steps with 32000 atoms
The used Testscirpt can be downloaded here:
https://raw.github.com/richti83/LIGGGHTS-CUDA/master/cuda_benchmark/in.b...
Unfortunately it does not work with mpi at the moment ...
As this is not an official founded project I have less time to code and I'm happy about any contribution / donation.
Best,
Christian.
ckloss | Mon, 11/04/2013 - 12:36
hi christian,
hi christian,
thanks for sharing the paper! For the moment, we're not looking into GPU as we have no one that could honestly work on it and support it.
Also afaik the LAMMPS guys plan to deprecate the CUDA package at some point so I think we won't work on this one but wait for the next major release from the LAMMPS guys instead.
Cheers
Christoph
vkoppejan | Wed, 01/06/2016 - 21:24
Is there anything new yet?
Dear Christian,
I came across this lib folder on the github
https://github.com/CFDEMproject/LIGGGHTS-PUBLIC/tree/master/lib/gpu
Is this a new developement and if so, is there any documentation on how to use it?
Thank you for feedback.
Regards,
Victor
vkoppejan | Wed, 01/06/2016 - 21:26
Is there anything new yet?
Dear Christian,
I came across this lib folder on the github
https://github.com/CFDEMproject/LIGGGHTS-PUBLIC/tree/master/lib/gpu
Is this a new developement and if so, is there any documentation on how to use it?
Thank you for feedback.
Regards,
Victor
ckloss | Mon, 01/25/2016 - 22:42
Hi Victor,
Hi Victor,
no news on this - we're following the developments on the LAMMPS side (that's where the GPU package/lib comes from), but are not fully convinced of the merits for granular simulations. So no developments planned in the very near future
Christoph
gilgamesch | Wed, 10/02/2019 - 08:49
Pushing again.
Sorry for pushing old threads. Is there anything new happening on this front or is it still not worth the effort?
Most of you probably know this but this presentation from 2016 (I think) shows a major decrease in processing time with a single GPU (from 46h to 0.19h)
http://on-demand.gputechconf.com/gtc/2016/presentation/s6113-nicolin-gov...
Would be nice to make use of that kind of power.