Memory issues

Submitted by tkulju on Thu, 10/06/2011 - 09:22

Hi!
Is there some upper limit from the LIGGGHTS to the one and page values? If I increase the page value too much, I get memory error, when run is executed. If this is not an LIGGGHTS related, do you know is this Linux or hardware related issue? I'm using 64-bit Linux on distributed memory cluster with 8GB RAM on each node, and the LIGGGHTS is executed parallel in a distributed way.

The reason for using high values for these parameters, is because I get quite often these "malloc(): memory corruption"-errors. And in the docs it said that "LAMMPS can crash without an error message if the number of neighbors for a single particle is larger than the page setting". Below is the output of the latest error (for which I used the one=5000 and page=1000*one=5000000):

*** glibc detected *** /opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora: malloc(): memory corruption: 0x00000000143a8160 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3451a72fae]
/lib64/libc.so.6(__libc_malloc+0x6e)[0x3451a74cde]
/opt/share/apps/openmpi-1.5.3/lib/openmpi/mca_coll_tuned.so[0x2af33c1df589]
/opt/share/apps/openmpi-1.5.3/lib/libmpi.so.1(PMPI_Allreduce+0x1f9)[0x2af3381746a9]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS15FixTriNeighlist9pre_forceEi+0x6f7)[0x59c409]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS6Modify9pre_forceEi+0x43)[0x5d305f]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS6Verlet3runEi+0x233)[0x69425f]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS3Run7commandEiPPc+0x668)[0x6743cc]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS5Input15execute_commandEv+0xfb1)[0x5c077d]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(_ZN9LAMMPS_NS5Input4fileEv+0x2b6)[0x5c12c6]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(main+0x5f)[0x5c8cef]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3451a1d994]
/opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora(__gxx_personality_v0+0x341)[0x4713e9]
======= Memory map: ========
00400000-0071f000 r-xp 00000000 00:1b 94147262 /opt/share/apps/LIGGGHTS/liggghtsdev-1.4.4/src/lmp_fedora
.
.
.

- Timo

ckloss's picture

ckloss | Thu, 10/06/2011 - 09:52

Hi Timo,

>> I get quite often these "malloc(): memory corruption"-errors
Can you post these please, I do not think that your issues are related to the one and page settings

Thanks, Christoph

tkulju | Thu, 10/06/2011 - 10:31

Hi Christoph!
I sent you the error output and script through e-mail. And one thing, which has been puzzling me; are the

INFO: Maxmimum number of particle-tri neighbors >380 at step 108568, growing array...done!
INFO: Maxmimum number of particle-tri contacts >12 at step 108581, growing array

lines crucial? Because I get them quite a lot, especially the "particle-tri neighbors" ones.

- Timo

msbentley's picture

msbentley | Thu, 10/06/2011 - 11:47

I was also wondering this - I have a long, narrow cylinder in which the triangles making the mesh are (a) rather elongated (running the whole length of the cylinder) and (b) give the above errors due to many particle-triangle contacts.

Is it better to re-mesh with smaller triangles? And are the issues related to computational efficiency, or are the results in fact then wrong?

Thanks!

Mark

ckloss's picture

ckloss | Thu, 10/06/2011 - 12:00

Hi Mark and Timo,

elongated triangles are numerically bad (as are e.g. elongated cells for CFD). I would advise you to use a mesher like gmsh (which can generate such a mesh from iges or step) instead of using CAD data directly

Christoph

tkulju | Thu, 10/20/2011 - 07:36

Hi Christoph & Mark!

Thanks!! I remeshed the surface of my geometry with gmsh and Salome. After importing the better mesh and modifying the simulation parameters (particle young modulus and cohesion) a bit, my case is working much better now.

- Timo

tkulju | Mon, 10/24/2011 - 10:12

Hi!
Now I'm getting "Dangerous builds"-messages on a regular basis after the simulation has been running a while. Any Idea how to localize this? Or is it a issue to concern about?
My Rayleigh and Hertz times are 0.036936609 and 0.021932516, so I don't think it's a time step issue.

- Timo

ckloss's picture

ckloss | Mon, 10/24/2011 - 10:18

2 possible reasons are:
+ time-step is too large (or skin too short) so that particles travel further than skin in a time-step (you should get a warning in this case)
+ particles that are inserted instantaneously overlap a CAD wall

Christoph

tkulju | Tue, 10/25/2011 - 14:46

Hi!
The message appears usually at different times than the particles are inserted and I don't get any warning message about the skin size. Could the too dense or coarse wall size be the reason?

For debugging purposes, is there a way to get the coordinates, where this "dangerous builds" happens?

- Timo

ckloss's picture

ckloss | Tue, 10/25/2011 - 23:30

Hi Timo,

>>For debugging purposes, is there a way to get the coordinates, where this "dangerous builds" happens?
I have a debugging possibility in my version but as I am out of office I cant send it to u unfortunately...

if you could narrow down the issue a bit (small testcase) and send it to me, I can have a look as I return

for now, you could change the skin size (make it larger) and see if that resolves the issue

Christoph

tkulju | Fri, 10/28/2011 - 07:10

Hi Christoph!

It seems that the problem is in my rotating .stl-geometry. So it is somehow related with the particle-wall interaction with the rotating mesh. Or my mesh is just too bad... If you could send me the debug version it would be nice. Or advice, how i could get the coordinates. I've noticed that in fix_tri_neighlist.cpp at line 340-341 these warnings appears, so appending there the coordinate information would help to locate the problem.

I'll try to create a simpler test case, which I could send to you. But it may take some time..

- Timo

tkulju | Fri, 10/28/2011 - 14:56

Hi Christoph!
Ok, now I have a small test case. Although I wasn't able to reproduce the "Dangerous build" warning, I have another question. As my geometry consists of 2 .stl surfaces, where other one is moving and other one is static, is it a bad if they overlap? I'll send you soon of an example of this issue.

- Timo

tkulju | Tue, 11/01/2011 - 09:37

Hi Christoph!
I think this is something to do with parallelization. I ran the same case (rotating mesh with Hertzian cohesion), and at serial version I didin't get this " WARNING: Dangerous build in triangle neighbor list."-message. So for my knowledge/intuition it has something to do with inter-processor communication, so should I increase/set the cutoff value?

- Timo

ckloss's picture

ckloss | Tue, 11/01/2011 - 10:32

well increasing the cutoff value could help. anyway, this behavior is not to be expected so I would have a look at this. is this the case you sent me a few days ago?

Christoph

tkulju | Tue, 11/01/2011 - 11:38

Hi!
The case I sent you was a bad imitation of my real case. I'll try create a better one and send it to you.

- Timo

tkulju | Fri, 11/04/2011 - 14:07

Hi!
I haven't been able to reproduce it in any of the simpler geometries, but the thing seems to be mesh-related. By modifying the mesh, it seems to go away. My geometry contains lots of sharp edges and high curvature. The "failed" meshes contained some long and narrow i.e. high aspect ratio faces , which in my understanding is bad.

Could it be possible in gran/mesh command include some information about aspect ratio or do you know how I can get the information e.g. in gmsh?

- Timo

ckloss's picture

ckloss | Sun, 11/06/2011 - 14:02

The aspect ratio is checked in fix mesh/gran, and a warning is issued if the aspect ratio is >10 I think. The code is a bit of a mess (we are currently re-writing it), but if you search for error->warning in fix_mesh_gran.cpp, you should be able to find the place in the code.

Cheers, Christoph