Simulation crashing with segfault

Submitted by QuinnReynolds on Mon, 09/27/2010 - 07:29

Hi all,

We have successfully installed LIGGGHTS 1.1.6 on a small cluster (4 nodes, each with quad-core processors). In all testing so far the software has performed very well and stably both serial and parallel, however, when I try to run a larger simulation (continuous feed and removal of particles, 100000+ particles), LIGGGHTS crashes with what looks like a memory segfault some way into the run - see output below. Could this be related to the total memory available on the nodes being exceeded? This simulation was running on all 16 CPUs, with ~512MB memory per CPU (2GB per quad-core processor).

...
124000 98446 103.97684 2.0611838 3.75
125000 98286 83.144795 2.0069003 3.75
WARNING: Less insertions than requested
WARNING: Less insertions than requested
WARNING: Less insertions than requested
126000 99560 90.43412 2.2861112 3.75
127000 99400 83.596227 1.7898005 3.75
WARNING: Less insertions than requested
WARNING: Less insertions than requested
WARNING: Less insertions than requested
128000 100642 79.100547 2.0405716 3.75
WARNING: Less insertions than requested
WARNING: Less insertions than requested
WARNING: Less insertions than requested
129000 101961 90.427392 2.2216379 3.75
130000 101801 82.805503 1.7609675 3.75
WARNING: Less insertions than requested
WARNING: Less insertions than requested
WARNING: Less insertions than requested
[node001:01014] *** Process received signal ***
[node001:01014] Signal: Segmentation fault (11)
[node001:01014] Signal code: Address not mapped (1)
[node001:01014] Failing at address: 0xf5d42328
[node001:01014] [ 0] [0xb78d1410]
[node001:01014] [ 1] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS15FixTriNeighlist9pre_forceEi+0x8c1) [0x823d121]
[node001:01014] [ 2] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS6Modify9pre_forceEi+0x39) [0x8286b99]
[node001:01014] [ 3] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS6Verlet3runEi+0x3cc) [0x8367e3c]
[node001:01014] [ 4] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS3Run7commandEiPPc+0x254) [0x83461e4]
[node001:01014] [ 5] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input15execute_commandEv+0xa7f) [0x8269fdf]
[node001:01014] [ 6] /home/quinnr/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input4fileEv+0x4fe) [0x826a9fe]
[node001:01014] [ 7] /home/quinnr/bin/lmp_openmpi(main+0x53) [0x8271d63]
[node001:01014] [ 8] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xb747fbd6]
[node001:01014] [ 9] /home/quinnr/bin/lmp_openmpi() [0x80a86f1]
[node001:01014] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 1014 on node node001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[node002][[15420,1],6][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node003][[15420,1],10][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
quinnr@mntksrv005:~/software/liggghts_1p1p6/testcases/sil_fullscale$

Kind regards,
Quinn Reynolds

ckloss's picture

ckloss | Mon, 09/27/2010 - 09:09

Hi Quinn,

hm - difficult to say.

>>Could this be related to the total memory available on the nodes being exceeded
Could be possible. Are you running on a 32 bit or 64 bit system?

You can send me (use the "contact us" form) the input script + eventually needed geometry, and I will have a look then.

Christoph