segfault when running MPI simulation

msbentley's picture
Submitted by msbentley on Wed, 06/13/2012 - 15:55

Hi all,

I'm having a problem which only occurs when I run a simulation on multiple processors using MPI. My script is a hybrid granular/molecular simulation, which simply reads in a datafile defining ~2k aggregates (each of 32 spheres), and lets these settle under gravity. Running without MPI this works fine. With MPI (no matter how many processors I try to use) it consistently crashes with errors like:

[comp-l06:18276] *** Process received signal ***
[comp-l06:18276] Signal: Segmentation fault (11)
[comp-l06:18276] Signal code: Address not mapped (1)
[comp-l06:18276] Failing at address: 0x18d62390
[comp-l06:18276] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7fd2b1cf88f0]
[comp-l06:18276] [ 1] /home/mab/bin/lmp_openmpi(_ZN9LAMMPS_NS20PairGranHookeHistory7computeEiii+0x526) [0x738d16]
[comp-l06:18276] [ 2] /home/mab/bin/lmp_openmpi(_ZN9LAMMPS_NS6Verlet3runEi+0x195) [0x7d9905]
[comp-l06:18276] [ 3] /home/mab/bin/lmp_openmpi(_ZN9LAMMPS_NS3Run7commandEiPPc+0x26a) [0x7acf5a]
[comp-l06:18276] [ 4] /home/mab/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input15execute_commandEv+0x9ae) [0x6aee2e]
[comp-l06:18276] [ 5] /home/mab/bin/lmp_openmpi(_ZN9LAMMPS_NS5Input4fileEv+0x3c8) [0x6af688]
[comp-l06:18276] [ 6] /home/mab/bin/lmp_openmpi(main+0x49) [0x6b8979]
[comp-l06:18276] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x7fd2b1984c4d]
[comp-l06:18276] [ 8] /home/mab/bin/lmp_openmpi() [0x47c199]
[comp-l06:18276] *** End of error message **

Smaller simulations run OK in MPI or with a single CPU, but this and larger test cases break (though they fail at a different step etc.). Is this user error, or a bug? I've attached my script and data file...

Thanks!

Mark

AttachmentSize
Binary Data agg_test.tar_.gz1.61 MB

ryan.houlihan | Fri, 06/15/2012 - 02:23

I am also running a hybrid granular/molecular simulation which reads in a datafile containing multiple spheres of a specific radius set up in a square lattice and lets these spheres settle under gravity. Running without MPI works fine. However with MPI it also consistently crashes on what seems to be the same function it is crashing for you: PairGranHookeHistory::compute(). The following error then occurs (if the full error is needed please let me know this was just the most important tidbit with the comment (..... continues multiple more time) where similar lines repeat)

MPI: On host r1i0n10, Program /home/houlihan/Inputs/RigidGran/lmp_rmm_icc, Rank 108, Process 6871 received signal SIGSEGV(11)

(..... continues multiple more time)

MPI: --------stack traceback-------
MPI: Attaching to program: /proc/8094/exe, process 8094
MPI: Try: zypper install -C "debuginfo(build-id)=341d7c595fd2db49df98b8a6ae2c319f46b43c5b"
MPI: (no debugging symbols found)...done.
MPI: [Thread debugging using libthread_db enabled]
MPI: Try: zypper install -C "debuginfo(build-id)=e907b88d15f5e1312d1ae0c7c61f8da92745738b"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=4e9fa1a2c1141fc0123a142783efd044c40bdaaf"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=3f06bcfc74f9b01780d68e89b8dce403bef9b2e3"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=9e0264386fde8570b215fd4c32465fdda3c1c996"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=f607b21f9a513c99bba9539050c01236d19bf22b"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=c1807b5762068e6c5f4a6a0ed48d9d4469965351"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=d44cbcbbcbdc9ed66abdcd82fa04fb4140bc155e"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=7bcdd7deb661fbb367edf63273568fc962aefbed"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=7d7940c46e5ea77fb4896ce5dba45bc9299c5e0c"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=02c78a8ec7997130f18f6c4fdef78ed36b853133"
MPI: (no debugging symbols found)...done.
MPI: Try: zypper install -C "debuginfo(build-id)=f1e396fe2fd218a097a0fe16bd3e8951056cb0e6"
MPI: (no debugging symbols found)...done.
MPI: 0x00002b257b3aa1e5 in waitpid () from /lib64/libpthread.so.0
MPI: (gdb) #0 0x00002b257b3aa1e5 in waitpid () from /lib64/libpthread.so.0
MPI: #1 0x00002b257b115a74 in mpi_sgi_system (command=) at sig.c:89
MPI: #2 MPI_SGI_stacktraceback (header=) at sig.c:272
MPI: #3 0x00002b257b116425 in first_arriver_handler (signo=11, stack_trace_sem=0x2b257eca9d00) at sig.c:415
MPI: #4 0x00002b257b116620 in slave_sig_handler (signo=11, siginfo=, extra=) at sig.c:494
MPI: #5
MPI: #6 0x000000000062c39b in LAMMPS_NS::PairGranHookeHistory::compute(int, int, int) ()
MPI: #7 0x0000000000628ff2 in LAMMPS_NS::PairGran::compute(int, int) ()
MPI: #8 0x00000000006cd0a7 in LAMMPS_NS::Verlet::setup() ()
MPI: #9 0x000000000069ea27 in LAMMPS_NS::Run::command(int, char**) ()
MPI: #10 0x00000000005a9eb6 in LAMMPS_NS::Input::execute_command() ()
MPI: #11 0x00000000005a6fe1 in LAMMPS_NS::Input::file() ()
MPI: #12 0x00000000005b8a0d in main ()
MPI: (gdb) A debugging session is active.

(....... continues multiple more times)

MPI: -----stack traceback ends-----
MPI: On host r1i0n4, Program /home/houlihan/Inputs/RigidGran/lmp_rmm_icc, Rank 48, Process 6620: Dumping core on signal SIGSEGV(11) into directory /home/houlihan/Inputs/RigidGran

(....... continues multiple more times)

MPI: MPI_COMM_WORLD rank 86 has terminated without calling MPI_Finalize()
MPI: aborting job
MPI: Received signal 11

ckloss's picture

ckloss | Sat, 06/16/2012 - 11:04

Hi Mark and Ryan,

please check if the issue exists in LIGGGHTS 2.0RC as well, there have been several bugfixes related to that

Thanks,
Christoph

msbentley's picture

msbentley | Mon, 06/18/2012 - 10:49

Hi Christoph,

Thanks for your reply! I'm afraid I fell at the first hurdle here, getting the error:

atom_style hybrid sphere molecular
ERROR: Invalid atom style (atom.cpp:334)

Has the syntax for hybrid pair styles changed in LIGGGHTS 2, or similar?

Thanks! Mark

ckloss's picture

ckloss | Mon, 06/18/2012 - 19:02

"atom_style molecular" is part of the MOLECULE package - did you compile with "make yes-MOLECULE" ?

Cheers, Christoph

msbentley's picture

msbentley | Tue, 06/19/2012 - 09:56

Great, thanks - previously molecule was build by default, obviously not now. I'll finish adapting the rest of my script for LIGGGHTS 2 and report back on the MPI crash...

EDIT: as above, I unfortunately get the same errors as before...

msbentley's picture

msbentley | Thu, 06/28/2012 - 10:01

I've just run my script again after upgrading to 2.0 release, and so far it seems to be running better - I ran the same script (which crashed before) with 2 and 4 CPUs in MPI and it hasn't crashed so far. I'll start the "full size" run now and see how that goes! Thanks, Mark

ryan.houlihan | Mon, 06/18/2012 - 17:59

Christoph,

Thanks for the response. I have downloaded and installed LIGGGHTS 2 but also receive the following error:

atom_style hybrid granular molecular
ERROR: Invalid atom style (atom.cpp:334)

I took a look at atom.cpp but how to resolve this problem was not immediately obvious. Further help would be greatly appreciated.

Thanks,

Ryan

ryan.houlihan | Wed, 06/20/2012 - 00:10

Sorry for the slow response. I am also recieving the same errors as before using LIGGGHTS 2.

Thanks,

Ryan

zamir | Wed, 02/27/2013 - 19:26

I just wanted to make a comment that this is no longer a bug in v2.3. It may have been fixed in an earlier version but I never went back to check until now. Thank you Mr Kloss.