SIGINT/SIGTERM caught

Submitted by AzamatSalamat on Tue, 05/19/2020 - 11:12

Hello!

I am trying to simulate about 7 million particles with size ranging from 0.0003175 m to 0.0031 m. During the simulation I have received the following error:

Memory usage per processor = 356.933 Mbytes
Step Atoms KinEng CPU
66000 1860696 980.5158 0
INFO: Particle insertion ins: inserted 1019560 particle templates (mass 6.060482e+01) at step 66667
- a total of 2880256 particle templates (mass 1.712086e+02) inserted so far.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
--------------------------------------------------------------------------
mpirun noticed that process rank 11 with PID 0 on node andromeda exited on signal 9 (Killed).

Other forum answers suggested that the problem might be in the high neighbor bin distance (cut-off distance). I have reduced this distance from 0.004 to 0.0003 m but the error was same. Does anyone know what to do? I am also attaching my input script

I've been running this on a HPC cluster with CentOS 7 using 32 cores. There was no problem when I've been simulating a similar case but with particles larger 5 times in diameter (~2 500 000 particles) than the actual size on this cluser. On my own home laptop using 8 cores, the minimum size I could simulate was 30 times bigger in diameter (~20 000 particles), anything lower than that caused this SIGINT/SIGTERM error to occur. So I believe this error has to do something with the hardware and load distribution between processors. But I do not know how to solve it.

Thanks,
Azamat

AttachmentSize
Plain text icon in.txt6.39 KB

AzamatSalamat | Tue, 05/19/2020 - 12:37

I've tried a similar case with a larger size (2 times) and 0.003m cut-off distance and it gave me a more detailed error:

[andromeda:53267] *** Process received signal ***
[andromeda:53267] Signal: Bus error (7)
[andromeda:53267] Signal code: Non-existant physical address (2)
[andromeda:53267] Failing at address: 0x7fc7245a4000
[andromeda:53267] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fc73de3b5e0]
[andromeda:53267] [ 1] /lib64/libc.so.6(+0x14c6b0)[0x7fc73dbb56b0]
[andromeda:53267] [ 2] /opt/openmpi/lib/libopen-pal.so.20(opal_convertor_pack+0x175)[0x7fc73d3f0755]
[andromeda:53267] [ 3] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_prepare_src+0x199)[0x7fc73d440c59]
[andromeda:53267] [ 4] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fc73e6cc546]
[andromeda:53267] [ 5] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_ack+0x178)[0x7fc73e6c6ec8]
[andromeda:53267] [ 6] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_component_progress+0x3d5)[0x7fc73d442695]
[andromeda:53267] [ 7] /opt/openmpi/lib/libopen-pal.so.20(opal_progress+0x3c)[0x7fc73d3e124c]
[andromeda:53267] [ 8] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send+0x28d)[0x7fc73e6c425d]
[andromeda:53267] [ 9] /opt/openmpi/lib/libmpi.so.20(MPI_Send+0xf2)[0x7fc73e5f0a82]
[andromeda:53267] [10] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9ff326]
[andromeda:53267] [11] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9c8585]
[andromeda:53267] [12] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0xa17ed8]
[andromeda:53267] [13] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x48ac2f]
[andromeda:53267] [14] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x488787]
[andromeda:53267] [15] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x489280]
[andromeda:53267] [16] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x409f5a]
[andromeda:53267] [17] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc73da8ac05]
[andromeda:53267] [18] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x40a137]
[andromeda:53267] *** End of error message ***

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
...

Some forum answer from 2010 only tells that it's a bug. Is it still a bug?

AzamatSalamat | Wed, 05/20/2020 - 15:34

Once I've tried reducing time step from 1e-5 to 1e-6 s but got same error.
Did not try 1e-7 and further yet though, so might try it later