Bug when simulating particles of various size

Submitted by AzamatSalamat on Tue, 05/19/2020 - 15:40

Hello!

I've been simulating particles insertion with 9 different radii and have been receiving a bus error signal with non-existent physical address. This error was not present when the number of radii was 3. Here is the full error message:

[andromeda:66872] *** Process received signal ***
[andromeda:66872] Signal: Bus error (7)
[andromeda:66872] Signal code: Non-existant physical address (2)
[andromeda:66872] Failing at address: 0x7f5cbc4093e0
[andromeda:66872] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f5cd6d615e0]
[andromeda:66872] [ 1] /lib64/libc.so.6(+0x14ae7b)[0x7f5cd6ad9e7b]
[andromeda:66872] [ 2] /opt/openmpi/lib/libopen-pal.so.20(opal_convertor_pack+0x175)[0x7f5cd6316755]
[andromeda:66872] [ 3] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_prepare_src+0x199)[0x7f5cd6366c59]
[andromeda:66872] [ 4] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7f5cd75f2546]
[andromeda:66872] [ 5] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_ack+0x178)[0x7f5cd75ecec8]
[andromeda:66872] [ 6] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_component_progress+0x3d5)[0x7f5cd6368695]
[andromeda:66872] [ 7] /opt/openmpi/lib/libopen-pal.so.20(opal_progress+0x3c)[0x7f5cd630724c]
[andromeda:66872] [ 8] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send+0x28d)[0x7f5cd75ea25d]
[andromeda:66872] [ 9] /opt/openmpi/lib/libmpi.so.20(MPI_Send+0xf2)[0x7f5cd7516a82]
[andromeda:66872] [10] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9ff326]
[andromeda:66872] [11] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9c8585]
[andromeda:66872] [12] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0xa17ed8[andromeda:66895] *** Process received signal ***
[andromeda:66895] Signal: Bus error (7)
[andromeda:66895] Signal code: Non-existant physical address (2)
[andromeda:66895] Failing at address: 0x7fd324856000
]
[andromeda:66895] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fd33e3745e0]
[andromeda:66895] [andromeda:66872] [13] [ 1] /lib64/libc.so.6(+0x14c6b0)[0x7fd33e0ee6b0]
[andromeda:66895] [ 2] /opt/openmpi/lib/libopen-pal.so.20(opal_convertor_pack+0x/home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x48ac2f175)[0x7fd33d929755]
[andromeda:66895] [ 3] /opt/openmpi/lib/libopen-pal.so.20(]
[andromeda:66872] mca_btl_sm_prepare_src+0x199)[0x7fd33d979c59]
[andromeda:66895] [ 4] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7fd33ec05546]
[andromeda:66895] [ 5] [14] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto/opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_ack+0x178)[0x7fd33ebffec8]
[andromeda:66895] [ 6] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_component_progress+0x3d5)[0x7fd33d97b695]
[andromeda:66895] [ 7] /opt/openmpi/lib/libopen-pal.so.20[0x488787]
[andromeda:66872] [15] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x489280]
[andromeda:66872] [16] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x409f5a]
[andromeda:66872] [17] /lib64/libc.so.6(opal_progress+0x3c)[0x7fd33d91a24c]
[andromeda:66895] [ 8] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send+0x28d)[0x7fd33ebfd25d]
[andromeda:66895] [ 9] (__libc_start_main+0xf5)[0x7f5cd69b0c05]
[andromeda:66872] [18] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x40a137]
[andromeda:66872] *** End of error message ***
/opt/openmpi/lib/libmpi.so.20(MPI_Send+0xf2)[0x7fd33eb29a82]
[andromeda:66895] [10] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9ff326]
[andromeda:66895] [11] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9c8585]
[andromeda:66895] [12] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0xa17ed8]
[andromeda:66895] [13] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x48ac2f]
[andromeda:66895] [14] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x488787]
[andromeda:66895] [15] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x489280]
[andromeda:66895] [16] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x409f5a]
[andromeda:66895] [17] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd33dfc3c05]
[andromeda:66895] [18] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x40a137]
[andromeda:66895] *** End of error message ***
[andromeda:66908] *** Process received signal ***
[andromeda:66908] Signal: Bus error (7)
[andromeda:66908] Signal code: Non-existant physical address (2)
[andromeda:66908] Failing at address: 0x7f5ce004a370
[andromeda:66908] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f5cf7f3b5e0]
[andromeda:66908] [ 1] /lib64/libc.so.6(+0x14ae7b)[0x7f5cf7cb3e7b]
[andromeda:66908] [ 2] /opt/openmpi/lib/libopen-pal.so.20(opal_convertor_pack+0x175)[0x7f5cf74f0755]
[andromeda:66908] [ 3] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_prepare_src+0x199)[0x7f5cf7540c59]
[andromeda:66908] [ 4] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send_request_schedule_once+0x1b6)[0x7f5cf87cc546]
[andromeda:66908] [ 5] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_ack+0x178)[0x7f5cf87c6ec8]
[andromeda:66908] [ 6] /opt/openmpi/lib/libopen-pal.so.20(mca_btl_sm_component_progress+0x3d5)[0x7f5cf7542695]
[andromeda:66908] [ 7] /opt/openmpi/lib/libopen-pal.so.20(opal_progress+0x3c)[0x7f5cf74e124c]
[andromeda:66908] [ 8] /opt/openmpi/lib/libmpi.so.20(mca_pml_ob1_send+0x28d)[0x7f5cf87c425d]
[andromeda:66908] [ 9] /opt/openmpi/lib/libmpi.so.20(MPI_Send+0xf2)[0x7f5cf86f0a82]
[andromeda:66908] [10] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9ff326]
[andromeda:66908] [11] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x9c8585]
[andromeda:66908] [12] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0xa17ed8]
[andromeda:66908] [13] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x48ac2f]
[andromeda:66908] [14] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x488787]
[andromeda:66908] [15] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x489280]
[andromeda:66908] [16] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x409f5a]
[andromeda:66908] [17] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cf7b8ac05]
[andromeda:66908] [18] /home/student/Azamat/LIGGGHTS-PUBLIC/src/lmp_auto[0x40a137]
[andromeda:66908] *** End of error message ***

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
--------------------------------------------------------------------------
mpirun noticed that process rank 21 with PID 0 on node andromeda exited on signal 7 (Bus error).

I am not sure but I believe that this is a bug and have not found anyone who experienced same thing. Been running LIGGGHTS 3.8.0 on CentOS 7 cluster using 32 cores. If needed, I am also attaching my input script.

Thanks,
Azamat

AttachmentSize
Plain text icon inn.txt6.43 KB