Clean Finish of LIGGGHTS run

Submitted by JoG on Mon, 04/16/2018 - 11:32

I perform a simulation with big moving meshes and about 200000 particles and write out restart files regularly at the end of a simulation run. The files have more than 100 MB. LIGGGHTS is called from a bash script and sometimes the simulation crashes with the error

"SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that."

This error is shown multiple times, but sometimes more often and sometimes less often. Therefore I suspect it has something to with the parallelism. Probably the master thread has finished writing the restart file, but some other threads not. So here is my question: Is there a command to finish a simulation cleanly? I tried to use the "quit" command, but this seems to sent an error flag and mpirun throws an error.

JoG | Mon, 04/16/2018 - 16:33

It really seems to have something to do with the writing of the restart file on disk. If I conduct post processing on the same cluster node while a Job is running, the probability that the job crashes seems to be higher.
What I tried so far:
- used a run 0 command after the write_restart command --> problem persists
- defined a command mpi_barrier which only executes MPI_Barrier(world), recompiled and used this after the write_restart command -->problem persists
- placed a compute after the write_restart command -->problem persists
- wrote restart files for each processor instead of single , big one -->problem persists

I start LIGGGHTS from a bash script, which I submit to a LSF queue. The corresponding line in the bash script is:
mpirun -np $((numprocs)) "${liggghts}" -in "${loopDEM_inputfile}" -log "post/logs/${current_step}.log" > "post/logs/${current_step}.log.live"

So far the problem only occurs if I use the queue system and the bash script. If I run the iterations sequentially and manually start LIGGGHTS, the crashed cases exute fine and I can continue my case, also by submitting to the queue again until the next error occurs. However, this could also be just a coincidence.

I would generate a minimal example, but in this case it is not possible as the problem just occurs for huge restart files and particle numbers.

If there is anybody who also outputs huge restart files at the end of a simulation run and runs LIGGGHTS from a bash script in a loop, it would be nice if experiences on that could be shared.

richti83's picture

richti83 | Mon, 04/16/2018 - 19:36

It's only a guess, but did you ask the queue system for enough RAM ? Sounds to me like one or more processes are running out of memory when collecting restart file information.
Can you share the full crash output, sometimes the signal codes can give a hint what's going wrong.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

JoG | Sun, 04/22/2018 - 11:51

Hi,

thanks for your help. I looked for a way how to get a little bit more information about the error and used the -e errLog and -o outfile of the bsub command. In the output log the interesting part is:

Exited with exit code 134.

Resource usage summary:

CPU time : 235389.84 sec.
Max Memory : 1541 MB
Max Swap : 6136 MB

Max Processes : 20

And the error log is the following:

*** Error in `/home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2': free(): invalid next size (fast): 0x00000000097d7560 ***
[MIRACULIX:02568] *** Process received signal ***
[MIRACULIX:02568] Signal: Aborted (6)
[MIRACULIX:02568] Signal code: (-6)
[MIRACULIX:02568] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x2b5054764330]
[MIRACULIX:02568] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x2b50549a8c37]
[MIRACULIX:02568] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x2b50549ac028]
[MIRACULIX:02568] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x732a4) [0x2b50549e52a4]
[MIRACULIX:02568] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x7f82e) [0x2b50549f182e]
[MIRACULIX:02568] [ 5] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS14FixMeshSurface20deleteOtherNeighListEPKc+0x1ba) [0xd045aa]
[MIRACULIX:02568] [ 6] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS17FixMassflowMeshJGD1Ev+0x40) [0xcf6a50]
[MIRACULIX:02568] [ 7] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS17FixMassflowMeshJGD0Ev+0x9) [0xcf6a89]
[MIRACULIX:02568] [ 8] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS6Modify10delete_fixEPKcb+0x59) [0x1ae8809]
[MIRACULIX:02568] [ 9] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS6ModifyD1Ev+0x43) [0x1aeb6b3]
[MIRACULIX:02568] [10] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS6ModifyD0Ev+0x9) [0x1aebb39]
[MIRACULIX:02568] [11] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS6LAMMPS7destroyEv+0x70) [0xdf1700]
[MIRACULIX:02568] [12] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(_ZN9LAMMPS_NS6LAMMPSD1Ev+0x9) [0xdf1789]
[MIRACULIX:02568] [13] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2(main+0xd2) [0xb4f3a2]
[MIRACULIX:02568] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x2b5054993f45]
[MIRACULIX:02568] [15] /home/jg/LIGGGHTS/LIGGGHTS-PUBLIC/src/lmp_auto2() [0xb50507]
[MIRACULIX:02568] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 2568 on node MIRACULIX exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The LIGGGTHS log file looks like this when the error occurs (it is the end of the input script, error occurs after all commands have been processed, that's why I wrote about clean finish of LIGGGHTS run):

variable heatSourceIn delete
variable current_step equal step
write_restart post/restart/*.restartProc%
System init for write_restart ...
print ${current_step} file current_timestep.txt screen no
print 10815965 file current_timestep.txt screen no
compute avTemp all reduce ave f_Temp
mpi_barrier
#print "average particle temperature: $c_avTemp K"
print "#################### DEM RUN ${current_step} DONE ################"
#################### DEM RUN 10815965 DONE ################
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.
SIGINT/SIGTERM caught - Writing restart on next occasion and quitting after that.

The RAM of my machinne is 64 GB plus 64 GB swap. So the size itself is not the problem. As the problem just occurs occasionally (sometimes after 30 minutes, sometimes after 2 days, my feeling tells me it is more often if other processes run on the same node), I find it very hard to tackle. I recently also had the problem in a case where I just had the rotating meshes and only a single particle, so the error does not only occur at big particle numbers.

I will now further dig and read about the error messages, but my impression is that there is some false memory treatment somewhere in my LIGGGHTS code. I modified the code slightly, so it is not completely identical to the LIGGGHTS 3.8 version.

As I am not really a very experienced C++ programmer, can anybody give me a hint how I can tackle the problem? I use the gdb debugger, but just the basic features, I don't really know how to tackle a problem like this, which just accours after a couple hours. CAn somebody give me a hint? Or did somebody already got a similar error while deveolping) ? I would really appreciate help.
Thanky you,
Johannes

JoG | Mon, 04/23/2018 - 11:26

Ok, I think it was really myself with my small code changes who caused the problem :-)

I installed valgrind and looked for memory leaks. It found a lot of errors and some of them were caused by my code. So I looked in the parts caused by me and I think I found the problem.

I added a new fix_massflow_mesh by copying the original one and renaming it. Then I updated my version to 3.8.0 a couple months ago, but as this file did not exist I did not have to merge any changes. I looked at the differences between original fix_massflow_mesh and my modified one and found parts which should be different in my version. I changed it now and I am curious to see if the problem occurs again. However, I don't know yet, since also the version with the error runs since last evening. So hopefully I won't come back to this thread a couple days later ;-)

Thank you Christian for pointing me in the right direction.