write_dump of gzipped file hangs simulation when running on several nodes

Submitted by iluvatar on Sun, 12/20/2020 - 04:24

Hello,
(Edit : Moved to liggghts users forum)
My simulations have been hanging when running them on several nodes. They work fine in a single node with several cpus. I was able to get a Minimal Working Example based on the packing example. Just by adding a single line
write_dump all custom out/dump_initial.gz id
is enough to hang the simulation (around step 4000 when ran in two nodes) or to generate some strange mpi error. I checked that the actual file is written successfully, but it seems that something is not finished correctly among the processes and the simulation hangs. I am not knowledgeable enough to fix this so I came here for your help.

Notice that running the simulation in a single node work without problems. Also, notice that if I do not write a compressed file, the simulation works even on several nodes. So this might be related to the gzip writer.

I am copying both the slurm script and input file directly at the end of this message.

I am using the following liggghts version:
LIGGGHTS (Version LIGGGHTS-PUBLIC 3.8.0, compiled 2019-09-05-14:30:08 by modules, git commit ce1931e377a2945e3fa25dae0eebc1f697009a32)

Any guidance is greatly appreciated.
Thank you.
____________________________________________________________________________________________
SLURM SCRIPT:

#!/bin/bash
# change the name to something related with lambda, eta, mus, np
#SBATCH --job-name=test_liggghts
#SBATCH --output=results-%x.txt
# The number of tasks is the number of MPI processes
#SBATCH --ntasks=5
##SBATCH --tasks-per-node=20
#SBATCH --time=10
# The number of nodes: Only one, I have problems with multiple nodes, the processes freeze
#SBATCH --nodes=2
#SBATCH --partition=debug
#SBATCH --mem=10000
# working directory
#SBATCH --chdir ./

hostname
# Clear the environment from any previously loaded modules
#module purge > /dev/null 2>&1
# load python
module load Python/3.7.2
# load ligghts
module load LIGGGHTS/3.8.0-patched-20190905

echo "Running packing with input file: in.packing "
date
srun lmp_mpi -i in.packing > log-packing.txt
date

_____________________________________________________________________________________________
LIGGGHTS SCRIPT: in.packing

#Particle packing by insertion and successive growing of particles

atom_style granular
atom_modify map array
boundary m m m
newton off
#echo both

communicate single vel yes

units si

region reg block -0.05 0.05 -0.05 0.05 0. 0.15 units box
create_box 1 reg

neighbor 0.002 bin
neigh_modify delay 0

#Material properties required for new pair styles

fix m1 all property/global youngsModulus peratomtype 5.e6
fix m2 all property/global poissonsRatio peratomtype 0.45
fix m3 all property/global coefficientRestitution peratomtypepair 1 0.3
fix m4 all property/global coefficientFriction peratomtypepair 1 0.5

#New pair style
pair_style gran model hertz tangential history #Hertzian without cohesion
pair_coeff * *

timestep 0.00001

fix xwalls1 all wall/gran model hertz tangential history primitive type 1 xplane -0.05
fix xwalls2 all wall/gran model hertz tangential history primitive type 1 xplane +0.05
fix ywalls1 all wall/gran model hertz tangential history primitive type 1 yplane -0.05
fix ywalls2 all wall/gran model hertz tangential history primitive type 1 yplane +0.05
fix zwalls1 all wall/gran model hertz tangential history primitive type 1 zplane 0.00
fix zwalls2 all wall/gran model hertz tangential history primitive type 1 zplane 0.15

#distributions for insertion
fix pts1 all particletemplate/sphere 15485863 atom_type 1 density constant 2500 radius constant 0.001
fix pts2 all particletemplate/sphere 15485867 atom_type 1 density constant 2500 radius constant 0.002
fix pdd1 all particledistribution/discrete 32452843 2 pts1 0.3 pts2 0.7

#parameters for gradually growing particle diameter
variable alphastart equal 0.50
variable alphatarget equal 0.67
variable growts equal 50000
variable growevery equal 40
variable relaxts equal 20000

#region and insertion
group nve_group region reg

#particle insertion
fix ins nve_group insert/pack seed 32452867 distributiontemplate pdd1 &
maxattempt 200 insert_every once overlapcheck yes all_in yes vel constant 0. 0. 0. &
region reg volumefraction_region ${alphastart}

#apply nve integration to all particles that are inserted as single particles
fix integr nve_group nve/sphere

#output settings, include total thermal energy
compute 1 all erotate/sphere
thermo_style custom step atoms ke c_1 vol
thermo 1000
thermo_modify lost ignore norm no

#insert the first particles
run 1
#dump dmp all custom/vtk 350 post/packing_*.vtk id type type x y z ix iy iz vx vy vz fx fy fz omegax omegay omegaz radius
unfix ins

#####################################################
## ERROR WHEN USING SEVERAL NODES !!!!!!!!!!!!!!!!!
write_dump all custom out/dump_initial.gz id # ERROR
#write_dump all custom out/dump_initial id # WORKS
#####################################################

#calculate grow rate
variable Rgrowrate equal (${alphatarget}/${alphastart})^(${growevery}/(3.*${growts}))
print "The radius grow rate is ${Rgrowrate}"

#do the diameter grow
compute rad all property/atom radius

variable dgrown atom ${Rgrowrate}*2.*c_rad
fix grow all adapt ${growevery} atom diameter v_dgrown

#run
run ${growts}

#let the packing relax
unfix grow
run ${relaxts}