Bond atoms missing

Submitted by leaso on Mon, 06/10/2013 - 21:12

Hi all,

I am trying to add a modified bond model to the LIGGGHTS source code.

I have added 3 files which are similar to bond create, bond break and bond_quartic. I also added some more parameters to the atom class and modified the file atom_vec_sphere.cpp to include the new parameters in the atom class.

I keep getting this error when I run in parallel.

ERROR on proc 12: Bond atoms 700 with force -0.000287329i 0.000350155j 0.00232285k and 747 missing on proc 12 at step 182685 (neigh_bond.cpp:60)
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 12
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(855).............: MPI_Allreduce(sbuf=0x7fff4cbcf958, rbuf=0x7fff4cbcf950, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(712).......:
MPIR_Allreduce_intra(208)......:
MPIR_Bcast_impl(1321)..........:
MPIR_Bcast_intra(1155).........:
MPIR_Bcast_binomial(213).......: Failure during collective
MPIR_Allreduce_intra(197)......:
allreduce_intra_or_coll_fn(106):
MPIR_Allreduce_intra(357)......:
dequeue_and_set_error(596).....: Communication error with rank 0
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(855).............: MPI_Allreduce(sbuf=0x7fffc7b59cc8, rbuf=0x7fffc7b59cc0, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(712).......:
MPIR_Allreduce_intra(208)......:
MPIR_Bcast_impl(1321)..........:
MPIR_Bcast_intra(1155).........:
MPIR_Bcast_binomial(213).......: Failure during collective
MPIR_Allreduce_intra(197)......:
allreduce_intra_or_coll_fn(106):
MPIR_Allreduce_intra(357)......:
dequeue_and_set_error(596).....: Communication error with rank 0
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(958).......: MPI_Allgather(sbuf=0x7fffb631b850, scount=3, MPI_DOUBLE, rbuf=0x1e824e0, rcount=3, MPI_DOUBLE, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(805)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(596): Communication error with rank 1
MPIR_Allgather_intra(181).:
dequeue_and_set_error(596): Communication error with rank 2
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(958).......: MPI_Allgather(sbuf=0x7ffff2873b60, scount=3, MPI_DOUBLE, rbuf=0x14e6610, rcount=3, MPI_DOUBLE, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(805)..:
MPIR_Allgather(766).......:
MPIR_Allgather_intra(181).:
dequeue_and_set_error(596): Communication error with rank 5

I tried increasing the pairwise cutoff but it still gives me the same error and also output the force on those 2 particles to see if the force blows up. The force seems to be reasonable.
It runs fine in serial though.
Does anybody know the cause of this error?

Thanks for any help in advance,
Liza

richti83's picture

richti83 | Tue, 06/11/2013 - 09:06

Hi Liza,

I have had simelar difficulties with the "bondspackage" in the git repo of Polyun (btw this is a good starting point for own bond-implemention !):
https://github.com/Polyun/LIGGGHTS-PUBLIC/tree/master/bondspackage

It bases on Christophs http://cfdem.dcs-computing.com/?q=node/525 early alpha implemention (in fact there are less differences between the two sources)

the problem seems to be related to multicore use ! Try in single-core mode first. It was runing with 2 cores too but gave different results.

When you find out were the error comes from please repost a solution here. I gave up because I'm not a parallelcomputing expert.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

leaso | Wed, 08/21/2013 - 21:58

Hi again,

I figured that the crash is due to the wrong communication of a new variable I introduced in the atom class.
The atom style I am using is a hybrid of bond and gran.
I had made another file like atom_vec_sphere.cpp and atom_vec_bond.cpp that incorporated this new variable in the atom class. But when I run in parallel it does not communicate the correct value for that new variable.
Can someone tell me what I am I missing? Why is it passing incorrect values?
The serial version works correctly.
Thanks for any help in advance!
Liza

ckloss's picture

ckloss | Fri, 08/30/2013 - 15:32

>>Can someone tell me what I am I missing? Why is it passing incorrect values?
All we can do here in the forum, is give you hints, pointers, or references.
I know, debugging with MPI can be quite some pain, but since it's your code, you will have to do it :-)

Cheers
Christoph