MPI issue using "atom_style hybrid" and "fix adapt" commands

Submitted by thboivin on Thu, 11/24/2022 - 17:00

Hello everyone,

I am new on this wonderful forum that helped me many times, and I really thank all the people that take the time to answer ^^
This time, I ask for help concerning a tricky issue.

I currently use LIGGGHTS to simulate the breathing behaviour of a bed of particles with changing diameters. Consequently, I use the method proposed in packing example of Tutorials_public, based on the use of the fix adapt command (https://www.cfdem.com/media/DEM/docu/fix_adapt.html).
So far, I met no particular issue using this command.

However, I recently tried to extend my simulation using the atom_style hybrid, but I met the following error message :

[aar093:20997] *** An error occurred in MPI_Wait
[aar093:20997] *** reported by process [908394497,4594212051357794305]
[aar093:20997] *** on communicator MPI_COMM_WORLD
[aar093:20997] *** MPI_ERR_TRUNCATE: message truncated
[aar093:20997] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[aar093:20997] *** and potentially your MPI job)

This error is quite common on forums, but I did not manage to link the solutions to my particular case.
Nevertheless, I reduced the occurence to the smallest situation I could find. Indeed, in my case, this issue only appears when :
fix adapt is used in a atom_style hybrid simulation ;
→ the simulation is run on multiple processors, depending on the number of particles (for example, 2 processors work for a low number of particles, but the error can appear with a higher number).

It seems the issue comes from a "partial" incompatibility of the fix adapt command with atom_style hybrid when it comes to multi-processoring, which means that at a certain amount of number of particles and/or processors, the MPI compiler crashes.
On my side, I use LIGGGHTS-PUBLIC 3.8.0 with openmpi 2.0.1 defined through the $PATH. An important constraint to notice is that I do not have the administrator rights on my Unix environment.

Up to now, I tried several things :
→ Checking if the problem came from the compute property/atom associated command ;
→ Using the LIGGGHTS_Flexible_Fibers version (https://github.com/schrummy14/LIGGGHTS_Flexible_Fibers) ;
→ Using other fix commands that read/change atom behaviour (fix ave/atom and fix addforce) in a hybrid atom style simulation ;
→ Changing the diameter of only one atom through the fix adapt command ;
→ Using a more recent version of openmpi (version 3.1.6).

I have made a little simple test attached, if anyone wants to check if it displays the same error on his/her side.

Consequently, if anyone had an idea about this strange behaviour, it would help a lot to avoid making all these simulations on only one processor.

Thank you very much in advance !
Best regards,
Theo

Daniel Queteschiner | Fri, 11/25/2022 - 12:20

The issue is that the size of the communication buffer doesn't get defined correctly in that case.
In more detail, the creation of an atom style follows the steps:
1. create a new atom style (hybrid, granular etc) -> AtomVec::AtomVec()
2. parse the settings of the atom style -> AtomVec::settings(...)
3. do some initialization of the atom style -> AtomVec::init()

When combining fix adapt and atom style granular, this triggers a change in buffer size in AtomVecSphere::init() (additional communication of type, radius, mass and density).
However, when using atom style hybrid (+granular) the overall buffer size is defined in AtomVecHybrid::settings(...) and the later change of the granular sub-style buffer size in AtomVecSphere::init() causes a mismatch as it is not recognized by the AtomVecHybrid class.

A quick fix could be to add the following line to the end of the AtomVecHybrid::init() method
if (atom->radvary_flag == 1) size_forward += 4;
Note that I have not tested this change thoroughly.

LAMMPS seems to solve this issue by adding a setting to the granular/sphere atom style (i.e. adjusting the buffer size already in AtomVecSphere::settings(...)) such that it can be properly recognized by the AtomVecHybrid class.

thboivin | Wed, 11/30/2022 - 13:21

Thank you very much for your clear and fast answer, Daniel !

I better understand the issue, and it gave me the opportunity to learn more about the LIGGGHTS code structure.
And first of all, your solution works perfectly. Indeed, the AtomVecSphere::init() method corrects the size_forward parameter from 3 to 7 if particle diameters are time-varying due to some fix, and this is not the case in AtomVecHybrid::init() method. So your solution is perfect to correct the buffer size when the radvary parameter is changed to 1.

Thank you once again for your help and all the best to you !
Theo