Hi there,
I am trying to run a simulation with ~120 million particles on an HPC system using a slightly modified version of LIGGGHTS-INL.
When I restart the simulation after an initial setup stage, it becomes stuck in an apparent infinite loop in read_restart.cpp. After extensive troubleshooting, I suspect this may be related to unexpected behavior in how certain variables are handled during restart, and I would appreciate any insight from those more familiar with this part of the code.
This issue does not occur when I reduce the number of particles. I have also let the simulation run for an extended time with substantial computational resources (thousands of cores and terabytes of memory) to check whether progress was simply very slow, but it does not advance.
The problematic section appears to be:
m = 0;
while (m < n) {
x = &buf[m+1];
if (triclinic) {
domain->x2lamda(x,lamda);
coord = lamda;
} else coord = x;
if (coord[0] >= sublo[0] && coord[0] < subhi[0] &&
coord[1] >= sublo[1] && coord[1] < subhi[1] &&
coord[2] >= sublo[2] && coord[2] < subhi[2]) {
m += avec->unpack_restart(&buf[m]);
}
else m += static_cast(buf[m]);
}
When the loop stalls, triclinic is 0 and the coordinate bounds check is false, so execution always goes to line 243. At that point, buf[m] is extremely small:
(gdb) p buf[0]
$3 = 1.3410234044177822e-312
As a result, static_cast(buf[m]) evaluates to 0, so m never increments. The loop then repeats indefinitely with no change in state.
Has anyone encountered similar behavior when restarting very large systems? Is there a known issue that could cause buf[m] to be corrupted or underflow in this way, or something I may be missing in how restart records are interpreted at this scale?