cfdemSolverPiso simualtions stop without any error message

Min Zhang's picture
Submitted by Min Zhang on Mon, 02/17/2020 - 17:34

I am trying to run a case with a lot of particles. Approximately, there will be (around 6,658,051) particles in the simulation domain.

My simulation geometry visual is attached.

I have the following error message:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Time = 0.000717

Courant Number mean: 0.00584625 max: 0.340819

Coupling...
Starting up LIGGGHTS
Executing command: 'run 100 '
run 100
Setting up run ...
Memory usage per processor = 93.5025 Mbytes
Step Atoms KinEng RotE ts[1] ts[2]
71601 2767016 88028623 21778.316 0 0
CFD Coupling established at step 71700
INFO: Particle insertion ins: inserted 111627 particle templates (mass 0.522742) at step 71701
- a total of 2902302 particle templates (mass 13.591296) inserted so far.
71701 2878426 91242628 21663.997 0 0
Loop time of 13.6019 on 32 procs for 100 steps with 2878426 atoms

Pair time (%) = 2.91061 (21.3985)
Neigh time (%) = 0.10412 (0.76548)
Comm time (%) = 0.167363 (1.23043)
Outpt time (%) = 0.321599 (2.36437)
Other time (%) = 10.0982 (74.2412)

Nlocal: 89950.8 ave 232887 max 0 min
Histogram: 16 0 3 1 0 0 0 0 4 8
Nghost: 6673.22 ave 20177 max 0 min
Histogram: 16 0 0 4 0 1 5 2 0 4
Neighs: 442032 ave 1.38182e+06 max 0 min
Histogram: 16 0 4 0 0 0 2 8 0 2

Total # of neighbors = 14145021
Ave neighs/atom = 4.91415
Neighbor list builds = 1
Dangerous builds = 0
LIGGGHTS finished
srun: error: nid01141: tasks 16-18,20-31: Killed
srun: Terminating job step 2700774.0
slurmstepd: error: *** STEP 2700774.0 ON nid01013 CANCELLED AT 2020-02-15T22:36:24 ***
srun: error: nid01013: tasks 0-9,11-15: Killed
srun: error: nid01141: task 19: Killed
srun: error: nid01013: task 10: Killed
srun: Force Terminated job step 2700774.0
TACC: MPI job exited with code: 137

TACC: Shutdown complete. Exiting.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So from the above output info, I expect the reason for the case error is "memory issue" since I have too many particles.
If it is because of the memory issue, what options I have to run this case? Coarse-grained CFD-DEM? Scale down my simulation geometry?

Thanks and best regards,
Min

AttachmentSize
Image icon simulation_geometry.png235.25 KB

mschramm | Tue, 02/25/2020 - 16:40

You are only using 100 MB of memory per processor. This is not very much.

I would recommend scaling down your particles so you can run it on your own machine (or run slurm in some debug mode, I do not use slurm...) to get a better error message.
I think your issue is on the CFD side of things as in your output it states that the LIGGGHTS simulation side was completed so the error is either in the coupling or in the CFD.

mschramm | Tue, 02/25/2020 - 20:40

If the memory issue is in LIGGGHTS, you should get an error stating that liggghts could not allocate x bytes.
If you start to get these, you may want to manually set you bin size.

Without more information about the CFD side of things, I can't really go into more information.

Min Zhang's picture

Min Zhang | Fri, 04/24/2020 - 18:37

I have two cases. They are identical runs, but they have different error messages. They crashed at different time points.

In addition, The first run stopped at the beginning of the LIGGGHTS run, but the second run stopped after the LIGGGHTS run finished.

Please note that. These two cases are different from the first case I posted. But the common issue of them is that there will be a lot of particles in the simulation.

The following is the detailed error messages for the two identical runs.
The first run: /scratch/04454/minzhang/CFDEM/ClusterBasedPTE_20200228/Case10:
Time = 0.026

Courant Number mean: 0.00334543 max: 0.0449133

Coupling...
Starting up LIGGGHTS
Executing command: 'run 100 '
srun: error: nid00401: tasks 0-25: Killed
srun: Terminating job step 2742389.0
srun: error: nid00404: tasks 78,80-102: Terminated
srun: error: nid00405: tasks 103-104,106-116,118-127: Terminated
srun: error: nid00403: tasks 52,55-77: Terminated
srun: error: nid00402: tasks 26-34,36-51: Terminated
srun: error: nid00404: task 79: Terminated
srun: error: nid00405: tasks 105,117: Terminated
srun: error: nid00402: task 35: Terminated
srun: error: nid00403: tasks 53-54: Terminated
srun: Force Terminated job step 2742389.0
TACC: MPI job exited with code: 143

TACC: Shutdown complete. Exiting.

The second run: /scratch/04454/minzhang/CFDEM/ClusterBasedPTE_20200228/Case10_again
Time = 0.02678

Courant Number mean: 0.00334562 max: 0.0443878

Coupling...
Starting up LIGGGHTS
Executing command: 'run 100 '
run 100
Setting up run ...
Memory usage per processor = 27.8005 Mbytes
Step Atoms KinEng RotE ts[1] ts[2]
2677901 1460455 4.6400994e+08 391187.51 0.0010050491 0.00073688914
INFO: Particle insertion ins: inserted 3925 particle templates (mass 1.176354) at step 2677996
- a total of 1632546 particle templates (mass 489.286997) inserted so far.
CFD Coupling established at step 2678000
2678001 1464360 4.655741e+08 390592.78 0.0010050491 0.00073688914
Loop time of 2.35193 on 128 procs for 100 steps with 1464360 atoms

Pair time (%) = 0.528199 (22.4581)
Neigh time (%) = 0.0171635 (0.72976)
Comm time (%) = 0.104113 (4.42669)
Outpt time (%) = 0.0142617 (0.60638)
Other time (%) = 1.6882 (71.7791)

Nlocal: 11440.3 ave 21176 max 0 min
Histogram: 10 7 19 14 2 3 5 59 7 2
Nghost: 3240.59 ave 5988 max 0 min
Histogram: 9 6 8 24 3 12 30 17 8 11
Neighs: 74830.6 ave 226078 max 0 min
Histogram: 15 24 7 45 21 4 3 6 1 2

Total # of neighbors = 9578314
Ave neighs/atom = 6.54096
Neighbor list builds = 1
Dangerous builds = 0
LIGGGHTS finished
srun: error: nid00908: tasks 52-63,65-77: Killed
srun: Terminating job step 2789702.0
slurmstepd: error: *** STEP 2789702.0 ON nid00849 CANCELLED AT 2020-04-09T12:42:53 ***
srun: error: nid00908: task 64: Killed
srun: error: nid00849: task 24: Terminated
srun: error: nid01155: tasks 103-109,111-127: Terminated
srun: error: nid00872: tasks 26-31,33-51: Terminated
srun: error: nid00849: tasks 0,2-22,25: Terminated
srun: error: nid00849: task 1: Terminated
srun: error: nid00989: tasks 78-102: Terminated
srun: error: nid01155: task 110: Terminated
srun: error: nid00849: task 23: Terminated
srun: error: nid00872: task 32: Terminated
srun: Force Terminated job step 2789702.0
TACC: MPI job exited with code: 143

TACC: Shutdown complete. Exiting.

Min Zhang's picture

Min Zhang | Fri, 04/24/2020 - 18:20

I have been running a lot of CFDEMcoupling cases successfully.
The only difference between this failed case and other successful cases is that I have more particles (the particle concentration is higher in this failed case).
So if I scale down my simulation to have fewer particles, I expect that I won't have error messages.

Min Zhang's picture

Min Zhang | Fri, 05/01/2020 - 01:48

I checked a lot of my simulation cases and I found one who has the same error message (actually it just stopped without detailed error message) and this case is relatively smaller. So I can run it on my desktop.
However, I still could not get a detailed error message. Here is what I got from running it on my desktop:

timeStepFraction() = 1
update Ksl.internalField()
TotalForceImp: (-2.41381e+06 -1.12201e+07 -525035)
DILUPBiCG: Solving for Ux, Initial residual = 0.000479415, Final residual = 3.27988e-07, No Iterations 1
DILUPBiCG: Solving for Uy, Initial residual = 0.00167556, Final residual = 4.788e-10, No Iterations 2
DILUPBiCG: Solving for Uz, Initial residual = 0.000458776, Final residual = 1.69818e-06, No Iterations 1
GAMG: Solving for p, Initial residual = 0.0338918, Final residual = 0.00303889, No Iterations 4
GAMG: Solving for p, Initial residual = 0.00353127, Final residual = 0.000142403, No Iterations 8
GAMG: Solving for p, Initial residual = 0.000280811, Final residual = 1.9855e-05, No Iterations 8
time step continuity errors : sum local = 6.0155e-10, global = -7.30549e-11, cumulative = -3.39956e-09
GAMG: Solving for p, Initial residual = 0.000331648, Final residual = 1.79172e-05, No Iterations 2
GAMG: Solving for p, Initial residual = 2.14069e-05, Final residual = 1.61388e-06, No Iterations 13
GAMG: Solving for p, Initial residual = 2.65873e-06, Final residual = 7.69618e-07, No Iterations 4
time step continuity errors : sum local = 2.33062e-11, global = 5.69171e-12, cumulative = -3.39387e-09
GAMG: Solving for p, Initial residual = 1.2601e-05, Final residual = 1.1836e-06, No Iterations 1
GAMG: Solving for p, Initial residual = 1.19209e-06, Final residual = 7.52028e-07, No Iterations 1
GAMG: Solving for p, Initial residual = 7.64527e-07, Final residual = 7.64527e-07, No Iterations 0
time step continuity errors : sum local = 2.3152e-11, global = 8.54349e-13, cumulative = -3.39301e-09
GAMG: Solving for p, Initial residual = 1.08442e-06, Final residual = 4.44126e-07, No Iterations 1
GAMG: Solving for p, Initial residual = 4.51034e-07, Final residual = 4.51034e-07, No Iterations 0
DICPCG: Solving for p, Initial residual = 4.51034e-07, Final residual = 4.51034e-07, No Iterations 0
time step continuity errors : sum local = 1.36585e-11, global = 1.30824e-13, cumulative = -3.39288e-09
smoothSolver: Solving for omega, Initial residual = 0.000101484, Final residual = 1.04937e-06, No Iterations 1
bounding omega, min: -2.47704 max: 310223 average: 2591.95
DILUPBiCG: Solving for k, Initial residual = 4.06896e-05, Final residual = 4.65268e-06, No Iterations 1
ExecutionTime = 60099.1 s ClockTime = 60101 s

Time = 0.015652

Courant Number mean: 0.00435321 max: 0.348735

Coupling...
Starting up LIGGGHTS
Executing command: 'run 100 '
run 100
Setting up run at Thu Apr 30 11:30:33 2020

Memory usage per processor = 100.38 Mbytes
Step Atoms KinEng RotE ts[1] ts[2]
1565101 539639 8.3200005e+08 357899.19 0.00037987286 0.00043571677
CFD Coupling established at step 1565200
1565201 539639 8.3228921e+08 357995.8 0.00037987286 0.00043571677
Loop time of 4.21332 on 8 procs for 100 steps with 539639 atoms, finish time Thu Apr 30 11:30:38 2020

Pair time (%) = 1.86279 (44.2119)
Neigh time (%) = 0 (0)
Comm time (%) = 0.184399 (4.37656)
Outpt time (%) = 0.0111896 (0.265576)
Other time (%) = 2.15495 (51.146)

Nlocal: 67454.9 ave 70013 max 66029 min
Histogram: 2 2 0 0 2 0 0 1 0 1
Nghost: 8015 ave 8272 max 7657 min
Histogram: 1 0 0 1 2 1 0 0 0 3
Neighs: 377207 ave 495310 max 311715 min
Histogram: 4 0 0 1 1 0 0 0 1 1

Total # of neighbors = 3017653
Ave neighs/atom = 5.59198
Neighbor list builds = 0
Dangerous builds = 0
LIGGGHTS finished

timeStepFraction() = 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
It just stopped without any error message... In addition, on my desktop, the case crashed/stopped at an earlier time point.

I tested a lot of cases, I think it might NOT be related to the memory issue, and I am thinking it might be related to the turbulence modeling. There are several reasons:
(1) I checked the log.liggghts file, I didn't see anything wrong.
(2) In the terminal output, I saw some negative values of k and omega.

It might also be related to the information transfer from LIGGGHTS to OpenFOAM.

Your comments would be appreciated!

Min Zhang's picture

Min Zhang | Sun, 05/03/2020 - 22:13

I have been testing a lot of cases, and it turns out that a lot of my cases are having the same issue, which is, it just stopped without a detailed error message.
Just now, I got a relatively detailed error message:

Total # of neighbors = 8574777
Ave neighs/atom = 5.33162
Neighbor list builds = 1
Dangerous builds = 0
LIGGGHTS finished

timeStepFraction() = 1
update Ksl.internalField()
TotalForceImp: (2864.22 -2.39358e+06 -295816)
DILUPBiCG: Solving for Ux, Initial residual = 0.00440567, Final residual = 1.12107e-06, No Iterations 1
DILUPBiCG: Solving for Uy, Initial residual = 0.00147887, Final residual = 2.9036e-07, No Iterations 1
DILUPBiCG: Solving for Uz, Initial residual = 0.00451077, Final residual = 1.14307e-06, No Iterations 1
suppressing ddt(voidfraction)
GAMG: Solving for p, Initial residual = 0.0595724, Final residual = 0.00506171, No Iterations 1
suppressing ddt(voidfraction)
srun: error: nid00807: tasks 26-34,36-37,39-41,43-51: Killed
srun: Terminating job step 2828076.0
slurmstepd: error: *** STEP 2828076.0 ON nid00806 CANCELLED AT 2020-05-03T08:17:34 ***
srun: error: nid00807: tasks 35,38,42: Killed
srun: error: nid00846: tasks 78-94,96-100,102: Terminated
srun: error: nid00867: tasks 103-107,109,111-127: Terminated
srun: error: nid00806: tasks 0-14,16-25: Terminated
srun: error: nid00845: tasks 52-77: Terminated
srun: error: nid00867: tasks 108,110: Terminated
srun: error: nid00806: task 15: Terminated
srun: error: nid00846: tasks 95,101: Terminated
srun: Force Terminated job step 2828076.0
TACC: MPI job exited with code: 143

TACC: Shutdown complete. Exiting.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This error message told me that the error happened on the CFD side.
Your valuable comments would be greatly appreciated!!

Min Zhang's picture

Min Zhang | Sat, 09/05/2020 - 01:11

I did three simulations, including prop. conc. = 0.5ppg, 1ppg, and 3ppg, and all other settings are the same.

I noticed that the case of 0.5ppg stopped at around 0.06s, the case of 1ppg stopped at around 0.03s, and the case of 3ppg stopped at around 0.01s. So what do you think?

I am thinking whether it is related to the max. number of particles that I can inject?

Thanks and best regards,
Min