Running Parallel-Simulation on a cluster

Submitted by gilgamesch on Mon, 09/09/2019 - 07:59

Hello everyone,

before I ask my question and describe my problem, I want to summarize what I did so far and what I experienced, so that it might make more sense.

I installed LIGGGHTS-Public and Paraview according do the Tutorials and Guides in this forum and the doccumentation (using make auto). Everything works fine, I can run simulations and they look good so far.

So I wanted to go a step further and ran the simulations in parallel (using -np numberofparalleljobs) this kind of worked as well. The more I incresed the number of parallel jobs the faster the simulation became until n-1 cores (I use a Intel Core i5 @ 2.80GHz × 6 ). After that the calculation slows down again.

The next step was to use the cluster provided by my lab. I installed openmp on another PC in the network and ran some mpirun commands to see if I could remotely control the other PC and everything was fine. I created a shared folder on our Public Server where I placed the executable lmp_auto and all the necessary files and setup a HOSTFILE with the IP adresses of my PC (the master PC) and the other PC. I then ran the command

mpirun --hostfile HOSTFILE ./lmp_auto < in.sand

HOSTFILE is*
IPmaster slots=5
IPslave slots=11

It worked but was not very fast. The simulation ran on 10 procs without me defining -np XX. I tried increasing the proc count with -np to 16 (the other PC is a Intel Core i7 @ 3.40GHz × 12) and it slowed down quiet a bit.

mpirun -np 16 --hostfile HOSTFILE ./lmp_auto < in.sand

I cannot make out what the problem here is and this is where I am stuck right now.

So my questions are*

Why are my parallel jobs on two PC slower with more procs even though there are still cores that are not used?
How do I fix the problem?
Is there anything I am doing wrong here?

mschramm | Thu, 09/12/2019 - 15:45

Hello,

To answer your questions
1) Because the two computers must now transfer information from one to the other. As you add more cores, the number of time information must be passed increases.
You also have two completely different CPUs (and probably memory modules) so as one computer has finished a routine, it must wait for the other computer to catch up.
2) Hire someone who has built clusters before.
3) Answers are going to be based on how you have the cluster set up.
How are the communications between computers being handled?
Are they connected using infiniband or an equivalent, by CAT5/6/7, or by wifi (please no...)?
Are all of the computers the same (or at least have the same processor and memory)?
I personally use a cluster with 10 computers for LIGGGHTS. The main difference is that I do not
run a single simulation across multiple computers. I mostly use my cluster for DOE runs using the -partition MxN command to run M jobs utilizing N cores each.
This is the way I would recommend you utilizing your cluster. If you must use a cluster for a single simulation, I would suggest looking into HPC cluster design.
(A very big rabbit hole... Don't think I will ever need to know about creating a 6d torus cluster but I did read about it...)

gilgamesch | Fri, 09/13/2019 - 07:47

First of all, Thank You!

I want to address some of what you said before I talk about what I did in the meantime.

1) I am aware of the fact that communication takes processing power as well, but as in my experiments below, this does not seem to be the problem as far as I can tell. Now the different CPUs should also not affect the outcome to the worse, since The slowest single calculation process will not get slower with more cores. If let's say the slowest core takes 10s and the others all take 2 seconds, the ones that finish early will wait for the slow 10second core (as you said). This however should make all of the runs equally fast since the bottleneck is always 10seconds (please correct me if I am wrong), but what I observed is that the simulation slows down significantly (as below).
2) Hiring someone is not an option for me, since it is a project that we have to deal with.
3) The Wifi option made me laugh out loud :) I am running a CAT5e connection between the PCs. All have more than enough RAM (32GB is the lowest value one).

So as far as I understand you run multiple simulations on a few cores instead of running one simulation on multiple cores right? That again will probably not work for what I am planning with the simulation.
I need to run one simulation as fast as possible.

HPC Clusters indeed look like a rabbit hole I would love to avoid if I can :)

While I was waiting for a reply I did a few test with different amounts of cores and a few different PCs.
The same basic Simulation was run on all of these iterations and these are the results:

2.40GHz x 6 (master)
One PC on 1/6 cores 441s
One PC on 2/6 cores 406s
One PC on 3/6 cores 402s
One PC on 4/6 cores 401s
One PC on 5/6 cores 365s
One PC on 6/6 cores 404s

2.40GHz x 6 (master) and 4.20GHz x 8 (slave) (CoresOnMaster:CoresOnSlave)
Two PCs on 06/14 cores 410s (5:1)
Two PCs on 07/14 cores 297s (6:1)
Two PCs on 07/14 cores 298s (5:2)
Two PCs on 07/14 cores 407s (1:6)
Two PCs on 08/14 cores 433s (6:2)
Two PCs on 08/14 cores 430s (5:3)
Two PCs on 09/14 cores 420s (5:4)
Two PCs on 10/14 cores 456s (5:5)
Two PCs on 11/14 cores 407s (5:6)
Two PCs on 11/14 cores 403s (6:5)
Two PCs on 12/14 cores 660s (5:7)
Two PCs on 13/14 cores 603s (5:8)
Two PCs on 14/14 cores 498s (6:8)

2.40GHz x 6 (master) and 3.40GHz x 12 (slave)
Two PCs on 06/18 cores 424s (5:1)
Two PCs on 07/18 cores 303s (5:2)
Two PCs on 08/18 cores 436s (5:3)
Two PCs on 09/18 cores 434s (5:4)
Two PCs on 10/18 cores 444s (5:5)
Two PCs on 11/18 cores 422s (5:6)
Two PCs on 12/18 cores 513s (5:7)
Two PCs on 12/18 cores 605s (4:8)
Two PCs on 13/18 cores 416s (6:7)
Two PCs on 13/18 cores 546s (5:8)
Two PCs on 14/18 cores 483s (5:9)
Two PCs on 15/18 cores 586s (5:10)
Two PCs on 16/18 cores 701s (5:11)
Two PCs on 17/18 cores 492s (5:12)
Two PCs on 18/18 cores 685s (6:12)

I hope it makes sense. Apparently the times fluctuate significantly over the amount of used cores. As you can see on both constellations 7 cores yield the best result (don't know why or how). If I use all available cores its slow which makes sense to me since background processes and communications need to wait for 1 core to be free again to be processed. but the rest looks random to me and I cannot make any progress without parallel processing.

Now I read in the Documentation that LIGGGHTS creates a 3D-Matrix out of the procs assigned to the simulation that then divide the simulation volume between them. Maybe there is a connection between this division of the simulation are and the values above? It looks to me that 7 cores and around 14 cores are significantly lover than the rest of the values around them. Matter of fact I did one more test as I was writing this, where I ran the following test:

Two PCs on 5/18 cores 364s (3:2)
Two PCs on 7/18 cores 360s (1:6)

As 5 and 7 cores was the fastest overall I thought that maybe this will prove or disprove my theory about the Matrix.

Hope this help you help me.

mschramm | Fri, 09/13/2019 - 20:07

Hello,
How many particles are you simulating? How many per processor?
I did a few quick simulations with 80,000 spheres
1 core == 1x speed up
4 cores == 1.79x speed up (1 computer)
8 cores == 2.44x speed up (2 computers)
12 cores == 2.27x speed up (3 computers)
16 cores == 2.04x speed up (4 computers)
20 cores == 2.35x speed up (5 computers)
I am using CAT6 cables with a gigabit network switch (longest cable between switch and computer is 5 ft)

gilgamesch | Mon, 09/16/2019 - 05:50

At the moment its a small setup with 12,000 spheres. How do I check the amount of atoms per core? (As I said I`m fairly fresh into LIGGGHTS)

It`s strange. In my results the highest speed is 1.5 in any of my 2 and 1 PC setups. I tried more PC after I saw your reply, but I can`t get it to work properly. It gets stuck after*

Created orthogonal box = (-0.03 -0.03 -0.006) to (0.198 0.298 0.2)

Nothing else happens. My HOSTFILE looks like this*

masterIP slots=1 max_slots=6
slave1IP slots=1 max_slots=12
slave2IP slots=1 max_slots=8

My command has the tree spawn disabled because I thought that that might be causing the problem, but that is not the case since it still gets stuck at the exact same place every single time.*

mpirun --mca plm_no_tree_spawn 1 --hostfile HOSTFILE lmp_auto < in.test

The connection between them all works since hostname command works. That means there might be something wrong with my simulation or my executable?*

$ mpirun --mca no_tree_spawn 1 --hostfile HOSTFILE hostname
master
slave1
slave2

Could this also be the reason for the much slower cluster results?