Running Parallel-Simulation on a cluster

Submitted by gilgamesch on Mon, 09/09/2019 - 07:59

Hello everyone,

before I ask my question and describe my problem, I want to summarize what I did so far and what I experienced, so that it might make more sense.

I installed LIGGGHTS-Public and Paraview according do the Tutorials and Guides in this forum and the doccumentation (using make auto). Everything works fine, I can run simulations and they look good so far.

So I wanted to go a step further and ran the simulations in parallel (using -np numberofparalleljobs) this kind of worked as well. The more I incresed the number of parallel jobs the faster the simulation became until n-1 cores (I use a Intel Core i5 @ 2.80GHz × 6 ). After that the calculation slows down again.

The next step was to use the cluster provided by my lab. I installed openmp on another PC in the network and ran some mpirun commands to see if I could remotely control the other PC and everything was fine. I created a shared folder on our Public Server where I placed the executable lmp_auto and all the necessary files and setup a HOSTFILE with the IP adresses of my PC (the master PC) and the other PC. I then ran the command

mpirun --hostfile HOSTFILE ./lmp_auto < in.sand

HOSTFILE is*
IPmaster slots=5
IPslave slots=11

It worked but was not very fast. The simulation ran on 10 procs without me defining -np XX. I tried increasing the proc count with -np to 16 (the other PC is a Intel Core i7 @ 3.40GHz × 12) and it slowed down quiet a bit.

mpirun -np 16 --hostfile HOSTFILE ./lmp_auto < in.sand

I cannot make out what the problem here is and this is where I am stuck right now.

So my questions are*

Why are my parallel jobs on two PC slower with more procs even though there are still cores that are not used?
How do I fix the problem?
Is there anything I am doing wrong here?

mschramm | Thu, 09/12/2019 - 15:45

Hello,

To answer your questions
1) Because the two computers must now transfer information from one to the other. As you add more cores, the number of time information must be passed increases.
You also have two completely different CPUs (and probably memory modules) so as one computer has finished a routine, it must wait for the other computer to catch up.
2) Hire someone who has built clusters before.
3) Answers are going to be based on how you have the cluster set up.
How are the communications between computers being handled?
Are they connected using infiniband or an equivalent, by CAT5/6/7, or by wifi (please no...)?
Are all of the computers the same (or at least have the same processor and memory)?
I personally use a cluster with 10 computers for LIGGGHTS. The main difference is that I do not
run a single simulation across multiple computers. I mostly use my cluster for DOE runs using the -partition MxN command to run M jobs utilizing N cores each.
This is the way I would recommend you utilizing your cluster. If you must use a cluster for a single simulation, I would suggest looking into HPC cluster design.
(A very big rabbit hole... Don't think I will ever need to know about creating a 6d torus cluster but I did read about it...)

gilgamesch | Fri, 09/13/2019 - 07:47

First of all, Thank You!

I want to address some of what you said before I talk about what I did in the meantime.

1) I am aware of the fact that communication takes processing power as well, but as in my experiments below, this does not seem to be the problem as far as I can tell. Now the different CPUs should also not affect the outcome to the worse, since The slowest single calculation process will not get slower with more cores. If let's say the slowest core takes 10s and the others all take 2 seconds, the ones that finish early will wait for the slow 10second core (as you said). This however should make all of the runs equally fast since the bottleneck is always 10seconds (please correct me if I am wrong), but what I observed is that the simulation slows down significantly (as below).
2) Hiring someone is not an option for me, since it is a project that we have to deal with.
3) The Wifi option made me laugh out loud :) I am running a CAT5e connection between the PCs. All have more than enough RAM (32GB is the lowest value one).

So as far as I understand you run multiple simulations on a few cores instead of running one simulation on multiple cores right? That again will probably not work for what I am planning with the simulation.
I need to run one simulation as fast as possible.

HPC Clusters indeed look like a rabbit hole I would love to avoid if I can :)

While I was waiting for a reply I did a few test with different amounts of cores and a few different PCs.
The same basic Simulation was run on all of these iterations and these are the results:

2.40GHz x 6 (master)
One PC on 1/6 cores 441s
One PC on 2/6 cores 406s
One PC on 3/6 cores 402s
One PC on 4/6 cores 401s
One PC on 5/6 cores 365s
One PC on 6/6 cores 404s

2.40GHz x 6 (master) and 4.20GHz x 8 (slave) (CoresOnMaster:CoresOnSlave)
Two PCs on 06/14 cores 410s (5:1)
Two PCs on 07/14 cores 297s (6:1)
Two PCs on 07/14 cores 298s (5:2)
Two PCs on 07/14 cores 407s (1:6)
Two PCs on 08/14 cores 433s (6:2)
Two PCs on 08/14 cores 430s (5:3)
Two PCs on 09/14 cores 420s (5:4)
Two PCs on 10/14 cores 456s (5:5)
Two PCs on 11/14 cores 407s (5:6)
Two PCs on 11/14 cores 403s (6:5)
Two PCs on 12/14 cores 660s (5:7)
Two PCs on 13/14 cores 603s (5:8)
Two PCs on 14/14 cores 498s (6:8)

2.40GHz x 6 (master) and 3.40GHz x 12 (slave)
Two PCs on 06/18 cores 424s (5:1)
Two PCs on 07/18 cores 303s (5:2)
Two PCs on 08/18 cores 436s (5:3)
Two PCs on 09/18 cores 434s (5:4)
Two PCs on 10/18 cores 444s (5:5)
Two PCs on 11/18 cores 422s (5:6)
Two PCs on 12/18 cores 513s (5:7)
Two PCs on 12/18 cores 605s (4:8)
Two PCs on 13/18 cores 416s (6:7)
Two PCs on 13/18 cores 546s (5:8)
Two PCs on 14/18 cores 483s (5:9)
Two PCs on 15/18 cores 586s (5:10)
Two PCs on 16/18 cores 701s (5:11)
Two PCs on 17/18 cores 492s (5:12)
Two PCs on 18/18 cores 685s (6:12)

I hope it makes sense. Apparently the times fluctuate significantly over the amount of used cores. As you can see on both constellations 7 cores yield the best result (don't know why or how). If I use all available cores its slow which makes sense to me since background processes and communications need to wait for 1 core to be free again to be processed. but the rest looks random to me and I cannot make any progress without parallel processing.

Now I read in the Documentation that LIGGGHTS creates a 3D-Matrix out of the procs assigned to the simulation that then divide the simulation volume between them. Maybe there is a connection between this division of the simulation are and the values above? It looks to me that 7 cores and around 14 cores are significantly lover than the rest of the values around them. Matter of fact I did one more test as I was writing this, where I ran the following test:

Two PCs on 5/18 cores 364s (3:2)
Two PCs on 7/18 cores 360s (1:6)

As 5 and 7 cores was the fastest overall I thought that maybe this will prove or disprove my theory about the Matrix.

Hope this help you help me.

mschramm | Fri, 09/13/2019 - 20:07

Hello,
How many particles are you simulating? How many per processor?
I did a few quick simulations with 80,000 spheres
1 core == 1x speed up
4 cores == 1.79x speed up (1 computer)
8 cores == 2.44x speed up (2 computers)
12 cores == 2.27x speed up (3 computers)
16 cores == 2.04x speed up (4 computers)
20 cores == 2.35x speed up (5 computers)
I am using CAT6 cables with a gigabit network switch (longest cable between switch and computer is 5 ft)

gilgamesch | Mon, 09/16/2019 - 05:50

At the moment its a small setup with 12,000 spheres. How do I check the amount of atoms per core? (As I said I`m fairly fresh into LIGGGHTS)

It`s strange. In my results the highest speed is 1.5 in any of my 2 and 1 PC setups. I tried more PC after I saw your reply, but I can`t get it to work properly. It gets stuck after*

Created orthogonal box = (-0.03 -0.03 -0.006) to (0.198 0.298 0.2)

Nothing else happens. My HOSTFILE looks like this*

masterIP slots=1 max_slots=6
slave1IP slots=1 max_slots=12
slave2IP slots=1 max_slots=8

My command has the tree spawn disabled because I thought that that might be causing the problem, but that is not the case since it still gets stuck at the exact same place every single time.*

mpirun --mca plm_no_tree_spawn 1 --hostfile HOSTFILE lmp_auto < in.test

The connection between them all works since hostname command works. That means there might be something wrong with my simulation or my executable?*

$ mpirun --mca no_tree_spawn 1 --hostfile HOSTFILE hostname
master
slave1
slave2

Could this also be the reason for the much slower cluster results?

richti83's picture

richti83 | Tue, 09/17/2019 - 22:22

A rule of thumb is 10K particle/core.
Be aware of the latency of your ethernet connection. I tried with 10 Gbit ethernet adapters and ended up with slower performance bc. of network latency. There is a good reason that HPC clusters uses Infiniband or other up to 50GBe standards for interprocessor connection.

The best choice for "small" systems up to 6.000.000 particles is a dual-processor XEON workstation with 2x10 cores (or more) and high clock frequency (3.5Ghz or more). The reason is that communication becomes more expensive than computation when there are slow interprocessor connections. It has been shown, that on systems with dualprocessor setup small systems (10K/core) performs best on one single core (mpi option -bind to socket).
More power is not the solution in words of cores but in words of clock rate. We own 2 small Intel 4x4Ghz computers which are for small systems much faster than the big xeon workstations with 2x12x3.0Ghz.

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

gilgamesch | Wed, 09/18/2019 - 03:10

Thank you for that explanation. While ethernet might be a bottleneck as you said, I still don't understand why the computation time fluctuates up and down so much. For example the use of 16 cores over 2 PCs results in aprox. 600s while 17 cores give me 400s. The 17 cores is slower than single PC 5 cores and that is explainable by what you replied, but the jump between these two cluster examples doesn't make sense to me.

I'm sorry if the answer is trivial. I might be missing trees inside the forest.

gilgamesch | Wed, 09/18/2019 - 04:17

As I said in my last message, I cannot seem to get more than 2 PCs to work in the cluster. Every single PC combination results in a stuck terminal at*

Created orthogonal box = (-0.03 -0.03 -0.006) to (0.198 0.298 0.2)

Hostname execution works though(as stated above)

richti83's picture

richti83 | Wed, 09/18/2019 - 09:19

This depends on how liggghts spreads the available cores over the dimensions of the simulation region. Have a look on the line after create box:

X by Y by Z MPI processor grid

for 16 cores it is possible to distribute it over an even number of cores, let's say 2x2x4.
for 17 cores only 1x1x17 is possible as far as I see.
Now when your particles are only in at the bottom layer comunication rules. In this example only the bottom layer has to compute everything. In case of 2x2x4 every bottom layer has to communicate to the 3 neighbours in their own xy-plane and to the z+1 layer ==>4x mpi wall per subdomain.
In case of 1x1x17 (asuming all particles are in the z0 and z1 layer) only communication between z0 and z1 happens.
You can force liggghts using a special distribution with the processors command, try

processors 1 1 *

and run the three tests again.

Also the use of hyperthreading slows down everything bc the physical processors are "oversubscribed".

When you have compiled liggghts with vtk support you can use dump style atom/vtk which outputs the processor related to each partice:

dump dmp all atom/vtk 1000 post/dump*.vtu #no arguments

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

gilgamesch | Mon, 09/23/2019 - 07:02

I will look into the processor grid as soon as I get back to the cluster again. Thank You.

Just to make it clear: My Simulation area is divided into fixed size cubes depending on the amount of processors used? So there is no way for me to assign equal amount of work load to each processor? Is that the reason why even 1 PC does not have a significant increase in speed with more cores?
2.40GHz x 6 (master)
One PC on 1/6 cores 441s
One PC on 2/6 cores 406s
One PC on 3/6 cores 402s
One PC on 4/6 cores 401s
One PC on 5/6 cores 365s
One PC on 6/6 cores 404s

And the other question I asked was about the cluster not working for more than 2 PCs. The command I used is $ mpirun --mca plm_no_tree_spawn 1 --hostfile HOSTFILE lmp_auto < in.test. Every single PC combination results in a stuck terminal at*

Created orthogonal box = (-0.03 -0.03 -0.006) to (0.198 0.298 0.2)

Hostname execution works (as stated above)

mschramm | Mon, 09/23/2019 - 07:23

Hello are you using the -np flag to set the number of processors?

I use the following command
mpirun --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 --mca btl_tcp_if_include em1 --verbose --show-progress --display-map --display-topo -np 36 -hostfile /PATHTOHOSTFILE/.HOSTFILE liggghts -in in.liggghts

This shows me some additional information that I like to have.

gilgamesch | Tue, 09/24/2019 - 09:18

That command is pretty useful, thank you. However there is no real error or hint to why it won't work. As I said the core communication between the PCs works. This is what I get*

$ mpirun --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 --mca --verbose --show-progress --display-map --display-topo -np 9 -hostfile HOSTFILE lmp_auto -in in.test
Data for JOB [60085,1] offset 0

======================== JOB MAP ========================

Data for node: master Num slots: 5 Max slots: 6 Num procs: 5
Process OMPI jobid: [60085,1] App: 0 Process rank: 0
Process OMPI jobid: [60085,1] App: 0 Process rank: 1
Process OMPI jobid: [60085,1] App: 0 Process rank: 2
Process OMPI jobid: [60085,1] App: 0 Process rank: 3
Process OMPI jobid: [60085,1] App: 0 Process rank: 4

Data for node: slave1 Num slots: 2 Max slots: 12 Num procs: 2
Process OMPI jobid: [60085,1] App: 0 Process rank: 5
Process OMPI jobid: [60085,1] App: 0 Process rank: 6

Data for node: slave2 Num slots: 2 Max slots: 8 Num procs: 2
Process OMPI jobid: [60085,1] App: 0 Process rank: 7
Process OMPI jobid: [60085,1] App: 0 Process rank: 8

=============================================================
LIGGGHTS (Version LIGGGHTS-PUBLIC 3.8.0, compiled 2019-08-06-15:44:08 by master, git commit ce1931e377a2945e3fa25dae0eebc1f697009a32)
Created orthogonal box = (-0.03 -0.03 -0.006) to (0.198 0.298 0.2)

And here it freezes. The 9 CPUs are occupied on their respective PCs though.

gilgamesch | Wed, 10/02/2019 - 05:48

Anyone able to help me with this? I am struggling with my Simulation itself aswell so I might need to open up a new top for that, but this problem hinders me in using more nodes.