Amazon EC2 and StarCluster

Submitted by philmartin on Fri, 11/29/2013 - 02:42

I am experimenting with Amazon EC2 to see if I can get some significant wins in LIGGGHTS running time. I've tried a variety of datasets, and I seem to be encountering some sort of slow down beyond 24 CPUs.

My bigger test dataset I'm running is conveyor a simulation. At the end of the 10 second run, there are around 200k particles. The path the particles travel is quite narrow, but oblique, meaning the simulation is highly sensitive to processor partitioning because there are large empty spaces in the simulation box

The best partitioning I've found produces times similar to this:

4 cpus : 15 hours
8 cpus : 12 hours
24 cpus : 9 hours

However, if I go to 32 CPUs one a single machine, computation time goes back to around the 8 CPU speed. If I set up a cluster using StarCluster and have 96 CPU's across 4 nodes, performance degrades to below that of 4 cores. I've tried extremely simlple simulations as well (< 5000 particles) and their running time is very very slow compared to just a couple of CPU's.

I am guessing there may be some issue with how I've set up the EC2 cluster that is slowing down node communications, or perhaps messages sizes for LIGGGHTS are too large for EC2 to handle nicely.

Does anyone have any experiences with StarCluster and EC2? Or any ideas on how to diagnose where the bottleneck is? Or should I just give up and instead rent some CPU time on a real HPC cluster somewhere?

richti83's picture

richti83 | Fri, 11/29/2013 - 07:59

Communication costs plays a significant role.
I don't knwo the Hardware of the nodes U used, in special the interconnect between them, but I can tell you the following:
We own two DELL T7600 Workstation, every WS has two Intel Eightcore-CPUs =>16 _real_ CPU/node (Hyperthreading decreases MPI performance)
BC. of HW limits both WS are only connected by a simple 1Gbit Ethernet at the moment.
When I run a simulation only on one workstation, the speadup is nearly linear to the used number of cores.
When I use both workstions the simulation time increases by 10 compared to the same setup only using one "node".

Benchmarking with less particles results in bad results for me (for example the moving mesh example in LIGGGHTs is runing with 1 core with smallest simtime. This is because communication becomse over computation.

However, if I go to 32 CPUs one a single machine
Did you oversubscribe the hardware or do you realy own 32cores in one hardware-node ?

A good tuning factor is the bin size and the neighbour cut-off distance (this should be 2*r_max). Maybe your script is not optimal. I think 200k particles is not a big computing problem. How does your conveyor mesh looks like ? A rule of thumb is that the triangles should not be smaller than particle radius. (remember: every triangle is as expensive as 11 particles -> many triangles -> high computing costs).

Could you please tell me the simtime for this script for 1,4,8,16,32,96 cores ?:
https://raw.github.com/richti83/LIGGGHTS-CUDA/master/cuda_benchmark/in.b...

good luck,
Christian

I'm not an associate of DCS GmbH and not a core developer of LIGGGHTS®
ResearchGate | Contact

philmartin | Fri, 11/29/2013 - 10:57

Thanks so much Christian, I really appreciate your efforts.

It is reassuring that other people experience a significant communication cost. It certainly would explain the behaviour I'm seeing.

I will try out that benchmark early next week, that will be a very interesting exercise. I should write an article somewhere explaining how I set up the starcluster too, I haven't seen anyone else do that.

And the bin size and neighbour cutoff looks very interesting - thanks for that. Is r_max the maximum radius of particles? The conveyor and transfer chute is complicated, but the triangles are significantly larger 95% of the particles (screenshot), but I really should double check that. even if that were the case, I would think it would be odd that the message passing is happening due to triangle checks.

Did I oversubscribe the hardware? I'm not sure. the OS reports 32 CPU's, but I'm unsure if they are physical CPU's, or cores, or virtual CPU's. I thought perhaps I did, so I sczled it back to 24 or 16 CPU's on that machine just to be sure.

It sure has been an interesting exercise. Thanks again for the input Christian.

Kind regards,
Phil Martin

philmartin | Fri, 11/29/2013 - 11:37

Pointing me at the cutoffs looks like it will be very beneficial. To double check I am looking at the right place, the communicate command and the neighbor commands are what you are discussing?

These look very appealing because I have mostly small particles (10-15mm radius), but there is about 0.5% of particles that are 50mm in radius. The bin neighbour style would be adding as many as 15 times the number of force interaction checks because of the large variation in particle size. The single communicate style would be doing something similar for the processor boundaries, but a small factor since the box sizes are so big. Does all of that sound like a reasonable conclusion?

Interesting. I guess a simpler way to test if that is the problem is to just make the particle size the same everywhere.

I can't wait to experiment next week!