[aspect-devel] ASPECT scaling on a Cray XC30

Fri Jan 17 13:17:28 PST 2014

Thanks very much for the hints Timo. I have repeated part of the models
so far (optimized to the 24 cores per node) and just wanted you to not
spend more time thinking on my questions for now. It seems like a lot
will be solved by excluding I/O from the runtimes. Typical beginner's
mistake on scaling analysis, I guess.
I will post an update during the next days, when the new models have
finished.

Cheers,
Rene

On 01/17/2014 04:47 PM, Timo Heister wrote:
> Rene,
> 
> some quick comments (I will look through your data in more detail later):
> - you normally start with one full node not 1 core (I assume #CPUS
> means number of cores). This is because memory bandwidth is shared
> between nodes.
> - How many cores are one node on this machine? You fill them with one
> thread per core? Be aware that hyperthreading/lower number of floating
> point units (on AMD) could mean that using cores/2 is faster than all
> cores. You should try that out (not that it matters for scaling,
> though)
> - weak scaling is normally number of cores vs runtime with a fixed
> number of DoFs/core. Optimal scaling are horizontal lines (or you plot
> it as cores vs. efficiency)
> - assembly might not scale perfectly due to communication to the ghost
> neighbors. I have not seen that much in the past, but it depends on
> the speed of the interconnect (and if you are bandwidth-limited it
> will get slower with number of DoFs/core). You can try to add another
> timing section for the matrix.compress() call.
> - what are the timings in the first tables? Total runtime? I would
> only look at setup, assemble, and solve (and not random stuff like
> i/o)
> - for later: you need to run identical setups more than once to get
> rid of differences in timings
> 
> 
> 
> On Fri, Jan 17, 2014 at 10:17 AM, Rene Gassmoeller
> <rengas at gfz-potsdam.de> wrote:
>> Ok here are my setup (with help from Juliane) and results for the ASPECT
>> 3D scaling up to 310 million DOFs and 4096 CPUs. The models were run
>> with Aspect revision 2265 (release mode) and deal.II revision 31464. The
>> setup is an octant of a 3d Sphere with a spherical temperature and
>> compositional anomaly close to the inner boundary with a mildly
>> temperature dependent viscosity and incompressible material properties
>> (simple material model).
>> I attach the results for total wallclock time as given in the output by
>> ASPECT. Sorry for the mess in the calc sheet, but this is the raw data,
>> so you can figure out, whether I made a mistake in my calculation or
>> whether my questions below are real features in ASPECT's scaling behaviour.
>> In general I am quite pleased by the results, especially that even the
>> global resolution of 7 (310 million DOFs) scales very well from 1024 to
>> 4096 CPUs, however two questions arised to me, maybe just because of my
>> lack of experience in scaling tests.
>>
>> 1. In the plot "Speedup to maximal #DOFs/CPU" to me it looks like the
>> speedup of ASPECT is optimal for a medium amount of DOFs per CPU. The
>> speedup falls of to suboptimal for both very high (>~300000) and very
>> low #DOFs/CPU (<~30000). This is somewhat surprising to me, since I
>> always thought, minimizing the number of CPUs / maximizing the DOFs per
>> CPU (and hence minimizing the communication overhead) would give the
>> best computational efficiency. This is especially apparent in this plot
>> because I used the runtime with the smallest number of CPUs as a
>> reference value for optimal speedup and hence the curve is above the
>> "optimal" line for medium #DOFs per CPU. It is a consistent effect for
>> large models (resolution 5,6,7) and across many different CPU numbers,
>> so no one-time effect. Does anyone have an idea about the reason for
>> this effect? I thought of swap space used for high #DOFs/CPU but that
>> should crush performance much more than this slightly suboptimal
>> scaling. Cray is promoting the XC30 for a particularly fast
>> node-interconnect, but it is certainly not faster than accessing system
>> memory on a single CPU, right? On the other hand more CPUs and a fast
>> interconnect could mean a larger part of the system matrices fit in the
>> CPU cache, which could speed up computation significantly. I am no
>> expert in this, so feel free to make comments.
>>
>> 2. Another feature I got interested in was the scaling for fixed
>> #DOFs/CPU (~150000), therefore scaling the model size with the #CPUs.
>> The numbers are shown in the lowest part of the Calc sheet. Over an
>> increase of model size and #CPUS of factor 512 the time needed for a
>> single timestep increased by factor 3 and while most parts of the
>> wallclock time stayed below this increase (nearly constant to slightly
>> increasing), most of the additional time was used by Stokes Assembly and
>> Stokes solution (and a bit for Stokes preconditioner). Especially Solve
>> Stokes stood out by doubling its computing time for doubling the
>> resolution (increasing #DOFs by around factor 8). The jump from
>> resolution 4 to 5 is even higher, but that may be because resolution 4
>> fits on a single node, so no node-interconnection is needed for this
>> solution. Is this increase in time an expected behaviour for the Stokes
>> solver due to the increased communication overhead / size of the
>> problem, or might there be something wrong (with my setup, compilation
>> or in the code)? Or is this already a very good scaling in this
>> dimension? As I said, I have limited experience with this.
>>
>> Feel free to comment or try the setup yourself and compare the runtimes
>> on your systems (I just found that the Xeon E5-2695v2 on the Cray beats
>> my Laptop Core i5-430M by around 66% for 2 CPUs despite comparable clock
>> frequencies). I would be especially interested if somebody finds much
>> faster runtimes on any system, which could show problems in our
>> configuration here.
>>
>> Cheers,
>> Rene
>>
>>
>>
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> 
> 
>