[aspect-devel] ASPECT scaling on a Cray XC30

Fri Jan 17 07:47:12 PST 2014

Rene,

some quick comments (I will look through your data in more detail later):
- you normally start with one full node not 1 core (I assume #CPUS
means number of cores). This is because memory bandwidth is shared
between nodes.
- How many cores are one node on this machine? You fill them with one
thread per core? Be aware that hyperthreading/lower number of floating
point units (on AMD) could mean that using cores/2 is faster than all
cores. You should try that out (not that it matters for scaling,
though)
- weak scaling is normally number of cores vs runtime with a fixed
number of DoFs/core. Optimal scaling are horizontal lines (or you plot
it as cores vs. efficiency)
- assembly might not scale perfectly due to communication to the ghost
neighbors. I have not seen that much in the past, but it depends on
the speed of the interconnect (and if you are bandwidth-limited it
will get slower with number of DoFs/core). You can try to add another
timing section for the matrix.compress() call.
- what are the timings in the first tables? Total runtime? I would
only look at setup, assemble, and solve (and not random stuff like
i/o)
- for later: you need to run identical setups more than once to get
rid of differences in timings


On Fri, Jan 17, 2014 at 10:17 AM, Rene Gassmoeller
<rengas at gfz-potsdam.de> wrote:
> Ok here are my setup (with help from Juliane) and results for the ASPECT
> 3D scaling up to 310 million DOFs and 4096 CPUs. The models were run
> with Aspect revision 2265 (release mode) and deal.II revision 31464. The
> setup is an octant of a 3d Sphere with a spherical temperature and
> compositional anomaly close to the inner boundary with a mildly
> temperature dependent viscosity and incompressible material properties
> (simple material model).
> I attach the results for total wallclock time as given in the output by
> ASPECT. Sorry for the mess in the calc sheet, but this is the raw data,
> so you can figure out, whether I made a mistake in my calculation or
> whether my questions below are real features in ASPECT's scaling behaviour.
> In general I am quite pleased by the results, especially that even the
> global resolution of 7 (310 million DOFs) scales very well from 1024 to
> 4096 CPUs, however two questions arised to me, maybe just because of my
> lack of experience in scaling tests.
>
> 1. In the plot "Speedup to maximal #DOFs/CPU" to me it looks like the
> speedup of ASPECT is optimal for a medium amount of DOFs per CPU. The
> speedup falls of to suboptimal for both very high (>~300000) and very
> low #DOFs/CPU (<~30000). This is somewhat surprising to me, since I
> always thought, minimizing the number of CPUs / maximizing the DOFs per
> CPU (and hence minimizing the communication overhead) would give the
> best computational efficiency. This is especially apparent in this plot
> because I used the runtime with the smallest number of CPUs as a
> reference value for optimal speedup and hence the curve is above the
> "optimal" line for medium #DOFs per CPU. It is a consistent effect for
> large models (resolution 5,6,7) and across many different CPU numbers,
> so no one-time effect. Does anyone have an idea about the reason for
> this effect? I thought of swap space used for high #DOFs/CPU but that
> should crush performance much more than this slightly suboptimal
> scaling. Cray is promoting the XC30 for a particularly fast
> node-interconnect, but it is certainly not faster than accessing system
> memory on a single CPU, right? On the other hand more CPUs and a fast
> interconnect could mean a larger part of the system matrices fit in the
> CPU cache, which could speed up computation significantly. I am no
> expert in this, so feel free to make comments.
>
> 2. Another feature I got interested in was the scaling for fixed
> #DOFs/CPU (~150000), therefore scaling the model size with the #CPUs.
> The numbers are shown in the lowest part of the Calc sheet. Over an
> increase of model size and #CPUS of factor 512 the time needed for a
> single timestep increased by factor 3 and while most parts of the
> wallclock time stayed below this increase (nearly constant to slightly
> increasing), most of the additional time was used by Stokes Assembly and
> Stokes solution (and a bit for Stokes preconditioner). Especially Solve
> Stokes stood out by doubling its computing time for doubling the
> resolution (increasing #DOFs by around factor 8). The jump from
> resolution 4 to 5 is even higher, but that may be because resolution 4
> fits on a single node, so no node-interconnection is needed for this
> solution. Is this increase in time an expected behaviour for the Stokes
> solver due to the increased communication overhead / size of the
> problem, or might there be something wrong (with my setup, compilation
> or in the code)? Or is this already a very good scaling in this
> dimension? As I said, I have limited experience with this.
>
> Feel free to comment or try the setup yourself and compare the runtimes
> on your systems (I just found that the Xeon E5-2695v2 on the Cray beats
> my Laptop Core i5-430M by around 66% for 2 CPUs despite comparable clock
> frequencies). I would be especially interested if somebody finds much
> faster runtimes on any system, which could show problems in our
> configuration here.
>
> Cheers,
> Rene
>
>
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel


-- 
Timo Heister
http://www.math.clemson.edu/~heister/