[aspect-devel] ASPECT scaling on a Cray XC30

Tue Feb 4 02:53:35 PST 2014

he Rene,

how did you pin your mpi processes on the node?
for the implicit solver this makes a big difference.
these routines are memory bandwidth limited. you will saturate your
bandwidth usually when running on half the number of cores per socket.
so for a dual socket node you want to use half the number of cores per
socket but use both sockets.
depending on the configuration of the system usually the default strategy
is to fill the first socket completely before using cores from the second
socket.

cheers
Thomas
ps i assume you run on intel Xeon cpu's? for amd its a little more complex
since there you also have to make sure you pin your processes on a single
socket in the right way.

On Tue, Feb 4, 2014 at 11:22 AM, Rene Gassmoeller <rengas at gfz-potsdam.de>wrote:

> Ok so finally here is the promised update on the scaling results. I
> attached the new calc spreadsheet, in which I subtracted the
> initialization and I/O timing and averaged the runtimes over 3
> subsequent runs (not much change there, except from the very small models).
> In fact removing the initialization and I/O times from the runtime
> resolved the issue with the apparent slowdown for a high number of
> DoFs/core, apparently the I/O speed is somewhat limiting, but this will
> not be a problem for the final models.
>
> Using half the available cores per node did not change much in terms of
> efficiency (at least on a single node).
>
> >> - weak scaling is normally number of cores vs runtime with a fixed
> >> number of DoFs/core. Optimal scaling are horizontal lines (or you plot
> >> it as cores vs. efficiency)
>
> Done in the new spreadsheet. The lines are not horizontal but see
> next point for this.
>
> >> - assembly might not scale perfectly due to communication to the ghost
> >> neighbors. I have not seen that much in the past, but it depends on
> >> the speed of the interconnect (and if you are bandwidth-limited it
> >> will get slower with number of DoFs/core). You can try to add another
> >> timing section for the matrix.compress() call.
>
> Thanks for the hint. In the new setup (maximal cores per node)
> the assembly scales quite perfectly for 50 kDoFs/core at the moment . At
> 400 kDoF/node at least the Stokes Assembly increase its computing time
> by 10 % over an increase in #DoFs by factor 8 (increasing global
> resolution by 1). This does support your point. However the increase is
> so small that it does not bother me at the moment.
>
> The increase in computing time for stokes solve however is by far the
> stronger effect. On the other hand I think this might be model specific
> for this setup. Since we wanted to use a setup that is not too far from
> the models we will run productively, we decided to include a spherical
> temperature/composition anomaly in the setup, which of course is
> resolved differently by the different resolutions. This may be the
> reason for the increased number of stokes iterations for increased
> resolution. For an actual assessment of the code scaling (instead of the
> scaling for our model setup) one would need
> to repeat the models with a resolution independent setup (i.e. harmonic
> perturbation of low degree), I guess. I finished a model setup for this,
> and if I find some free time I will update the results for resolution
> independence.
>
> Cheers and thanks for the help,
> Rene
>
>
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geodynamics.org/pipermail/aspect-devel/attachments/20140204/a2b344a6/attachment.html>