[aspect-devel] ASPECT scaling on a Cray XC30

Rene Gassmoeller rengas at gfz-potsdam.de
Fri Jan 17 07:17:41 PST 2014


Ok here are my setup (with help from Juliane) and results for the ASPECT
3D scaling up to 310 million DOFs and 4096 CPUs. The models were run
with Aspect revision 2265 (release mode) and deal.II revision 31464. The
setup is an octant of a 3d Sphere with a spherical temperature and
compositional anomaly close to the inner boundary with a mildly
temperature dependent viscosity and incompressible material properties
(simple material model).
I attach the results for total wallclock time as given in the output by
ASPECT. Sorry for the mess in the calc sheet, but this is the raw data,
so you can figure out, whether I made a mistake in my calculation or
whether my questions below are real features in ASPECT's scaling behaviour.
In general I am quite pleased by the results, especially that even the
global resolution of 7 (310 million DOFs) scales very well from 1024 to
4096 CPUs, however two questions arised to me, maybe just because of my
lack of experience in scaling tests.

1. In the plot "Speedup to maximal #DOFs/CPU" to me it looks like the
speedup of ASPECT is optimal for a medium amount of DOFs per CPU. The
speedup falls of to suboptimal for both very high (>~300000) and very
low #DOFs/CPU (<~30000). This is somewhat surprising to me, since I
always thought, minimizing the number of CPUs / maximizing the DOFs per
CPU (and hence minimizing the communication overhead) would give the
best computational efficiency. This is especially apparent in this plot
because I used the runtime with the smallest number of CPUs as a
reference value for optimal speedup and hence the curve is above the
"optimal" line for medium #DOFs per CPU. It is a consistent effect for
large models (resolution 5,6,7) and across many different CPU numbers,
so no one-time effect. Does anyone have an idea about the reason for
this effect? I thought of swap space used for high #DOFs/CPU but that
should crush performance much more than this slightly suboptimal
scaling. Cray is promoting the XC30 for a particularly fast
node-interconnect, but it is certainly not faster than accessing system
memory on a single CPU, right? On the other hand more CPUs and a fast
interconnect could mean a larger part of the system matrices fit in the
CPU cache, which could speed up computation significantly. I am no
expert in this, so feel free to make comments.

2. Another feature I got interested in was the scaling for fixed
#DOFs/CPU (~150000), therefore scaling the model size with the #CPUs.
The numbers are shown in the lowest part of the Calc sheet. Over an
increase of model size and #CPUS of factor 512 the time needed for a
single timestep increased by factor 3 and while most parts of the
wallclock time stayed below this increase (nearly constant to slightly
increasing), most of the additional time was used by Stokes Assembly and
Stokes solution (and a bit for Stokes preconditioner). Especially Solve
Stokes stood out by doubling its computing time for doubling the
resolution (increasing #DOFs by around factor 8). The jump from
resolution 4 to 5 is even higher, but that may be because resolution 4
fits on a single node, so no node-interconnection is needed for this
solution. Is this increase in time an expected behaviour for the Stokes
solver due to the increased communication overhead / size of the
problem, or might there be something wrong (with my setup, compilation
or in the code)? Or is this already a very good scaling in this
dimension? As I said, I have limited experience with this.

Feel free to comment or try the setup yourself and compare the runtimes
on your systems (I just found that the Xeon E5-2695v2 on the Cray beats
my Laptop Core i5-430M by around 66% for 2 CPUs despite comparable clock
frequencies). I would be especially interested if somebody finds much
faster runtimes on any system, which could show problems in our
configuration here.

Cheers,
Rene


-------------- next part --------------
set Adiabatic surface temperature          = 1600               # default: 0
set CFL number                             = 1.0
set Composition solver tolerance           = 1e-12
set Linear solver tolerance                = 1e-4

set Dimension                              = 3

set End time                               = 0                # default: 1e8

set Output directory                       = scaling_test # default: output

set Pressure normalization                 = surface
set Surface pressure                       = 0
set Resume computation                     = false
set Start time                             = 0

set Use years in output instead of seconds = true


subsection Compositional fields
  set List of normalized fields = 
  set Number of fields          = 2 # default: 0
end


subsection Compositional initial conditions
  set Model name = function

  subsection Function
    set Function constants  = r=840000,h=3911000,r0=3481000,l=6301000,x0=2009756,a=2.25347e-12,d=4169000 
    set Function expression = if(sqrt(x*x+y*y+z*z)<h, 0.15 + 0.5 * (h - sqrt(x*x+y*y+z*z))/430000, if(sqrt((x-x0)*(x-x0)+(y-x0)*(y-x0)+(z-x0)*(z-x0)) < r, 0.15,if(sqrt(x*x+y*y+z*z)<d, a*(sqrt(x*x+y*y+z*z)-d)*(sqrt(x*x+y*y+z*z)-d),0.0))); if((sqrt(x*x+y*y+z*z))>l, 1.0, 0.0)
    set Variable names      =  x,y,z
  end

end


subsection Initial conditions
  set Model name = function

  subsection Function
    set Function constants  = r=840000,h=3911000,r0=3481000,l=6301000,x0=2009756
    set Function expression = if(sqrt(x*x+y*y+z*z)<h, 1600 + 250 * (h - sqrt(x*x+y*y+z*z))/430000, if(sqrt((x-x0)*(x-x0)+(y-x0)*(y-x0)+(z-x0)*(z-x0)) < r, 1850.0,1600.0))
    set Variable names      =  x,y,z
  end

end


subsection Discretization

  set Composition polynomial degree = 2
  set Temperature polynomial degree = 2

  subsection Stabilization parameters
    set beta  = 0.117
    set cR    = 0.5
  end

end


subsection Geometry model
  set Model name = spherical shell # default: 

  subsection Spherical shell
    set Inner radius  = 3481000
    set Opening angle = 90
    set Outer radius  = 6371000
  end

end


subsection Gravity model

  set Model name = radial constant # default: 

  subsection Radial constant
    set Magnitude = 10.0 # default: 30
  end

end

subsection Material model

  set Model name = simple # default: 

  subsection Simple model
    set Viscosity = 1e22
    set Thermal viscosity exponent = 9.0
    set Reference temperature = 1600
  end

end


subsection Mesh refinement

  set Additional refinement times              = 
  set Coarsening fraction                      = 0
  set Refinement fraction                      = 0

  set Initial adaptive refinement              = 0                    # default: 2
  set Initial global refinement                = 4                    # default: 2

  set Normalize individual refinement criteria = true

  set Run postprocessors on initial refinement = false
end


subsection Boundary temperature model

  set Model name = spherical constant

  subsection Spherical constant
    set Inner temperature = 2340 # default: 6000
    set Outer temperature = 273  # default: 0
  end

end

subsection Model settings

  set Fixed composition boundary indicators   = 
  set Fixed temperature boundary indicators   = 0,1        # default: 
  set Include adiabatic heating               = false       # default: false
  set Include latent heat                     = false
  set Include shear heating                   = false
  set Prescribed velocity boundary indicators =

  set Tangential velocity boundary indicators = 0,1,2,3,4
  set Zero velocity boundary indicators       =           # default: 

end

set Timing output frequency                   = 20

subsection Postprocess

  set List of postprocessors = visualization # default: all

  subsection Visualization

    set List of output variables      = density,viscosity,# default: 

    # VTU file output supports grouping files from several CPUs into one file
    # using MPI I/O when writing on a parallel filesystem. Select 0 for no
    # grouping. This will disable parallel file output and instead write one
    # file per processor in a background thread. A value of 1 will generate
    # one big file containing the whole solution.
    set Number of grouped files       = 0
    set Output format                 = vtu
    set Time between graphical output = 2e5                                                                                # default: 1e8
  end

end



-------------- next part --------------
A non-text attachment was scrubbed...
Name: speedup_3d.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 76007 bytes
Desc: not available
URL: <http://geodynamics.org/pipermail/aspect-devel/attachments/20140117/eb2b1ae3/attachment-0001.ods>


More information about the Aspect-devel mailing list