[cig-commits] commit: Updated parallel scaling results with those from Lonestar.

Fri Aug 31 12:12:42 PDT 2012

changeset:   149:6a76f6eb25ff
tag:         tip
user:        Brad Aagaard <baagaard at usgs.gov>
date:        Fri Aug 31 12:12:39 2012 -0700
files:       faultRup.tex figs/solvertest_scaling.pdf
description:
Updated parallel scaling results with those from Lonestar.


diff -r b3e79c713b83 -r 6a76f6eb25ff faultRup.tex

--- a/faultRup.tex	Wed Aug 29 21:41:37 2012 -0500
+++ b/faultRup.tex	Fri Aug 31 12:12:39 2012 -0700
@@ -1126,12 +1126,12 @@ We generate both hexahedral meshes and t
 We generate both hexahedral meshes and tetrahedral meshes using CUBIT
 (available from http://cubit.sandia.gov) and construct meshes so that
 the problem size (number of DOF) for the two different cell types
-(hexahedra and tetrahedra) are nearly the same. The suite of
-simulations examine increasing larger problem sizes as we increase the
-number of processes, with $1.8\times 10^5$ DOF for 1 process up to
-$1.1\times 10^7$ DOF for 64 processes. The corresponding
-discretization sizes are 2033 m to 500 m for the hexahedral meshes and
-2326 m to 581 m for the tetrahedral meshes.
+(hexahedra and tetrahedra) are nearly the same (within 2\%). The suite
+of simulations examine increasing larger problem sizes as we increase
+the number of processes, with $7.8\times 10^4$ DOF for 1 process up to
+$7.1\times 10^6$ DOF for 96 processes. The corresponding
+discretization sizes are 2033 m to 437 m for the hexahedral meshes and
+2326 m to 712 m for the tetrahedral meshes.
 Figure~\ref{fig:solvertest:mesh} shows the 1846 m resolution
 tetrahedral mesh. As we will see in
 Section~\ref{sec:verification:quasi-static}, the hexahedral mesh for a
@@ -1195,7 +1195,7 @@ same as conventional bulk cells while pa
 same as conventional bulk cells while partitioning. In this
 performance benchmark matrix-vector multiplication (the PETSc
 \texttt{MatMult} function) has a load imbalance of up to 20\%
-on 64 processors. The cell partition balances the number of cells
+on 96 processors. The cell partition balances the number of cells
 across the processes using ParMetis \citep{Karypis:etal:1999} in order
 to achieve good balance for the finite element integration. This does
 not take into account a reduction in the number of DOF associated with
@@ -1214,20 +1214,22 @@ for the various stages of the simulation
 for the various stages of the simulation is independent of the number
 of processes. For this performance benchmark we use the entire suite
 of hexahedral and tetrahedral meshes described earlier that range in
-size from $1.8\times 10^5$ DOF (1 process) to $1.1\times 10^7$ DOF
-(64 processes). We employ the AMG preconditioner for the elasticity
+size from $7.8\times 10^4$ DOF (1 process) to $7.1\times 10^6$ DOF (96
+processes). We employ the AMG preconditioner for the elasticity
 submatrix and our custom preconditioner for the Lagrange multipliers
-submatrix. We ran the simulations on a Beowulf cluster comprised of 24
-compute nodes connected by QDR Infiniband, where each compute node
-consisted of two quad-core Intel Xeon E5620 processors with 24 GB of
-RAM. Simulations run on eight or fewer cores were run on a single
+submatrix. We ran the simulations on Lonestar at the Texas Advanced
+Computing Center. Lonestar is comprised of 1888 compute nodes
+connected by QDR Infiniband in a fat-tree topology, where each compute
+node consisted of two six-core Intel Xeon E5650 processors with 24 GB
+of RAM. Simulations run on twelve or fewer cores were run on a single
 compute node with processes distributed across processors and then
 cores. For example, the two process simulation used one core on each
 of two processors. In addition to algorithm bottlenecks, runtime
 performance is potentially impeded by core/memory affinity, memory
-bandwidth, and communication among compute nodes.
+bandwidth, communication among compute nodes (including communication
+from other jobs running on the machine).
 
-The single node scaling for PyLith (eight processes or less in this
+The single node scaling for PyLith (twelve processes or less in this
 case) is almost completely controlled by the available memory
 bandwidth. Good illustrations of the memory system performance are
 given by the \texttt{VecAXPY}, \texttt{VecMAXPY} and \texttt{VecMDot}
@@ -1236,7 +1238,7 @@ rate at which a processor can perform fl
 rate at which a processor can perform floating points operations. From
 Table~\ref{tab:solvertest:memory:events}, we see that we saturate the
 memory bandwidth using two processes per processor, since scaling
-plateaus from 2 to 4 processes, but shows good scaling from 8 to 16
+plateaus from 2 to 4 processes, but shows good scaling from 12 to 24
 processes. This lack of memory bandwidth will depress overall
 performance, but should not affect the inter-node scaling of the
 application.
@@ -1245,32 +1247,30 @@ operation for vector reductions, and \te
 operation for vector reductions, and \texttt{MatMult} for
 point-to-point communication. In
 Table~\ref{tab:solvertest:memory:events} we see that the vector
-reduction shows good scaling up to 64 processes. Similarly in
+reduction shows good scaling up to 96 processes. Similarly in
 Table~\ref{tab:solvertest:solver:events}, we see that \texttt{MatMult}
 has good scalability, but that it is a small fraction of the overall
 solver time. The AMG preconditioner setup (\texttt{PCSetUp}) and
 application \texttt{PCApply}) dominate the overall solver time. The
-AMG preconditioner setup time increases roughly linearly with the
-number of processes. Note that many weak scaling studies do not
-include this event, because it is amortized over the
-iteration. Nevertheless, in our benchmark it is responsible for most
-of the deviation from perfect weak scaling.  The scalability of the
-application of the AMG preconditioner decays more slowly, but there is
-still serious deterioration by 64 processes. We could trade
-preconditioner strength for scalability by reducing the work done on
-the coarse AMG grids, so that the solver uses more iterations which
-scale very well.  However, that would increase overall solver time and
-thus would not be the choice to maximize scientific output.
+AMG preconditioner setup time increases with the number of
+processes. Note that many weak scaling studies do not include this
+event, because it is amortized over the iteration. Nevertheless, in
+our benchmark it is responsible for most of the deviation from perfect
+weak scaling.  We could trade preconditioner strength for scalability
+by reducing the work done on the coarse AMG grids, so that the solver
+uses more iterations which scale very well.  However, that would
+increase overall solver time and thus would not be the choice to
+maximize scientific output.
 
 Figure~\ref{fig:solvertest:scaling} illustrates the excellent parallel
 performance for the finite-element assembly routines (reforming the
-Jacobian sparse matrix and computing the residual). From the ASM
-performance, we see that the basic solver building-blocks, like
-parallel sparse matrix-vector multiplication, scale well. However, the
-ASM preconditioner itself is not scalable, and the number of iterations
-increases significantly with the number of processes. The introduction
-of Schur complement methods and an AMG preconditioner slows the growth
-considerably, but future work will pursue the ultimate goal of
+Jacobian sparse matrix and computing the residual). As discussed
+earlier in this section, the ASM preconditioner performance is not
+scalable because the number of iterations increases significantly with
+the number of processes. As shown in
+Figure~\ref{fig:solvertest:scaling}, the introduction of Schur
+complement methods and an AMG preconditioner slows the growth
+considerably, and future work will pursue the ultimate goal of
 iteration counts independent of the number of processes.
 
 % ------------------------------------------------------------------
@@ -1591,7 +1591,10 @@ rupture propagation.
   been supported by NSF grants EAR/ITR-0313238 and EAR-0745391. This
   is SCEC contribution number 1665. Several of the figures were
   produced using Matplotlib \citep{matplotlib} and PGF/TikZ (available
-  from \url{http://sourceforge.net/projects/pgf/}).
+  from \url{http://sourceforge.net/projects/pgf/}). Computing
+  resources for the parallel scalability benchmarks were provided by
+  the Texas Advanced Computing Center (TACC) at The University of
+  Texas at Austin (http://www.tacc.utexas.edu).
 \end{acknowledgments}
 
 
@@ -1679,7 +1682,9 @@ rupture propagation.
     the problem size. The linear solve (solid lines in the top panel)
     does not scale as well, which we attribute to the poor scaling of
     the algebraic multigrid setup and application as well as limited
-    memory and interconnect bandwidth.}
+    memory and interconnect bandwidth. We attribute fluctuations in
+    the relative performance to variations in the machine load
+    from other jobs on the cluster.}
   \label{fig:solvertest:scaling}
 \end{figure}
 
@@ -1951,35 +1956,37 @@ rupture propagation.
   \hline
   Event & \# Cores & Load Imbalance & MFlops/s \\
   \hline
-VecMDot &    1 & 1.0 &   2188 \\
-     &    2 & 1.1 &   3968 \\
-     &    4 & 1.1 &   5510 \\
-     &    8 & 1.1 &   6008 \\
-     &   16 & 1.3 &  10249 \\
-     &   32 & 1.2 &   4270 \\
-     &   64 & 1.2 &  12300 \\
+VecMDot &    1 & 1.0 &   2007 \\
+     &    2 & 1.1 &   3809 \\
+     &    4 & 1.1 &   5431 \\
+     &    6 & 1.1 &   5967 \\
+     &   12 & 1.2 &   5714 \\
+     &   24 & 1.2 &  11784 \\
+     &   48 & 1.2 &  20958 \\
+     &   96 & 1.3 &  17976 \\
   \hline
-VecAXPY &    1 & 1.0 &   1453 \\
-     &    2 & 1.1 &   2708 \\
-     &    4 & 1.1 &   5002 \\
-     &    8 & 1.1 &   4224 \\
-     &   16 & 1.3 &   8158 \\
-     &   32 & 1.2 &  13872 \\
-     &   64 & 1.2 &  25802 \\
+VecAXPY &    1 & 1.0 &   1629 \\
+     &    2 & 1.1 &   3694 \\
+     &    4 & 1.1 &   5969 \\
+     &    6 & 1.1 &   6028 \\
+     &   12 & 1.2 &   5055 \\
+     &   24 & 1.2 &  10071 \\
+     &   48 & 1.2 &  18761 \\
+     &   96 & 1.3 &  33676 \\
   \hline
-VecMAXPY &    1 & 1.0 &   1733 \\
-     &    2 & 1.1 &   3284 \\
-     &    4 & 1.1 &   4990 \\
-     &    8 & 1.1 &   5610 \\
-     &   16 & 1.3 &  11051 \\
-     &   32 & 1.2 &  21678 \\
-     &   64 & 1.2 &  42680 \\
+VecMAXPY &    1 & 1.0 &   1819 \\
+     &    2 & 1.1 &   3415 \\
+     &    4 & 1.1 &   5200 \\
+     &    6 & 1.1 &   5860 \\
+     &   12 & 1.2 &   6051 \\
+     &   24 & 1.2 &  12063 \\
+     &   48 & 1.2 &  23072 \\
+     &   96 & 1.3 &  28461 \\
   \hline
 \end{tabular}
 \tablenotetext{a}{Examination of memory system performance using three
   PETSc vector operations for simulations with the hexahedral
-  meshes. The performance for the tetrahedral meshes is nearly
-  the same. For ideal scaling the number of floating point operations
+  meshes. The performance for the tetrahedral meshes is very similar. For ideal scaling the number of floating point operations
   per second should scale linearly with the number of processes. \texttt{VecMDot}
   corresponds to the operation for vector reductions, \texttt{VecAXPY}
   corresponds to vector scaling and addition, and \texttt{VecMAXPY}
@@ -1995,34 +2002,34 @@ VecMAXPY &    1 & 1.0 &   1733 \\
   \hline
   Event & \# Calls & Time (s) & MFlops/s \\
   \hline
-\multicolumn{4}{c}{p = 8} \\
-  MatMult & 168 &      2.1 &     4946 \\
-  PCSetUp &   1 &      5.8 &      159 \\
-  PCApply &  53 &      4.2 &     3081 \\
-  KSPSolve &   1 &     12.9 &     2246 \\
+\multicolumn{4}{c}{p = 12} \\
+  MatMult & 180 &      2.7 &     6113 \\
+  PCSetUp &   1 &      5.7 &      232 \\
+  PCApply &  57 &      5.5 &     3690 \\
+  KSPSolve &   1 &     15.1 &     3013 \\
 \hline
-\multicolumn{4}{c}{p = 16} \\
-  MatMult & 174 &      2.2 &     9691 \\
-  PCSetUp &   1 &      7.0 &      258 \\
-  PCApply &  55 &      4.9 &     5629 \\
-  KSPSolve &   1 &     14.9 &     4033 \\
+\multicolumn{4}{c}{p = 24} \\
+  MatMult & 207 &      3.1 &    12293 \\
+  PCSetUp &   1 &      5.2 &      526 \\
+  PCApply &  66 &      6.6 &     7285 \\
+  KSPSolve &   1 &     16.4 &     6666 \\
 \hline
-\multicolumn{4}{c}{p = 32} \\
-  MatMult & 189 &      3.8 &    12003 \\
-  PCSetUp &   1 &     15.3 &      241 \\
-  PCApply &  60 &      7.3 &     8174 \\
-  KSPSolve &   1 &     26.0 &     5034 \\
+\multicolumn{4}{c}{p = 48} \\
+  MatMult & 222 &      4.0 &    21136 \\
+  PCSetUp &   1 &     10.1 &      628 \\
+  PCApply &  71 &      9.4 &    12032 \\
+  KSPSolve &   1 &     25.1 &    10129 \\
 \hline
-\multicolumn{4}{c}{p = 64} \\
-  MatMult & 219 &      3.2 &    34534 \\
-  PCSetUp &   1 &     29.0 &      348 \\
-  PCApply &  70 &     14.8 &    10228 \\
-  KSPSolve &   1 &     47.5 &     7067 \\
+\multicolumn{4}{c}{p = 96} \\
+  MatMult & 234 &      4.0 &    42130 \\
+  PCSetUp &   1 &     11.8 &     1943 \\
+  PCApply &  75 &     11.6 &    20422 \\
+  KSPSolve &   1 &     30.5 &    17674 \\
 \hline
 \end{tabular}
 \tablenotetext{a}{Examination of solver performance using three of the
   main events comprising the linear solve for simulations with the
-  hexahedral meshes and 8, 16, 32, and 64 processes. The performance
+  hexahedral meshes and 12, 24, 48, and 96 processes. The performance
   for the tetrahedral meshes is nearly the same. For ideal scaling
   the time for each event should be constant as the number of
   processes incrases. The \texttt{KSPSolve} event encompasses the
diff -r b3e79c713b83 -r 6a76f6eb25ff figs/solvertest_scaling.pdf
Binary file figs/solvertest_scaling.pdf has changed