[cig-commits] commit: Rewrote the performance section and added refs

Wed Jul 18 20:09:24 PDT 2012

changeset:   129:d25491644e1e
tag:         tip
user:        Matthew G. Knepley <knepley at gmail.com>
date:        Wed Jul 18 22:09:18 2012 -0500
files:       faultRup.tex references.bib
description:
Rewrote the performance section and added refs


diff -r 5ef3c48c864c -r d25491644e1e faultRup.tex

--- a/faultRup.tex	Thu May 17 09:34:59 2012 +1200
+++ b/faultRup.tex	Wed Jul 18 22:09:18 2012 -0500
@@ -23,7 +23,9 @@
 \newcommand\brad[1]{{\color{red}\bf [BRAD: #1]}}
 \newcommand\matt[1]{{\color{blue}\bf [MATT: #1]}}
 \newcommand\charles[1]{{\color{green}\bf [CHARLES: #1]}}
-\newcommand\event[1]{\texttt{\bf #1}}
+\newcommand\PetscFunction[1]{\texttt{\bf #1}}
+\newcommand\PetscClass[1]{\texttt{\bf #1}}
+\newcommand\PetscEvent[1]{\texttt{\bf #1}}
 
 % ======================================================================
 % PREAMBLE
@@ -910,7 +912,7 @@ which leads to a simple block diagonal p
 
 The elastic submatrix $K$, in the absence of boundary conditions,
 has three translational and three rotational null modes. These are
-provided to the algebraic multigrid preconditioner, such as the
+provided to the algebraic multigrid (AMG) preconditioner, such as the
 ML library \citep{ML:users:guide} or the PETSc GAMG preconditioner,
 in order to assure an accurate coarse grid solution. AMG mimics the
 action of traditional geometric multgrid, but it generates coarse
@@ -1212,25 +1214,24 @@ accelerates the convergence with an 80\%
 accelerates the convergence with an 80\% further reduction in the
 number of iterations required for convergence.
 
-\matt{Add additional comments, explanation of performance}
-The underlying PETSc solver infrastructure has dmeonstrated optimal scalability
+\subsection{Parallel Scaling Performance}
+
+The underlying PETSc solver infrastructure has demonstrated optimal scalability
 on the largest machines available today. However, very often computer science
-scalability results are bsed upon unrelaistically simple problems which do not
+scalability results are based upon unrealistically simple problems which do not
 advance the scientific state-of-the-art. We will concentrate on explaining the
 sources of reduced scalability, and propose possible algorithmic mitigation.
 
 The main impediment to scalability in PyLith is load imbalance in the solver stage.
 This imbalance is the combination of three effects: the inherent imbalance in the
 partition of an unstructured mesh, the use of a cell partition, and lack of
-incorporation of cohesive cells in the partition. In our full test case, the unstructured
-partition calculated with both ParMetis could have a load imbalance of up to 30\%
-on 128 processors. On top of this, the cell partition, which is necessary in order
-to achieve good balance for the finite element integration, does not take into
-account Dirichlet boundary conditions or unknowns on the fault, which can exacerbate
+incorporation of cohesive cells in the partition. In our full test case, matrix-vector
+multiplication (the PETSc \PetscFunction{MatMult} function) could have a load imbalance
+of up to 30\% on 128 processors. The cell partition, calculated with ParMetis, which is
+necessary in order to achieve good balance for the finite element integration, does not
+take into account Dirichlet boundary conditions or unknowns on the fault, which can exacerbate
 the imbalance. However, elimination of constrained unknowns preserves the symmetry
 of the overall systems, and can result in better conditioned linear systems.
-
-\subsection{Parallel Scaling Performance}
 
 We evaluate the parallel performance via a weak scaling
 criterion. That is, we run simulations on various numbers of
@@ -1241,29 +1242,44 @@ performance benchmark we use the entire 
 performance benchmark we use the entire suite of hexahedral and
 tetrahedral meshes described earlier that range in size from
 $1.78\times 10^5$ DOF to $2.14\times 10^7$ DOF. In each of these
-simulations, we employ the field split algebraic multigrid
-preconditioner with multiplicative composition for the elasticity
-submatrix and the custom preconditioner for the Lagrange multipliers
+simulations, we compare the standard Additive Schwarz parallel
+preconditioner~\cite{Smith:etal:1996} with an approximate block factorization
+preconditioner~\cite{elman2008tcp} built using the PETSc \PetscClass{PC FieldSplit}
+object. We test additive, multiplicative, and Schur complement block compositions,
+and employ the ML algebraic multigrid preconditioner for the elasticity
+submatrix and our custom preconditioner for the Lagrange multipliers
 submatrix. We ran the simulations on a Beowulf cluster comprised of 24
 compute nodes connected by QDR Infiniband, where each compute node
 consisted of two quad-core Intel Xeon E5620 processors with 24 GB
 of RAM. Simulations run on eight or fewer cores were run on a single
-compute node. Thus, in addition to algorithm bottlenecks, runtime
-performance is potentially impeded by core/memory affinity, memory
-bandwidth, and communication among compute nodes.
+compute node. In the USGS configuration, four dual-quadcore chips share a single backplane.
+Thus, in addition to algorithm bottlenecks, runtime performance is potentially impeded
+by core/memory affinity, memory bandwidth, and communication among compute nodes.
 
-The scaling of \event{VecMAXPY} and \event{MatMult} give a good idea of the
-memory system performance since both operations are limited by memory bandwidth.
-From Fig.~\ref{fig:memBandwidth}, we see that we saturate the memory system
-using two processes per processor, since scaling plateaus from 2 to 4 processes,
-but shows good scaling from 4 to 8.
+The single node scaling for PyLith is almost completely controlled by the
+available memory bandwidth. Good illustrations of the memory system
+performance are given by the \PetscEvent{VecAXPY}, \PetscEvent{VecMAXPY} and
+\PetscEvent{VecMDot} operations reported in the log summary~\cite{PETSc:manual},
+since these operations are limited by available memory bandwidth rather than
+processor flop rate. From Table~\ref{tab:memBandwidth}, we see that we saturate
+the memory system using two or three processes per processor, since scaling
+plateaus from 2 to 4 processes, but shows good scaling from 8 to 16. This lack of
+bandwidth will depress overall performance, but should not affect the inter-node
+scaling of the application.
 
-Machine network performance can be elucidated by the \event{VecMDot} operation
-for reductions, and \event{MatMult} for point-to-point communication. In the USGS
-configuration, four dual-quadcore chips shared a single backplane. In
-Fig.~\ref{fig:comm} we see that these operations show good scaling up to 32 processes,
-which corresponds to a single 2U box, but very poor scaling thereafter, indicating
-a very poor internode network.
+Machine network performance can be elucidated by the \PetscEvent{VecMDot} operation
+for reductions, and \PetscEvent{MatMult} for point-to-point communication. In
+Table~\ref{tab:memBandwidth} we see that the vector reduction shows good scaling up
+to 64 processes \matt{Recheck with new 128 results}. Similarly in Table~\ref{tab:solver},
+we see that \PetscEvent{MatMult} has good scalability, but that it is the smaller component
+of overall solver time. The AMG setup time increases roughly linearly with the number of
+processes. It is often not included in weak scaling studies since it is amortized over
+the iteration, but it is responsible for most of the deviation from perfect weak scaling.
+The scalability of the AMG apply decays more slowly, but there is still serious deterioation
+by 64 processes. Here we could trade preconditioner strength for scalability by turning down
+the work done on coarse AMG grids, so that the solver uses more iterations which scale very well.
+However, that would increase overall solver time and thus would not be the choice to maximize
+scientific output.
 
 Figure~\ref{fig:solvertest:scaling} illustrates excellent the parallel
 performance for the finite-element assembly routines (reforming the
@@ -1274,9 +1290,8 @@ Schur complement methods and AMG slows t
 Schur complement methods and AMG slows the growth considerably, but future work will
 pursue the ultimate goal of iteration counts independent of process number.
 
-Separately plot VecMDot and MatMult
-This shows that increasing iterates have a penalty from sync VecMDot, and MatMult scales
-better. Could tradeoff with iterative method like BCGS.
+%Separately plot VecMDot and MatMult. This shows that increasing iterates have a penalty from sync VecMDot, and MatMult
+%scales better. Could tradeoff with iterative method like BCGS.
 
 % ------------------------------------------------------------------
 \section{Code Verification Benchmarks}
@@ -1870,6 +1885,91 @@ simulations of earthquake rupture propag
 
 
 \begin{table}
+\caption{Performance Benchmark Memory System Evaluation\tablenotemark{a}}
+\label{tab:memBandwidth}
+\centering
+\begin{tabular}{rcc}
+  \# Cores & Load Imbalance & MFlops/s \\
+  \hline
+  \multicolumn{3}{c}{\PetscEvent{VecMDot}} \\
+     1 & 1.0 &  2188 \\
+     2 & 1.0 &  3969 \\
+     4 & 1.0 &  5511 \\
+     8 & 1.1 &  6007 \\
+    16 & 1.3 & 10249 \\
+    32 & 1.2 &  4270 \\
+    64 & 1.2 & 12299 \\
+   128 & 1.3 &  2019 \\
+  \hline
+  \multicolumn{3}{c}{\PetscEvent{VecAXPY}} \\
+     1 & 1.0 &  1453 \\
+     2 & 1.1 &  2708 \\
+     4 & 1.1 &  5001 \\
+     8 & 1.1 &  4225 \\
+    16 & 1.3 &  8157 \\
+    32 & 1.2 & 13876 \\
+    64 & 1.2 & 25807 \\
+   128 & 1.3 & 58759 \\
+  \hline
+  \multicolumn{3}{c}{\PetscEvent{VecMAXPY}} \\
+     1 & 1.0 &  1733 \\
+     2 & 1.1 &  3283 \\
+     4 & 1.1 &  4991 \\
+     8 & 1.1 &  5611 \\
+    16 & 1.3 & 11050 \\
+    32 & 1.2 & 21680 \\
+    64 & 1.2 & 42697 \\
+   128 & 1.3 & 84691 \\
+  \hline
+\end{tabular}
+\tablenotetext{a}{Examination of memory system performance using three PETSc vector operations.}
+\end{table}
+
+
+\begin{table}
+\caption{Performance Benchmark Solver Evaluation\tablenotemark{a}}
+\label{tab:solver}
+\centering
+\begin{tabular}{lrrr}
+  Event & Calls & Time (s) & MFlops/s \\
+  \hline
+  \multicolumn{4}{c}{p = 8}    \\
+  MatMult  & 168 &  2.1 & 4947 \\
+  PCSetUp  &   1 &  5.8 &  159 \\
+  PCApply  &  53 &  4.2 & 3081 \\
+  KSPSolve &   1 & 12.9 & 2246 \\
+  \hline
+  \multicolumn{4}{c}{p = 16}   \\
+  MatMult  & 174 &  2.2 & 9690 \\
+  PCSetUp  &   1 &  7.0 &  258 \\
+  PCApply  &  55 &  4.9 & 5629 \\
+  KSPSolve &   1 & 14.9 & 4033 \\
+  \hline
+  \multicolumn{4}{c}{p = 32}   \\
+  MatMult  & 189 &  3.8 & 12003 \\
+  PCSetUp  &   1 & 15.3 &   241 \\
+  PCApply  &  60 &  7.3 &  8174 \\
+  KSPSolve &   1 & 26.0 &  5034 \\
+  \hline
+  \multicolumn{4}{c}{p = 64}   \\
+  MatMult  & 219 &  3.2 & 34538 \\
+  PCSetUp  &   1 & 29.0 &   348 \\
+  PCApply  &  70 & 14.8 & 10229 \\
+  KSPSolve &   1 & 47.5 &  7067 \\
+  \hline
+  \multicolumn{4}{c}{p = 128}  \\
+  MatMult  & 222 &   22.1 & 9880 \\
+  PCSetUp  &   1 &  314.1 &  169 \\
+  PCApply  &  71 & 6865.0 &   46 \\
+  KSPSolve &   1 & 7198.9 &   99 \\
+  \hline
+\end{tabular}
+\tablenotetext{a}{Examination of memory system performance using three PETSc vector operations.}
+\end{table}
+
+
+\clearpage
+\begin{table}
 \caption{SCEC Benchmark TPV13 Parameters\tablenotemark{a}}
 \label{tab:tpv13:parameters}
 \centering
diff -r 5ef3c48c864c -r d25491644e1e references.bib
--- a/references.bib	Thu May 17 09:34:59 2012 +1200
+++ b/references.bib	Wed Jul 18 22:09:18 2012 -0500
@@ -985,6 +985,17 @@
   url       = {http://www.mcs.anl.gov/\~{ }bsmith/ddbook.html}
 }
 
+ at article{elman2008tcp,
+  title={{A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier-Stokes equations}},
+  author={Elman, H.C. and Howle, V.E. and Shadid, J. and Shuttleworth, R. and Tuminaro, R.},
+  journal={Journal of Computational Physics},
+  volume={227},
+  number={1},
+  pages={1790--1808},
+  year={2008},
+  publisher={Academic Press}
+}
+
 @Article{Liu:etal:2006,
   author = 	 {Liu, P. and Archuleta, R.~J. and Hartzell, S.~H.},
   title = 	 {Prediction of Broadband Ground-Motion Time