[cig-commits] commit: Cleanup of discussion of parallel performance.

Mon Aug 13 14:56:20 PDT 2012

changeset:   131:56ed4e9edde6
tag:         tip
user:        Brad Aagaard <baagaard at usgs.gov>
date:        Mon Aug 13 14:56:14 2012 -0700
files:       faultRup.tex references.bib
description:
Cleanup of discussion of parallel performance.


diff -r 9268656f1d00 -r 56ed4e9edde6 faultRup.tex

--- a/faultRup.tex	Mon Aug 13 09:33:09 2012 -0700
+++ b/faultRup.tex	Mon Aug 13 14:56:14 2012 -0700
@@ -24,9 +24,6 @@
 \newcommand\brad[1]{{\color{red}\bf [BRAD: #1]}}
 \newcommand\matt[1]{{\color{blue}\bf [MATT: #1]}}
 \newcommand\charles[1]{{\color{green}\bf [CHARLES: #1]}}
-\newcommand\PetscFunction[1]{\texttt{\bf #1}}
-\newcommand\PetscClass[1]{\texttt{\bf #1}}
-\newcommand\PetscEvent[1]{\texttt{\bf #1}}
 
 % ======================================================================
 % PREAMBLE
@@ -80,7 +77,7 @@
   equations that provides excellent scalability with problem size
   compared to conventional Additive Schwarz methods. We demonstrate
   application of this approach using benchmarks for both quasi-static
-  viscoelastic deformation and spontaneous dynamic rupture propagation
+  viscoelastic deformation and dynamic spontaneous rupture propagation
   that verify the numerical implementation.  Future work will focus on
   linking the quasi-static and dynamic simulations together to capture
   both the slow strain accumulation and post-seismic relaxation at
@@ -223,7 +220,7 @@ fault slip evolve according to a fault c
 fault slip evolve according to a fault constitutive model that
 specifies the friction on the fault surface. Here, we describe a
 robust, yet flexible method for implementing fault slip with a domain
-decomposition approach, its affect on the overall design of PyLith,
+decomposition approach, its effect on the overall design of PyLith,
 and verification of its implementation using benchmarks.
 
 % ------------------------------------------------------------------
@@ -233,7 +230,7 @@ equations using the finite-element metho
 equations using the finite-element method. We augment the conventional
 finite-element formulation for elasticity with a domain decomposition
 approach \citep{Smith:etal:1996,Zienkiewicz:Taylor:2005} to implement
-the fault slip.  The PyLith manual \citep{PyLith:manual:1.6.2} provides
+the fault slip.  The PyLith manual \citep{PyLith:manual:1.7.1} provides
 a step-by-step description of the formulation.
 
 We solve the elasticity equation including inertial terms,
@@ -298,8 +295,8 @@ the Lagrange multipliers (fault traction
   \begin{split}
     - \int_{V} \nabla\bm{\phi} : \bm{\sigma} \, dV
     + \int_{S_T} \bm{\phi} \cdot \bm{T} \, dS
-    - \int_{S_{f^+}} \bm{\phi} \cdot \bm{l} \, dS
-    + \int_{S_{f^-}} \bm{\phi} \cdot \bm{l} \, dS \\
+    - \int_{S_{f^+}} \bm{\phi} \cdot \bm{l} \, dS \\
+    + \int_{S_{f^-}} \bm{\phi} \cdot \bm{l} \, dS
     + \int_{V} \bm{\phi} \cdot \bm{f} \, dV 
     - \int_{V} \bm{\phi} \cdot \rho \frac{\partial^{2}\bm{u}}{\partial t^{2}} \, dV
     =0.
@@ -363,7 +360,7 @@ for all $\bm{a}_m$, which leads to
 \end{gather}
 We want to solve these equations for the coefficients $\bm{u}_n$
 and $\bm{l}_p$ subject to $\bm{u} = \bm{u}_0 \text{ on
-}S_u$. When we prescribed the slip, we specify $\bm{d}$ on $S_f$,
+}S_u$. When we prescribe the slip, we specify $\bm{d}$ on $S_f$,
 and when we use a fault constitutive model, we specify how the
 Lagrange multipliers $\bm{l}$ depend on the fault slip, slip rate,
 and state variables.
@@ -414,7 +411,7 @@ Extensible Toolkit for Scientific Comput
 Extensible Toolkit for Scientific Computation (PETSc), which provides
 a suite of tools for solving linear systems of algebraic equations
 with parallel processing
-\citep{PETSc:manual,PETSC:efficient}. In solving the
+\citep{PETSC:efficient,PETSc:manual}. In solving the
 system, we compute the residual (i.e., $\bm{r} = \bm{b} -
 \bm{A} \cdot \bm{u}$ and the Jacobian of the system
 ($\bm{A}$). In our case the solution is $\bm{u} =
@@ -447,8 +444,10 @@ some algebra we find this portion of the
   (\nabla^T + \nabla) \bm{N}_m^T \cdot
   \bm{C} \cdot (\nabla + \nabla^T) \bm{N}_n  \, dV.
 \end{equation}
-Following a similar procedure, we find the portion of the Jacobian
-associated with the constraints, equation~(\ref{eqn:quasi-static:residual:fault}), is
+This matches the stiffness matrix in conventional solid mechanics
+finite-element formulations. Following a similar procedure, we find
+the portion of the Jacobian associated with the constraints,
+equation~(\ref{eqn:quasi-static:residual:fault}), is
 \begin{equation}
   \label{eqn:jacobian:constraint}
   \bm{L} = \int_{S_f} \bm{N}_p^T \cdot (\bm{N}_{n^+} - \bm{N}_{n^-}) \, dS.
@@ -494,7 +493,7 @@ the fault slip constraint remains unchan
 the fault slip constraint remains unchanged, so the corresponding
 portions of the Jacobian ($\bm{L}$) and residual ($\bm{r}_p$)
 are exactly the same as in the quasi-static simulations. Including the
-inertial term in equation~\ref{eqn:quasi-static:residual:elasticity} yields
+inertial term in equation~(\ref{eqn:quasi-static:residual:elasticity}) yields
 \begin{equation}
   \label{eqn:dynamic:residual:elasticity}
   \begin{split}
@@ -541,10 +540,12 @@ so that the upper portion of the Jacobia
   \bm{K} = 
     \frac{1}{\Delta t^2} \int_{V} \rho \bm{N}_m^T\ \cdot \bm{N}_n \, dV.
 \end{equation}
+This matches the mass matrix in conventional solid mechanics
+finite-element formulations.
 
-Earthquake ruptures in which the slip has a short rise time tends to
+Earthquake ruptures in which the slip has a short rise time tend to
 introduce deformation at short length scales (high frequencies) that
-the numerical models can resolve accurately. This is especially true
+the numerical models cannot resolve accurately. This is especially true
 in spontaneous rupture simulations, because the rise time is sensitive
 to the evolution of the fault rupture. In order to reduce the
 introduction of deformation at such short length scales we add
@@ -558,7 +559,7 @@ artificial damping via Kelvin-Voigt visc
   \bm{u_d} = \bm{u} + \eta^* \Delta t \frac{\partial
     \bm{u}}{\partial t},
 \end{gather}
-where $\eta^*$ is a nomdimensional viscosity on the order of
+where $\eta^*$ is a nondimensional viscosity on the order of
 0.1--1.0. 
 
 % ------------------------------------------------------------------
@@ -593,7 +594,7 @@ magnitude. This results in a rather ill-
 magnitude. This results in a rather ill-conditioned system. We avoid
 this ill-conditioning by nondimensionalizing all of the quantities
 involved in the problem based upon user-specified scales
-\citep{PyLith:manual:1.6.2}, facilitating forming well-conditioned
+\citep{PyLith:manual:1.7.1}, facilitating forming well-conditioned
 systems of equations for problems across a wide range of spatial and
 temporal scales.
 
@@ -614,7 +615,7 @@ rate), the integral of Brune's far-field
 \citep{Brune:1970}, a sine-cosine function developed by
 \citet{Liu:etal:2006}, and a user-defined time history. These are
 discussed in detail in the PyLith manual
-\citep{PyLith:manual:1.6.2}. PyLith allows specification of the slip
+\citep{PyLith:manual:1.7.1}. PyLith allows specification of the slip
 initiation time independently at each location as well as
 superposition of multiple earthquake ruptures with different origin
 times, thereby permitting complex slip behavior.
@@ -623,11 +624,11 @@ times, thereby permitting complex slip b
 % ------------------------------------------------------------------
 \subsection{Spontaneous Fault Rupture}
 
-In a spontaneous (dynamic) fault rupture a constitutive model controls
-the tractions on the fault surface. The fault slip evolves based on
-the fault tractions as driven by the initial conditions, boundary
+In a spontaneous fault rupture a constitutive model controls the
+tractions on the fault surface. The fault slip evolves based on the
+fault tractions as driven by the initial conditions, boundary
 conditions and deformation. In our formulation of fault slip, we
-specify the fault slip and the fault tractions (Lagrange multipliers)
+specify the fault slip, and the fault tractions (Lagrange multipliers)
 are part of the solution (unknowns). The fault constitutive model
 places bounds on the Lagrange multipliers and the system of equations
 is nonlinear-- when a location on the fault is slipping, we must solve
@@ -700,7 +701,7 @@ significantly with slip, then the estima
 significantly with slip, then the estimate of how much slip is
 required to match the fault constitutive model will be
 poor. Similarly, in rare cases in which the fault slip extends across
-the entire domain, deformation extends much farther from the fault and
+the entire domain, deformation extends far from the fault and
 the estimate derived using only the fault DOF will be
 poor. In order to make this iterative procedure more robust so that it
 works well across a wide variety of fault constitutive models, we add
@@ -736,7 +737,7 @@ zero. The fault constitutive models incl
 zero. The fault constitutive models include static friction, linear
 slip-weakening, linear time-weakening, and Dieterich-Ruina rate-state
 friction with an aging law (see the PyLith manual
-\citep{PyLith:manual:1.6.2} for details).
+\citep{PyLith:manual:1.7.1} for details).
 
 % ------------------------------------------------------------------
 \section{Finite-Element Mesh Processing}
@@ -754,7 +755,7 @@ representation and manipulation is based
 representation and manipulation is based upon a direct acyclic graph
 representation of the \textit{covering} relation in a mesh. For
 example, faces cover cells, edges cover faces, and points cover
-edges. By focusing on the key topological relation, the interface can
+edges. By focusing on the key topological relations, the interface can
 be both concise and quite general. Using this generic API, PyLith is
 able to support one, two, and three dimensional meshes, with
 simplicial, hex, and even prismatic cell shapes, while using very
@@ -881,7 +882,7 @@ accommodate an arbitrary number of field
 accommodate an arbitrary number of fields, mixed discretizations,
 fields defined over a subset of the mesh, etc. Once these fields are
 defined, a substantial range of preconditioners can be assembled using
-only PyLith options for PETSc. Table~\ref{tab:solver:options} shows
+only PyLith options for PETSc. Table~\ref{tab:preconditioner:options} shows
 example preconditioners and the options necessary to construct them.
 
 Another option involves using the field split preconditioner in PETSc
@@ -911,26 +912,26 @@ which leads to a simple block diagonal p
   \end{array} \right).
 \end{equation}
 
-The elastic submatrix $K$, in the absence of boundary conditions,
+The elastic submatrix $\bm{K}$, in the absence of boundary conditions,
 has three translational and three rotational null modes. These are
 provided to the algebraic multigrid (AMG) preconditioner, such as the
-ML library \citep{ML:users:guide} or the PETSc GAMG preconditioner,
-in order to assure an accurate coarse grid solution. AMG mimics the
-action of traditional geometric multgrid, but it generates coarse
+ML library \citep{ML:users:guide} or the PETSc GAMG preconditioner, in
+order to assure an accurate coarse grid solution. AMG mimics the
+action of traditional geometric multigrid, but it generates coarse
 level operators and interpolation matrices using only the system
 matrix, treated as a weighted graph, rather than a separate
-description of the problem geometry, such as a mesh. We use
-PCFIELDSPLIT to split the elastic block from the fault block, and
-also to manage Schur complements. In this way, all block preconditioners,
-including those nested with multigrid, can be controlled from the
-options file without recompilation or special code.
+description of the problem geometry, such as a mesh. We split the
+elastic block from the fault block, and also to manage Schur
+complements. In this way, all block preconditioners, including those
+nested with multigrid, can be controlled from the options file without
+recompilation or special code.
 
 We now turn our attention to evaluating the fault portion of the
 preconditioning matrix associated with the Lagrange multipliers, since
 stock PETSc preconditioners can handle the elastic portion as
 discussed in the previous paragraph. In computing
 $\bm{P_\mathit{fault}}$ we approximate $\bm{K}^{-1}$ with
-the inverse of the diagonal portion of $\bm{K}$. $\bm{L}$, which
+the inverse of the diagonal portion of $\bm{K}$. $\bm{L}$ which
 consists of integrating the products of basis functions over the fault
 faces. Its structure depends on the quadrature scheme and the choice
 of basis functions. For conventional low order finite-elements and
@@ -956,12 +957,12 @@ equation~(\ref{eqn:saddle:point:code}).
 %  $\bm{K}$.}
 
 Our preferred setup uses the field splitting options in PETSc to
-combine an AMG preconditioner for the elasticity submatrix with out
+combine an AMG preconditioner for the elasticity submatrix with our
 custom fault preconditioner for the Lagrange multiplier submatrix. See
-Section~\ref{sec:solvertest} for a comparison of preconditioner
-performance for an application involved a static simulation with
-multiple faults. It shows the clear superiority of this setup over
-several other possible preconditioning strategies.
+Section~\ref{sec:performance:benchmark} for a comparison of
+preconditioner performance for an application involved a static
+simulation with multiple faults. It shows the clear superiority of
+this setup over several other possible preconditioning strategies.
 
 \subsection{Dynamic Simulations}
 
@@ -1048,7 +1049,7 @@ associated with the fault slip constrain
 associated with the fault slip constraint in solving the system of
 equations via a Schur's complement algorithm. We compute an initial
 residual assuming the increment in the solution is zero (i.e.,
-$\bm{du}_n = \bm{0}$ and $\bm{dl}_p = \bm{0}$,
+$\bm{du}_n = \bm{0}$ and $\bm{dl}_p = \bm{0}$),
 \begin{equation}
   \bm{r}^* = \begin{pmatrix} \bm{r}_n^* \\ \bm{r}_p^* \end{pmatrix} =
   \begin{pmatrix} \bm{b}_n \\ \bm{b}_p \end{pmatrix}
@@ -1169,10 +1170,10 @@ the problem size (number of DOF) for the
 the problem size (number of DOF) for the two different cell types
 (hexahedra and tetrahedra) are nearly the same. The suite of
 simulations examine problem sizes increasing by about a factor of two
-from $1.78\times 10^5$ DOF to $2.14\times 10^7$ DOF. The corresponding
-discretization sizes are 1520 m to 392 m for the hexahedral meshes and
-1744 m to 456 m for the tetrahedral meshes.
-Figure~\ref{fig:solvertest:mesh} shows the 1744 m resolution
+from $1.78\times 10^5$ DOF (1 process) to $1.10\times 10^7$ DOF (64
+processes). The corresponding discretization sizes are 2033 m to 500 m
+for the hexahedral meshes and 2326 m to 581 m for the tetrahedral
+meshes.  Figure~\ref{fig:solvertest:mesh} shows the 1846 m resolution
 tetrahedral mesh. As we will see in
 Section~\ref{sec:verification:quasi-static}, the hexahedral mesh for a
 given resolution in a quasi-static problem is slightly more accurate,
@@ -1212,87 +1213,103 @@ family of field split preconditioners, t
 family of field split preconditioners, the one with multiplicative
 composition minimizes the number of iterations. The custom
 preconditioner for the Lagrange multiplier submatrix greatly
-accelerates the convergence with an 80\% further reduction in the
+accelerates the convergence with an 80\% reduction in the
 number of iterations required for convergence.
 
 \subsection{Parallel Scaling Performance}
 
-The underlying PETSc solver infrastructure has demonstrated optimal scalability
-on the largest machines available today. However, very often computer science
-scalability results are based upon unrealistically simple problems which do not
-advance the scientific state-of-the-art. We will concentrate on explaining the
-sources of reduced scalability, and propose possible algorithmic mitigation.
+The underlying PETSc solver infrastructure has demonstrated optimal
+scalability on the largest machines available today \ref{??} \matt{Add
+  reference for this statement}. However, computer science scalability
+results are often based upon unrealistically simple problems which do
+not advance the scientific state-of-the-art. In evaluating the
+parallel scalability of PyLith, we consider the sources responsible
+for reducing the scalability and propose possible steps for
+mitigation.
 
-The main impediment to scalability in PyLith is load imbalance in the solver stage.
-This imbalance is the combination of three effects: the inherent imbalance in the
-partition of an unstructured mesh, the use of a cell partition, and lack of
-incorporation of cohesive cells in the partition. In our full test case, matrix-vector
-multiplication (the PETSc \PetscFunction{MatMult} function) could have a load imbalance
-of up to 30\% on 128 processors. The cell partition, calculated with ParMetis, which is
-necessary in order to achieve good balance for the finite element integration, does not
-take into account Dirichlet boundary conditions or unknowns on the fault, which can exacerbate
-the imbalance. However, elimination of constrained unknowns preserves the symmetry
-of the overall systems, and can result in better conditioned linear systems.
+The main impediment to scalability in PyLith is load imbalance during
+the solve.  This imbalance is the combination of three effects: the
+inherent imbalance in partitioning an unstructured mesh, partitioning
+based on cells rather than DOF, and weighting the cohesive cells the
+same as conventional bulk cells while partitioning. In this
+performance benchmark matrix-vector multiplication (the PETSc
+\texttt{MatMult} function) has a load imbalance of up to 20\%
+on 64 processors. The cell partition balances the number of cells
+across the processes using ParMetis \cite{Karypis:etal:1999} in order
+to achieve good balance for the finite element integration. This does
+not take into account a reduction in the number of DOF associated with
+constraints from Dirichlet boundary conditions or the additional DOF
+associated with the Lagrange multiplier constraints, which can
+exacerbate any imbalance. Nevertheless, eliminating DOF associated
+with Dirichlet boundary conditions preserves the symmetry of the
+overall systems and, in many cases, results in better conditioned
+linear systems.
 
 We evaluate the parallel performance via a weak scaling
 criterion. That is, we run simulations on various numbers of
 processors/cores with an increase in the problem size as the number of
-processors increases to maintain the same workload (number of DOF) for
-each core. Ideally, the time for the various stages of the simulation
-should be independent of the number of processors/cores. For this
-performance benchmark we use the entire suite of hexahedral and
-tetrahedral meshes described earlier that range in size from
-$1.78\times 10^5$ DOF to $2.14\times 10^7$ DOF. In each of these
-simulations, we compare the standard Additive Schwarz parallel
-preconditioner~\cite{Smith:etal:1996} with an approximate block factorization
-preconditioner~\cite{elman2008tcp} built using the PETSc \PetscClass{PC FieldSplit}
-object. We test additive, multiplicative, and Schur complement block compositions,
-and employ the ML algebraic multigrid preconditioner for the elasticity
+processes increases to maintain the same workload (e.g., number of
+cells and number of DOF) for each core. In ideal weak scaling the time
+for the various stages of the simulation is independent of the number
+of processes. For this performance benchmark we use the entire suite
+of hexahedral and tetrahedral meshes described earlier that range in
+size from $1.78\times 10^5$ DOF (1 process) to $1.10\times 10^7$ DOF
+(64 processes). We employ the AMG preconditioner for the elasticity
 submatrix and our custom preconditioner for the Lagrange multipliers
 submatrix. We ran the simulations on a Beowulf cluster comprised of 24
 compute nodes connected by QDR Infiniband, where each compute node
-consisted of two quad-core Intel Xeon E5620 processors with 24 GB
-of RAM. Simulations run on eight or fewer cores were run on a single
-compute node. In the USGS configuration, four dual-quadcore chips share a single backplane.
-Thus, in addition to algorithm bottlenecks, runtime performance is potentially impeded
-by core/memory affinity, memory bandwidth, and communication among compute nodes.
+consisted of two quad-core Intel Xeon E5620 processors with 24 GB of
+RAM. Simulations run on eight or fewer cores were run on a single
+compute node with processes distributed across processors and then
+cores. For example, the two process simulation used one core on each
+of two processors. In addition to algorithm bottlenecks, runtime
+performance is potentially impeded by core/memory affinity, memory
+bandwidth, and communication among compute nodes.
 
-The single node scaling for PyLith is almost completely controlled by the
-available memory bandwidth. Good illustrations of the memory system
-performance are given by the \PetscEvent{VecAXPY}, \PetscEvent{VecMAXPY} and
-\PetscEvent{VecMDot} operations reported in the log summary~\cite{PETSc:manual},
-since these operations are limited by available memory bandwidth rather than
-processor flop rate. From Table~\ref{tab:memBandwidth}, we see that we saturate
-the memory system using two or three processes per processor, since scaling
-plateaus from 2 to 4 processes, but shows good scaling from 8 to 16. This lack of
-bandwidth will depress overall performance, but should not affect the inter-node
-scaling of the application.
+The single node scaling for PyLith (eight processes or less in this
+case) is almost completely controlled by the available memory
+bandwidth. Good illustrations of the memory system performance are
+given by the \texttt{VecAXPY}, \texttt{VecMAXPY} and \texttt{VecMDot}
+operations reported in the log summary \cite{PETSc:manual}. These
+operations are limited by available memory bandwidth rather than the
+rate at which a processor can perform floating points operations. From
+Table~\ref{tab:solvertest:memory:events}, we see that we saturate the memory
+system using two processes per processor, since scaling plateaus from
+2 to 4 processes, but shows good scaling from 8 to 16 processes. This
+lack of memory bandwidth will depress overall performance, but should
+not affect the inter-node scaling of the application.
 
-Machine network performance can be elucidated by the \PetscEvent{VecMDot} operation
-for reductions, and \PetscEvent{MatMult} for point-to-point communication. In
-Table~\ref{tab:memBandwidth} we see that the vector reduction shows good scaling up
-to 64 processes \matt{Recheck with new 128 results}. Similarly in Table~\ref{tab:solver},
-we see that \PetscEvent{MatMult} has good scalability, but that it is the smaller component
-of overall solver time. The AMG setup time increases roughly linearly with the number of
-processes. It is often not included in weak scaling studies since it is amortized over
-the iteration, but it is responsible for most of the deviation from perfect weak scaling.
-The scalability of the AMG apply decays more slowly, but there is still serious deterioation
-by 64 processes. Here we could trade preconditioner strength for scalability by turning down
-the work done on coarse AMG grids, so that the solver uses more iterations which scale very well.
-However, that would increase overall solver time and thus would not be the choice to maximize
-scientific output.
+Machine network performance can be elucidated by the \texttt{VecMDot}
+operation for vector reductions, and \texttt{MatMult} for
+point-to-point communication. In
+Table~\ref{tab:solvertest:memory:events} we see that the vector
+reduction shows good scaling up to 64 processes. Similarly in
+Table~\ref{tab:solvertest:solver:events}, we see that \texttt{MatMult}
+has good scalability, but that it is a small fraction of the overall
+solver time. The AMG preconditioner setup (\texttt{PCSetUp}) and
+application \texttt{PCApply}) dominate the overall solver time. The
+AMG preconditioner setup time increases roughly linearly with the
+number of processes. Note that many weak scaling studies do not
+include this event, because it is amortized over the
+iteration. Nevertheless, in our benchmark it is responsible for most
+of the deviation from perfect weak scaling.  The scalability of the
+application of the AMG preconditioner decays more slowly, but there is
+still serious deterioration by 64 processes. We could trade
+preconditioner strength for scalability by reducing the work done on
+the coarse AMG grids, so that the solver uses more iterations which
+scale very well.  However, that would increase overall solver time and
+thus would not be the choice to maximize scientific output.
 
 Figure~\ref{fig:solvertest:scaling} illustrates excellent the parallel
 performance for the finite-element assembly routines (reforming the
-Jacobian sparse matrix and computing the residual). From the ASM performance,
-we see that the basic solver building-blocks, like parallel sparse matrix-vector
-multiplication, scale well. However, the preconditioner itself is not scalable, and
-the number of iterates increases strongly with process count. The introduction of
-Schur complement methods and AMG slows the growth considerably, but future work will
-pursue the ultimate goal of iteration counts independent of process number.
-
-%Separately plot VecMDot and MatMult. This shows that increasing iterates have a penalty from sync VecMDot, and MatMult
-%scales better. Could tradeoff with iterative method like BCGS.
+Jacobian sparse matrix and computing the residual). From the ASM
+performance, we see that the basic solver building-blocks, like
+parallel sparse matrix-vector multiplication, scale well. However, the
+ASM preconditioner itself is not scalable, and the number of iterates
+increases strongly with the number of processes. The introduction of Schur
+complement methods and an AMG preconditione slows the growth
+considerably, but future work will pursue the ultimate goal of
+iteration counts independent of the number of processes.
 
 % ------------------------------------------------------------------
 \section{Code Verification Benchmarks}
@@ -1307,19 +1324,19 @@ verify that the code properly solves the
 verify that the code properly solves the numerical problem.  These
 benchmark problems include quasi-static strike-slip and reverse
 viscoelastic simulations and various problems in the suite of
-spontaneous dynamic rupture benchmarks developed by the Southern
+dynamic spontaneous rupture benchmarks developed by the Southern
 California Earthquake Center (SCEC) and the United States Geological
 Survey \citep{Harris:etal:SRL:2009}. In this section we focus on two
 benchmarks that test different scientific applications: quasi-static
 relaxation of a Maxwell viscoelastic material subjected to multiple
 earthquake cycles involving slip and steady creep on a vertical
 strike-slip fault \citep{Savage:Prescott:1978} and supershear
-spontaneous dynamic rupture of a 60 degree dipping normal fault in a
+dynamic spontaneous rupture of a 60 degree dipping normal fault in a
 Drucker-Prager elastoplastic medium. This second benchmark corresponds
-to benchmark TPV13 in the SCEC suite of spontaneous dynamic rupture
+to benchmark TPV13 in the SCEC suite of dynamic spontaneous rupture
 benchmarks.
 
-\subsection{Quasi-static}
+\subsection{Quasi-static Benchmark}
 \label{sec:verification:quasi-static}
 
 As a test of our quasi-static solution, we compare our numerical
@@ -1361,13 +1378,13 @@ domain. Finally, we fix the x-displaceme
 domain. Finally, we fix the x-displacements on the -y and +y faces to
 enforce symmetry consistent with an infinitely long strike-slip fault.
 
-We examine four difference numerical solutions considering the effects
+We examine four different numerical solutions considering the effects
 of cell type (hexahedral versus tetrahedral) and discretization
 size. In our coarse hexahedral mesh we use a uniform resolution of 20
 km. In our higher resolution hexahedral mesh we refine an inner region
 (480 km by 240 km by 100 km) by a factor of three, yielding a
 resolution near the center of the fault of 6.7 km. For the
-tetrahedral meshes, we match the discretization size of the hexehedral
+tetrahedral meshes, we match the discretization size of the hexahedral
 mesh near the center of the fault (20km or 6.7 km) while increasing
 the discretization size in a geometric progression at a rate of
 1.02. This results in a maximum discretization size of approximately
@@ -1404,8 +1421,8 @@ second earthquake cycle, the far-field n
 second earthquake cycle, the far-field numerical solution does not yet
 accurately represent steady plate motion and the numerical simulations
 underpredict the displacement. By the tenth earthquake cycle, steady
-plate motion is accurately simulated and the numerical results thus
-match the analytical solution.
+plate motion is accurately simulated and the numerical results match
+the analytical solution.
 
 Within about one elastic thickness of the fault the effect of the
 resolution of the numerical models becomes apparent. We find large
@@ -1415,21 +1432,21 @@ solution. The 6.7 km hexahedral solution
 solution. The 6.7 km hexahedral solution is indistinguishable from the
 analytical solution in Figure~\ref{fig:savage:prescott:profiles}(b);
 the 6.7 km tetrahedral solution slightly underpredicts the analytical
-solution for times later in the earthquake cycle.  The greater
+solution for times late in the earthquake cycle.  The greater
 accuracy of the hexahedral cells relative to the tetrahedral cells
 with the same nominal discretization size for quasi-static solutions
 is consistent with our findings for other benchmarks. The greater
 number of polynomial terms in the basis functions of the hexahedra
 allows the model to capture a more complex deformation field.
 
-\subsection{Dynamic}
+\subsection{Dynamic Benchmark}
 \label{sec:verification:dynamic}
 
 SCEC Spontaneous Rupture Benchmark TPV13 focuses on modeling an
 earthquake that produces extreme (very large) ground motions
 associated with supershear rupture towards the ground surface on a
-dipping fault a large stress drop to generate large slip and fast slip
-rates \citep{Harris:etal:SRL:2011}. It uses a Drucker-Prager
+dipping fault with a large stress drop to generate large slip and fast
+slip rates \citep{Harris:etal:SRL:2011}. It uses a Drucker-Prager
 elastoplastic bulk rheology and a slip-weakening friction model in a
 depth-dependent initial stress field. Figure~\ref{fig:tpv13:geometry}
 show the geometry of the benchmark problem and the size of the domain
@@ -1441,12 +1458,12 @@ both triangular and quadrilateral discre
 both triangular and quadrilateral discretizations with resolutions on
 the fault of 50 m, 100 m, and 200 m for TPV13-2D and 100 m and 200 m
 for TPV13. We gradually coarsen the mesh with distance from the fault
-by increasing the discretization size at a geometric rate of
-2\%. This provides high resolution at the fault surface to resolve the
-small scale features of the rupture process with less resolution at
-the edges of the boundary where the solution is much
-smoother. Figure~\ref{fig:tpv13-2d:mesh} shows the triangular mesh for a
-discretization size of 100 m on the fault.
+by increasing the discretization size at a geometric rate of 2\%. This
+provides high resolution at the fault surface to resolve the small
+scale features of the rupture process with less resolution at the
+edges of the boundary where the solution is much
+smoother. Figure~\ref{fig:tpv13-2d:mesh} shows the triangular mesh for
+a discretization size of 100 m on the fault.
 
 Rupture initiates due to a low static coefficient of friction in the
 nucleation region. Figure~\ref{fig:tpv13-2d:stress:slip} illustrates
@@ -1480,11 +1497,11 @@ cohesive zone \citep{Rice:1993}. The fin
 cohesive zone \citep{Rice:1993}. The finer meshes provide sufficient
 resolution of the cohesive zone so there is very little high-frequency
 oscillation in the slip rate time histories. The triangular cells
-result in less oscillation compared with quadrilateral cells.
+generate less oscillation compared with quadrilateral cells.
 
 In this problem without an analytical solution, as in all of the
 benchmarks in the SCEC spontaneous rupture benchmark suite, we rely on
-comparison with other spontaneous dynamic rupture modeling codes to
+comparison with other dynamic spontaneous rupture modeling codes to
 verify the numerical implementation in
 PyLith. Figure~\ref{fig:tpv13-2d:slip:rate}(b) compares the slip rate
 time histories from PyLith with four other codes (see
@@ -1529,8 +1546,8 @@ produces a small time shift in the time 
 
 From the 2-D and 3-D versions of the SCEC spontaneous rupture
 benchmark TPV13, we conclude that the PyLith performs very similarly
-to other finite-element and finite-difference spontaneous dynamic
-rupture modeling codes. In particular it is well-suited to problems
+to other finite-element and finite-difference dynamic
+spontaneous rupture modeling codes. In particular it is well-suited to problems
 with complex geometry as we are able to vary the discretization size
 while simulating a dipping normal fault. The code accurately captures
 supershear rupture and properly implements a Drucker-Prager
@@ -1554,7 +1571,7 @@ in PyLith with excellent agreement with 
 in PyLith with excellent agreement with the analytical solution for
 viscoelastic relaxation and strike-slip faulting over multiply
 earthquake cycle and excellent agreement with other codes for
-supershear spontaneous dynamic rupture on a dipping normal fault
+supershear dynamic spontaneous rupture on a dipping normal fault
 embedded in an elastoplastic domain. Consequently, we believe this
 methodology will provide a promising avenue for modeling the
 earthquake cycle through coupling of quasi-static simulations of the
@@ -1676,7 +1693,7 @@ simulations of earthquake rupture propag
   \noindent\includegraphics{figs/savageprescott_profiles}
   \caption{Comparison of displacement profiles perpendicular to the
     fault in the Savage and Prescott benchmark during earthquake
-    cycles (a) 2 and (b) 10. The displacement values shown are
+    cycles (a) two and (b) ten. The displacement values shown are
     relative to the values at the beginning of the earthquake cycle to
     facilitate comparison between the analytical solution and the
     numerical models. Both the analytical and numerical simulations
@@ -1689,8 +1706,8 @@ simulations of earthquake rupture propag
     coarser (20 km) resolutions are unable to match the details of the
     displacement field at distances less than about one elastic
     thickness, but all of the numerical models provide a good fit to
-    the analytical solution at distances greater than 2-3 times the
-    elastic thickness.}
+    the analytical solution in the tenth earthquake cycle at distances
+    greater than 2--3 times the elastic thickness.}
   \label{fig:savage:prescott:profiles}
 \end{figure}
 
@@ -1701,7 +1718,7 @@ simulations of earthquake rupture propag
     friction, a depth-dependent stress field, and normal fault with a
     60 degree dip angle. The 2-D version corresponds to the vertical
     slice shown by the dashed line. The red dotes denote locations on
-    the fault used in the comparison of the vertical slip dates
+    the fault used in the comparison of the vertical slip rates
     (Figures~\ref{fig:tpv13-2d:slip:rate}
     and~\ref{fig:tpv13:slip:rate}). The blue dots indicate locations
     on the ground surface used in the comparison of fault normal and
@@ -1711,7 +1728,7 @@ simulations of earthquake rupture propag
 
 \begin{figure}
   \noindent\includegraphics[width=84mm]{figs/tpv13-2d_mesh}
-  \caption{Finite-element mesh comprised of triangular cells for SCEC
+  \caption{Finite-element mesh comprised of quadrilateral cells for SCEC
     spontaneous rupture benchmark TPV13-2D. The discretization size is 100
     m on the fault surface and increases at a geometric rate of 2\%
     with distance from the fault. We employ this same spatial
@@ -1813,12 +1830,13 @@ simulations of earthquake rupture propag
 % ------------------------------------------------------------------
 % TABLES
 % ------------------------------------------------------------------
-\begin{table*}
+\begin{table}
   \caption{Example Preconditioners for the Saddle Point Problem in
     Equation~(\ref{eqn:saddle:point})\tablenotemark{a}}
-\label{tab:solver:options}
+  \label{tab:preconditioner:options}
 \centering
 \begin{tabular}{ll}
+  AMG with additive relaxation                         & AMG with multiplicative relaxation \\
   $\begin{pmatrix}\bm{K} & \bm{0} \\ \bm{0} & \bm{I}\end{pmatrix}$~\label{prec:add} & $\begin{pmatrix}\bm{K} & \bm{L}^T \\ \bm{0} & \bm{I}\end{pmatrix}$ \label{prec:mult}\\
   \texttt{[pylithapp.problem.formulation]}             & \texttt{[pylithapp.problem.formulation]} \\
   \texttt{split\_fields = True}                        & \texttt{split\_fields = True} \\
@@ -1832,6 +1850,7 @@ simulations of earthquake rupture propag
   \texttt{fs\_fieldsplit\_1\_pc\_type = jacobi}        & \texttt{fs\_fieldsplit\_1\_pc\_type = jacobi} \\
   \texttt{fs\_fieldsplit\_1\_ksp\_type = gmres}        & \texttt{fs\_fieldsplit\_1\_ksp\_type = gmres} \\
   \smallskip \\
+  Schur complement, upper factorization                       & Schur complement, full factorization \\
   $\begin{pmatrix}\bm{K} & \bm{L}^T \\ \bm{0} & \bm{S}\end{pmatrix}$ \label{prec:schurUpper} & $\begin{pmatrix}\bm{I} & \bm{0} \\ \bm{B}^T \bm{A}^{-1} & \bm{I}\end{pmatrix}\begin{pmatrix}\bm{A} & \bm{0} \\ \bm{0} & \bm{S}\end{pmatrix}\begin{pmatrix}\bm{I} & \bm{A}^{-1} \bm{B} \\ \bm{0} & \bm{I}\end{pmatrix}$ \label{prec:schurFull}\\
   \texttt{[pylithapp.problem.formulation]}                    & \texttt{[pylithapp.problem.formulation]} \\
   \texttt{split\_fields = True}                               & \texttt{split\_fields = True} \\
@@ -1847,8 +1866,16 @@ simulations of earthquake rupture propag
   \texttt{fieldsplit\_1\_pc\_type = jacobi}                   & \texttt{fieldsplit\_1\_pc\_type = jacobi} \\
   \texttt{fieldsplit\_1\_ksp\_type = gmres}                   & \texttt{fieldsplit\_1\_ksp\_type = gmres} \\
 \end{tabular}
-\tablenotetext{a}{ADD STUFF HERE}
-\end{table*}
+\tablenotetext{a}{Four examples of preconditioners often used to
+  accelerate convergence in saddle point problems. Below the
+  mathematical expression for the preconditioner, we show the PyLith
+  parameters used to construct the preconditioner. In the performance
+  benchmark we consider the AMG preconditioner with multiplicative
+  relaxation, the Schur complement preconditioner with upper
+  factorization, and the Schur complement preconditioner with full
+  factorization. The AMG preconditioner with additive relaxation is
+  shown for completeness.}
+\end{table}
 
 \clearpage
 \begin{table}
@@ -1887,12 +1914,12 @@ simulations of earthquake rupture propag
 
 \begin{table}
 \caption{Performance Benchmark Memory System Evaluation\tablenotemark{a}}
-\label{tab:memBandwidth}
+\label{tab:solvertest:memory:events}
 \centering
 \begin{tabular}{rcc}
   \# Cores & Load Imbalance & MFlops/s \\
   \hline
-  \multicolumn{3}{c}{\PetscEvent{VecMDot}} \\
+  \multicolumn{3}{c}{\texttt{VecMDot}} \\
      1 & 1.0 &  2188 \\
      2 & 1.0 &  3969 \\
      4 & 1.0 &  5511 \\
@@ -1900,9 +1927,8 @@ simulations of earthquake rupture propag
     16 & 1.3 & 10249 \\
     32 & 1.2 &  4270 \\
     64 & 1.2 & 12299 \\
-   128 & 1.3 &  2019 \\
   \hline
-  \multicolumn{3}{c}{\PetscEvent{VecAXPY}} \\
+  \multicolumn{3}{c}{\texttt{VecAXPY}} \\
      1 & 1.0 &  1453 \\
      2 & 1.1 &  2708 \\
      4 & 1.1 &  5001 \\
@@ -1910,9 +1936,8 @@ simulations of earthquake rupture propag
     16 & 1.3 &  8157 \\
     32 & 1.2 & 13876 \\
     64 & 1.2 & 25807 \\
-   128 & 1.3 & 58759 \\
   \hline
-  \multicolumn{3}{c}{\PetscEvent{VecMAXPY}} \\
+  \multicolumn{3}{c}{\texttt{VecMAXPY}} \\
      1 & 1.0 &  1733 \\
      2 & 1.1 &  3283 \\
      4 & 1.1 &  4991 \\
@@ -1920,16 +1945,19 @@ simulations of earthquake rupture propag
     16 & 1.3 & 11050 \\
     32 & 1.2 & 21680 \\
     64 & 1.2 & 42697 \\
-   128 & 1.3 & 84691 \\
   \hline
 \end{tabular}
-\tablenotetext{a}{Examination of memory system performance using three PETSc vector operations.}
+\tablenotetext{a}{Examination of memory system performance using three
+  PETSc vector operations for simulations with the hexahedral meshes. \texttt{VecMDot}
+  corresponds to the operation for vector reductions, \texttt{VecAXPY}
+  corresponds to vector scaling and addition, and \texttt{VecMAXPY}
+  corresponds to mutiple vector scaling and addition.}
 \end{table}
 
 
 \begin{table}
 \caption{Performance Benchmark Solver Evaluation\tablenotemark{a}}
-\label{tab:solver}
+\label{tab:solvertest:solver:events}
 \centering
 \begin{tabular}{lrrr}
   Event & Calls & Time (s) & MFlops/s \\
@@ -1958,14 +1986,14 @@ simulations of earthquake rupture propag
   PCApply  &  70 & 14.8 & 10229 \\
   KSPSolve &   1 & 47.5 &  7067 \\
   \hline
-  \multicolumn{4}{c}{p = 128}  \\
-  MatMult  & 222 &   22.1 & 9880 \\
-  PCSetUp  &   1 &  314.1 &  169 \\
-  PCApply  &  71 & 6865.0 &   46 \\
-  KSPSolve &   1 & 7198.9 &   99 \\
-  \hline
 \end{tabular}
-\tablenotetext{a}{Examination of memory system performance using three PETSc vector operations.}
+\tablenotetext{a}{Examination of solver performance using three of the
+  main events comprising the linear solve for simulations with the
+  hexahedral meshes and 8, 16, 32, and 64 processes. The
+  \texttt{KSPSolve} event encompasses the entire linear
+  solve. \texttt{MatMult} corresponds to matrix-vector
+  multiplications. \texttt{PCSetUp} and \texttt{PCApply} correspond to
+  the setup and application of the AMG preconditioner.}
 \end{table}
 
 
@@ -2030,7 +2058,7 @@ simulations of earthquake rupture propag
   preconditioners for tetrahedral and hexahedral discretizations and
   three problem sizes (S1 with $1.8\times 10^5$ DOF, S2 with
   $3.5\times 10^5$ DOF, and S3 with $6.9\times 10^5$ DOF). The field
-  split preconditioner with multiplicative composittion and the custom
+  split preconditioner with multiplicative composition and the custom
   fault block preconditioner yields good performance with only a
   fraction of the iterates as the other preconditioners and a small
   increase with problem size.}
diff -r 9268656f1d00 -r 56ed4e9edde6 references.bib
--- a/references.bib	Mon Aug 13 09:33:09 2012 -0700
+++ b/references.bib	Mon Aug 13 14:56:14 2012 -0700
@@ -982,7 +982,7 @@
   title     = {Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations},
   publisher = {Cambridge University Press},
   year      = {1996},
-  url       = {http://www.mcs.anl.gov/\~{ }bsmith/ddbook.html}
+  note       = {http://www.mcs.anl.gov/\~{ }bsmith/ddbook.html}
 }
 
 @article{elman2008tcp,
@@ -1093,6 +1093,50 @@
   doi     = {10.3233/SPR-2009-0249}
 }
 
+ at article{Karypis:etal:1999,
+  author =	 {Karypis, G. and Aggarwal, R. and Kumar, V. and
+                  Shekhar, S.},
+  title =	 {Multilevel hypergraph partitioning: Applications in
+                  {VLSI} domain},
+  journal =	 {IEEE Transactions on Very Large Scale Integration
+                  (VLSI) Systems},
+  year =	 1999,
+  month =	 mar,
+  volume =	 7,
+  number =	 1,
+  pages =	 {69--79},
+  abstract =	 {In this paper, we present a new hypergraph-
+                  partitioning algorithm that is based on the
+                  multilevel paradigm. In the multilevel paradigm, a
+                  sequence of successively coarser hypergraphs is
+                  constructed, A bisection of the smallest hypergraph
+                  is computed and it is used to obtain a bisection of
+                  the original hypergraph by successively projecting
+                  and refining the bisection to the next level finer
+                  hypergraph. We have developed new hypergraph
+                  coarsening strategies within the multilevel
+                  framework. We evaluate their performance both in
+                  terms of the size of the hyperedge cut on the
+                  bisection, as well as on the run time for a number
+                  of very large scale integration circuits. Our
+                  experiments show that our multilevel hypergraph-
+                  partitioning algorithm produces high-quality
+                  partitioning in a relatively small amount of
+                  time. The quality of the partitionings produced by
+                  our scheme are on the average 6\%-23\% better than
+                  those produced by other state-of-the-art
+                  schemes. Furthermore, our partitioning algorithm is
+                  significantly faster, often requiring CIO times less
+                  time than that required by the other schemes. Our
+                  multilevel hypergraph-partitioning algorithm scales
+                  very well for large hypergraphs, Hypergraphs with
+                  over 100 000 vertices can be bisected in a few
+                  minutes on today's workstations. Also, on the large
+                  hypergraphs, our scheme outperforms other schemes
+                  (in hyperedge cut) quite consistently with larger
+                  margins (9\%-30\%).},
+}
+
 @Book{Zienkiewicz:Taylor:2005,
   author = 	 {Ziekiewicz, O.~C. and Taylor, R.~L. and Zhu, J.~Z.},
   title = 	 {The Finite-Element Method: Its Basis and Fundamentals},
@@ -1118,7 +1162,7 @@
    Number      = "ANL-95/11 - Revision 3.1",
    Institution = "Argonne National Laboratory",
    Year              = 2010,
-  url    = {http://www.mcs.anl.gov/petsc}
+  note    = "http://www.mcs.anl.gov/petsc"
 }
 
 @InProceedings{PETSc:efficient,
@@ -1133,14 +1177,14 @@
     Year           = 1997,
 }
 
- at Manual{PyLith:manual:1.6.2,
-  title = 	 {PyLith User Manual, Version 1.6.2},
+ at Manual{PyLith:manual:1.7.1,
+  title = 	 {PyLith User Manual, Version 1.7.1},
   author = 	 {Aagaard, B. and Kientz, S. and Knepley, M. and
                   Somala, S. and Strand, L. and Williams, C.},
   organization = {Computational Infrastructure for Geodynamics (CIG)},
   address = 	 {University of California, Davis},
-  year = 	 2011,
-  note =         {http://www.geodynamics.org/cig/software/pylith/pylith\_manual-1.6.2.pdf}
+  year = 	 2012,
+  note =         {http://www.geodynamics.org/cig/software/pylith/pylith\_manual-1.7.1.pdf}
 }
 
 @book{Saad03,