[aspect-devel] Assembly speedup

Wolfgang Bangerth bangerth at math.tamu.edu
Tue Oct 8 14:57:21 PDT 2013


> from my profiles the walltime in local_assemble_stokes_system for instance
> is spend in this nested loop
>   if (rebuild_stokes_matrix)
>            for (unsigned int i=0; i<dofs_per_cell; ++i)
>              for (unsigned int j=0; j<dofs_per_cell; ++j)
>                data.local_matrix(i,j) += ( eta * 2.0 *
> (scratch.grads_phi_u[i] * scratch.grads_phi_u[j])
>                                            -
> (material_model->is_compressible()
>                                               ?
>                                               eta * 2.0/3.0 *
> (scratch.div_phi_u[i] * scratch.div_phi_u[j])
>                                               :
>                                               0)
>                                            - (pressure_scaling *
>                                               scratch.div_phi_u[i] *
> scratch.phi_p[j])
>                                            - (pressure_scaling *
>                                               scratch.phi_p[i] *
> scratch.div_phi_u[j]))
>                                          *
> scratch.finite_element_values.JxW(q);
>
>   moving the
> material_model->is_compressible()
> ?
>
> outside the loop made this part of the code run 35% faster.
> other then that i do not observe any conditionals or are calls
> like scratch.phi_p[i] more complex then just returning a value?

Just for everyone's reference -- using the last few patches makes 
assembly on Timo's recently added 3dbox.prm significantly faster 
(timings for the first 20 time steps on 16 processors):


revision 1930 (baseline):
+---------------------------------------------+------------+------------
| Total wallclock time elapsed since start    |      43.4s |
|                                             |            |
| Section                         | no. calls |  wall time | % of total
+---------------------------------+-----------+------------+------------
| Assemble Stokes system          |        23 |      7.83s |        18%
| Assemble temperature system     |        23 |      19.4s |        45%
| Build Stokes preconditioner     |         4 |      2.81s |       6.5%
| Build temperature preconditioner|        23 |     0.735s |       1.7%
| Solve Stokes system             |        23 |      8.19s |        19%
| Solve temperature system        |        23 |      1.13s |       2.6%
| Initialization                  |         4 |     0.128s |      0.29%
| Postprocessing                  |        21 |     0.741s |       1.7%
| Refine mesh structure, part 1   |         3 |     0.401s |      0.92%
| Refine mesh structure, part 2   |         3 |     0.103s |      0.24%
| Setup dof systems               |         4 |      1.52s |       3.5%
+---------------------------------+-----------+------------+------------


revision 1931 (only loop over temperature/compositional fields in 
advection assembly):
+---------------------------------------------+------------+------------
| Total wallclock time elapsed since start    |      31.6s |
|                                             |            |
| Section                         | no. calls |  wall time | % of total
+---------------------------------+-----------+------------+------------
| Assemble Stokes system          |        23 |      7.83s |        25%
| Assemble temperature system     |        23 |      6.87s |        22%
| Build Stokes preconditioner     |         4 |      2.83s |         9%
| Build temperature preconditioner|        23 |     0.736s |       2.3%
| Solve Stokes system             |        23 |      8.73s |        28%
| Solve temperature system        |        23 |      1.21s |       3.8%
| Initialization                  |         4 |     0.125s |       0.4%
| Postprocessing                  |        21 |     0.741s |       2.3%
| Refine mesh structure, part 1   |         3 |     0.398s |       1.3%
| Refine mesh structure, part 2   |         3 |     0.104s |      0.33%
| Setup dof systems               |         4 |      1.54s |       4.9%
+---------------------------------+-----------+------------+------------


revision 1932 (move is_compressible() out of the inner loop of Stokes 
assembly):
+---------------------------------------------+------------+------------
| Total wallclock time elapsed since start    |      27.7s |
|                                             |            |
| Section                         | no. calls |  wall time | % of total
+---------------------------------+-----------+------------+------------
| Assemble Stokes system          |        23 |      5.31s |        19%
| Assemble temperature system     |        23 |      6.97s |        25%
| Build Stokes preconditioner     |         4 |      2.79s |        10%
| Build temperature preconditioner|        23 |     0.719s |       2.6%
| Solve Stokes system             |        23 |       7.5s |        27%
| Solve temperature system        |        23 |      1.09s |       3.9%
| Initialization                  |         4 |     0.124s |      0.45%
| Postprocessing                  |        21 |     0.739s |       2.7%
| Refine mesh structure, part 1   |         3 |     0.399s |       1.4%
| Refine mesh structure, part 2   |         3 |     0.104s |      0.37%
| Setup dof systems               |         4 |      1.53s |       5.5%
+---------------------------------+-----------+------------+------------


So temperature assembly now only takes 35% of the time it used to take, 
Stokes assembly 68%. Good job, Thomas & Timo!

I'm working on filtering the local systems, let's see if that helps some 
more.

Best
  Wolfgang

-- 
------------------------------------------------------------------------
Wolfgang Bangerth               email:            bangerth at math.tamu.edu
                                 www: http://www.math.tamu.edu/~bangerth/



More information about the Aspect-devel mailing list