<div dir="ltr">with respect to local_assemble_advection system i see a speedup of almost 20X using linear elements for temperature. <div>however the copy_local_to_global on 512 cores still takes to much time.</div><div>with the new patches it runs 10% faster but still a lot of time is spend in inserting matrix values for off process entries</div>

<div>i will do some timing using mpi_Wtime to make sure we are not looking at profiling overhead </div><div>if that gives the same results i will post this on the trilinos forum</div><div><br></div><div>cheers</div><div>
Thomas</div>
<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 9, 2013 at 12:42 AM, Wolfgang Bangerth <span dir="ltr"><<a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
revision 1932 (move is_compressible() out of the inner loop of Stokes<br>
assembly):<br>
+-----------------------------<u></u>----------------+------------+<u></u>------------<br>
| Total wallclock time elapsed since start    |      27.7s |<br>
|                                             |            |<br>
| Section                         | no. calls |  wall time | % of total<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
| Assemble Stokes system          |        23 |      5.31s |        19%<br>
| Assemble temperature system     |        23 |      6.97s |        25%<br>
| Build Stokes preconditioner     |         4 |      2.79s |        10%<br>
| Build temperature preconditioner|        23 |     0.719s |       2.6%<br>
| Solve Stokes system             |        23 |       7.5s |        27%<br>
| Solve temperature system        |        23 |      1.09s |       3.9%<br>
| Initialization                  |         4 |     0.124s |      0.45%<br>
| Postprocessing                  |        21 |     0.739s |       2.7%<br>
| Refine mesh structure, part 1   |         3 |     0.399s |       1.4%<br>
| Refine mesh structure, part 2   |         3 |     0.104s |      0.37%<br>
| Setup dof systems               |         4 |      1.53s |       5.5%<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
</blockquote>
<br></div>
And this is after revision 1948 where I filter out all degrees of freedom in the temperature assembly that I don't care about:<br>
<br>
+-----------------------------<u></u>----------------+------------+<u></u>------------<br>
| Total wallclock time elapsed since start    |      26.1s |<div class="im"><br>
|                                             |            |<br>
| Section                         | no. calls |  wall time | % of total<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br></div>
| Assemble Stokes system          |        23 |      5.37s |        21%<br>
| Assemble temperature system     |        23 |      6.13s |        23%<br>
| Build Stokes preconditioner     |         4 |      2.81s |        11%<br>
| Build temperature preconditioner|        23 |     0.726s |       2.8%<br>
| Solve Stokes system             |        23 |      6.64s |        25%<br>
| Solve temperature system        |        23 |      1.14s |       4.3%<br>
| Initialization                  |         4 |     0.125s |      0.48%<br>
| Postprocessing                  |        21 |     0.742s |       2.8%<br>
| Refine mesh structure, part 1   |         3 |     0.399s |       1.5%<br>
| Refine mesh structure, part 2   |         3 |     0.104s |       0.4%<br>
| Setup dof systems               |         4 |      1.52s |       5.8%<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
<br>
This is probably almost in the noise, but should help significantly with the problem Thomas sees on many processors. In any case, we're now at less than 1/3 of the time for temperature assembly :-)<br>
<br>
<br>
@Thomas: Can you see whether that makes a difference?<br>
<br>
@Timo: Want to re-run your 3d simulation with the same setup and compare results on your end?<div class="HOEnZb"><div class="h5"><br>
<br>
Best<br>
 Wolfgang<br>
<br>
<br>
-- <br>
------------------------------<u></u>------------------------------<u></u>------------<br>
Wolfgang Bangerth               email:            <a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a><br>
                                www: <a href="http://www.math.tamu.edu/~bangerth/" target="_blank">http://www.math.tamu.edu/~<u></u>bangerth/</a><br>
<br>
</div></div></blockquote></div><br></div>