<div dir="ltr">with respect to local_assemble_advection system i see a speedup of almost 20X using linear elements for temperature. <div>however the copy_local_to_global on 512 cores still takes to much time.</div><div>with the new patches it runs 10% faster but still a lot of time is spend in inserting matrix values for off process entries</div>
<div>i will do some timing using mpi_Wtime to make sure we are not looking at profiling overhead </div><div>if that gives the same results i will post this on the trilinos forum</div><div><br></div><div>cheers</div><div>
Thomas</div>
<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 9, 2013 at 12:42 AM, Wolfgang Bangerth <span dir="ltr"><<a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
revision 1932 (move is_compressible() out of the inner loop of Stokes<br>
assembly):<br>
+-----------------------------<u></u>----------------+------------+<u></u>------------<br>
| Total wallclock time elapsed since start | 27.7s |<br>
| | |<br>
| Section | no. calls | wall time | % of total<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
| Assemble Stokes system | 23 | 5.31s | 19%<br>
| Assemble temperature system | 23 | 6.97s | 25%<br>
| Build Stokes preconditioner | 4 | 2.79s | 10%<br>
| Build temperature preconditioner| 23 | 0.719s | 2.6%<br>
| Solve Stokes system | 23 | 7.5s | 27%<br>
| Solve temperature system | 23 | 1.09s | 3.9%<br>
| Initialization | 4 | 0.124s | 0.45%<br>
| Postprocessing | 21 | 0.739s | 2.7%<br>
| Refine mesh structure, part 1 | 3 | 0.399s | 1.4%<br>
| Refine mesh structure, part 2 | 3 | 0.104s | 0.37%<br>
| Setup dof systems | 4 | 1.53s | 5.5%<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
</blockquote>
<br></div>
And this is after revision 1948 where I filter out all degrees of freedom in the temperature assembly that I don't care about:<br>
<br>
+-----------------------------<u></u>----------------+------------+<u></u>------------<br>
| Total wallclock time elapsed since start | 26.1s |<div class="im"><br>
| | |<br>
| Section | no. calls | wall time | % of total<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br></div>
| Assemble Stokes system | 23 | 5.37s | 21%<br>
| Assemble temperature system | 23 | 6.13s | 23%<br>
| Build Stokes preconditioner | 4 | 2.81s | 11%<br>
| Build temperature preconditioner| 23 | 0.726s | 2.8%<br>
| Solve Stokes system | 23 | 6.64s | 25%<br>
| Solve temperature system | 23 | 1.14s | 4.3%<br>
| Initialization | 4 | 0.125s | 0.48%<br>
| Postprocessing | 21 | 0.742s | 2.8%<br>
| Refine mesh structure, part 1 | 3 | 0.399s | 1.5%<br>
| Refine mesh structure, part 2 | 3 | 0.104s | 0.4%<br>
| Setup dof systems | 4 | 1.52s | 5.8%<br>
+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>
<br>
This is probably almost in the noise, but should help significantly with the problem Thomas sees on many processors. In any case, we're now at less than 1/3 of the time for temperature assembly :-)<br>
<br>
<br>
@Thomas: Can you see whether that makes a difference?<br>
<br>
@Timo: Want to re-run your 3d simulation with the same setup and compare results on your end?<div class="HOEnZb"><div class="h5"><br>
<br>
Best<br>
Wolfgang<br>
<br>
<br>
-- <br>
------------------------------<u></u>------------------------------<u></u>------------<br>
Wolfgang Bangerth email: <a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a><br>
www: <a href="http://www.math.tamu.edu/~bangerth/" target="_blank">http://www.math.tamu.edu/~<u></u>bangerth/</a><br>
<br>
</div></div></blockquote></div><br></div>