<div dir="ltr">with respect to local_assemble_advection system i see a speedup of almost 20X using linear elements for temperature. <div>however the copy_local_to_global on 512 cores still takes to much time.</div><div>with the new patches it runs 10% faster but still a lot of time is spend in inserting matrix values for off process entries</div>


<div>i will do some timing using mpi_Wtime to make sure we are not looking at profiling overhead </div><div>if that gives the same results i will post this on the trilinos forum</div><div><br></div><div>cheers</div><div>

Thomas</div>

<div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 9, 2013 at 12:42 AM, Wolfgang Bangerth <span dir="ltr"><<a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

revision 1932 (move is_compressible() out of the inner loop of Stokes<br>

assembly):<br>

+-----------------------------<u></u>----------------+------------+<u></u>------------<br>

| Total wallclock time elapsed since start    |      27.7s |<br>

|                                             |            |<br>

| Section                         | no. calls |  wall time | % of total<br>

+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>

| Assemble Stokes system          |        23 |      5.31s |        19%<br>

| Assemble temperature system     |        23 |      6.97s |        25%<br>

| Build Stokes preconditioner     |         4 |      2.79s |        10%<br>

| Build temperature preconditioner|        23 |     0.719s |       2.6%<br>

| Solve Stokes system             |        23 |       7.5s |        27%<br>

| Solve temperature system        |        23 |      1.09s |       3.9%<br>

| Initialization                  |         4 |     0.124s |      0.45%<br>

| Postprocessing                  |        21 |     0.739s |       2.7%<br>

| Refine mesh structure, part 1   |         3 |     0.399s |       1.4%<br>

| Refine mesh structure, part 2   |         3 |     0.104s |      0.37%<br>

| Setup dof systems               |         4 |      1.53s |       5.5%<br>

+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>

</blockquote>

<br></div>

And this is after revision 1948 where I filter out all degrees of freedom in the temperature assembly that I don't care about:<br>

<br>

+-----------------------------<u></u>----------------+------------+<u></u>------------<br>

| Total wallclock time elapsed since start    |      26.1s |<div class="im"><br>

|                                             |            |<br>

| Section                         | no. calls |  wall time | % of total<br>

+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br></div>

| Assemble Stokes system          |        23 |      5.37s |        21%<br>

| Assemble temperature system     |        23 |      6.13s |        23%<br>

| Build Stokes preconditioner     |         4 |      2.81s |        11%<br>

| Build temperature preconditioner|        23 |     0.726s |       2.8%<br>

| Solve Stokes system             |        23 |      6.64s |        25%<br>

| Solve temperature system        |        23 |      1.14s |       4.3%<br>

| Initialization                  |         4 |     0.125s |      0.48%<br>

| Postprocessing                  |        21 |     0.742s |       2.8%<br>

| Refine mesh structure, part 1   |         3 |     0.399s |       1.5%<br>

| Refine mesh structure, part 2   |         3 |     0.104s |       0.4%<br>

| Setup dof systems               |         4 |      1.52s |       5.8%<br>

+-----------------------------<u></u>----+-----------+------------+<u></u>------------<br>

<br>

This is probably almost in the noise, but should help significantly with the problem Thomas sees on many processors. In any case, we're now at less than 1/3 of the time for temperature assembly :-)<br>

<br>

<br>

@Thomas: Can you see whether that makes a difference?<br>

<br>

@Timo: Want to re-run your 3d simulation with the same setup and compare results on your end?<div class="HOEnZb"><div class="h5"><br>

<br>

Best<br>

 Wolfgang<br>

<br>

<br>

-- <br>

------------------------------<u></u>------------------------------<u></u>------------<br>

Wolfgang Bangerth               email:            <a href="mailto:bangerth@math.tamu.edu" target="_blank">bangerth@math.tamu.edu</a><br>

                                www: <a href="http://www.math.tamu.edu/~bangerth/" target="_blank">http://www.math.tamu.edu/~<u></u>bangerth/</a><br>

<br>

</div></div></blockquote></div><br></div>