[aspect-devel] Memory issues when increasing the number of processors?

Timo Heister heister at clemson.edu
Thu Aug 31 15:00:12 PDT 2017


Another thing that might be happening is some aggressive tuning that is
done on the cluster and you run into MPI timeouts (code would run if they
had a little more time to respond). Maybe you can ask the sysadmins? Do
they have other MPI libraries installed you could try out?
I find it unlikely that this is a big inside trilinos. Though not
impossible of course.

On Aug 31, 2017 17:44, "John Naliboff" <jbnaliboff at ucdavis.edu> wrote:

> Hello again,
>
> A quick update. I ran scaling tests with dealii step-32 (8, 9 or 10 global
> refinements) across 192, 384, 768 or 1536 cores.
>
> The same error as reported previously (see attached file) typically occurs
> with 1536 or 768 cores, although in some cases models which previously
> crashed are able to run without issue. Still no real correlation between
> d.o.f per core and the point (number of cores) when the model crashes.
>
> Is it worth reporting this issue to the Trilinos mailing list?
>
> Cheers,
> John
>
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
>
> On 08/24/2017 03:46 PM, John Naliboff wrote:
>
> Hi all,
>
> Below are messages I accidentally only sent to Timo rather than the whole
> mailing list.
>
> Timo - I tried Trilinos 12.10.1 and this did not resolve the issue. I'm
> going to try and reproduce the issue with step-32 and/or a different
> cluster next.
>
> Cheers,
> John
>
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
>
> On 08/23/2017 02:56 PM, Timo Heister wrote:
>
> John,
>
>
> /home/jboff/software/trilinos/trilinos-12.4.2/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2ba25c648cc0]
>
> So the this is inside the multigrid preconditioner from Trilinos. One
> option might be to try a newer Trilinos release. Sorry, I know that is
> annoying.
>
>
> /home/jboff/software/aspect/master/aspect/./aspect(_ZN6aspect18FreeSurfaceHandlerILi3EE26compute_mesh_displacementsEv+0x55c)
>
> You are using free surface computations. This is something that we
> haven't tested as much. Do you also get crashes without free surface
> computations?
>
>
>
>
> On Wed, Aug 23, 2017 at 5:40 PM, John Naliboff <jbnaliboff at ucdavis.edu> <jbnaliboff at ucdavis.edu> wrote:
>
> Hi Timo,
>
> Thanks for the feedback. I tried a few more tests with a different model
> (lithospheric deformation) and encountered similar issues. The attached
> error output provides a bit more info this time. The model was run across
> 768 cores.
>
> From the output it looks like there is an issue in Epetra?
>
>  Perhaps unrelated, but I am using Lapack 3.6.0 and had change some of the
> symbols labels in packages/epetra/src/Epetra_LAPACK_wrappers.h (e.g.
> following https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dealii.org_8.5.0_external-2Dlibs_trilinos.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=-52w3PZcChqYNoEqEX3EQFxw7h1LxbshOiW84lYSJPw&m=KIi4Fm0ctZaNWUSlxyOz9H31YTD1MWZdXrS-M99zGsg&s=Zfr_gHqG_d9nwcISyOmx_EvIkKibJSW3iQ73AgsoKGg&e= ).
>
> Cheers,
> John
>
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
>
> On 08/23/2017 09:41 AM, Timo Heister wrote:
>
> John,
>
> it would be neat to have a longer callstack to see where this error is
> happening.
>
> Some ideas:
> 1. This could be a hardware issue (one of the nodes can not
> communicate, has packet loss or whatever).
> 2. This could be a configuration problem ("too many retries sending
> message to 0x5a90:0x000639a2, giving up" could mean some MPI timeouts
> are triggered)
> 3. It could be a bug in some MPI code (in Trilinos, deal.II, or
> ASPECT). A longer callstack would help narrow that down.
>
> If you feel like experimenting, you could see if you can trigger the
> same issue with deal.II step-32.
>
>
> On Tue, Aug 22, 2017 at 4:44 PM, John Naliboff <jbnaliboff at ucdavis.edu> <jbnaliboff at ucdavis.edu>
> wrote:
>
> Hi all,
>
> I'm looking for feedback on a memory error(s?) that has me somewhat
> perplexed.
>
> The errors are occurring on the XSEDE cluster Comet:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e=
>
> The models in question are a series of scaling tests following the tests run
> by Rene Gassmoeller:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e=
>
> When using up to 192 processors and global refinement levels of 2, 3, 4 or 5
> the scaling results are "roughly" (not too far off) what I would expect
> based on Rene's results.
>
> However, once I get up to 384 cores the models almost always crash with a
> segmentation fault error. Here is part of the error message from a model run
> on 384 cores with 4 global refinement levels.
>   Number of active cells: 393,216 (on 5 levels)
>   Number of degrees of freedom: 16,380,620
> (9,585,030+405,570+3,195,010+3,195,010)
>
>   *** Timestep 0:  t=0 years
>      Solving temperature system... 0 iterations.
>      Solving C_1 system ... 0 iterations.
>      Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process
> received signal ***
>   [comet-06-22:09703] Signal: Segmentation fault (11)
>
> The full model output is locate in the attached file.
>
> Thoughts on what might be causing a memory issue when increasing the number
> of cores?
>
> The perplexing part is that the error does not seemed to be tied to the
> number of d.o.f. per processor. Also somewhat perplexing is one model that
> crashed with this error was able to run successfully using the exact same
> submission script, input file, etc. However, this only happened once
> (successfully running failed job) and the errors are almost reproducible.
>
> If no one has encountered this issue before, any suggestions for debugging
> tricks with this number of processors? I may be able to run an interactive
> session in debug mode with this number of processors, but I would need to
> check with the cluster administrator.
>
> Thanks!
> John
>
> --
>
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
> 	
>
>
> _______________________________________________
> Aspect-devel mailing listAspect-devel at geodynamics.orghttps://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e=
>
>
>
>
>
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-
> ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_
> QFS80R7PEA2q0EPwDy7VQw&m=9W8eIhCtynBg06oSOs_KI7g2XSNBIsK_pw8JF8yPndI&s=
> DaWzdQrNfzNDgwXAiFb76XN4cSZ_eIAVFwzzo527aYw&e=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20170831/702c3f2d/attachment.html>


More information about the Aspect-devel mailing list