[aspect-devel] Memory issues when increasing the number of processors?

John Naliboff jbnaliboff at ucdavis.edu
Fri Sep 1 09:48:19 PDT 2017


Good news - switching the MPI library from openmpi to mvapich2 seems to have fixed the issue. Three different test cases that previously failed now run without issue. Thanks for the suggestions Timo! 

I’ll report back if the system administrator and I are able to pin down exactly what was went wrong with the openmpi library.

For anyone interested in using COMET (http://www.sdsc.edu/support/user_guides/comet.html <http://www.sdsc.edu/support/user_guides/comet.html>), I’ll add installation instructions and scaling results to the github pages in the next week or two. Feel free to send me an email if you would like to get started before then.

FYI, anyone in the University of California system can apply once for a large allocation on COMET (up to 500,000 core hours) with a short (1-2 page) proposal. Any subsequent allocation requests for COMET need to go through the regular XSEDE process, but nonetheless it is a great opportunity.

Cheers,
John

*************************************************
John Naliboff
Assistant Project Scientist, CIG
Earth & Planetary Sciences Dept., UC Davis






> On Aug 31, 2017, at 3:00 PM, Timo Heister <heister at clemson.edu> wrote:
> 
> Another thing that might be happening is some aggressive tuning that is done on the cluster and you run into MPI timeouts (code would run if they had a little more time to respond). Maybe you can ask the sysadmins? Do they have other MPI libraries installed you could try out?
> I find it unlikely that this is a big inside trilinos. Though not impossible of course.
> 
> On Aug 31, 2017 17:44, "John Naliboff" <jbnaliboff at ucdavis.edu <mailto:jbnaliboff at ucdavis.edu>> wrote:
> Hello again,
> 
> A quick update. I ran scaling tests with dealii step-32 (8, 9 or 10 global refinements) across 192, 384, 768 or 1536 cores. 
> 
> The same error as reported previously (see attached file) typically occurs with 1536 or 768 cores, although in some cases models which previously crashed are able to run without issue. Still no real correlation between d.o.f per core and the point (number of cores) when the model crashes.
> 
> Is it worth reporting this issue to the Trilinos mailing list?
> 
> Cheers,
> John
>  *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
> On 08/24/2017 03:46 PM, John Naliboff wrote:
>> Hi all,
>> 
>> Below are messages I accidentally only sent to Timo rather than the whole mailing list. 
>> 
>> Timo - I tried Trilinos 12.10.1 and this did not resolve the issue. I'm going to try and reproduce the issue with step-32 and/or a different cluster next.
>> 
>> Cheers,
>> John
>>  *************************************************
>> John Naliboff
>> Assistant Project Scientist, CIG
>> Earth & Planetary Sciences Dept., UC Davis
>> On 08/23/2017 02:56 PM, Timo Heister wrote:
>>> John,
>>> 
>>>> /home/jboff/software/trilinos/trilinos-12.4.2/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2ba25c648cc0]
>>> So the this is inside the multigrid preconditioner from Trilinos. One
>>> option might be to try a newer Trilinos release. Sorry, I know that is
>>> annoying.
>>> 
>>>> /home/jboff/software/aspect/master/aspect/./aspect(_ZN6aspect18FreeSurfaceHandlerILi3EE26compute_mesh_displacementsEv+0x55c)
>>> You are using free surface computations. This is something that we
>>> haven't tested as much. Do you also get crashes without free surface
>>> computations?
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 23, 2017 at 5:40 PM, John Naliboff <jbnaliboff at ucdavis.edu> <mailto:jbnaliboff at ucdavis.edu> wrote:
>>>> Hi Timo,
>>>> 
>>>> Thanks for the feedback. I tried a few more tests with a different model
>>>> (lithospheric deformation) and encountered similar issues. The attached
>>>> error output provides a bit more info this time. The model was run across
>>>> 768 cores.
>>>> 
>>>> From the output it looks like there is an issue in Epetra?
>>>> 
>>>>  Perhaps unrelated, but I am using Lapack 3.6.0 and had change some of the
>>>> symbols labels in packages/epetra/src/Epetra_LAPACK_wrappers.h (e.g.
>>>> following https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dealii.org_8.5.0_external-2Dlibs_trilinos.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=-52w3PZcChqYNoEqEX3EQFxw7h1LxbshOiW84lYSJPw&m=KIi4Fm0ctZaNWUSlxyOz9H31YTD1MWZdXrS-M99zGsg&s=Zfr_gHqG_d9nwcISyOmx_EvIkKibJSW3iQ73AgsoKGg&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dealii.org_8.5.0_external-2Dlibs_trilinos.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=-52w3PZcChqYNoEqEX3EQFxw7h1LxbshOiW84lYSJPw&m=KIi4Fm0ctZaNWUSlxyOz9H31YTD1MWZdXrS-M99zGsg&s=Zfr_gHqG_d9nwcISyOmx_EvIkKibJSW3iQ73AgsoKGg&e=> ).
>>>> 
>>>> Cheers,
>>>> John
>>>> 
>>>> *************************************************
>>>> John Naliboff
>>>> Assistant Project Scientist, CIG
>>>> Earth & Planetary Sciences Dept., UC Davis
>>>> 
>>>> On 08/23/2017 09:41 AM, Timo Heister wrote:
>>>> 
>>>> John,
>>>> 
>>>> it would be neat to have a longer callstack to see where this error is
>>>> happening.
>>>> 
>>>> Some ideas:
>>>> 1. This could be a hardware issue (one of the nodes can not
>>>> communicate, has packet loss or whatever).
>>>> 2. This could be a configuration problem ("too many retries sending
>>>> message to 0x5a90:0x000639a2, giving up" could mean some MPI timeouts
>>>> are triggered)
>>>> 3. It could be a bug in some MPI code (in Trilinos, deal.II, or
>>>> ASPECT). A longer callstack would help narrow that down.
>>>> 
>>>> If you feel like experimenting, you could see if you can trigger the
>>>> same issue with deal.II step-32.
>>>> 
>>>> 
>>>> On Tue, Aug 22, 2017 at 4:44 PM, John Naliboff <jbnaliboff at ucdavis.edu> <mailto:jbnaliboff at ucdavis.edu>
>>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'm looking for feedback on a memory error(s?) that has me somewhat
>>>> perplexed.
>>>> 
>>>> The errors are occurring on the XSEDE cluster Comet:
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e=>
>>>> 
>>>> The models in question are a series of scaling tests following the tests run
>>>> by Rene Gassmoeller:
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e=>
>>>> 
>>>> When using up to 192 processors and global refinement levels of 2, 3, 4 or 5
>>>> the scaling results are "roughly" (not too far off) what I would expect
>>>> based on Rene's results.
>>>> 
>>>> However, once I get up to 384 cores the models almost always crash with a
>>>> segmentation fault error. Here is part of the error message from a model run
>>>> on 384 cores with 4 global refinement levels.
>>>>   Number of active cells: 393,216 (on 5 levels)
>>>>   Number of degrees of freedom: 16,380,620
>>>> (9,585,030+405,570+3,195,010+3,195,010)
>>>> 
>>>>   *** Timestep 0:  t=0 years
>>>>      Solving temperature system... 0 iterations.
>>>>      Solving C_1 system ... 0 iterations.
>>>>      Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process
>>>> received signal ***
>>>>   [comet-06-22:09703] Signal: Segmentation fault (11)
>>>> 
>>>> The full model output is locate in the attached file.
>>>> 
>>>> Thoughts on what might be causing a memory issue when increasing the number
>>>> of cores?
>>>> 
>>>> The perplexing part is that the error does not seemed to be tied to the
>>>> number of d.o.f. per processor. Also somewhat perplexing is one model that
>>>> crashed with this error was able to run successfully using the exact same
>>>> submission script, input file, etc. However, this only happened once
>>>> (successfully running failed job) and the errors are almost reproducible.
>>>> 
>>>> If no one has encountered this issue before, any suggestions for debugging
>>>> tricks with this number of processors? I may be able to run an interactive
>>>> session in debug mode with this number of processors, but I would need to
>>>> check with the cluster administrator.
>>>> 
>>>> Thanks!
>>>> John
>>>> 
>>>> --
>>>> 
>>>> *************************************************
>>>> John Naliboff
>>>> Assistant Project Scientist, CIG
>>>> Earth & Planetary Sciences Dept., UC Davis
>>>> 	
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Aspect-devel mailing list
>>>> Aspect-devel at geodynamics.org <mailto:Aspect-devel at geodynamics.org>
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e=>
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org <mailto:Aspect-devel at geodynamics.org>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=9W8eIhCtynBg06oSOs_KI7g2XSNBIsK_pw8JF8yPndI&s=DaWzdQrNfzNDgwXAiFb76XN4cSZ_eIAVFwzzo527aYw&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=9W8eIhCtynBg06oSOs_KI7g2XSNBIsK_pw8JF8yPndI&s=DaWzdQrNfzNDgwXAiFb76XN4cSZ_eIAVFwzzo527aYw&e=> 
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20170901/f3b3dd1d/attachment-0001.html>


More information about the Aspect-devel mailing list