[aspect-devel] Memory issues when increasing the number of processors?

John Naliboff jbnaliboff at ucdavis.edu
Thu Aug 24 15:46:54 PDT 2017


Hi all,

Below are messages I accidentally only sent to Timo rather than the 
whole mailing list.

Timo - I tried Trilinos 12.10.1 and this did not resolve the issue. I'm 
going to try and reproduce the issue with step-32 and/or a different 
cluster next.

Cheers,
John

*************************************************
John Naliboff
Assistant Project Scientist, CIG
Earth & Planetary Sciences Dept., UC Davis

On 08/23/2017 02:56 PM, Timo Heister wrote:
> John,
>
>> /home/jboff/software/trilinos/trilinos-12.4.2/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2ba25c648cc0]
> So the this is inside the multigrid preconditioner from Trilinos. One
> option might be to try a newer Trilinos release. Sorry, I know that is
> annoying.
>
>> /home/jboff/software/aspect/master/aspect/./aspect(_ZN6aspect18FreeSurfaceHandlerILi3EE26compute_mesh_displacementsEv+0x55c)
> You are using free surface computations. This is something that we
> haven't tested as much. Do you also get crashes without free surface
> computations?
>
>
>
>
> On Wed, Aug 23, 2017 at 5:40 PM, John Naliboff <jbnaliboff at ucdavis.edu> wrote:
>> Hi Timo,
>>
>> Thanks for the feedback. I tried a few more tests with a different model
>> (lithospheric deformation) and encountered similar issues. The attached
>> error output provides a bit more info this time. The model was run across
>> 768 cores.
>>
>>  From the output it looks like there is an issue in Epetra?
>>
>>   Perhaps unrelated, but I am using Lapack 3.6.0 and had change some of the
>> symbols labels in packages/epetra/src/Epetra_LAPACK_wrappers.h (e.g.
>> following https://urldefense.proofpoint.com/v2/url?u=https-3A__www.dealii.org_8.5.0_external-2Dlibs_trilinos.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=-52w3PZcChqYNoEqEX3EQFxw7h1LxbshOiW84lYSJPw&m=KIi4Fm0ctZaNWUSlxyOz9H31YTD1MWZdXrS-M99zGsg&s=Zfr_gHqG_d9nwcISyOmx_EvIkKibJSW3iQ73AgsoKGg&e= ).
>>
>> Cheers,
>> John
>>
>> *************************************************
>> John Naliboff
>> Assistant Project Scientist, CIG
>> Earth & Planetary Sciences Dept., UC Davis
>>
>> On 08/23/2017 09:41 AM, Timo Heister wrote:
>>
>> John,
>>
>> it would be neat to have a longer callstack to see where this error is
>> happening.
>>
>> Some ideas:
>> 1. This could be a hardware issue (one of the nodes can not
>> communicate, has packet loss or whatever).
>> 2. This could be a configuration problem ("too many retries sending
>> message to 0x5a90:0x000639a2, giving up" could mean some MPI timeouts
>> are triggered)
>> 3. It could be a bug in some MPI code (in Trilinos, deal.II, or
>> ASPECT). A longer callstack would help narrow that down.
>>
>> If you feel like experimenting, you could see if you can trigger the
>> same issue with deal.II step-32.
>>
>>
>> On Tue, Aug 22, 2017 at 4:44 PM, John Naliboff <jbnaliboff at ucdavis.edu>
>> wrote:
>>
>> Hi all,
>>
>> I'm looking for feedback on a memory error(s?) that has me somewhat
>> perplexed.
>>
>> The errors are occurring on the XSEDE cluster Comet:
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e=
>>
>> The models in question are a series of scaling tests following the tests run
>> by Rene Gassmoeller:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e=
>>
>> When using up to 192 processors and global refinement levels of 2, 3, 4 or 5
>> the scaling results are "roughly" (not too far off) what I would expect
>> based on Rene's results.
>>
>> However, once I get up to 384 cores the models almost always crash with a
>> segmentation fault error. Here is part of the error message from a model run
>> on 384 cores with 4 global refinement levels.
>>    Number of active cells: 393,216 (on 5 levels)
>>    Number of degrees of freedom: 16,380,620
>> (9,585,030+405,570+3,195,010+3,195,010)
>>
>>    *** Timestep 0:  t=0 years
>>       Solving temperature system... 0 iterations.
>>       Solving C_1 system ... 0 iterations.
>>       Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process
>> received signal ***
>>    [comet-06-22:09703] Signal: Segmentation fault (11)
>>
>> The full model output is locate in the attached file.
>>
>> Thoughts on what might be causing a memory issue when increasing the number
>> of cores?
>>
>> The perplexing part is that the error does not seemed to be tied to the
>> number of d.o.f. per processor. Also somewhat perplexing is one model that
>> crashed with this error was able to run successfully using the exact same
>> submission script, input file, etc. However, this only happened once
>> (successfully running failed job) and the errors are almost reproducible.
>>
>> If no one has encountered this issue before, any suggestions for debugging
>> tricks with this number of processors? I may be able to run an interactive
>> session in debug mode with this number of processors, but I would need to
>> check with the cluster administrator.
>>
>> Thanks!
>> John
>>
>> --
>>
>> *************************************************
>> John Naliboff
>> Assistant Project Scientist, CIG
>> Earth & Planetary Sciences Dept., UC Davis
>> 	
>>
>>
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e=
>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20170824/76b6d4e7/attachment.html>


More information about the Aspect-devel mailing list