[aspect-devel] Memory issues when increasing the number of processors?

Timo Heister heister at clemson.edu
Wed Aug 23 09:41:21 PDT 2017


John,

it would be neat to have a longer callstack to see where this error is
happening.

Some ideas:
1. This could be a hardware issue (one of the nodes can not
communicate, has packet loss or whatever).
2. This could be a configuration problem ("too many retries sending
message to 0x5a90:0x000639a2, giving up" could mean some MPI timeouts
are triggered)
3. It could be a bug in some MPI code (in Trilinos, deal.II, or
ASPECT). A longer callstack would help narrow that down.

If you feel like experimenting, you could see if you can trigger the
same issue with deal.II step-32.


On Tue, Aug 22, 2017 at 4:44 PM, John Naliboff <jbnaliboff at ucdavis.edu> wrote:
> Hi all,
>
> I'm looking for feedback on a memory error(s?) that has me somewhat
> perplexed.
>
> The errors are occurring on the XSEDE cluster Comet:
>    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e= 
>
> The models in question are a series of scaling tests following the tests run
> by Rene Gassmoeller:
>    https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e= 
>
> When using up to 192 processors and global refinement levels of 2, 3, 4 or 5
> the scaling results are "roughly" (not too far off) what I would expect
> based on Rene's results.
>
> However, once I get up to 384 cores the models almost always crash with a
> segmentation fault error. Here is part of the error message from a model run
> on 384 cores with 4 global refinement levels.
>   Number of active cells: 393,216 (on 5 levels)
>   Number of degrees of freedom: 16,380,620
> (9,585,030+405,570+3,195,010+3,195,010)
>
>   *** Timestep 0:  t=0 years
>      Solving temperature system... 0 iterations.
>      Solving C_1 system ... 0 iterations.
>      Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process
> received signal ***
>   [comet-06-22:09703] Signal: Segmentation fault (11)
>
> The full model output is locate in the attached file.
>
> Thoughts on what might be causing a memory issue when increasing the number
> of cores?
>
> The perplexing part is that the error does not seemed to be tied to the
> number of d.o.f. per processor. Also somewhat perplexing is one model that
> crashed with this error was able to run successfully using the exact same
> submission script, input file, etc. However, this only happened once
> (successfully running failed job) and the errors are almost reproducible.
>
> If no one has encountered this issue before, any suggestions for debugging
> tricks with this number of processors? I may be able to run an interactive
> session in debug mode with this number of processors, but I would need to
> check with the cluster administrator.
>
> Thanks!
> John
>
> --
>
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
> 	
>
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e=



-- 
Timo Heister
http://www.math.clemson.edu/~heister/


More information about the Aspect-devel mailing list