[aspect-devel] Memory issues when increasing the number of processors?

John Naliboff jbnaliboff at ucdavis.edu
Tue Aug 22 13:44:47 PDT 2017


Hi all,

I'm looking for feedback on a memory error(s?) that has me somewhat 
perplexed.

The errors are occurring on the XSEDE cluster Comet:
    http://www.sdsc.edu/support/user_guides/comet.html

The models in question are a series of scaling tests following the tests 
run by Rene Gassmoeller:
    https://github.com/gassmoeller/aspect-performance-statistics

When using up to 192 processors and global refinement levels of 2, 3, 4 
or 5 the scaling results are "roughly" (not too far off) what I would 
expect based on Rene's results.

However, once I get up to 384 cores the models almost always crash with 
a segmentation fault error. Here is part of the error message from a 
model run on 384 cores with 4 global refinement levels.
*  Number of active cells: 393,216 (on 5 levels)**
**  Number of degrees of freedom: 16,380,620 
(9,585,030+405,570+3,195,010+3,195,010)**
**
**  *** Timestep 0:  t=0 years**
**     Solving temperature system... 0 iterations.**
**     Solving C_1 system ... 0 iterations.**
**     Rebuilding Stokes preconditioner...[comet-06-22:09703] *** 
Process received signal *****
**  [comet-06-22:09703] Signal: Segmentation fault (11)*

The full model output is locate in the attached file.

Thoughts on what might be causing a memory issue when increasing the 
number of cores?

The perplexing part is that the error does not seemed to be tied to the 
number of d.o.f. per processor. Also somewhat perplexing is one model 
that crashed with this error was able to run successfully using the 
exact same submission script, input file, etc. However, this only 
happened once (successfully running failed job) and the errors are 
almost reproducible.

If no one has encountered this issue before, any suggestions for 
debugging tricks with this number of processors? I may be able to run an 
interactive session in debug mode with this number of processors, but I 
would need to check with the cluster administrator.

Thanks!
John

-- 

*************************************************
John Naliboff
Assistant Project Scientist, CIG
Earth & Planetary Sciences Dept., UC Davis
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20170822/841287fd/attachment.html>
-------------- next part --------------
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
--     . version 2.0.0-pre
--     . running in OPTIMIZED mode
--     . running with 384 MPI processes
--     . using Trilinos
-----------------------------------------------------------------------------


-----------------------------------------------------------------------------
The output directory <output_sses_384_4_1/> provided in the input file appears not to exist.
ASPECT will create it for you.
-----------------------------------------------------------------------------


Number of active cells: 393,216 (on 5 levels)
Number of degrees of freedom: 16,380,620 (9,585,030+405,570+3,195,010+3,195,010)

*** Timestep 0:  t=0 years
   Solving temperature system... 0 iterations.
   Solving C_1 system ... 0 iterations.
   Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process received signal ***
[comet-06-22:09703] Signal: Segmentation fault (11)
[comet-06-22:09703] Signal code: Address not mapped (1)
[comet-06-22:09703] Failing at address: 0x3fcafb010
[comet-06-22:09703] [ 0] /lib64/libc.so.6[0x3c40632510]
[comet-06-22:09703] [ 1] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x1c8)[0x2ac481e86748]
[comet-06-22:09703] [ 2] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_memalign+0x52)[0x2ac481e89332]
[comet-06-22:09703] [ 3] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0xbf)[0x2ac481e8976f]
[comet-06-22:09703] [ 4] /opt/openmpi/gnu/ib/lib/libmpi.so.1(+0x1134d9)[0x2ac480b374d9]
[comet-06-22:09703] [ 5] /opt/openmpi/gnu/ib/lib/libmpi.so.1(+0x113cbc)[0x2ac480b37cbc]
[comet-06-22:09703] [ 6] /opt/openmpi/gnu/ib/lib/libmpi.so.1(+0x10b488)[0x2ac480b2f488]
[comet-06-22:09703] [ 7] /lib64/libpthread.so.0[0x3c40e07aa1]
[comet-06-22:09703] [ 8] /lib64/libc.so.6(clone+0x6d)[0x3c406e8bcd]
[comet-06-22:09703] *** End of error message ***
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.

Host: comet-06-04
PID:  28006

--------------------------------------------------------------------------
[comet-21-02.sdsc.edu:09948] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-21-16.sdsc.edu:04982] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-28.sdsc.edu:18414] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-53.sdsc.edu:30097] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04734] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14443] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-21-16.sdsc.edu:04994] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04720] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14448] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14449] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14447] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14457] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-04.sdsc.edu:27969] 1 more process has sent help message help-orte-odls-base.txt / orte-odls-base:could-not-kill
[comet-06-04.sdsc.edu:27969] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[comet-06-52.sdsc.edu:20479] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14445] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04732] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04735] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-28.sdsc.edu:18424] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14446] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04733] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-14.sdsc.edu:24525] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-14.sdsc.edu:24531] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04737] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-66.sdsc.edu:04739] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14455] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14456] too many retries sending message to 0x5a90:0x000639a2, giving up
[comet-06-57.sdsc.edu:14442] too many retries sending message to 0x5a90:0x000639a2, giving up
--------------------------------------------------------------------------
mpirun noticed that process rank 79 with PID 0 on node comet-06-22 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


More information about the Aspect-devel mailing list