[aspect-devel] Memory issues when increasing the number of processors?

John Naliboff jbnaliboff at ucdavis.edu
Thu Aug 31 13:42:24 PDT 2017


Hello again,

A quick update. I ran scaling tests with dealii step-32 (8, 9 or 10 
global refinements) across 192, 384, 768 or 1536 cores.

The same error as reported previously (see attached file) typically 
occurs with 1536 or 768 cores, although in some cases models which 
previously crashed are able to run without issue. Still no real 
correlation between d.o.f per core and the point (number of cores) when 
the model crashes.

Is it worth reporting this issue to the Trilinos mailing list?

Cheers,
John

*************************************************
John Naliboff
Assistant Project Scientist, CIG
Earth & Planetary Sciences Dept., UC Davis

On 08/24/2017 03:46 PM, John Naliboff wrote:
> Hi all,
>
> Below are messages I accidentally only sent to Timo rather than the 
> whole mailing list.
>
> Timo - I tried Trilinos 12.10.1 and this did not resolve the issue. 
> I'm going to try and reproduce the issue with step-32 and/or a 
> different cluster next.
>
> Cheers,
> John
> *************************************************
> John Naliboff
> Assistant Project Scientist, CIG
> Earth & Planetary Sciences Dept., UC Davis
> On 08/23/2017 02:56 PM, Timo Heister wrote:
>> John,
>>
>>> /home/jboff/software/trilinos/trilinos-12.4.2/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2ba25c648cc0]
>> So the this is inside the multigrid preconditioner from Trilinos. One
>> option might be to try a newer Trilinos release. Sorry, I know that is
>> annoying.
>>
>>> /home/jboff/software/aspect/master/aspect/./aspect(_ZN6aspect18FreeSurfaceHandlerILi3EE26compute_mesh_displacementsEv+0x55c)
>> You are using free surface computations. This is something that we
>> haven't tested as much. Do you also get crashes without free surface
>> computations?
>>
>>
>>
>>
>> On Wed, Aug 23, 2017 at 5:40 PM, John Naliboff<jbnaliboff at ucdavis.edu>  wrote:
>>> Hi Timo,
>>>
>>> Thanks for the feedback. I tried a few more tests with a different model
>>> (lithospheric deformation) and encountered similar issues. The attached
>>> error output provides a bit more info this time. The model was run across
>>> 768 cores.
>>>
>>>  From the output it looks like there is an issue in Epetra?
>>>
>>>   Perhaps unrelated, but I am using Lapack 3.6.0 and had change some of the
>>> symbols labels in packages/epetra/src/Epetra_LAPACK_wrappers.h (e.g.
>>> followinghttps://urldefense.proofpoint.com/v2/url?u=https-3A__www.dealii.org_8.5.0_external-2Dlibs_trilinos.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=-52w3PZcChqYNoEqEX3EQFxw7h1LxbshOiW84lYSJPw&m=KIi4Fm0ctZaNWUSlxyOz9H31YTD1MWZdXrS-M99zGsg&s=Zfr_gHqG_d9nwcISyOmx_EvIkKibJSW3iQ73AgsoKGg&e=  ).
>>>
>>> Cheers,
>>> John
>>>
>>> *************************************************
>>> John Naliboff
>>> Assistant Project Scientist, CIG
>>> Earth & Planetary Sciences Dept., UC Davis
>>>
>>> On 08/23/2017 09:41 AM, Timo Heister wrote:
>>>
>>> John,
>>>
>>> it would be neat to have a longer callstack to see where this error is
>>> happening.
>>>
>>> Some ideas:
>>> 1. This could be a hardware issue (one of the nodes can not
>>> communicate, has packet loss or whatever).
>>> 2. This could be a configuration problem ("too many retries sending
>>> message to 0x5a90:0x000639a2, giving up" could mean some MPI timeouts
>>> are triggered)
>>> 3. It could be a bug in some MPI code (in Trilinos, deal.II, or
>>> ASPECT). A longer callstack would help narrow that down.
>>>
>>> If you feel like experimenting, you could see if you can trigger the
>>> same issue with deal.II step-32.
>>>
>>>
>>> On Tue, Aug 22, 2017 at 4:44 PM, John Naliboff<jbnaliboff at ucdavis.edu>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm looking for feedback on a memory error(s?) that has me somewhat
>>> perplexed.
>>>
>>> The errors are occurring on the XSEDE cluster Comet:
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sdsc.edu_support_user-5Fguides_comet.html&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=njB2yysRd9W3aP6qJFTQspUXkAjXMAXtMUNshyD4XXY&e=
>>>
>>> The models in question are a series of scaling tests following the tests run
>>> by Rene Gassmoeller:
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gassmoeller_aspect-2Dperformance-2Dstatistics&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=NBYyCSgTMal5JLEHkh0Zeox-XPuHC_Vt1oaKkep6Dto&s=mchFc6MAAp6-OtcloaqidEJvZDnCljy0ZfO7CSIoS70&e=
>>>
>>> When using up to 192 processors and global refinement levels of 2, 3, 4 or 5
>>> the scaling results are "roughly" (not too far off) what I would expect
>>> based on Rene's results.
>>>
>>> However, once I get up to 384 cores the models almost always crash with a
>>> segmentation fault error. Here is part of the error message from a model run
>>> on 384 cores with 4 global refinement levels.
>>>    Number of active cells: 393,216 (on 5 levels)
>>>    Number of degrees of freedom: 16,380,620
>>> (9,585,030+405,570+3,195,010+3,195,010)
>>>
>>>    *** Timestep 0:  t=0 years
>>>       Solving temperature system... 0 iterations.
>>>       Solving C_1 system ... 0 iterations.
>>>       Rebuilding Stokes preconditioner...[comet-06-22:09703] *** Process
>>> received signal ***
>>>    [comet-06-22:09703] Signal: Segmentation fault (11)
>>>
>>> The full model output is locate in the attached file.
>>>
>>> Thoughts on what might be causing a memory issue when increasing the number
>>> of cores?
>>>
>>> The perplexing part is that the error does not seemed to be tied to the
>>> number of d.o.f. per processor. Also somewhat perplexing is one model that
>>> crashed with this error was able to run successfully using the exact same
>>> submission script, input file, etc. However, this only happened once
>>> (successfully running failed job) and the errors are almost reproducible.
>>>
>>> If no one has encountered this issue before, any suggestions for debugging
>>> tricks with this number of processors? I may be able to run an interactive
>>> session in debug mode with this number of processors, but I would need to
>>> check with the cluster administrator.
>>>
>>> Thanks!
>>> John
>>>
>>> --
>>>
>>> *************************************************
>>> John Naliboff
>>> Assistant Project Scientist, CIG
>>> Earth & Planetary Sciences Dept., UC Davis
>>> 	
>>>
>>>
>>> _______________________________________________
>>> Aspect-devel mailing list
>>> Aspect-devel at geodynamics.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIGaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=R5lvg9JC99XvuTgScgbY_QFS80R7PEA2q0EPwDy7VQw&m=-I0hdEXVrD9Y9ctQZz7W8BKwf95g8vGE23968nhbZp0&s=14IchaNUIlvQctS6F-2NYva_crTRZeowj2JwiNg0-oU&e=
>>>
>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20170831/5c8056b5/attachment-0001.html>
-------------- next part --------------
Number of active cells: 12,582,912 (on 11 levels)
Number of degrees of freedom: 188,817,408 (100,712,448+37,748,736+50,356,224)

Timestep 0:  t=0 years

   Rebuilding Stokes preconditioner...[comet-17-38:12976] *** Process received signal ***
[comet-17-38:12976] Signal: Segmentation fault (11)
[comet-17-38:12976] Signal code: Address not mapped (1)
[comet-17-38:12976] Failing at address: 0x36dfaf010
[comet-17-38:12976] [ 0] /lib64/libc.so.6[0x344ca32510]
[comet-17-38:12976] [ 1] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x1c8)[0x2b45a2c9a748]
[comet-17-38:12976] [ 2] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_memalign+0x52)[0x2b45a2c9d332]
[comet-17-38:12976] [ 3] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0xbf)[0x2b45a2c9d76f]
[comet-17-38:12976] [ 4] /usr/lib64/libmlx4-rdmav2.so(+0x1a8ef)[0x2b45a51388ef]
[comet-17-38:12976] [ 5] /usr/lib64/libmlx4-rdmav2.so(+0x1aa2f)[0x2b45a5138a2f]
[comet-17-38:12976] [ 6] /usr/lib64/libibverbs.so.1(ibv_create_ah+0x127)[0x2b45a26f0287]
[comet-17-38:12976] [ 7] /opt/openmpi/gnu/ib/lib/libmpi.so.1(+0x113040)[0x2b45a194b040]
[comet-17-38:12976] [ 8] /opt/openmpi/gnu/ib/lib/libmpi.so.1(mca_btl_openib_endpoint_send+0x8d)[0x2b45a193ae8d]
[comet-17-38:12976] [ 9] /opt/openmpi/gnu/ib/lib/libmpi.so.1(mca_pml_ob1_send_request_start_copy+0x470)[0x2b45a1a33c80]
[comet-17-38:12976] [10] /opt/openmpi/gnu/ib/lib/libmpi.so.1(mca_pml_ob1_send+0x2bb)[0x2b45a1a2a01b]
[comet-17-38:12976] [11] /opt/openmpi/gnu/ib/lib/libmpi.so.1(MPI_Send+0x153)[0x2b45a18e49c3]
[comet-17-38:12976] [12] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2b45990dbb30]
[comet-17-38:12976] [13] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_CommInfoOP_GenUsingGIDExternals+0x2ad)[0x2b45990e4bdd]
[comet-17-38:12976] [14] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_back_to_local+0x684)[0x2b45990e66a4]
[comet-17-38:12976] [15] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_rap+0x3ef)[0x2b459912ef0f]
[comet-17-38:12976] [16] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_AmatrixRAP+0x10d)[0x2b45990d152d]
[comet-17-38:12976] [17] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_MultiLevelHierarchy+0xbe5)[0x2b459907dbc5]
[comet-17-38:12976] [18] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_MultiLevelHierarchy_UsingAggregation+0x214)[0x2b459907e504]
[comet-17-38:12976] [19] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(_ZN9ML_Epetra24MultiLevelPreconditioner21ComputePreconditionerEb+0x3e70)[0x2b45991a2250]
[comet-17-38:12976] [20] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(_ZN9ML_Epetra24MultiLevelPreconditionerC1ERK16Epetra_RowMatrixRKN7Teuchos13ParameterListEb+0x272)[0x2b45991a6e42]
[comet-17-38:12976] [21] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERK16Epetra_RowMatrixRKN7Teuchos13ParameterListE+0xae)[0x2b45928cd8ba]
[comet-17-38:12976] [22] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERK16Epetra_RowMatrixRKNS1_14AdditionalDataE+0x2c6b)[0x2b45928d08d1]
[comet-17-38:12976] [23] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERKNS0_12SparseMatrixERKNS1_14AdditionalDataE+0x14)[0x2b45928d11a0]
[comet-17-38:12976] [24] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(_ZN6Step3221BoussinesqFlowProblemILi2EE27build_stokes_preconditionerEv+0x3c1)[0x4e69d1]
[comet-17-38:12976] [25] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(_ZN6Step3221BoussinesqFlowProblemILi2EE3runEv+0x296)[0x4f035a]
[comet-17-38:12976] [26] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(main+0x92)[0x4973cb]
[comet-17-38:12976] [27] /lib64/libc.so.6(__libc_start_main+0xfd)[0x344ca1ed1d]
[comet-17-38:12976] [28] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32[0x497159]
[comet-17-38:12976] *** End of error message ***
[comet-11-13:17731] *** Process received signal ***
[comet-11-13:17731] Signal: Segmentation fault (11)
[comet-11-13:17731] Signal code:  (128)
[comet-11-13:17731] Failing at address: (nil)
[comet-11-13:17731] [ 0] /lib64/libc.so.6[0x32d9432510]
[comet-11-13:17731] [ 1] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_libevent2021_event_active_nolock+0x110)[0x2b5235b5dda0]
[comet-11-13:17731] [ 2] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x522)[0x2b5235b5fab2]
[comet-11-13:17731] [ 3] /opt/openmpi/gnu/ib/lib/libopen-pal.so.6(opal_progress+0x98)[0x2b5235b1d6d8]
[comet-11-13:17731] [ 4] /opt/openmpi/gnu/ib/lib/libmpi.so.1(mca_pml_ob1_send+0x325)[0x2b5234930085]
[comet-11-13:17731] [ 5] /opt/openmpi/gnu/ib/lib/libmpi.so.1(MPI_Send+0x153)[0x2b52347ea9c3]
[comet-11-13:17731] [ 6] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Comm_Send+0x20)[0x2b522bfe1b30]
[comet-11-13:17731] [ 7] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_CommInfoOP_GenUsingGIDExternals+0x2ad)[0x2b522bfeabdd]
[comet-11-13:17731] [ 8] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_back_to_local+0x684)[0x2b522bfec6a4]
[comet-11-13:17731] [ 9] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_rap+0x3ef)[0x2b522c034f0f]
[comet-11-13:17731] [10] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_AmatrixRAP+0x10d)[0x2b522bfd752d]
[comet-11-13:17731] [11] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_MultiLevelHierarchy+0xbe5)[0x2b522bf83bc5]
[comet-11-13:17731] [12] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(ML_Gen_MultiLevelHierarchy_UsingAggregation+0x214)[0x2b522bf84504]
[comet-11-13:17731] [13] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(_ZN9ML_Epetra24MultiLevelPreconditioner21ComputePreconditionerEb+0x3e70)[0x2b522c0a8250]
[comet-11-13:17731] [14] /home/jboff/software/trilinos/trilinos-12.10.1/install/lib/libml.so.12(_ZN9ML_Epetra24MultiLevelPreconditionerC1ERK16Epetra_RowMatrixRKN7Teuchos13ParameterListEb+0x272)[0x2b522c0ace42]
[comet-11-13:17731] [15] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERK16Epetra_RowMatrixRKN7Teuchos13ParameterListE+0xae)[0x2b52257d38ba]
[comet-11-13:17731] [16] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERK16Epetra_RowMatrixRKNS1_14AdditionalDataE+0x2c6b)[0x2b52257d68d1]
[comet-11-13:17731] [17] /home/jboff/software/dealii/v8.5.0/install/lib/libdeal_II.g.so.8.5.0(_ZN6dealii16TrilinosWrappers15PreconditionAMG10initializeERKNS0_12SparseMatrixERKNS1_14AdditionalDataE+0x14)[0x2b52257d71a0]
[comet-11-13:17731] [18] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(_ZN6Step3221BoussinesqFlowProblemILi2EE27build_stokes_preconditionerEv+0x3c1)[0x4e69d1]
[comet-11-13:17731] [19] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(_ZN6Step3221BoussinesqFlowProblemILi2EE3runEv+0x296)[0x4f035a]
[comet-11-13:17731] [20] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32(main+0x92)[0x4973cb]
[comet-11-13:17731] [21] /lib64/libc.so.6(__libc_start_main+0xfd)[0x32d941ed1d]
[comet-11-13:17731] [22] /home/jboff/software/dealii/v8.5.0/install/examples/step-32/./step-32[0x497159]
[comet-11-13:17731] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1087 with PID 0 on node comet-17-38 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


More information about the Aspect-devel mailing list