[aspect-devel] convection-box-3d example hangs when more than one node is used (Rene Gassmoeller)

Rene Gassmoeller rengas at gfz-potsdam.de
Fri Aug 7 11:39:04 PDT 2015


Hi Rob,
great, then we know where to look. Could you add

'set Number of grouped files = 1'

in the Visualization subsection and re-enable the plugin? With this
setting all output will be written as MPI-IO into one file per timestep.
If this works, something with writing one file per process in a
background thread in a temporary folder is not working. If it does not
work we will have to look deeper into deal.II.

If this works it is also a great (semi-)permanent workaround for your
problem, since you usually do not need one output file per process (we
just can not rely on the fact that MPI-IO is available on all systems
aspect is running on, therefore it is not the default option).

Best,
Rene

PS: Alternatively you could try the hdf5 output format instead of vtu.
('set Output format = hdf5' in the Visualization subsection).


On 08/07/2015 05:17 PM, Robert Moucha wrote:
> Rene, the cause of the hang is the visualization plugin. When I
> removed this, I can run on more than one node to completion.
> 
> Rob
> 
> On Mon, Aug 3, 2015 at 3:00 PM,  <aspect-devel-request at geodynamics.org> wrote:
>> Send Aspect-devel mailing list submissions to
>>         aspect-devel at geodynamics.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> or, via email, send a message with subject or body 'help' to
>>         aspect-devel-request at geodynamics.org
>>
>> You can reach the person managing the list at
>>         aspect-devel-owner at geodynamics.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Aspect-devel digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Re: convection-box-3d example hangs when more than one node
>>       is used (Rene Gassmoeller)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 03 Aug 2015 14:20:20 +0200
>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>> To: aspect-devel at geodynamics.org
>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>         than one node is used
>> Message-ID: <55BF5C84.6010908 at gfz-potsdam.de>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Rob,
>> I tried to reproduce your problem with the same trilinos, p4est, deal.II
>> and aspect versions (older mpi and newer gcc though), but unfortunately
>> without success. It seems we need to do some tests on your machine to
>> find the problem. Could you as a first step disable the visualization
>> output plugin by removing 'visualization' from the line 'set List of
>> postprocessors' (just to be sure it is nothing from the last timestep
>> ... it is written in a background thread and might cause a delayed crash).
>> Next we need to figure out, what is happening between the last output
>> and the next expected output. From what I see the last output you get is
>> the start of the new timestep. Between there and the next message
>> ("Solving Temperature System") only the boundary conditions and user
>> plugins get updated and the temperature system and temperature
>> preconditioner is assembled. I attached a patch with additional debug
>> output for core.cc. When you apply the patch ('patch -p1 < patch' in
>> your aspect folder) and rebuild aspect you should see additional output
>> after the last line. Could you check which lines get printed in Timestep
>> 1? After that we can think about what is causing the issue.
>>
>> Best
>> Rene
>>
>> On 07/31/2015 10:51 PM, Robert Moucha wrote:
>>> Finally got a chance to get back to this after some travel:
>>>
>>> I tried step-32 example of deal.II and it runs without a problem on
>>> more than one compute node.
>>>
>>> ASPECT is the default 1.3 version, the problem occurs with both debug
>>> and release versions.
>>>
>>> To compile ASPECT I'm using:
>>>
>>> gcc 4.4.7
>>> BLAS and LAPACK 3.2.1-4
>>> OpenMPI 1.8.4
>>> Trilinos 12.0.1 with CXX11=OFF
>>> p4est 1.1
>>> pdhf5 1.8.15
>>> deal.ii 8.2.1
>>>
>>> no problems with compiling these as far as I could tell
>>>
>>> When ASPECT hangs, right after initial time step, all the files
>>> solution-00000 for all processors are written without issue as well as
>>> other files, then it just hangs.
>>>
>>> Thanks,
>>> Rob
>>>
>>>> Message: 1
>>>> Date: Sun, 19 Jul 2015 12:25:14 +0200
>>>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>>>> To: aspect-devel at geodynamics.org
>>>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>>>         than one node is used
>>>> Message-ID: <55AB7B0A.8040503 at gfz-potsdam.de>
>>>> Content-Type: text/plain; charset=utf-8
>>>>
>>>> Hi Rob,
>>>> I just checked the convection-box-3d on 2 nodes of our cluster and it
>>>> runs fine. So it seems there is something special about your
>>>> installation or there is a bug that is only showing in this
>>>> configuration. Could you test the following things for us to be able to
>>>> give you some more help:
>>>>
>>>> 1. Try running convection-box with an ASPECT compiled in debug mode.
>>>> Maybe there is some error message suppressed by the release mode.
>>>> 2. Could you try to compile the deal.II example step-32 and run that one
>>>> on more than one node of your cluster? It should be in your deal.II
>>>> folder /examples/step-32, and compiling should be as simple as 'cmake .
>>>> && make'. This will give us some insight if something in aspect is
>>>> causing the problem or if it is an issue with the deal.II code or
>>>> configuration.
>>>> 3. We need some more information on your deal.II configuration (your
>>>> ASPECT is an unchanged 1.3, right?). Which version of deal.II are you
>>>> using? Which trilinos, p4est and compiler? Were there any problems
>>>> during compiling those?
>>>>
>>>> Best,
>>>> Rene
>>>>
>>>> On 07/18/2015 12:23 AM, Robert Moucha wrote:
>>>>> Hi Timo,
>>>>>
>>>>> Yes I still have the same problem. It occurs with the following cook
>>>>> books (have not tried all, but it looks like anything to do with time
>>>>> stepping is causing the hang):
>>>>>
>>>>> convection-box
>>>>> convection-box-3d
>>>>> shell_simple_2d
>>>>> van-keken-discontinuous
>>>>>
>>>>> Thanks
>>>>> Rob
>>>>>
>>>>>> Hey Robert,
>>>>>>
>>>>>> sorry for only getting back to this now. Any update on your problem?
>>>>>> Does this happen with every .prm file (like a simple 2d problem)?
>>>>>>
>>>>>> On Sun, Jul 5, 2015 at 6:10 PM, Robert Moucha <rmoucha at gmail.com> wrote:
>>>>>>> OK, it appears that I solved last-weeks issue with the files, turns
>>>>>>> out one of the nodes did not have the correct paths (thanks).
>>>>>>>
>>>>>>> However, now I am still having problems when using more than one node,
>>>>>>> this time ASPECT just hangs on time step 1, no error, the
>>>>>>> solution-00000 files are created on each of the nodes than nothing.
>>>>>>>
>>>>>>> It runs fine on a single node. I should point out that the ASPECT
>>>>>>> example stokes.prm as well as Citcoms runs on the cluster without
>>>>>>> issues.
>>>>>>>
>>>>>>> Here is the log.txt for the convection-box-3d.prm -- thanks in advance Rob
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------
>>>>>>> -- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
>>>>>>> --     . version 1.3
>>>>>>> --     . running in OPTIMIZED mode
>>>>>>> --     . running with 12 MPI processes
>>>>>>> --     . using Trilinos
>>>>>>> -----------------------------------------------------------------------------
>>>>>>>
>>>>>>> Number of active cells: 512 (on 4 levels)
>>>>>>> Number of degrees of freedom: 20381 (14739+729+4913)
>>>>>>>
>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>    Solving Stokes system... 29 iterations.
>>>>>>>
>>>>>>> Number of active cells: 1583 (on 5 levels)
>>>>>>> Number of degrees of freedom: 63622 (46077+2186+15359)
>>>>>>>
>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>
>>>>>>> Number of active cells: 3256 (on 5 levels)
>>>>>>> Number of degrees of freedom: 122269 (88647+4073+29549)
>>>>>>>
>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>
>>>>>>> Number of active cells: 9010 (on 6 levels)
>>>>>>> Number of degrees of freedom: 333145 (241677+10909+80559)
>>>>>>>
>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>
>>>>>>>    Postprocessing:
>>>>>>>      RMS, max velocity:                  57.6 m/s, 176 m/s
>>>>>>>      Temperature min/avg/max:            0 K, 0.5 K, 1 K
>>>>>>>      Heat fluxes through boundary parts: 7.682e-07 W, -7.682e-07 W,
>>>>>>> 1.685e-15 W, 2.362e-15 W, -1 W, 1 W
>>>>>>>      Writing graphical output:
>>>>>>> /state/partition1/RMOUCHA/output/solution-00000
>>>>>>>
>>>>>>> *** Timestep 1:  t=8.87115e-05 seconds
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> Robert Moucha
>>>>>>> Assistant Professor of Geophysics
>>>>>>> Department of Earth Sciences
>>>>>>> 204 Heroy Geology Lab
>>>>>>> Syracuse University
>>>>>>> Syracuse, NY, 13244-1070
>>>>>>> _______________________________________________
>>>>>>> Aspect-devel mailing list
>>>>>>> Aspect-devel at geodynamics.org
>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Subject: Digest Footer
>>>>>>
>>>>>> _______________________________________________
>>>>>> Aspect-devel mailing list
>>>>>> Aspect-devel at geodynamics.org
>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> End of Aspect-devel Digest, Vol 44, Issue 9
>>>>>> *******************************************
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Subject: Digest Footer
>>>>
>>>> _______________________________________________
>>>> Aspect-devel mailing list
>>>> Aspect-devel at geodynamics.org
>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>
>>>> ------------------------------
>>>>
>>>> End of Aspect-devel Digest, Vol 44, Issue 11
>>>> ********************************************
>>>
>>>
>>>
>> -------------- next part --------------
>> diff --git a/source/simulator/core.cc b/source/simulator/core.cc
>> index 6516ac2..ecf881c 100644
>> --- a/source/simulator/core.cc
>> +++ b/source/simulator/core.cc
>> @@ -636,6 +636,8 @@ namespace aspect
>>      gravity_model->update();
>>      heating_model->update();
>>      adiabatic_conditions->update();
>> +
>> +    pcout << "   Updated constraints and plugins." << std::endl;
>>    }
>>
>>
>> @@ -1532,7 +1534,12 @@ namespace aspect
>>            if (parameters.free_surface_enabled)
>>              free_surface->execute ();
>>
>> +          pcout << "   Assemble temperature system." << std::endl;
>> +
>>            assemble_advection_system (AdvectionField::temperature());
>> +
>> +          pcout << "   Build temperature preconditioner." << std::endl;
>> +
>>            build_advection_preconditioner(AdvectionField::temperature(),
>>                                           T_preconditioner);
>>            solve_advection(AdvectionField::temperature());
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>
>> ------------------------------
>>
>> End of Aspect-devel Digest, Vol 45, Issue 2
>> *******************************************
> 
> 
> 


More information about the Aspect-devel mailing list