[aspect-devel] convection-box-3d example hangs when more than one node is used (Rene Gassmoeller)

Rene Gassmoeller rengas at gfz-potsdam.de
Fri Aug 7 11:52:43 PDT 2015


Actually, currently the code looks for $TMPDIR not $TMP, but setting the
export is still worth a try. In particular we could find out if the
temporary directory is causing the issue or the background writing thread.

Maybe we could make the temporary directory an input parameter, and only
try to use a temporary directory, when either $TMP or $TMPDIR is set, or
the input parameter is set, or /tmp is available (in the order parameter
> shell variable > /tmp). On the other hand we already check if the
folder is available before writing, and if it is not available it should
simply write the file directly to the final location. But somehow this
seems to crash in Rob's case.


On 08/07/2015 08:44 PM, Timo Heister wrote:
> I was just about to write the same suggestions, Rene. :-)
> 
> It could be that $TMP is not set up correctly on one of the nodes or
> the disk is full. Not sure how we can make this more robust from
> inside ASPECT.
> 
> Rob, another thing to try would be to "export TMP=~/mytmp" with some
> directory that you can write into. Do this in the launch script before
> mpirun.
> 
> 
> 
> 
> On Fri, Aug 7, 2015 at 1:39 PM, Rene Gassmoeller <rengas at gfz-potsdam.de> wrote:
>> Hi Rob,
>> great, then we know where to look. Could you add
>>
>> 'set Number of grouped files = 1'
>>
>> in the Visualization subsection and re-enable the plugin? With this
>> setting all output will be written as MPI-IO into one file per timestep.
>> If this works, something with writing one file per process in a
>> background thread in a temporary folder is not working. If it does not
>> work we will have to look deeper into deal.II.
>>
>> If this works it is also a great (semi-)permanent workaround for your
>> problem, since you usually do not need one output file per process (we
>> just can not rely on the fact that MPI-IO is available on all systems
>> aspect is running on, therefore it is not the default option).
>>
>> Best,
>> Rene
>>
>> PS: Alternatively you could try the hdf5 output format instead of vtu.
>> ('set Output format = hdf5' in the Visualization subsection).
>>
>>
>> On 08/07/2015 05:17 PM, Robert Moucha wrote:
>>> Rene, the cause of the hang is the visualization plugin. When I
>>> removed this, I can run on more than one node to completion.
>>>
>>> Rob
>>>
>>> On Mon, Aug 3, 2015 at 3:00 PM,  <aspect-devel-request at geodynamics.org> wrote:
>>>> Send Aspect-devel mailing list submissions to
>>>>         aspect-devel at geodynamics.org
>>>>
>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>         http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>> or, via email, send a message with subject or body 'help' to
>>>>         aspect-devel-request at geodynamics.org
>>>>
>>>> You can reach the person managing the list at
>>>>         aspect-devel-owner at geodynamics.org
>>>>
>>>> When replying, please edit your Subject line so it is more specific
>>>> than "Re: Contents of Aspect-devel digest..."
>>>>
>>>>
>>>> Today's Topics:
>>>>
>>>>    1. Re: convection-box-3d example hangs when more than one node
>>>>       is used (Rene Gassmoeller)
>>>>
>>>>
>>>> ----------------------------------------------------------------------
>>>>
>>>> Message: 1
>>>> Date: Mon, 03 Aug 2015 14:20:20 +0200
>>>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>>>> To: aspect-devel at geodynamics.org
>>>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>>>         than one node is used
>>>> Message-ID: <55BF5C84.6010908 at gfz-potsdam.de>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Hi Rob,
>>>> I tried to reproduce your problem with the same trilinos, p4est, deal.II
>>>> and aspect versions (older mpi and newer gcc though), but unfortunately
>>>> without success. It seems we need to do some tests on your machine to
>>>> find the problem. Could you as a first step disable the visualization
>>>> output plugin by removing 'visualization' from the line 'set List of
>>>> postprocessors' (just to be sure it is nothing from the last timestep
>>>> ... it is written in a background thread and might cause a delayed crash).
>>>> Next we need to figure out, what is happening between the last output
>>>> and the next expected output. From what I see the last output you get is
>>>> the start of the new timestep. Between there and the next message
>>>> ("Solving Temperature System") only the boundary conditions and user
>>>> plugins get updated and the temperature system and temperature
>>>> preconditioner is assembled. I attached a patch with additional debug
>>>> output for core.cc. When you apply the patch ('patch -p1 < patch' in
>>>> your aspect folder) and rebuild aspect you should see additional output
>>>> after the last line. Could you check which lines get printed in Timestep
>>>> 1? After that we can think about what is causing the issue.
>>>>
>>>> Best
>>>> Rene
>>>>
>>>> On 07/31/2015 10:51 PM, Robert Moucha wrote:
>>>>> Finally got a chance to get back to this after some travel:
>>>>>
>>>>> I tried step-32 example of deal.II and it runs without a problem on
>>>>> more than one compute node.
>>>>>
>>>>> ASPECT is the default 1.3 version, the problem occurs with both debug
>>>>> and release versions.
>>>>>
>>>>> To compile ASPECT I'm using:
>>>>>
>>>>> gcc 4.4.7
>>>>> BLAS and LAPACK 3.2.1-4
>>>>> OpenMPI 1.8.4
>>>>> Trilinos 12.0.1 with CXX11=OFF
>>>>> p4est 1.1
>>>>> pdhf5 1.8.15
>>>>> deal.ii 8.2.1
>>>>>
>>>>> no problems with compiling these as far as I could tell
>>>>>
>>>>> When ASPECT hangs, right after initial time step, all the files
>>>>> solution-00000 for all processors are written without issue as well as
>>>>> other files, then it just hangs.
>>>>>
>>>>> Thanks,
>>>>> Rob
>>>>>
>>>>>> Message: 1
>>>>>> Date: Sun, 19 Jul 2015 12:25:14 +0200
>>>>>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>>>>>> To: aspect-devel at geodynamics.org
>>>>>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>>>>>         than one node is used
>>>>>> Message-ID: <55AB7B0A.8040503 at gfz-potsdam.de>
>>>>>> Content-Type: text/plain; charset=utf-8
>>>>>>
>>>>>> Hi Rob,
>>>>>> I just checked the convection-box-3d on 2 nodes of our cluster and it
>>>>>> runs fine. So it seems there is something special about your
>>>>>> installation or there is a bug that is only showing in this
>>>>>> configuration. Could you test the following things for us to be able to
>>>>>> give you some more help:
>>>>>>
>>>>>> 1. Try running convection-box with an ASPECT compiled in debug mode.
>>>>>> Maybe there is some error message suppressed by the release mode.
>>>>>> 2. Could you try to compile the deal.II example step-32 and run that one
>>>>>> on more than one node of your cluster? It should be in your deal.II
>>>>>> folder /examples/step-32, and compiling should be as simple as 'cmake .
>>>>>> && make'. This will give us some insight if something in aspect is
>>>>>> causing the problem or if it is an issue with the deal.II code or
>>>>>> configuration.
>>>>>> 3. We need some more information on your deal.II configuration (your
>>>>>> ASPECT is an unchanged 1.3, right?). Which version of deal.II are you
>>>>>> using? Which trilinos, p4est and compiler? Were there any problems
>>>>>> during compiling those?
>>>>>>
>>>>>> Best,
>>>>>> Rene
>>>>>>
>>>>>> On 07/18/2015 12:23 AM, Robert Moucha wrote:
>>>>>>> Hi Timo,
>>>>>>>
>>>>>>> Yes I still have the same problem. It occurs with the following cook
>>>>>>> books (have not tried all, but it looks like anything to do with time
>>>>>>> stepping is causing the hang):
>>>>>>>
>>>>>>> convection-box
>>>>>>> convection-box-3d
>>>>>>> shell_simple_2d
>>>>>>> van-keken-discontinuous
>>>>>>>
>>>>>>> Thanks
>>>>>>> Rob
>>>>>>>
>>>>>>>> Hey Robert,
>>>>>>>>
>>>>>>>> sorry for only getting back to this now. Any update on your problem?
>>>>>>>> Does this happen with every .prm file (like a simple 2d problem)?
>>>>>>>>
>>>>>>>> On Sun, Jul 5, 2015 at 6:10 PM, Robert Moucha <rmoucha at gmail.com> wrote:
>>>>>>>>> OK, it appears that I solved last-weeks issue with the files, turns
>>>>>>>>> out one of the nodes did not have the correct paths (thanks).
>>>>>>>>>
>>>>>>>>> However, now I am still having problems when using more than one node,
>>>>>>>>> this time ASPECT just hangs on time step 1, no error, the
>>>>>>>>> solution-00000 files are created on each of the nodes than nothing.
>>>>>>>>>
>>>>>>>>> It runs fine on a single node. I should point out that the ASPECT
>>>>>>>>> example stokes.prm as well as Citcoms runs on the cluster without
>>>>>>>>> issues.
>>>>>>>>>
>>>>>>>>> Here is the log.txt for the convection-box-3d.prm -- thanks in advance Rob
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>> -- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
>>>>>>>>> --     . version 1.3
>>>>>>>>> --     . running in OPTIMIZED mode
>>>>>>>>> --     . running with 12 MPI processes
>>>>>>>>> --     . using Trilinos
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> Number of active cells: 512 (on 4 levels)
>>>>>>>>> Number of degrees of freedom: 20381 (14739+729+4913)
>>>>>>>>>
>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>>>    Solving Stokes system... 29 iterations.
>>>>>>>>>
>>>>>>>>> Number of active cells: 1583 (on 5 levels)
>>>>>>>>> Number of degrees of freedom: 63622 (46077+2186+15359)
>>>>>>>>>
>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>>>
>>>>>>>>> Number of active cells: 3256 (on 5 levels)
>>>>>>>>> Number of degrees of freedom: 122269 (88647+4073+29549)
>>>>>>>>>
>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>>>
>>>>>>>>> Number of active cells: 9010 (on 6 levels)
>>>>>>>>> Number of degrees of freedom: 333145 (241677+10909+80559)
>>>>>>>>>
>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>    Solving temperature system... 0 iterations.
>>>>>>>>>    Rebuilding Stokes preconditioner...
>>>>>>>>>    Solving Stokes system... 30+4 iterations.
>>>>>>>>>
>>>>>>>>>    Postprocessing:
>>>>>>>>>      RMS, max velocity:                  57.6 m/s, 176 m/s
>>>>>>>>>      Temperature min/avg/max:            0 K, 0.5 K, 1 K
>>>>>>>>>      Heat fluxes through boundary parts: 7.682e-07 W, -7.682e-07 W,
>>>>>>>>> 1.685e-15 W, 2.362e-15 W, -1 W, 1 W
>>>>>>>>>      Writing graphical output:
>>>>>>>>> /state/partition1/RMOUCHA/output/solution-00000
>>>>>>>>>
>>>>>>>>> *** Timestep 1:  t=8.87115e-05 seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> Robert Moucha
>>>>>>>>> Assistant Professor of Geophysics
>>>>>>>>> Department of Earth Sciences
>>>>>>>>> 204 Heroy Geology Lab
>>>>>>>>> Syracuse University
>>>>>>>>> Syracuse, NY, 13244-1070
>>>>>>>>> _______________________________________________
>>>>>>>>> Aspect-devel mailing list
>>>>>>>>> Aspect-devel at geodynamics.org
>>>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> Subject: Digest Footer
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Aspect-devel mailing list
>>>>>>>> Aspect-devel at geodynamics.org
>>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> End of Aspect-devel Digest, Vol 44, Issue 9
>>>>>>>> *******************************************
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Subject: Digest Footer
>>>>>>
>>>>>> _______________________________________________
>>>>>> Aspect-devel mailing list
>>>>>> Aspect-devel at geodynamics.org
>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> End of Aspect-devel Digest, Vol 44, Issue 11
>>>>>> ********************************************
>>>>>
>>>>>
>>>>>
>>>> -------------- next part --------------
>>>> diff --git a/source/simulator/core.cc b/source/simulator/core.cc
>>>> index 6516ac2..ecf881c 100644
>>>> --- a/source/simulator/core.cc
>>>> +++ b/source/simulator/core.cc
>>>> @@ -636,6 +636,8 @@ namespace aspect
>>>>      gravity_model->update();
>>>>      heating_model->update();
>>>>      adiabatic_conditions->update();
>>>> +
>>>> +    pcout << "   Updated constraints and plugins." << std::endl;
>>>>    }
>>>>
>>>>
>>>> @@ -1532,7 +1534,12 @@ namespace aspect
>>>>            if (parameters.free_surface_enabled)
>>>>              free_surface->execute ();
>>>>
>>>> +          pcout << "   Assemble temperature system." << std::endl;
>>>> +
>>>>            assemble_advection_system (AdvectionField::temperature());
>>>> +
>>>> +          pcout << "   Build temperature preconditioner." << std::endl;
>>>> +
>>>>            build_advection_preconditioner(AdvectionField::temperature(),
>>>>                                           T_preconditioner);
>>>>            solve_advection(AdvectionField::temperature());
>>>>
>>>> ------------------------------
>>>>
>>>> Subject: Digest Footer
>>>>
>>>> _______________________________________________
>>>> Aspect-devel mailing list
>>>> Aspect-devel at geodynamics.org
>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>
>>>> ------------------------------
>>>>
>>>> End of Aspect-devel Digest, Vol 45, Issue 2
>>>> *******************************************
>>>
>>>
>>>
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> 
> 
> 


More information about the Aspect-devel mailing list