[aspect-devel] Aspect-devel Digest, Vol 45, Issue 8

Cox, Samuel P. spc29 at leicester.ac.uk
Tue Feb 9 02:40:23 PST 2016


Hi all,

Sorry to dredge up a fairly old conversation, but I ran into the same problem when running Aspect on our cluster - it runs fine on a single node, but hangs on visualization when multiple nodes are used. I tried the fixes suggested below: nothing I did to TMP or TMPDIR seemed to make a difference, but changing set Number of grouped files = 1 does fix it. I am not too distressed by having everything outputted to a single file, but was wondering whether anybody came up with a better understanding or better fix?

Sam

> On 7 Aug 2015, at 20:00, aspect-devel-request at geodynamics.org wrote:
> 
> Send Aspect-devel mailing list submissions to
> 	aspect-devel at geodynamics.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> or, via email, send a message with subject or body 'help' to
> 	aspect-devel-request at geodynamics.org
> 
> You can reach the person managing the list at
> 	aspect-devel-owner at geodynamics.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Aspect-devel digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: convection-box-3d example hangs when more than one node
>      is used (Rene Gassmoeller) (Rene Gassmoeller)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Fri, 07 Aug 2015 20:52:43 +0200
> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
> To: aspect-devel at geodynamics.org
> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
> 	than one node is used (Rene Gassmoeller)
> Message-ID: <55C4FE7B.4070604 at gfz-potsdam.de>
> Content-Type: text/plain; charset=utf-8
> 
> Actually, currently the code looks for $TMPDIR not $TMP, but setting the
> export is still worth a try. In particular we could find out if the
> temporary directory is causing the issue or the background writing thread.
> 
> Maybe we could make the temporary directory an input parameter, and only
> try to use a temporary directory, when either $TMP or $TMPDIR is set, or
> the input parameter is set, or /tmp is available (in the order parameter
>> shell variable > /tmp). On the other hand we already check if the
> folder is available before writing, and if it is not available it should
> simply write the file directly to the final location. But somehow this
> seems to crash in Rob's case.
> 
> 
> On 08/07/2015 08:44 PM, Timo Heister wrote:
>> I was just about to write the same suggestions, Rene. :-)
>> 
>> It could be that $TMP is not set up correctly on one of the nodes or
>> the disk is full. Not sure how we can make this more robust from
>> inside ASPECT.
>> 
>> Rob, another thing to try would be to "export TMP=~/mytmp" with some
>> directory that you can write into. Do this in the launch script before
>> mpirun.
>> 
>> 
>> 
>> 
>> On Fri, Aug 7, 2015 at 1:39 PM, Rene Gassmoeller <rengas at gfz-potsdam.de> wrote:
>>> Hi Rob,
>>> great, then we know where to look. Could you add
>>> 
>>> 'set Number of grouped files = 1'
>>> 
>>> in the Visualization subsection and re-enable the plugin? With this
>>> setting all output will be written as MPI-IO into one file per timestep.
>>> If this works, something with writing one file per process in a
>>> background thread in a temporary folder is not working. If it does not
>>> work we will have to look deeper into deal.II.
>>> 
>>> If this works it is also a great (semi-)permanent workaround for your
>>> problem, since you usually do not need one output file per process (we
>>> just can not rely on the fact that MPI-IO is available on all systems
>>> aspect is running on, therefore it is not the default option).
>>> 
>>> Best,
>>> Rene
>>> 
>>> PS: Alternatively you could try the hdf5 output format instead of vtu.
>>> ('set Output format = hdf5' in the Visualization subsection).
>>> 
>>> 
>>> On 08/07/2015 05:17 PM, Robert Moucha wrote:
>>>> Rene, the cause of the hang is the visualization plugin. When I
>>>> removed this, I can run on more than one node to completion.
>>>> 
>>>> Rob
>>>> 
>>>> On Mon, Aug 3, 2015 at 3:00 PM,  <aspect-devel-request at geodynamics.org> wrote:
>>>>> Send Aspect-devel mailing list submissions to
>>>>>        aspect-devel at geodynamics.org
>>>>> 
>>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>>        http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>> or, via email, send a message with subject or body 'help' to
>>>>>        aspect-devel-request at geodynamics.org
>>>>> 
>>>>> You can reach the person managing the list at
>>>>>        aspect-devel-owner at geodynamics.org
>>>>> 
>>>>> When replying, please edit your Subject line so it is more specific
>>>>> than "Re: Contents of Aspect-devel digest..."
>>>>> 
>>>>> 
>>>>> Today's Topics:
>>>>> 
>>>>>   1. Re: convection-box-3d example hangs when more than one node
>>>>>      is used (Rene Gassmoeller)
>>>>> 
>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> 
>>>>> Message: 1
>>>>> Date: Mon, 03 Aug 2015 14:20:20 +0200
>>>>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>>>>> To: aspect-devel at geodynamics.org
>>>>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>>>>        than one node is used
>>>>> Message-ID: <55BF5C84.6010908 at gfz-potsdam.de>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>> 
>>>>> Hi Rob,
>>>>> I tried to reproduce your problem with the same trilinos, p4est, deal.II
>>>>> and aspect versions (older mpi and newer gcc though), but unfortunately
>>>>> without success. It seems we need to do some tests on your machine to
>>>>> find the problem. Could you as a first step disable the visualization
>>>>> output plugin by removing 'visualization' from the line 'set List of
>>>>> postprocessors' (just to be sure it is nothing from the last timestep
>>>>> ... it is written in a background thread and might cause a delayed crash).
>>>>> Next we need to figure out, what is happening between the last output
>>>>> and the next expected output. From what I see the last output you get is
>>>>> the start of the new timestep. Between there and the next message
>>>>> ("Solving Temperature System") only the boundary conditions and user
>>>>> plugins get updated and the temperature system and temperature
>>>>> preconditioner is assembled. I attached a patch with additional debug
>>>>> output for core.cc. When you apply the patch ('patch -p1 < patch' in
>>>>> your aspect folder) and rebuild aspect you should see additional output
>>>>> after the last line. Could you check which lines get printed in Timestep
>>>>> 1? After that we can think about what is causing the issue.
>>>>> 
>>>>> Best
>>>>> Rene
>>>>> 
>>>>> On 07/31/2015 10:51 PM, Robert Moucha wrote:
>>>>>> Finally got a chance to get back to this after some travel:
>>>>>> 
>>>>>> I tried step-32 example of deal.II and it runs without a problem on
>>>>>> more than one compute node.
>>>>>> 
>>>>>> ASPECT is the default 1.3 version, the problem occurs with both debug
>>>>>> and release versions.
>>>>>> 
>>>>>> To compile ASPECT I'm using:
>>>>>> 
>>>>>> gcc 4.4.7
>>>>>> BLAS and LAPACK 3.2.1-4
>>>>>> OpenMPI 1.8.4
>>>>>> Trilinos 12.0.1 with CXX11=OFF
>>>>>> p4est 1.1
>>>>>> pdhf5 1.8.15
>>>>>> deal.ii 8.2.1
>>>>>> 
>>>>>> no problems with compiling these as far as I could tell
>>>>>> 
>>>>>> When ASPECT hangs, right after initial time step, all the files
>>>>>> solution-00000 for all processors are written without issue as well as
>>>>>> other files, then it just hangs.
>>>>>> 
>>>>>> Thanks,
>>>>>> Rob
>>>>>> 
>>>>>>> Message: 1
>>>>>>> Date: Sun, 19 Jul 2015 12:25:14 +0200
>>>>>>> From: Rene Gassmoeller <rengas at gfz-potsdam.de>
>>>>>>> To: aspect-devel at geodynamics.org
>>>>>>> Subject: Re: [aspect-devel] convection-box-3d example hangs when more
>>>>>>>        than one node is used
>>>>>>> Message-ID: <55AB7B0A.8040503 at gfz-potsdam.de>
>>>>>>> Content-Type: text/plain; charset=utf-8
>>>>>>> 
>>>>>>> Hi Rob,
>>>>>>> I just checked the convection-box-3d on 2 nodes of our cluster and it
>>>>>>> runs fine. So it seems there is something special about your
>>>>>>> installation or there is a bug that is only showing in this
>>>>>>> configuration. Could you test the following things for us to be able to
>>>>>>> give you some more help:
>>>>>>> 
>>>>>>> 1. Try running convection-box with an ASPECT compiled in debug mode.
>>>>>>> Maybe there is some error message suppressed by the release mode.
>>>>>>> 2. Could you try to compile the deal.II example step-32 and run that one
>>>>>>> on more than one node of your cluster? It should be in your deal.II
>>>>>>> folder /examples/step-32, and compiling should be as simple as 'cmake .
>>>>>>> && make'. This will give us some insight if something in aspect is
>>>>>>> causing the problem or if it is an issue with the deal.II code or
>>>>>>> configuration.
>>>>>>> 3. We need some more information on your deal.II configuration (your
>>>>>>> ASPECT is an unchanged 1.3, right?). Which version of deal.II are you
>>>>>>> using? Which trilinos, p4est and compiler? Were there any problems
>>>>>>> during compiling those?
>>>>>>> 
>>>>>>> Best,
>>>>>>> Rene
>>>>>>> 
>>>>>>> On 07/18/2015 12:23 AM, Robert Moucha wrote:
>>>>>>>> Hi Timo,
>>>>>>>> 
>>>>>>>> Yes I still have the same problem. It occurs with the following cook
>>>>>>>> books (have not tried all, but it looks like anything to do with time
>>>>>>>> stepping is causing the hang):
>>>>>>>> 
>>>>>>>> convection-box
>>>>>>>> convection-box-3d
>>>>>>>> shell_simple_2d
>>>>>>>> van-keken-discontinuous
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Rob
>>>>>>>> 
>>>>>>>>> Hey Robert,
>>>>>>>>> 
>>>>>>>>> sorry for only getting back to this now. Any update on your problem?
>>>>>>>>> Does this happen with every .prm file (like a simple 2d problem)?
>>>>>>>>> 
>>>>>>>>> On Sun, Jul 5, 2015 at 6:10 PM, Robert Moucha <rmoucha at gmail.com> wrote:
>>>>>>>>>> OK, it appears that I solved last-weeks issue with the files, turns
>>>>>>>>>> out one of the nodes did not have the correct paths (thanks).
>>>>>>>>>> 
>>>>>>>>>> However, now I am still having problems when using more than one node,
>>>>>>>>>> this time ASPECT just hangs on time step 1, no error, the
>>>>>>>>>> solution-00000 files are created on each of the nodes than nothing.
>>>>>>>>>> 
>>>>>>>>>> It runs fine on a single node. I should point out that the ASPECT
>>>>>>>>>> example stokes.prm as well as Citcoms runs on the cluster without
>>>>>>>>>> issues.
>>>>>>>>>> 
>>>>>>>>>> Here is the log.txt for the convection-box-3d.prm -- thanks in advance Rob
>>>>>>>>>> 
>>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>> -- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
>>>>>>>>>> --     . version 1.3
>>>>>>>>>> --     . running in OPTIMIZED mode
>>>>>>>>>> --     . running with 12 MPI processes
>>>>>>>>>> --     . using Trilinos
>>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> Number of active cells: 512 (on 4 levels)
>>>>>>>>>> Number of degrees of freedom: 20381 (14739+729+4913)
>>>>>>>>>> 
>>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>>   Solving temperature system... 0 iterations.
>>>>>>>>>>   Rebuilding Stokes preconditioner...
>>>>>>>>>>   Solving Stokes system... 29 iterations.
>>>>>>>>>> 
>>>>>>>>>> Number of active cells: 1583 (on 5 levels)
>>>>>>>>>> Number of degrees of freedom: 63622 (46077+2186+15359)
>>>>>>>>>> 
>>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>>   Solving temperature system... 0 iterations.
>>>>>>>>>>   Rebuilding Stokes preconditioner...
>>>>>>>>>>   Solving Stokes system... 30+4 iterations.
>>>>>>>>>> 
>>>>>>>>>> Number of active cells: 3256 (on 5 levels)
>>>>>>>>>> Number of degrees of freedom: 122269 (88647+4073+29549)
>>>>>>>>>> 
>>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>>   Solving temperature system... 0 iterations.
>>>>>>>>>>   Rebuilding Stokes preconditioner...
>>>>>>>>>>   Solving Stokes system... 30+4 iterations.
>>>>>>>>>> 
>>>>>>>>>> Number of active cells: 9010 (on 6 levels)
>>>>>>>>>> Number of degrees of freedom: 333145 (241677+10909+80559)
>>>>>>>>>> 
>>>>>>>>>> *** Timestep 0:  t=0 seconds
>>>>>>>>>>   Solving temperature system... 0 iterations.
>>>>>>>>>>   Rebuilding Stokes preconditioner...
>>>>>>>>>>   Solving Stokes system... 30+4 iterations.
>>>>>>>>>> 
>>>>>>>>>>   Postprocessing:
>>>>>>>>>>     RMS, max velocity:                  57.6 m/s, 176 m/s
>>>>>>>>>>     Temperature min/avg/max:            0 K, 0.5 K, 1 K
>>>>>>>>>>     Heat fluxes through boundary parts: 7.682e-07 W, -7.682e-07 W,
>>>>>>>>>> 1.685e-15 W, 2.362e-15 W, -1 W, 1 W
>>>>>>>>>>     Writing graphical output:
>>>>>>>>>> /state/partition1/RMOUCHA/output/solution-00000
>>>>>>>>>> 
>>>>>>>>>> *** Timestep 1:  t=8.87115e-05 seconds
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> Robert Moucha
>>>>>>>>>> Assistant Professor of Geophysics
>>>>>>>>>> Department of Earth Sciences
>>>>>>>>>> 204 Heroy Geology Lab
>>>>>>>>>> Syracuse University
>>>>>>>>>> Syracuse, NY, 13244-1070
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Aspect-devel mailing list
>>>>>>>>>> Aspect-devel at geodynamics.org
>>>>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ------------------------------
>>>>>>>>> 
>>>>>>>>> Subject: Digest Footer
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Aspect-devel mailing list
>>>>>>>>> Aspect-devel at geodynamics.org
>>>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>>>> 
>>>>>>>>> ------------------------------
>>>>>>>>> 
>>>>>>>>> End of Aspect-devel Digest, Vol 44, Issue 9
>>>>>>>>> *******************************************
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------
>>>>>>> 
>>>>>>> Subject: Digest Footer
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Aspect-devel mailing list
>>>>>>> Aspect-devel at geodynamics.org
>>>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>>>> 
>>>>>>> ------------------------------
>>>>>>> 
>>>>>>> End of Aspect-devel Digest, Vol 44, Issue 11
>>>>>>> ********************************************
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> -------------- next part --------------
>>>>> diff --git a/source/simulator/core.cc b/source/simulator/core.cc
>>>>> index 6516ac2..ecf881c 100644
>>>>> --- a/source/simulator/core.cc
>>>>> +++ b/source/simulator/core.cc
>>>>> @@ -636,6 +636,8 @@ namespace aspect
>>>>>     gravity_model->update();
>>>>>     heating_model->update();
>>>>>     adiabatic_conditions->update();
>>>>> +
>>>>> +    pcout << "   Updated constraints and plugins." << std::endl;
>>>>>   }
>>>>> 
>>>>> 
>>>>> @@ -1532,7 +1534,12 @@ namespace aspect
>>>>>           if (parameters.free_surface_enabled)
>>>>>             free_surface->execute ();
>>>>> 
>>>>> +          pcout << "   Assemble temperature system." << std::endl;
>>>>> +
>>>>>           assemble_advection_system (AdvectionField::temperature());
>>>>> +
>>>>> +          pcout << "   Build temperature preconditioner." << std::endl;
>>>>> +
>>>>>           build_advection_preconditioner(AdvectionField::temperature(),
>>>>>                                          T_preconditioner);
>>>>>           solve_advection(AdvectionField::temperature());
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> Subject: Digest Footer
>>>>> 
>>>>> _______________________________________________
>>>>> Aspect-devel mailing list
>>>>> Aspect-devel at geodynamics.org
>>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> End of Aspect-devel Digest, Vol 45, Issue 2
>>>>> *******************************************
>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> Aspect-devel mailing list
>>> Aspect-devel at geodynamics.org
>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> 
>> 
>> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> 
> ------------------------------
> 
> End of Aspect-devel Digest, Vol 45, Issue 8
> *******************************************



More information about the Aspect-devel mailing list