[aspect-devel] Solution goes wrong after restart

Siqi Zhang siqi.zhang at mq.edu.au
Thu Jun 4 01:01:27 PDT 2015


Hi Timo,

I finally found a solution for this problem.
The problem lies somewhere in the HDF5 output. The simulation actually
correct during the restart, just the hdf5 output has some problem.
During the restart, the node numbering has somehow changed in the HDF5
output although the mesh doesn't actually changed (Not sure why, it may be
related to the redundant points filter, I guess.). However the hdf5 output
do not write new mesh file if mesh not changing. After restart, the the
solution look rubbish using the old mesh with old numbering.
After forcing ASPECT to rewrite hdf5 mesh file after every restart, the
problem fixed.

Regards,

Siqi

2015-05-21 14:48 GMT+10:00 Siqi Zhang <siqi.zhang at mq.edu.au>:

> I changed nothing in the vm. I didn't recompile the code, just using the
> ~/aspect/aspect compiled there, it's in the DEBUG mode.
> It's so strange it can produce different results, I tried a few times, and
> got the same wrong results every time.
>
> 2015-05-21 14:41 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>
>> did you change anything else? Did you update ASPECT? Debug or release
>> mode?
>>
>> On Wed, May 20, 2015 at 7:49 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
>> > Hi Timo,
>> > Thanks for the test.
>> > This is supper strange. I got garbage results at step 1  with doing
>> exactly
>> > the same thing with the vm.
>> >
>> > Siqi
>> >
>> > 2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:
>> >>
>> >> Siqi,
>> >>
>> >> so I should be able to (in the vm):
>> >>
>> >> 1. run the .prm with mpirun -n 24
>> >> 2. wait until it ends
>> >> 3. change: set Resume computation                     = true and set
>> >> End time                               = 5e5
>> >> 4. run again with mpirun -n 24, stop after timestep 1
>> >> 5. look at T of solution-00001 and see garbage
>> >>
>> >> right? Because this works just fine for me.
>> >>
>> >>
>> >> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> wrote:
>> >> > Hi Timo,
>> >> >
>> >> > Thanks for your reply. Since it only happens with some setting, I
>> stop
>> >> > chasing it for a while.
>> >> > The problem still exist at my end, even with the most recent v1.3
>> >> > version. I
>> >> > found this problem also can be reproduced with the virtual machine
>> you
>> >> > created on aspect website (v11 with 24 MPI processes). So I think
>> this
>> >> > must
>> >> > be a bug. (However it might be a bug inside deal.II or p4est rather
>> than
>> >> > aspect)
>> >> > And I found some additional information regarding this problem. I
>> >> > managed to
>> >> > short wired the dof_indices into the output. I found those indices of
>> >> > some
>> >> > process has changed during the restart (for a 24 process restart
>> test,
>> >> > those
>> >> > indices of process 0,2,4,6,8,... are OK; and those of process
>> >> > 1,3,5,7,...
>> >> > has changed) I guess the should stay the same to make the restart
>> >> > success.
>> >> > It seems this problem is caused by node numbering changes during the
>> >> > restart
>> >> > rather than the solution vector not stored properly.
>> >> >
>> >> > I attached the prm file again. I just start and restart with "end
>> time =
>> >> > 0"
>> >> > Hope this will help you to reproduce it and figure out what goes
>> wrong.
>> >> >
>> >> > Regards,
>> >> >
>> >> > Siqi
>> >> >
>> >> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
>> >> >>
>> >> >> Hey Siqi,
>> >> >>
>> >> >> I can not reproduce it on my workstation:
>> >> >> - changed end time to 0, resume=false
>> >> >> - ran with mpirun -n 24
>> >> >> - waited until it stopped
>> >> >> - set end time to 2e5, resume=true
>> >> >> - ran with mpirun -n 24
>> >> >> - output/solution-00001 looks fine
>> >> >>
>> >> >> Sorry, I have no idea what is going on and I don't think that this
>> is
>> >> >> a configuration problem (because you experience this on different
>> >> >> machines).
>> >> >>
>> >> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> >> wrote:
>> >> >> > Hi Timo,
>> >> >> >
>> >> >> > I tried to troubleshoot this problem, still no clue. This thing
>> just
>> >> >> > drives
>> >> >> > me crazy.
>> >> >> > Disable/enable MPI_IO in p4est build doesn't change the result,
>> >> >> > revert
>> >> >> > p4est
>> >> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried
>> >> >> > the
>> >> >> > development version of deal.II, the problem still exists.
>> >> >> >
>> >> >> > After set the "end time = 0" while keeping the refinement setting,
>> >> >> > and
>> >> >> > restart with the same setting:
>> >> >> > The problem seems repeatable with 24 processors (across 2 nodes).
>> >> >> > The problem seems repeatable with 24 processors (on 1 node).
>> >> >> > The problem disappears with 12 processors (across 2 nodes).
>> >> >> >
>> >> >> > The problem disappear after remove the initial refinement
>> (predefined
>> >> >> > refinement levels depends on depth) , I guess the grid need to be
>> >> >> > complex
>> >> >> > enough for this to happen.
>> >> >> >
>> >> >> > The problem is not so random here. For the certain prm with
>> certain
>> >> >> > number
>> >> >> > of processors, the problem seems always happen. But it may
>> disappear
>> >> >> > when
>> >> >> > changing the prm file and processor numbers.
>> >> >> > And I also encounter the similar problem at other machines (using
>> >> >> > same
>> >> >> > packages setting, manually built by different compiler and
>> different
>> >> >> > MPI
>> >> >> > version, and different file system): supercomputer NCI_Raijin
>> >> >> > (OpenMPI
>> >> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our
>> single
>> >> >> > node
>> >> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
>> >> >> > It is also strange that I never encounter similar problems with
>> some
>> >> >> > large
>> >> >> > 3D models running on NCI_Raijin (using more than 200 processors,
>> and
>> >> >> > get
>> >> >> > restarted quite a few times and using the similar mesh
>> refinements)
>> >> >> >
>> >> >> > The simulation runs fine, just the checkpoint got corrupted. So I
>> >> >> > guess
>> >> >> > it
>> >> >> > happens when save/load distributed triangulation. And the grid
>> seems
>> >> >> > fine
>> >> >> > for me, just some solution seems at the wrong place at restart.
>> >> >> >
>> >> >> > So could you verify if the prm file restarts fine on some of your
>> >> >> > machines?
>> >> >> > If it works fine could you send me some information on packages
>> >> >> > versions
>> >> >> > of
>> >> >> > deal.II/p4est/mpi? If it is something I did wrong while building
>> >> >> > those
>> >> >> > packages, do you have any clues about what could it be to lead to
>> >> >> > this
>> >> >> > problem?
>> >> >> >
>> >> >> > Thanks and regards,
>> >> >> >
>> >> >> > Siqi
>> >> >> >
>> >> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >> >> >>
>> >> >> >> Any news on this issue, Siqi?
>> >> >> >>
>> >> >> >> Can you experiment with the problem to find out when this problem
>> >> >> >> happens? How many processors do you need to see the problem? How
>> >> >> >> often
>> >> >> >> does it occur? Can you maybe simplify the .prm to do one check
>> point
>> >> >> >> after timestep 1 and end to check if that is enough?
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <
>> siqi.zhang at mq.edu.au>
>> >> >> >> wrote:
>> >> >> >> > Hi Timo,
>> >> >> >> >
>> >> >> >> > Thanks for your reply.
>> >> >> >> > The file system I am using for output in previously test on
>> our in
>> >> >> >> > house
>> >> >> >> > cluster is just a remotely mounted drive, not a distributed
>> file
>> >> >> >> > system.
>> >> >> >> > However a different tests on another Australian supercomputer
>> >> >> >> > NCI_Raijin
>> >> >> >> > which uses lustre file system also produces similar problem.
>> >> >> >> >
>> >> >> >> > My current p4est setup should have MPI_IO enabled, I will try
>> to
>> >> >> >> > disable
>> >> >> >> > it
>> >> >> >> > see if it changes the story.
>> >> >> >> >
>> >> >> >> > Regards,
>> >> >> >> >
>> >> >> >> > Siqi
>> >> >> >> >
>> >> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com
>> >:
>> >> >> >> >>
>> >> >> >> >> Hey Siqi,
>> >> >> >> >>
>> >> >> >> >> I wonder if this is could be related to the filesystem you are
>> >> >> >> >> writing
>> >> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in
>> the
>> >> >> >> >> hope
>> >> >> >> >> that this works reliably.
>> >> >> >> >>
>> >> >> >> >> Do you know what kind of filesystem your output directory is
>> on?
>> >> >> >> >> Is
>> >> >> >> >> there a different filesystem you can try?
>> >> >> >> >>
>> >> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang
>> >> >> >> >> <siqi.zhang at mq.edu.au>
>> >> >> >> >> wrote:
>> >> >> >> >> > Hi all,
>> >> >> >> >> >
>> >> >> >> >> > Recently, I have encountered an issue while trying to
>> restart
>> >> >> >> >> > from
>> >> >> >> >> > a
>> >> >> >> >> > checkpoint. It seems the solution is misplaced into the
>> wrong
>> >> >> >> >> > place
>> >> >> >> >> > while
>> >> >> >> >> > restart (See the two figures). However, the problem is not
>> >> >> >> >> > always
>> >> >> >> >> > repeatable
>> >> >> >> >> > (sometimes it restarts fine), and it might be related to
>> >> >> >> >> > something
>> >> >> >> >> > wrong
>> >> >> >> >> > in
>> >> >> >> >> > deal.II or p4est.
>> >> >> >> >> >
>> >> >> >> >> > The versions I used to build ASPECT is:
>> >> >> >> >> > DEAL.II         8.2.1
>> >> >> >> >> > P4EST          0.3.4.2  (encountered similar problem on 1.1
>> as
>> >> >> >> >> > well)
>> >> >> >> >> > TRILINOS     11.12.1
>> >> >> >> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
>> >> >> >> >> > and I am using the most recent development version of aspect
>> >> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
>> >> >> >> >> >
>> >> >> >> >> > I reproduced the problem with the attached prm file (using 2
>> >> >> >> >> > nodes
>> >> >> >> >> > 24
>> >> >> >> >> > processors total), I wonder if any of you would like to
>> give it
>> >> >> >> >> > a
>> >> >> >> >> > try
>> >> >> >> >> > see if
>> >> >> >> >> > it is a bug or just bad installation on my machine?
>> >> >> >> >> >
>> >> >> >> >> > Regards,
>> >> >> >> >> >
>> >> >> >> >> > Siqi
>> >> >> >> >> >
>> >> >> >> >> > --
>> >> >> >> >> > Siqi Zhang
>> >> >> >> >> >
>> >> >> >> >> > Research Associate
>> >> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
>> (CCFS)
>> >> >> >> >> > Department of Earth and Planetary Sciences
>> >> >> >> >> > Macquarie University
>> >> >> >> >> > NSW 2109
>> >> >> >> >> >
>> >> >> >> >> > Telephone: +61 2 9850 4727
>> >> >> >> >> > http://www.CCFS.mq.edu.au
>> >> >> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >> >> >
>> >> >> >> >> > _______________________________________________
>> >> >> >> >> > Aspect-devel mailing list
>> >> >> >> >> > Aspect-devel at geodynamics.org
>> >> >> >> >> >
>> >> >> >> >> >
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >> >> _______________________________________________
>> >> >> >> >> Aspect-devel mailing list
>> >> >> >> >> Aspect-devel at geodynamics.org
>> >> >> >> >>
>> >> >> >> >>
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Siqi Zhang
>> >> >> >> >
>> >> >> >> > Research Associate
>> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> >> >> > Department of Earth and Planetary Sciences
>> >> >> >> > Macquarie University
>> >> >> >> > NSW 2109
>> >> >> >> >
>> >> >> >> > Telephone: +61 2 9850 4727
>> >> >> >> > http://www.CCFS.mq.edu.au
>> >> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >> >
>> >> >> >> > _______________________________________________
>> >> >> >> > Aspect-devel mailing list
>> >> >> >> > Aspect-devel at geodynamics.org
>> >> >> >> >
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >> _______________________________________________
>> >> >> >> Aspect-devel mailing list
>> >> >> >> Aspect-devel at geodynamics.org
>> >> >> >>
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Siqi Zhang
>> >> >> >
>> >> >> > Research Associate
>> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> >> > Department of Earth and Planetary Sciences
>> >> >> > Macquarie University
>> >> >> > NSW 2109
>> >> >> >
>> >> >> > Telephone: +61 2 9850 4727
>> >> >> > http://www.CCFS.mq.edu.au
>> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Aspect-devel mailing list
>> >> >> > Aspect-devel at geodynamics.org
>> >> >> >
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >>
>> >> >> --
>> >> >> Timo Heister
>> >> >> http://www.math.clemson.edu/~heister/
>> >> >> _______________________________________________
>> >> >> Aspect-devel mailing list
>> >> >> Aspect-devel at geodynamics.org
>> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Siqi Zhang
>> >> >
>> >> > Research Associate
>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> > Department of Earth and Planetary Sciences
>> >> > Macquarie University
>> >> > NSW 2109
>> >> >
>> >> > Telephone: +61 2 9850 4727
>> >> > http://www.CCFS.mq.edu.au
>> >> > http://www.GEMOC.mq.edu.au
>> >> >
>> >> > _______________________________________________
>> >> > Aspect-devel mailing list
>> >> > Aspect-devel at geodynamics.org
>> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >>
>> >>
>> >>
>> >> --
>> >> Timo Heister
>> >> http://www.math.clemson.edu/~heister/
>> >> _______________________________________________
>> >> Aspect-devel mailing list
>> >> Aspect-devel at geodynamics.org
>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >
>> >
>> >
>> >
>> > --
>> > Siqi Zhang
>> >
>> > Research Associate
>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> > Department of Earth and Planetary Sciences
>> > Macquarie University
>> > NSW 2109
>> >
>> > Telephone: +61 2 9850 4727
>> > http://www.CCFS.mq.edu.au
>> > http://www.GEMOC.mq.edu.au
>> >
>> > _______________________________________________
>> > Aspect-devel mailing list
>> > Aspect-devel at geodynamics.org
>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>
>
>
>
> --
> Siqi Zhang
>
> Research Associate
> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> Department of Earth and Planetary Sciences
> Macquarie University
> NSW 2109
>
> Telephone: +61 2 9850 4727
> http://www.CCFS.mq.edu.au
> http://www.GEMOC.mq.edu.au
>



-- 
Siqi Zhang

Research Associate
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Macquarie University
NSW 2109

Telephone: +61 2 9850 4727
http://www.CCFS.mq.edu.au
http://www.GEMOC.mq.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150604/3e3c5b26/attachment-0001.html>


More information about the Aspect-devel mailing list