[aspect-devel] Solution goes wrong after restart

Siqi Zhang siqi.zhang at mq.edu.au
Mon Jun 8 19:26:55 PDT 2015


Thanks Timo. I am glad we finally get this fixed.

2015-06-07 19:05 GMT+10:00 Timo Heister <timo.heister at gmail.com>:

> Thanks Siqi. Of course I switched output to vtu before testing and
> therefore couldn't see the problem. Sorry about that. :-(
>
> See https://github.com/geodynamics/aspect/pull/525
>
> On Thu, Jun 4, 2015 at 4:01 AM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> > Hi Timo,
> >
> > I finally found a solution for this problem.
> > The problem lies somewhere in the HDF5 output. The simulation actually
> > correct during the restart, just the hdf5 output has some problem.
> > During the restart, the node numbering has somehow changed in the HDF5
> > output although the mesh doesn't actually changed (Not sure why, it may
> be
> > related to the redundant points filter, I guess.). However the hdf5
> output
> > do not write new mesh file if mesh not changing. After restart, the the
> > solution look rubbish using the old mesh with old numbering.
> > After forcing ASPECT to rewrite hdf5 mesh file after every restart, the
> > problem fixed.
> >
> > Regards,
> >
> > Siqi
> >
> > 2015-05-21 14:48 GMT+10:00 Siqi Zhang <siqi.zhang at mq.edu.au>:
> >>
> >> I changed nothing in the vm. I didn't recompile the code, just using the
> >> ~/aspect/aspect compiled there, it's in the DEBUG mode.
> >> It's so strange it can produce different results, I tried a few times,
> and
> >> got the same wrong results every time.
> >>
> >> 2015-05-21 14:41 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >>>
> >>> did you change anything else? Did you update ASPECT? Debug or release
> >>> mode?
> >>>
> >>> On Wed, May 20, 2015 at 7:49 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> wrote:
> >>> > Hi Timo,
> >>> > Thanks for the test.
> >>> > This is supper strange. I got garbage results at step 1  with doing
> >>> > exactly
> >>> > the same thing with the vm.
> >>> >
> >>> > Siqi
> >>> >
> >>> > 2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:
> >>> >>
> >>> >> Siqi,
> >>> >>
> >>> >> so I should be able to (in the vm):
> >>> >>
> >>> >> 1. run the .prm with mpirun -n 24
> >>> >> 2. wait until it ends
> >>> >> 3. change: set Resume computation                     = true and set
> >>> >> End time                               = 5e5
> >>> >> 4. run again with mpirun -n 24, stop after timestep 1
> >>> >> 5. look at T of solution-00001 and see garbage
> >>> >>
> >>> >> right? Because this works just fine for me.
> >>> >>
> >>> >>
> >>> >> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> >>> >> wrote:
> >>> >> > Hi Timo,
> >>> >> >
> >>> >> > Thanks for your reply. Since it only happens with some setting, I
> >>> >> > stop
> >>> >> > chasing it for a while.
> >>> >> > The problem still exist at my end, even with the most recent v1.3
> >>> >> > version. I
> >>> >> > found this problem also can be reproduced with the virtual machine
> >>> >> > you
> >>> >> > created on aspect website (v11 with 24 MPI processes). So I think
> >>> >> > this
> >>> >> > must
> >>> >> > be a bug. (However it might be a bug inside deal.II or p4est
> rather
> >>> >> > than
> >>> >> > aspect)
> >>> >> > And I found some additional information regarding this problem. I
> >>> >> > managed to
> >>> >> > short wired the dof_indices into the output. I found those indices
> >>> >> > of
> >>> >> > some
> >>> >> > process has changed during the restart (for a 24 process restart
> >>> >> > test,
> >>> >> > those
> >>> >> > indices of process 0,2,4,6,8,... are OK; and those of process
> >>> >> > 1,3,5,7,...
> >>> >> > has changed) I guess the should stay the same to make the restart
> >>> >> > success.
> >>> >> > It seems this problem is caused by node numbering changes during
> the
> >>> >> > restart
> >>> >> > rather than the solution vector not stored properly.
> >>> >> >
> >>> >> > I attached the prm file again. I just start and restart with "end
> >>> >> > time =
> >>> >> > 0"
> >>> >> > Hope this will help you to reproduce it and figure out what goes
> >>> >> > wrong.
> >>> >> >
> >>> >> > Regards,
> >>> >> >
> >>> >> > Siqi
> >>> >> >
> >>> >> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
> >>> >> >>
> >>> >> >> Hey Siqi,
> >>> >> >>
> >>> >> >> I can not reproduce it on my workstation:
> >>> >> >> - changed end time to 0, resume=false
> >>> >> >> - ran with mpirun -n 24
> >>> >> >> - waited until it stopped
> >>> >> >> - set end time to 2e5, resume=true
> >>> >> >> - ran with mpirun -n 24
> >>> >> >> - output/solution-00001 looks fine
> >>> >> >>
> >>> >> >> Sorry, I have no idea what is going on and I don't think that
> this
> >>> >> >> is
> >>> >> >> a configuration problem (because you experience this on different
> >>> >> >> machines).
> >>> >> >>
> >>> >> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <
> siqi.zhang at mq.edu.au>
> >>> >> >> wrote:
> >>> >> >> > Hi Timo,
> >>> >> >> >
> >>> >> >> > I tried to troubleshoot this problem, still no clue. This thing
> >>> >> >> > just
> >>> >> >> > drives
> >>> >> >> > me crazy.
> >>> >> >> > Disable/enable MPI_IO in p4est build doesn't change the result,
> >>> >> >> > revert
> >>> >> >> > p4est
> >>> >> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also
> >>> >> >> > tried
> >>> >> >> > the
> >>> >> >> > development version of deal.II, the problem still exists.
> >>> >> >> >
> >>> >> >> > After set the "end time = 0" while keeping the refinement
> >>> >> >> > setting,
> >>> >> >> > and
> >>> >> >> > restart with the same setting:
> >>> >> >> > The problem seems repeatable with 24 processors (across 2
> nodes).
> >>> >> >> > The problem seems repeatable with 24 processors (on 1 node).
> >>> >> >> > The problem disappears with 12 processors (across 2 nodes).
> >>> >> >> >
> >>> >> >> > The problem disappear after remove the initial refinement
> >>> >> >> > (predefined
> >>> >> >> > refinement levels depends on depth) , I guess the grid need to
> be
> >>> >> >> > complex
> >>> >> >> > enough for this to happen.
> >>> >> >> >
> >>> >> >> > The problem is not so random here. For the certain prm with
> >>> >> >> > certain
> >>> >> >> > number
> >>> >> >> > of processors, the problem seems always happen. But it may
> >>> >> >> > disappear
> >>> >> >> > when
> >>> >> >> > changing the prm file and processor numbers.
> >>> >> >> > And I also encounter the similar problem at other machines
> (using
> >>> >> >> > same
> >>> >> >> > packages setting, manually built by different compiler and
> >>> >> >> > different
> >>> >> >> > MPI
> >>> >> >> > version, and different file system): supercomputer NCI_Raijin
> >>> >> >> > (OpenMPI
> >>> >> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our
> >>> >> >> > single
> >>> >> >> > node
> >>> >> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
> >>> >> >> > It is also strange that I never encounter similar problems with
> >>> >> >> > some
> >>> >> >> > large
> >>> >> >> > 3D models running on NCI_Raijin (using more than 200
> processors,
> >>> >> >> > and
> >>> >> >> > get
> >>> >> >> > restarted quite a few times and using the similar mesh
> >>> >> >> > refinements)
> >>> >> >> >
> >>> >> >> > The simulation runs fine, just the checkpoint got corrupted.
> So I
> >>> >> >> > guess
> >>> >> >> > it
> >>> >> >> > happens when save/load distributed triangulation. And the grid
> >>> >> >> > seems
> >>> >> >> > fine
> >>> >> >> > for me, just some solution seems at the wrong place at restart.
> >>> >> >> >
> >>> >> >> > So could you verify if the prm file restarts fine on some of
> your
> >>> >> >> > machines?
> >>> >> >> > If it works fine could you send me some information on packages
> >>> >> >> > versions
> >>> >> >> > of
> >>> >> >> > deal.II/p4est/mpi? If it is something I did wrong while
> building
> >>> >> >> > those
> >>> >> >> > packages, do you have any clues about what could it be to lead
> to
> >>> >> >> > this
> >>> >> >> > problem?
> >>> >> >> >
> >>> >> >> > Thanks and regards,
> >>> >> >> >
> >>> >> >> > Siqi
> >>> >> >> >
> >>> >> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com
> >:
> >>> >> >> >>
> >>> >> >> >> Any news on this issue, Siqi?
> >>> >> >> >>
> >>> >> >> >> Can you experiment with the problem to find out when this
> >>> >> >> >> problem
> >>> >> >> >> happens? How many processors do you need to see the problem?
> How
> >>> >> >> >> often
> >>> >> >> >> does it occur? Can you maybe simplify the .prm to do one check
> >>> >> >> >> point
> >>> >> >> >> after timestep 1 and end to check if that is enough?
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang
> >>> >> >> >> <siqi.zhang at mq.edu.au>
> >>> >> >> >> wrote:
> >>> >> >> >> > Hi Timo,
> >>> >> >> >> >
> >>> >> >> >> > Thanks for your reply.
> >>> >> >> >> > The file system I am using for output in previously test on
> >>> >> >> >> > our in
> >>> >> >> >> > house
> >>> >> >> >> > cluster is just a remotely mounted drive, not a distributed
> >>> >> >> >> > file
> >>> >> >> >> > system.
> >>> >> >> >> > However a different tests on another Australian
> supercomputer
> >>> >> >> >> > NCI_Raijin
> >>> >> >> >> > which uses lustre file system also produces similar problem.
> >>> >> >> >> >
> >>> >> >> >> > My current p4est setup should have MPI_IO enabled, I will
> try
> >>> >> >> >> > to
> >>> >> >> >> > disable
> >>> >> >> >> > it
> >>> >> >> >> > see if it changes the story.
> >>> >> >> >> >
> >>> >> >> >> > Regards,
> >>> >> >> >> >
> >>> >> >> >> > Siqi
> >>> >> >> >> >
> >>> >> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister
> >>> >> >> >> > <timo.heister at gmail.com>:
> >>> >> >> >> >>
> >>> >> >> >> >> Hey Siqi,
> >>> >> >> >> >>
> >>> >> >> >> >> I wonder if this is could be related to the filesystem you
> >>> >> >> >> >> are
> >>> >> >> >> >> writing
> >>> >> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in
> >>> >> >> >> >> the
> >>> >> >> >> >> hope
> >>> >> >> >> >> that this works reliably.
> >>> >> >> >> >>
> >>> >> >> >> >> Do you know what kind of filesystem your output directory
> is
> >>> >> >> >> >> on?
> >>> >> >> >> >> Is
> >>> >> >> >> >> there a different filesystem you can try?
> >>> >> >> >> >>
> >>> >> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang
> >>> >> >> >> >> <siqi.zhang at mq.edu.au>
> >>> >> >> >> >> wrote:
> >>> >> >> >> >> > Hi all,
> >>> >> >> >> >> >
> >>> >> >> >> >> > Recently, I have encountered an issue while trying to
> >>> >> >> >> >> > restart
> >>> >> >> >> >> > from
> >>> >> >> >> >> > a
> >>> >> >> >> >> > checkpoint. It seems the solution is misplaced into the
> >>> >> >> >> >> > wrong
> >>> >> >> >> >> > place
> >>> >> >> >> >> > while
> >>> >> >> >> >> > restart (See the two figures). However, the problem is
> not
> >>> >> >> >> >> > always
> >>> >> >> >> >> > repeatable
> >>> >> >> >> >> > (sometimes it restarts fine), and it might be related to
> >>> >> >> >> >> > something
> >>> >> >> >> >> > wrong
> >>> >> >> >> >> > in
> >>> >> >> >> >> > deal.II or p4est.
> >>> >> >> >> >> >
> >>> >> >> >> >> > The versions I used to build ASPECT is:
> >>> >> >> >> >> > DEAL.II         8.2.1
> >>> >> >> >> >> > P4EST          0.3.4.2  (encountered similar problem on
> 1.1
> >>> >> >> >> >> > as
> >>> >> >> >> >> > well)
> >>> >> >> >> >> > TRILINOS     11.12.1
> >>> >> >> >> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
> >>> >> >> >> >> > and I am using the most recent development version of
> >>> >> >> >> >> > aspect
> >>> >> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
> >>> >> >> >> >> >
> >>> >> >> >> >> > I reproduced the problem with the attached prm file
> (using
> >>> >> >> >> >> > 2
> >>> >> >> >> >> > nodes
> >>> >> >> >> >> > 24
> >>> >> >> >> >> > processors total), I wonder if any of you would like to
> >>> >> >> >> >> > give it
> >>> >> >> >> >> > a
> >>> >> >> >> >> > try
> >>> >> >> >> >> > see if
> >>> >> >> >> >> > it is a bug or just bad installation on my machine?
> >>> >> >> >> >> >
> >>> >> >> >> >> > Regards,
> >>> >> >> >> >> >
> >>> >> >> >> >> > Siqi
> >>> >> >> >> >> >
> >>> >> >> >> >> > --
> >>> >> >> >> >> > Siqi Zhang
> >>> >> >> >> >> >
> >>> >> >> >> >> > Research Associate
> >>> >> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
> >>> >> >> >> >> > (CCFS)
> >>> >> >> >> >> > Department of Earth and Planetary Sciences
> >>> >> >> >> >> > Macquarie University
> >>> >> >> >> >> > NSW 2109
> >>> >> >> >> >> >
> >>> >> >> >> >> > Telephone: +61 2 9850 4727
> >>> >> >> >> >> > http://www.CCFS.mq.edu.au
> >>> >> >> >> >> > http://www.GEMOC.mq.edu.au
> >>> >> >> >> >> >
> >>> >> >> >> >> > _______________________________________________
> >>> >> >> >> >> > Aspect-devel mailing list
> >>> >> >> >> >> > Aspect-devel at geodynamics.org
> >>> >> >> >> >> >
> >>> >> >> >> >> >
> >>> >> >> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >> >> >> _______________________________________________
> >>> >> >> >> >> Aspect-devel mailing list
> >>> >> >> >> >> Aspect-devel at geodynamics.org
> >>> >> >> >> >>
> >>> >> >> >> >>
> >>> >> >> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> > --
> >>> >> >> >> > Siqi Zhang
> >>> >> >> >> >
> >>> >> >> >> > Research Associate
> >>> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
> >>> >> >> >> > (CCFS)
> >>> >> >> >> > Department of Earth and Planetary Sciences
> >>> >> >> >> > Macquarie University
> >>> >> >> >> > NSW 2109
> >>> >> >> >> >
> >>> >> >> >> > Telephone: +61 2 9850 4727
> >>> >> >> >> > http://www.CCFS.mq.edu.au
> >>> >> >> >> > http://www.GEMOC.mq.edu.au
> >>> >> >> >> >
> >>> >> >> >> > _______________________________________________
> >>> >> >> >> > Aspect-devel mailing list
> >>> >> >> >> > Aspect-devel at geodynamics.org
> >>> >> >> >> >
> >>> >> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >> >> _______________________________________________
> >>> >> >> >> Aspect-devel mailing list
> >>> >> >> >> Aspect-devel at geodynamics.org
> >>> >> >> >>
> >>> >> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > --
> >>> >> >> > Siqi Zhang
> >>> >> >> >
> >>> >> >> > Research Associate
> >>> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >>> >> >> > Department of Earth and Planetary Sciences
> >>> >> >> > Macquarie University
> >>> >> >> > NSW 2109
> >>> >> >> >
> >>> >> >> > Telephone: +61 2 9850 4727
> >>> >> >> > http://www.CCFS.mq.edu.au
> >>> >> >> > http://www.GEMOC.mq.edu.au
> >>> >> >> >
> >>> >> >> > _______________________________________________
> >>> >> >> > Aspect-devel mailing list
> >>> >> >> > Aspect-devel at geodynamics.org
> >>> >> >> >
> >>> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >>
> >>> >> >> --
> >>> >> >> Timo Heister
> >>> >> >> http://www.math.clemson.edu/~heister/
> >>> >> >> _______________________________________________
> >>> >> >> Aspect-devel mailing list
> >>> >> >> Aspect-devel at geodynamics.org
> >>> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > Siqi Zhang
> >>> >> >
> >>> >> > Research Associate
> >>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >>> >> > Department of Earth and Planetary Sciences
> >>> >> > Macquarie University
> >>> >> > NSW 2109
> >>> >> >
> >>> >> > Telephone: +61 2 9850 4727
> >>> >> > http://www.CCFS.mq.edu.au
> >>> >> > http://www.GEMOC.mq.edu.au
> >>> >> >
> >>> >> > _______________________________________________
> >>> >> > Aspect-devel mailing list
> >>> >> > Aspect-devel at geodynamics.org
> >>> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Timo Heister
> >>> >> http://www.math.clemson.edu/~heister/
> >>> >> _______________________________________________
> >>> >> Aspect-devel mailing list
> >>> >> Aspect-devel at geodynamics.org
> >>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Siqi Zhang
> >>> >
> >>> > Research Associate
> >>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >>> > Department of Earth and Planetary Sciences
> >>> > Macquarie University
> >>> > NSW 2109
> >>> >
> >>> > Telephone: +61 2 9850 4727
> >>> > http://www.CCFS.mq.edu.au
> >>> > http://www.GEMOC.mq.edu.au
> >>> >
> >>> > _______________________________________________
> >>> > Aspect-devel mailing list
> >>> > Aspect-devel at geodynamics.org
> >>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>> _______________________________________________
> >>> Aspect-devel mailing list
> >>> Aspect-devel at geodynamics.org
> >>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>
> >>
> >>
> >>
> >> --
> >> Siqi Zhang
> >>
> >> Research Associate
> >> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> Department of Earth and Planetary Sciences
> >> Macquarie University
> >> NSW 2109
> >>
> >> Telephone: +61 2 9850 4727
> >> http://www.CCFS.mq.edu.au
> >> http://www.GEMOC.mq.edu.au
> >
> >
> >
> >
> > --
> > Siqi Zhang
> >
> > Research Associate
> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> > Department of Earth and Planetary Sciences
> > Macquarie University
> > NSW 2109
> >
> > Telephone: +61 2 9850 4727
> > http://www.CCFS.mq.edu.au
> > http://www.GEMOC.mq.edu.au
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>



-- 
Siqi Zhang

Research Associate
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Macquarie University
NSW 2109

Telephone: +61 2 9850 4727
http://www.CCFS.mq.edu.au
http://www.GEMOC.mq.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150609/1fb50878/attachment-0001.html>


More information about the Aspect-devel mailing list