[aspect-devel] Solution goes wrong after restart
Siqi Zhang
siqi.zhang at mq.edu.au
Wed May 20 21:48:27 PDT 2015
I changed nothing in the vm. I didn't recompile the code, just using the
~/aspect/aspect compiled there, it's in the DEBUG mode.
It's so strange it can produce different results, I tried a few times, and
got the same wrong results every time.
2015-05-21 14:41 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> did you change anything else? Did you update ASPECT? Debug or release mode?
>
> On Wed, May 20, 2015 at 7:49 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> > Hi Timo,
> > Thanks for the test.
> > This is supper strange. I got garbage results at step 1 with doing
> exactly
> > the same thing with the vm.
> >
> > Siqi
> >
> > 2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:
> >>
> >> Siqi,
> >>
> >> so I should be able to (in the vm):
> >>
> >> 1. run the .prm with mpirun -n 24
> >> 2. wait until it ends
> >> 3. change: set Resume computation = true and set
> >> End time = 5e5
> >> 4. run again with mpirun -n 24, stop after timestep 1
> >> 5. look at T of solution-00001 and see garbage
> >>
> >> right? Because this works just fine for me.
> >>
> >>
> >> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> wrote:
> >> > Hi Timo,
> >> >
> >> > Thanks for your reply. Since it only happens with some setting, I stop
> >> > chasing it for a while.
> >> > The problem still exist at my end, even with the most recent v1.3
> >> > version. I
> >> > found this problem also can be reproduced with the virtual machine you
> >> > created on aspect website (v11 with 24 MPI processes). So I think this
> >> > must
> >> > be a bug. (However it might be a bug inside deal.II or p4est rather
> than
> >> > aspect)
> >> > And I found some additional information regarding this problem. I
> >> > managed to
> >> > short wired the dof_indices into the output. I found those indices of
> >> > some
> >> > process has changed during the restart (for a 24 process restart test,
> >> > those
> >> > indices of process 0,2,4,6,8,... are OK; and those of process
> >> > 1,3,5,7,...
> >> > has changed) I guess the should stay the same to make the restart
> >> > success.
> >> > It seems this problem is caused by node numbering changes during the
> >> > restart
> >> > rather than the solution vector not stored properly.
> >> >
> >> > I attached the prm file again. I just start and restart with "end
> time =
> >> > 0"
> >> > Hope this will help you to reproduce it and figure out what goes
> wrong.
> >> >
> >> > Regards,
> >> >
> >> > Siqi
> >> >
> >> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
> >> >>
> >> >> Hey Siqi,
> >> >>
> >> >> I can not reproduce it on my workstation:
> >> >> - changed end time to 0, resume=false
> >> >> - ran with mpirun -n 24
> >> >> - waited until it stopped
> >> >> - set end time to 2e5, resume=true
> >> >> - ran with mpirun -n 24
> >> >> - output/solution-00001 looks fine
> >> >>
> >> >> Sorry, I have no idea what is going on and I don't think that this is
> >> >> a configuration problem (because you experience this on different
> >> >> machines).
> >> >>
> >> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> >> >> wrote:
> >> >> > Hi Timo,
> >> >> >
> >> >> > I tried to troubleshoot this problem, still no clue. This thing
> just
> >> >> > drives
> >> >> > me crazy.
> >> >> > Disable/enable MPI_IO in p4est build doesn't change the result,
> >> >> > revert
> >> >> > p4est
> >> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried
> >> >> > the
> >> >> > development version of deal.II, the problem still exists.
> >> >> >
> >> >> > After set the "end time = 0" while keeping the refinement setting,
> >> >> > and
> >> >> > restart with the same setting:
> >> >> > The problem seems repeatable with 24 processors (across 2 nodes).
> >> >> > The problem seems repeatable with 24 processors (on 1 node).
> >> >> > The problem disappears with 12 processors (across 2 nodes).
> >> >> >
> >> >> > The problem disappear after remove the initial refinement
> (predefined
> >> >> > refinement levels depends on depth) , I guess the grid need to be
> >> >> > complex
> >> >> > enough for this to happen.
> >> >> >
> >> >> > The problem is not so random here. For the certain prm with certain
> >> >> > number
> >> >> > of processors, the problem seems always happen. But it may
> disappear
> >> >> > when
> >> >> > changing the prm file and processor numbers.
> >> >> > And I also encounter the similar problem at other machines (using
> >> >> > same
> >> >> > packages setting, manually built by different compiler and
> different
> >> >> > MPI
> >> >> > version, and different file system): supercomputer NCI_Raijin
> >> >> > (OpenMPI
> >> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our
> single
> >> >> > node
> >> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
> >> >> > It is also strange that I never encounter similar problems with
> some
> >> >> > large
> >> >> > 3D models running on NCI_Raijin (using more than 200 processors,
> and
> >> >> > get
> >> >> > restarted quite a few times and using the similar mesh refinements)
> >> >> >
> >> >> > The simulation runs fine, just the checkpoint got corrupted. So I
> >> >> > guess
> >> >> > it
> >> >> > happens when save/load distributed triangulation. And the grid
> seems
> >> >> > fine
> >> >> > for me, just some solution seems at the wrong place at restart.
> >> >> >
> >> >> > So could you verify if the prm file restarts fine on some of your
> >> >> > machines?
> >> >> > If it works fine could you send me some information on packages
> >> >> > versions
> >> >> > of
> >> >> > deal.II/p4est/mpi? If it is something I did wrong while building
> >> >> > those
> >> >> > packages, do you have any clues about what could it be to lead to
> >> >> > this
> >> >> > problem?
> >> >> >
> >> >> > Thanks and regards,
> >> >> >
> >> >> > Siqi
> >> >> >
> >> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >> >> >>
> >> >> >> Any news on this issue, Siqi?
> >> >> >>
> >> >> >> Can you experiment with the problem to find out when this problem
> >> >> >> happens? How many processors do you need to see the problem? How
> >> >> >> often
> >> >> >> does it occur? Can you maybe simplify the .prm to do one check
> point
> >> >> >> after timestep 1 and end to check if that is enough?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au
> >
> >> >> >> wrote:
> >> >> >> > Hi Timo,
> >> >> >> >
> >> >> >> > Thanks for your reply.
> >> >> >> > The file system I am using for output in previously test on our
> in
> >> >> >> > house
> >> >> >> > cluster is just a remotely mounted drive, not a distributed file
> >> >> >> > system.
> >> >> >> > However a different tests on another Australian supercomputer
> >> >> >> > NCI_Raijin
> >> >> >> > which uses lustre file system also produces similar problem.
> >> >> >> >
> >> >> >> > My current p4est setup should have MPI_IO enabled, I will try to
> >> >> >> > disable
> >> >> >> > it
> >> >> >> > see if it changes the story.
> >> >> >> >
> >> >> >> > Regards,
> >> >> >> >
> >> >> >> > Siqi
> >> >> >> >
> >> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com
> >:
> >> >> >> >>
> >> >> >> >> Hey Siqi,
> >> >> >> >>
> >> >> >> >> I wonder if this is could be related to the filesystem you are
> >> >> >> >> writing
> >> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in the
> >> >> >> >> hope
> >> >> >> >> that this works reliably.
> >> >> >> >>
> >> >> >> >> Do you know what kind of filesystem your output directory is
> on?
> >> >> >> >> Is
> >> >> >> >> there a different filesystem you can try?
> >> >> >> >>
> >> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang
> >> >> >> >> <siqi.zhang at mq.edu.au>
> >> >> >> >> wrote:
> >> >> >> >> > Hi all,
> >> >> >> >> >
> >> >> >> >> > Recently, I have encountered an issue while trying to restart
> >> >> >> >> > from
> >> >> >> >> > a
> >> >> >> >> > checkpoint. It seems the solution is misplaced into the wrong
> >> >> >> >> > place
> >> >> >> >> > while
> >> >> >> >> > restart (See the two figures). However, the problem is not
> >> >> >> >> > always
> >> >> >> >> > repeatable
> >> >> >> >> > (sometimes it restarts fine), and it might be related to
> >> >> >> >> > something
> >> >> >> >> > wrong
> >> >> >> >> > in
> >> >> >> >> > deal.II or p4est.
> >> >> >> >> >
> >> >> >> >> > The versions I used to build ASPECT is:
> >> >> >> >> > DEAL.II 8.2.1
> >> >> >> >> > P4EST 0.3.4.2 (encountered similar problem on 1.1
> as
> >> >> >> >> > well)
> >> >> >> >> > TRILINOS 11.12.1
> >> >> >> >> > MPI openmpi 1.8.3 (with gcc 4.4.7)
> >> >> >> >> > and I am using the most recent development version of aspect
> >> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
> >> >> >> >> >
> >> >> >> >> > I reproduced the problem with the attached prm file (using 2
> >> >> >> >> > nodes
> >> >> >> >> > 24
> >> >> >> >> > processors total), I wonder if any of you would like to give
> it
> >> >> >> >> > a
> >> >> >> >> > try
> >> >> >> >> > see if
> >> >> >> >> > it is a bug or just bad installation on my machine?
> >> >> >> >> >
> >> >> >> >> > Regards,
> >> >> >> >> >
> >> >> >> >> > Siqi
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Siqi Zhang
> >> >> >> >> >
> >> >> >> >> > Research Associate
> >> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
> (CCFS)
> >> >> >> >> > Department of Earth and Planetary Sciences
> >> >> >> >> > Macquarie University
> >> >> >> >> > NSW 2109
> >> >> >> >> >
> >> >> >> >> > Telephone: +61 2 9850 4727
> >> >> >> >> > http://www.CCFS.mq.edu.au
> >> >> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >> >> >
> >> >> >> >> > _______________________________________________
> >> >> >> >> > Aspect-devel mailing list
> >> >> >> >> > Aspect-devel at geodynamics.org
> >> >> >> >> >
> >> >> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >> >> _______________________________________________
> >> >> >> >> Aspect-devel mailing list
> >> >> >> >> Aspect-devel at geodynamics.org
> >> >> >> >>
> >> >> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Siqi Zhang
> >> >> >> >
> >> >> >> > Research Associate
> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> >> >> > Department of Earth and Planetary Sciences
> >> >> >> > Macquarie University
> >> >> >> > NSW 2109
> >> >> >> >
> >> >> >> > Telephone: +61 2 9850 4727
> >> >> >> > http://www.CCFS.mq.edu.au
> >> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >> >
> >> >> >> > _______________________________________________
> >> >> >> > Aspect-devel mailing list
> >> >> >> > Aspect-devel at geodynamics.org
> >> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >> _______________________________________________
> >> >> >> Aspect-devel mailing list
> >> >> >> Aspect-devel at geodynamics.org
> >> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Siqi Zhang
> >> >> >
> >> >> > Research Associate
> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> >> > Department of Earth and Planetary Sciences
> >> >> > Macquarie University
> >> >> > NSW 2109
> >> >> >
> >> >> > Telephone: +61 2 9850 4727
> >> >> > http://www.CCFS.mq.edu.au
> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >
> >> >> > _______________________________________________
> >> >> > Aspect-devel mailing list
> >> >> > Aspect-devel at geodynamics.org
> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >>
> >> >> --
> >> >> Timo Heister
> >> >> http://www.math.clemson.edu/~heister/
> >> >> _______________________________________________
> >> >> Aspect-devel mailing list
> >> >> Aspect-devel at geodynamics.org
> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Siqi Zhang
> >> >
> >> > Research Associate
> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> > Department of Earth and Planetary Sciences
> >> > Macquarie University
> >> > NSW 2109
> >> >
> >> > Telephone: +61 2 9850 4727
> >> > http://www.CCFS.mq.edu.au
> >> > http://www.GEMOC.mq.edu.au
> >> >
> >> > _______________________________________________
> >> > Aspect-devel mailing list
> >> > Aspect-devel at geodynamics.org
> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>
> >>
> >>
> >> --
> >> Timo Heister
> >> http://www.math.clemson.edu/~heister/
> >> _______________________________________________
> >> Aspect-devel mailing list
> >> Aspect-devel at geodynamics.org
> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> >
> > --
> > Siqi Zhang
> >
> > Research Associate
> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> > Department of Earth and Planetary Sciences
> > Macquarie University
> > NSW 2109
> >
> > Telephone: +61 2 9850 4727
> > http://www.CCFS.mq.edu.au
> > http://www.GEMOC.mq.edu.au
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
--
Siqi Zhang
Research Associate
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Macquarie University
NSW 2109
Telephone: +61 2 9850 4727
http://www.CCFS.mq.edu.au
http://www.GEMOC.mq.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150521/86e3aec1/attachment-0001.html>
More information about the Aspect-devel
mailing list