[aspect-devel] Solution goes wrong after restart

Siqi Zhang siqi.zhang at mq.edu.au
Wed May 20 16:49:12 PDT 2015


Hi Timo,
Thanks for the test.
This is supper strange. I got garbage results at step 1  with doing exactly
the same thing with the vm.

Siqi

2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:

> Siqi,
>
> so I should be able to (in the vm):
>
> 1. run the .prm with mpirun -n 24
> 2. wait until it ends
> 3. change: set Resume computation                     = true and set
> End time                               = 5e5
> 4. run again with mpirun -n 24, stop after timestep 1
> 5. look at T of solution-00001 and see garbage
>
> right? Because this works just fine for me.
>
>
> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> > Hi Timo,
> >
> > Thanks for your reply. Since it only happens with some setting, I stop
> > chasing it for a while.
> > The problem still exist at my end, even with the most recent v1.3
> version. I
> > found this problem also can be reproduced with the virtual machine you
> > created on aspect website (v11 with 24 MPI processes). So I think this
> must
> > be a bug. (However it might be a bug inside deal.II or p4est rather than
> > aspect)
> > And I found some additional information regarding this problem. I
> managed to
> > short wired the dof_indices into the output. I found those indices of
> some
> > process has changed during the restart (for a 24 process restart test,
> those
> > indices of process 0,2,4,6,8,... are OK; and those of process 1,3,5,7,...
> > has changed) I guess the should stay the same to make the restart
> success.
> > It seems this problem is caused by node numbering changes during the
> restart
> > rather than the solution vector not stored properly.
> >
> > I attached the prm file again. I just start and restart with "end time =
> 0"
> > Hope this will help you to reproduce it and figure out what goes wrong.
> >
> > Regards,
> >
> > Siqi
> >
> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
> >>
> >> Hey Siqi,
> >>
> >> I can not reproduce it on my workstation:
> >> - changed end time to 0, resume=false
> >> - ran with mpirun -n 24
> >> - waited until it stopped
> >> - set end time to 2e5, resume=true
> >> - ran with mpirun -n 24
> >> - output/solution-00001 looks fine
> >>
> >> Sorry, I have no idea what is going on and I don't think that this is
> >> a configuration problem (because you experience this on different
> >> machines).
> >>
> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> wrote:
> >> > Hi Timo,
> >> >
> >> > I tried to troubleshoot this problem, still no clue. This thing just
> >> > drives
> >> > me crazy.
> >> > Disable/enable MPI_IO in p4est build doesn't change the result, revert
> >> > p4est
> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried the
> >> > development version of deal.II, the problem still exists.
> >> >
> >> > After set the "end time = 0" while keeping the refinement setting, and
> >> > restart with the same setting:
> >> > The problem seems repeatable with 24 processors (across 2 nodes).
> >> > The problem seems repeatable with 24 processors (on 1 node).
> >> > The problem disappears with 12 processors (across 2 nodes).
> >> >
> >> > The problem disappear after remove the initial refinement (predefined
> >> > refinement levels depends on depth) , I guess the grid need to be
> >> > complex
> >> > enough for this to happen.
> >> >
> >> > The problem is not so random here. For the certain prm with certain
> >> > number
> >> > of processors, the problem seems always happen. But it may disappear
> >> > when
> >> > changing the prm file and processor numbers.
> >> > And I also encounter the similar problem at other machines (using same
> >> > packages setting, manually built by different compiler and different
> MPI
> >> > version, and different file system): supercomputer NCI_Raijin (OpenMPI
> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our single
> >> > node
> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
> >> > It is also strange that I never encounter similar problems with some
> >> > large
> >> > 3D models running on NCI_Raijin (using more than 200 processors, and
> get
> >> > restarted quite a few times and using the similar mesh refinements)
> >> >
> >> > The simulation runs fine, just the checkpoint got corrupted. So I
> guess
> >> > it
> >> > happens when save/load distributed triangulation. And the grid seems
> >> > fine
> >> > for me, just some solution seems at the wrong place at restart.
> >> >
> >> > So could you verify if the prm file restarts fine on some of your
> >> > machines?
> >> > If it works fine could you send me some information on packages
> versions
> >> > of
> >> > deal.II/p4est/mpi? If it is something I did wrong while building those
> >> > packages, do you have any clues about what could it be to lead to this
> >> > problem?
> >> >
> >> > Thanks and regards,
> >> >
> >> > Siqi
> >> >
> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >> >>
> >> >> Any news on this issue, Siqi?
> >> >>
> >> >> Can you experiment with the problem to find out when this problem
> >> >> happens? How many processors do you need to see the problem? How
> often
> >> >> does it occur? Can you maybe simplify the .prm to do one check point
> >> >> after timestep 1 and end to check if that is enough?
> >> >>
> >> >>
> >> >>
> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> >> >> wrote:
> >> >> > Hi Timo,
> >> >> >
> >> >> > Thanks for your reply.
> >> >> > The file system I am using for output in previously test on our in
> >> >> > house
> >> >> > cluster is just a remotely mounted drive, not a distributed file
> >> >> > system.
> >> >> > However a different tests on another Australian supercomputer
> >> >> > NCI_Raijin
> >> >> > which uses lustre file system also produces similar problem.
> >> >> >
> >> >> > My current p4est setup should have MPI_IO enabled, I will try to
> >> >> > disable
> >> >> > it
> >> >> > see if it changes the story.
> >> >> >
> >> >> > Regards,
> >> >> >
> >> >> > Siqi
> >> >> >
> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >> >> >>
> >> >> >> Hey Siqi,
> >> >> >>
> >> >> >> I wonder if this is could be related to the filesystem you are
> >> >> >> writing
> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in the
> hope
> >> >> >> that this works reliably.
> >> >> >>
> >> >> >> Do you know what kind of filesystem your output directory is on?
> Is
> >> >> >> there a different filesystem you can try?
> >> >> >>
> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang <siqi.zhang at mq.edu.au
> >
> >> >> >> wrote:
> >> >> >> > Hi all,
> >> >> >> >
> >> >> >> > Recently, I have encountered an issue while trying to restart
> from
> >> >> >> > a
> >> >> >> > checkpoint. It seems the solution is misplaced into the wrong
> >> >> >> > place
> >> >> >> > while
> >> >> >> > restart (See the two figures). However, the problem is not
> always
> >> >> >> > repeatable
> >> >> >> > (sometimes it restarts fine), and it might be related to
> something
> >> >> >> > wrong
> >> >> >> > in
> >> >> >> > deal.II or p4est.
> >> >> >> >
> >> >> >> > The versions I used to build ASPECT is:
> >> >> >> > DEAL.II         8.2.1
> >> >> >> > P4EST          0.3.4.2  (encountered similar problem on 1.1 as
> >> >> >> > well)
> >> >> >> > TRILINOS     11.12.1
> >> >> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
> >> >> >> > and I am using the most recent development version of aspect
> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
> >> >> >> >
> >> >> >> > I reproduced the problem with the attached prm file (using 2
> nodes
> >> >> >> > 24
> >> >> >> > processors total), I wonder if any of you would like to give it
> a
> >> >> >> > try
> >> >> >> > see if
> >> >> >> > it is a bug or just bad installation on my machine?
> >> >> >> >
> >> >> >> > Regards,
> >> >> >> >
> >> >> >> > Siqi
> >> >> >> >
> >> >> >> > --
> >> >> >> > Siqi Zhang
> >> >> >> >
> >> >> >> > Research Associate
> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> >> >> > Department of Earth and Planetary Sciences
> >> >> >> > Macquarie University
> >> >> >> > NSW 2109
> >> >> >> >
> >> >> >> > Telephone: +61 2 9850 4727
> >> >> >> > http://www.CCFS.mq.edu.au
> >> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >> >
> >> >> >> > _______________________________________________
> >> >> >> > Aspect-devel mailing list
> >> >> >> > Aspect-devel at geodynamics.org
> >> >> >> >
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >> _______________________________________________
> >> >> >> Aspect-devel mailing list
> >> >> >> Aspect-devel at geodynamics.org
> >> >> >>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Siqi Zhang
> >> >> >
> >> >> > Research Associate
> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> >> > Department of Earth and Planetary Sciences
> >> >> > Macquarie University
> >> >> > NSW 2109
> >> >> >
> >> >> > Telephone: +61 2 9850 4727
> >> >> > http://www.CCFS.mq.edu.au
> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >
> >> >> > _______________________________________________
> >> >> > Aspect-devel mailing list
> >> >> > Aspect-devel at geodynamics.org
> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> _______________________________________________
> >> >> Aspect-devel mailing list
> >> >> Aspect-devel at geodynamics.org
> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Siqi Zhang
> >> >
> >> > Research Associate
> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> > Department of Earth and Planetary Sciences
> >> > Macquarie University
> >> > NSW 2109
> >> >
> >> > Telephone: +61 2 9850 4727
> >> > http://www.CCFS.mq.edu.au
> >> > http://www.GEMOC.mq.edu.au
> >> >
> >> > _______________________________________________
> >> > Aspect-devel mailing list
> >> > Aspect-devel at geodynamics.org
> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >>
> >> --
> >> Timo Heister
> >> http://www.math.clemson.edu/~heister/
> >> _______________________________________________
> >> Aspect-devel mailing list
> >> Aspect-devel at geodynamics.org
> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> >
> > --
> > Siqi Zhang
> >
> > Research Associate
> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> > Department of Earth and Planetary Sciences
> > Macquarie University
> > NSW 2109
> >
> > Telephone: +61 2 9850 4727
> > http://www.CCFS.mq.edu.au
> > http://www.GEMOC.mq.edu.au
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
> --
> Timo Heister
> http://www.math.clemson.edu/~heister/
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>



-- 
Siqi Zhang

Research Associate
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Macquarie University
NSW 2109

Telephone: +61 2 9850 4727
http://www.CCFS.mq.edu.au
http://www.GEMOC.mq.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150521/e69d06c8/attachment-0001.html>


More information about the Aspect-devel mailing list