[aspect-devel] Solution goes wrong after restart

Siqi Zhang siqi.zhang at mq.edu.au
Tue May 19 17:46:48 PDT 2015


Hi Timo,

Thanks for your reply. Since it only happens with some setting, I stop
chasing it for a while.
The problem still exist at my end, even with the most recent v1.3 version.
I found this problem also can be reproduced with the virtual machine you
created on aspect website (v11 with 24 MPI processes). So I think this must
be a bug. (However it might be a bug inside deal.II or p4est rather than
aspect)
And I found some additional information regarding this problem. I managed
to short wired the dof_indices into the output. I found those indices of
some process has changed during the restart (for a 24 process restart test,
those indices of process 0,2,4,6,8,... are OK; and those of process
1,3,5,7,... has changed) I guess the should stay the same to make the
restart success.
It seems this problem is caused by node numbering changes during the
restart rather than the solution vector not stored properly.

I attached the prm file again. I just start and restart with "end time = 0"
Hope this will help you to reproduce it and figure out what goes wrong.

Regards,

Siqi

2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:

> Hey Siqi,
>
> I can not reproduce it on my workstation:
> - changed end time to 0, resume=false
> - ran with mpirun -n 24
> - waited until it stopped
> - set end time to 2e5, resume=true
> - ran with mpirun -n 24
> - output/solution-00001 looks fine
>
> Sorry, I have no idea what is going on and I don't think that this is
> a configuration problem (because you experience this on different
> machines).
>
> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> > Hi Timo,
> >
> > I tried to troubleshoot this problem, still no clue. This thing just
> drives
> > me crazy.
> > Disable/enable MPI_IO in p4est build doesn't change the result, revert
> p4est
> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried the
> > development version of deal.II, the problem still exists.
> >
> > After set the "end time = 0" while keeping the refinement setting, and
> > restart with the same setting:
> > The problem seems repeatable with 24 processors (across 2 nodes).
> > The problem seems repeatable with 24 processors (on 1 node).
> > The problem disappears with 12 processors (across 2 nodes).
> >
> > The problem disappear after remove the initial refinement (predefined
> > refinement levels depends on depth) , I guess the grid need to be complex
> > enough for this to happen.
> >
> > The problem is not so random here. For the certain prm with certain
> number
> > of processors, the problem seems always happen. But it may disappear when
> > changing the prm file and processor numbers.
> > And I also encounter the similar problem at other machines (using same
> > packages setting, manually built by different compiler and different MPI
> > version, and different file system): supercomputer NCI_Raijin (OpenMPI
> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our single
> node
> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
> > It is also strange that I never encounter similar problems with some
> large
> > 3D models running on NCI_Raijin (using more than 200 processors, and get
> > restarted quite a few times and using the similar mesh refinements)
> >
> > The simulation runs fine, just the checkpoint got corrupted. So I guess
> it
> > happens when save/load distributed triangulation. And the grid seems fine
> > for me, just some solution seems at the wrong place at restart.
> >
> > So could you verify if the prm file restarts fine on some of your
> machines?
> > If it works fine could you send me some information on packages versions
> of
> > deal.II/p4est/mpi? If it is something I did wrong while building those
> > packages, do you have any clues about what could it be to lead to this
> > problem?
> >
> > Thanks and regards,
> >
> > Siqi
> >
> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >>
> >> Any news on this issue, Siqi?
> >>
> >> Can you experiment with the problem to find out when this problem
> >> happens? How many processors do you need to see the problem? How often
> >> does it occur? Can you maybe simplify the .prm to do one check point
> >> after timestep 1 and end to check if that is enough?
> >>
> >>
> >>
> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
> wrote:
> >> > Hi Timo,
> >> >
> >> > Thanks for your reply.
> >> > The file system I am using for output in previously test on our in
> house
> >> > cluster is just a remotely mounted drive, not a distributed file
> system.
> >> > However a different tests on another Australian supercomputer
> NCI_Raijin
> >> > which uses lustre file system also produces similar problem.
> >> >
> >> > My current p4est setup should have MPI_IO enabled, I will try to
> disable
> >> > it
> >> > see if it changes the story.
> >> >
> >> > Regards,
> >> >
> >> > Siqi
> >> >
> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >> >>
> >> >> Hey Siqi,
> >> >>
> >> >> I wonder if this is could be related to the filesystem you are
> writing
> >> >> the files to. If p4est is not using MPI_IO it uses posix in the hope
> >> >> that this works reliably.
> >> >>
> >> >> Do you know what kind of filesystem your output directory is on? Is
> >> >> there a different filesystem you can try?
> >> >>
> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang <siqi.zhang at mq.edu.au>
> >> >> wrote:
> >> >> > Hi all,
> >> >> >
> >> >> > Recently, I have encountered an issue while trying to restart from
> a
> >> >> > checkpoint. It seems the solution is misplaced into the wrong place
> >> >> > while
> >> >> > restart (See the two figures). However, the problem is not always
> >> >> > repeatable
> >> >> > (sometimes it restarts fine), and it might be related to something
> >> >> > wrong
> >> >> > in
> >> >> > deal.II or p4est.
> >> >> >
> >> >> > The versions I used to build ASPECT is:
> >> >> > DEAL.II         8.2.1
> >> >> > P4EST          0.3.4.2  (encountered similar problem on 1.1 as
> well)
> >> >> > TRILINOS     11.12.1
> >> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
> >> >> > and I am using the most recent development version of aspect
> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
> >> >> >
> >> >> > I reproduced the problem with the attached prm file (using 2 nodes
> 24
> >> >> > processors total), I wonder if any of you would like to give it a
> try
> >> >> > see if
> >> >> > it is a bug or just bad installation on my machine?
> >> >> >
> >> >> > Regards,
> >> >> >
> >> >> > Siqi
> >> >> >
> >> >> > --
> >> >> > Siqi Zhang
> >> >> >
> >> >> > Research Associate
> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> >> > Department of Earth and Planetary Sciences
> >> >> > Macquarie University
> >> >> > NSW 2109
> >> >> >
> >> >> > Telephone: +61 2 9850 4727
> >> >> > http://www.CCFS.mq.edu.au
> >> >> > http://www.GEMOC.mq.edu.au
> >> >> >
> >> >> > _______________________________________________
> >> >> > Aspect-devel mailing list
> >> >> > Aspect-devel at geodynamics.org
> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >> _______________________________________________
> >> >> Aspect-devel mailing list
> >> >> Aspect-devel at geodynamics.org
> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Siqi Zhang
> >> >
> >> > Research Associate
> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> > Department of Earth and Planetary Sciences
> >> > Macquarie University
> >> > NSW 2109
> >> >
> >> > Telephone: +61 2 9850 4727
> >> > http://www.CCFS.mq.edu.au
> >> > http://www.GEMOC.mq.edu.au
> >> >
> >> > _______________________________________________
> >> > Aspect-devel mailing list
> >> > Aspect-devel at geodynamics.org
> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> _______________________________________________
> >> Aspect-devel mailing list
> >> Aspect-devel at geodynamics.org
> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> >
> > --
> > Siqi Zhang
> >
> > Research Associate
> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> > Department of Earth and Planetary Sciences
> > Macquarie University
> > NSW 2109
> >
> > Telephone: +61 2 9850 4727
> > http://www.CCFS.mq.edu.au
> > http://www.GEMOC.mq.edu.au
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
> --
> Timo Heister
> http://www.math.clemson.edu/~heister/
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>



-- 
Siqi Zhang

Research Associate
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Macquarie University
NSW 2109

Telephone: +61 2 9850 4727
http://www.CCFS.mq.edu.au
http://www.GEMOC.mq.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150520/91031d81/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple-compressible.prm
Type: application/octet-stream
Size: 3354 bytes
Desc: not available
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150520/91031d81/attachment-0001.obj>


More information about the Aspect-devel mailing list