[aspect-devel] Solution goes wrong after restart

Timo Heister timo.heister at gmail.com
Wed May 20 21:41:37 PDT 2015


did you change anything else? Did you update ASPECT? Debug or release mode?

On Wed, May 20, 2015 at 7:49 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> Hi Timo,
> Thanks for the test.
> This is supper strange. I got garbage results at step 1  with doing exactly
> the same thing with the vm.
>
> Siqi
>
> 2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:
>>
>> Siqi,
>>
>> so I should be able to (in the vm):
>>
>> 1. run the .prm with mpirun -n 24
>> 2. wait until it ends
>> 3. change: set Resume computation                     = true and set
>> End time                               = 5e5
>> 4. run again with mpirun -n 24, stop after timestep 1
>> 5. look at T of solution-00001 and see garbage
>>
>> right? Because this works just fine for me.
>>
>>
>> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
>> > Hi Timo,
>> >
>> > Thanks for your reply. Since it only happens with some setting, I stop
>> > chasing it for a while.
>> > The problem still exist at my end, even with the most recent v1.3
>> > version. I
>> > found this problem also can be reproduced with the virtual machine you
>> > created on aspect website (v11 with 24 MPI processes). So I think this
>> > must
>> > be a bug. (However it might be a bug inside deal.II or p4est rather than
>> > aspect)
>> > And I found some additional information regarding this problem. I
>> > managed to
>> > short wired the dof_indices into the output. I found those indices of
>> > some
>> > process has changed during the restart (for a 24 process restart test,
>> > those
>> > indices of process 0,2,4,6,8,... are OK; and those of process
>> > 1,3,5,7,...
>> > has changed) I guess the should stay the same to make the restart
>> > success.
>> > It seems this problem is caused by node numbering changes during the
>> > restart
>> > rather than the solution vector not stored properly.
>> >
>> > I attached the prm file again. I just start and restart with "end time =
>> > 0"
>> > Hope this will help you to reproduce it and figure out what goes wrong.
>> >
>> > Regards,
>> >
>> > Siqi
>> >
>> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
>> >>
>> >> Hey Siqi,
>> >>
>> >> I can not reproduce it on my workstation:
>> >> - changed end time to 0, resume=false
>> >> - ran with mpirun -n 24
>> >> - waited until it stopped
>> >> - set end time to 2e5, resume=true
>> >> - ran with mpirun -n 24
>> >> - output/solution-00001 looks fine
>> >>
>> >> Sorry, I have no idea what is going on and I don't think that this is
>> >> a configuration problem (because you experience this on different
>> >> machines).
>> >>
>> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> wrote:
>> >> > Hi Timo,
>> >> >
>> >> > I tried to troubleshoot this problem, still no clue. This thing just
>> >> > drives
>> >> > me crazy.
>> >> > Disable/enable MPI_IO in p4est build doesn't change the result,
>> >> > revert
>> >> > p4est
>> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried
>> >> > the
>> >> > development version of deal.II, the problem still exists.
>> >> >
>> >> > After set the "end time = 0" while keeping the refinement setting,
>> >> > and
>> >> > restart with the same setting:
>> >> > The problem seems repeatable with 24 processors (across 2 nodes).
>> >> > The problem seems repeatable with 24 processors (on 1 node).
>> >> > The problem disappears with 12 processors (across 2 nodes).
>> >> >
>> >> > The problem disappear after remove the initial refinement (predefined
>> >> > refinement levels depends on depth) , I guess the grid need to be
>> >> > complex
>> >> > enough for this to happen.
>> >> >
>> >> > The problem is not so random here. For the certain prm with certain
>> >> > number
>> >> > of processors, the problem seems always happen. But it may disappear
>> >> > when
>> >> > changing the prm file and processor numbers.
>> >> > And I also encounter the similar problem at other machines (using
>> >> > same
>> >> > packages setting, manually built by different compiler and different
>> >> > MPI
>> >> > version, and different file system): supercomputer NCI_Raijin
>> >> > (OpenMPI
>> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our single
>> >> > node
>> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
>> >> > It is also strange that I never encounter similar problems with some
>> >> > large
>> >> > 3D models running on NCI_Raijin (using more than 200 processors, and
>> >> > get
>> >> > restarted quite a few times and using the similar mesh refinements)
>> >> >
>> >> > The simulation runs fine, just the checkpoint got corrupted. So I
>> >> > guess
>> >> > it
>> >> > happens when save/load distributed triangulation. And the grid seems
>> >> > fine
>> >> > for me, just some solution seems at the wrong place at restart.
>> >> >
>> >> > So could you verify if the prm file restarts fine on some of your
>> >> > machines?
>> >> > If it works fine could you send me some information on packages
>> >> > versions
>> >> > of
>> >> > deal.II/p4est/mpi? If it is something I did wrong while building
>> >> > those
>> >> > packages, do you have any clues about what could it be to lead to
>> >> > this
>> >> > problem?
>> >> >
>> >> > Thanks and regards,
>> >> >
>> >> > Siqi
>> >> >
>> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >> >>
>> >> >> Any news on this issue, Siqi?
>> >> >>
>> >> >> Can you experiment with the problem to find out when this problem
>> >> >> happens? How many processors do you need to see the problem? How
>> >> >> often
>> >> >> does it occur? Can you maybe simplify the .prm to do one check point
>> >> >> after timestep 1 and end to check if that is enough?
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> >> wrote:
>> >> >> > Hi Timo,
>> >> >> >
>> >> >> > Thanks for your reply.
>> >> >> > The file system I am using for output in previously test on our in
>> >> >> > house
>> >> >> > cluster is just a remotely mounted drive, not a distributed file
>> >> >> > system.
>> >> >> > However a different tests on another Australian supercomputer
>> >> >> > NCI_Raijin
>> >> >> > which uses lustre file system also produces similar problem.
>> >> >> >
>> >> >> > My current p4est setup should have MPI_IO enabled, I will try to
>> >> >> > disable
>> >> >> > it
>> >> >> > see if it changes the story.
>> >> >> >
>> >> >> > Regards,
>> >> >> >
>> >> >> > Siqi
>> >> >> >
>> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >> >> >>
>> >> >> >> Hey Siqi,
>> >> >> >>
>> >> >> >> I wonder if this is could be related to the filesystem you are
>> >> >> >> writing
>> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in the
>> >> >> >> hope
>> >> >> >> that this works reliably.
>> >> >> >>
>> >> >> >> Do you know what kind of filesystem your output directory is on?
>> >> >> >> Is
>> >> >> >> there a different filesystem you can try?
>> >> >> >>
>> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang
>> >> >> >> <siqi.zhang at mq.edu.au>
>> >> >> >> wrote:
>> >> >> >> > Hi all,
>> >> >> >> >
>> >> >> >> > Recently, I have encountered an issue while trying to restart
>> >> >> >> > from
>> >> >> >> > a
>> >> >> >> > checkpoint. It seems the solution is misplaced into the wrong
>> >> >> >> > place
>> >> >> >> > while
>> >> >> >> > restart (See the two figures). However, the problem is not
>> >> >> >> > always
>> >> >> >> > repeatable
>> >> >> >> > (sometimes it restarts fine), and it might be related to
>> >> >> >> > something
>> >> >> >> > wrong
>> >> >> >> > in
>> >> >> >> > deal.II or p4est.
>> >> >> >> >
>> >> >> >> > The versions I used to build ASPECT is:
>> >> >> >> > DEAL.II         8.2.1
>> >> >> >> > P4EST          0.3.4.2  (encountered similar problem on 1.1 as
>> >> >> >> > well)
>> >> >> >> > TRILINOS     11.12.1
>> >> >> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
>> >> >> >> > and I am using the most recent development version of aspect
>> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
>> >> >> >> >
>> >> >> >> > I reproduced the problem with the attached prm file (using 2
>> >> >> >> > nodes
>> >> >> >> > 24
>> >> >> >> > processors total), I wonder if any of you would like to give it
>> >> >> >> > a
>> >> >> >> > try
>> >> >> >> > see if
>> >> >> >> > it is a bug or just bad installation on my machine?
>> >> >> >> >
>> >> >> >> > Regards,
>> >> >> >> >
>> >> >> >> > Siqi
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Siqi Zhang
>> >> >> >> >
>> >> >> >> > Research Associate
>> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> >> >> > Department of Earth and Planetary Sciences
>> >> >> >> > Macquarie University
>> >> >> >> > NSW 2109
>> >> >> >> >
>> >> >> >> > Telephone: +61 2 9850 4727
>> >> >> >> > http://www.CCFS.mq.edu.au
>> >> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >> >
>> >> >> >> > _______________________________________________
>> >> >> >> > Aspect-devel mailing list
>> >> >> >> > Aspect-devel at geodynamics.org
>> >> >> >> >
>> >> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >> _______________________________________________
>> >> >> >> Aspect-devel mailing list
>> >> >> >> Aspect-devel at geodynamics.org
>> >> >> >>
>> >> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Siqi Zhang
>> >> >> >
>> >> >> > Research Associate
>> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> >> > Department of Earth and Planetary Sciences
>> >> >> > Macquarie University
>> >> >> > NSW 2109
>> >> >> >
>> >> >> > Telephone: +61 2 9850 4727
>> >> >> > http://www.CCFS.mq.edu.au
>> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Aspect-devel mailing list
>> >> >> > Aspect-devel at geodynamics.org
>> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> _______________________________________________
>> >> >> Aspect-devel mailing list
>> >> >> Aspect-devel at geodynamics.org
>> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Siqi Zhang
>> >> >
>> >> > Research Associate
>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> > Department of Earth and Planetary Sciences
>> >> > Macquarie University
>> >> > NSW 2109
>> >> >
>> >> > Telephone: +61 2 9850 4727
>> >> > http://www.CCFS.mq.edu.au
>> >> > http://www.GEMOC.mq.edu.au
>> >> >
>> >> > _______________________________________________
>> >> > Aspect-devel mailing list
>> >> > Aspect-devel at geodynamics.org
>> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >>
>> >> --
>> >> Timo Heister
>> >> http://www.math.clemson.edu/~heister/
>> >> _______________________________________________
>> >> Aspect-devel mailing list
>> >> Aspect-devel at geodynamics.org
>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >
>> >
>> >
>> >
>> > --
>> > Siqi Zhang
>> >
>> > Research Associate
>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> > Department of Earth and Planetary Sciences
>> > Macquarie University
>> > NSW 2109
>> >
>> > Telephone: +61 2 9850 4727
>> > http://www.CCFS.mq.edu.au
>> > http://www.GEMOC.mq.edu.au
>> >
>> > _______________________________________________
>> > Aspect-devel mailing list
>> > Aspect-devel at geodynamics.org
>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>
>>
>>
>> --
>> Timo Heister
>> http://www.math.clemson.edu/~heister/
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
>
> --
> Siqi Zhang
>
> Research Associate
> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> Department of Earth and Planetary Sciences
> Macquarie University
> NSW 2109
>
> Telephone: +61 2 9850 4727
> http://www.CCFS.mq.edu.au
> http://www.GEMOC.mq.edu.au
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel


More information about the Aspect-devel mailing list