[aspect-devel] Solution goes wrong after restart
Timo Heister
heister at clemson.edu
Wed May 20 14:30:53 PDT 2015
Siqi,
so I should be able to (in the vm):
1. run the .prm with mpirun -n 24
2. wait until it ends
3. change: set Resume computation = true and set
End time = 5e5
4. run again with mpirun -n 24, stop after timestep 1
5. look at T of solution-00001 and see garbage
right? Because this works just fine for me.
On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> Hi Timo,
>
> Thanks for your reply. Since it only happens with some setting, I stop
> chasing it for a while.
> The problem still exist at my end, even with the most recent v1.3 version. I
> found this problem also can be reproduced with the virtual machine you
> created on aspect website (v11 with 24 MPI processes). So I think this must
> be a bug. (However it might be a bug inside deal.II or p4est rather than
> aspect)
> And I found some additional information regarding this problem. I managed to
> short wired the dof_indices into the output. I found those indices of some
> process has changed during the restart (for a 24 process restart test, those
> indices of process 0,2,4,6,8,... are OK; and those of process 1,3,5,7,...
> has changed) I guess the should stay the same to make the restart success.
> It seems this problem is caused by node numbering changes during the restart
> rather than the solution vector not stored properly.
>
> I attached the prm file again. I just start and restart with "end time = 0"
> Hope this will help you to reproduce it and figure out what goes wrong.
>
> Regards,
>
> Siqi
>
> 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
>>
>> Hey Siqi,
>>
>> I can not reproduce it on my workstation:
>> - changed end time to 0, resume=false
>> - ran with mpirun -n 24
>> - waited until it stopped
>> - set end time to 2e5, resume=true
>> - ran with mpirun -n 24
>> - output/solution-00001 looks fine
>>
>> Sorry, I have no idea what is going on and I don't think that this is
>> a configuration problem (because you experience this on different
>> machines).
>>
>> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
>> > Hi Timo,
>> >
>> > I tried to troubleshoot this problem, still no clue. This thing just
>> > drives
>> > me crazy.
>> > Disable/enable MPI_IO in p4est build doesn't change the result, revert
>> > p4est
>> > version from 1.1 to 0.3.4.2 doesn't change it either. I also tried the
>> > development version of deal.II, the problem still exists.
>> >
>> > After set the "end time = 0" while keeping the refinement setting, and
>> > restart with the same setting:
>> > The problem seems repeatable with 24 processors (across 2 nodes).
>> > The problem seems repeatable with 24 processors (on 1 node).
>> > The problem disappears with 12 processors (across 2 nodes).
>> >
>> > The problem disappear after remove the initial refinement (predefined
>> > refinement levels depends on depth) , I guess the grid need to be
>> > complex
>> > enough for this to happen.
>> >
>> > The problem is not so random here. For the certain prm with certain
>> > number
>> > of processors, the problem seems always happen. But it may disappear
>> > when
>> > changing the prm file and processor numbers.
>> > And I also encounter the similar problem at other machines (using same
>> > packages setting, manually built by different compiler and different MPI
>> > version, and different file system): supercomputer NCI_Raijin (OpenMPI
>> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our single
>> > node
>> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
>> > It is also strange that I never encounter similar problems with some
>> > large
>> > 3D models running on NCI_Raijin (using more than 200 processors, and get
>> > restarted quite a few times and using the similar mesh refinements)
>> >
>> > The simulation runs fine, just the checkpoint got corrupted. So I guess
>> > it
>> > happens when save/load distributed triangulation. And the grid seems
>> > fine
>> > for me, just some solution seems at the wrong place at restart.
>> >
>> > So could you verify if the prm file restarts fine on some of your
>> > machines?
>> > If it works fine could you send me some information on packages versions
>> > of
>> > deal.II/p4est/mpi? If it is something I did wrong while building those
>> > packages, do you have any clues about what could it be to lead to this
>> > problem?
>> >
>> > Thanks and regards,
>> >
>> > Siqi
>> >
>> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >>
>> >> Any news on this issue, Siqi?
>> >>
>> >> Can you experiment with the problem to find out when this problem
>> >> happens? How many processors do you need to see the problem? How often
>> >> does it occur? Can you maybe simplify the .prm to do one check point
>> >> after timestep 1 and end to check if that is enough?
>> >>
>> >>
>> >>
>> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> wrote:
>> >> > Hi Timo,
>> >> >
>> >> > Thanks for your reply.
>> >> > The file system I am using for output in previously test on our in
>> >> > house
>> >> > cluster is just a remotely mounted drive, not a distributed file
>> >> > system.
>> >> > However a different tests on another Australian supercomputer
>> >> > NCI_Raijin
>> >> > which uses lustre file system also produces similar problem.
>> >> >
>> >> > My current p4est setup should have MPI_IO enabled, I will try to
>> >> > disable
>> >> > it
>> >> > see if it changes the story.
>> >> >
>> >> > Regards,
>> >> >
>> >> > Siqi
>> >> >
>> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >> >>
>> >> >> Hey Siqi,
>> >> >>
>> >> >> I wonder if this is could be related to the filesystem you are
>> >> >> writing
>> >> >> the files to. If p4est is not using MPI_IO it uses posix in the hope
>> >> >> that this works reliably.
>> >> >>
>> >> >> Do you know what kind of filesystem your output directory is on? Is
>> >> >> there a different filesystem you can try?
>> >> >>
>> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> >> wrote:
>> >> >> > Hi all,
>> >> >> >
>> >> >> > Recently, I have encountered an issue while trying to restart from
>> >> >> > a
>> >> >> > checkpoint. It seems the solution is misplaced into the wrong
>> >> >> > place
>> >> >> > while
>> >> >> > restart (See the two figures). However, the problem is not always
>> >> >> > repeatable
>> >> >> > (sometimes it restarts fine), and it might be related to something
>> >> >> > wrong
>> >> >> > in
>> >> >> > deal.II or p4est.
>> >> >> >
>> >> >> > The versions I used to build ASPECT is:
>> >> >> > DEAL.II 8.2.1
>> >> >> > P4EST 0.3.4.2 (encountered similar problem on 1.1 as
>> >> >> > well)
>> >> >> > TRILINOS 11.12.1
>> >> >> > MPI openmpi 1.8.3 (with gcc 4.4.7)
>> >> >> > and I am using the most recent development version of aspect
>> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
>> >> >> >
>> >> >> > I reproduced the problem with the attached prm file (using 2 nodes
>> >> >> > 24
>> >> >> > processors total), I wonder if any of you would like to give it a
>> >> >> > try
>> >> >> > see if
>> >> >> > it is a bug or just bad installation on my machine?
>> >> >> >
>> >> >> > Regards,
>> >> >> >
>> >> >> > Siqi
>> >> >> >
>> >> >> > --
>> >> >> > Siqi Zhang
>> >> >> >
>> >> >> > Research Associate
>> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> >> > Department of Earth and Planetary Sciences
>> >> >> > Macquarie University
>> >> >> > NSW 2109
>> >> >> >
>> >> >> > Telephone: +61 2 9850 4727
>> >> >> > http://www.CCFS.mq.edu.au
>> >> >> > http://www.GEMOC.mq.edu.au
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Aspect-devel mailing list
>> >> >> > Aspect-devel at geodynamics.org
>> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >> _______________________________________________
>> >> >> Aspect-devel mailing list
>> >> >> Aspect-devel at geodynamics.org
>> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Siqi Zhang
>> >> >
>> >> > Research Associate
>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> > Department of Earth and Planetary Sciences
>> >> > Macquarie University
>> >> > NSW 2109
>> >> >
>> >> > Telephone: +61 2 9850 4727
>> >> > http://www.CCFS.mq.edu.au
>> >> > http://www.GEMOC.mq.edu.au
>> >> >
>> >> > _______________________________________________
>> >> > Aspect-devel mailing list
>> >> > Aspect-devel at geodynamics.org
>> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> _______________________________________________
>> >> Aspect-devel mailing list
>> >> Aspect-devel at geodynamics.org
>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >
>> >
>> >
>> >
>> > --
>> > Siqi Zhang
>> >
>> > Research Associate
>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> > Department of Earth and Planetary Sciences
>> > Macquarie University
>> > NSW 2109
>> >
>> > Telephone: +61 2 9850 4727
>> > http://www.CCFS.mq.edu.au
>> > http://www.GEMOC.mq.edu.au
>> >
>> > _______________________________________________
>> > Aspect-devel mailing list
>> > Aspect-devel at geodynamics.org
>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>
>> --
>> Timo Heister
>> http://www.math.clemson.edu/~heister/
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
>
> --
> Siqi Zhang
>
> Research Associate
> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> Department of Earth and Planetary Sciences
> Macquarie University
> NSW 2109
>
> Telephone: +61 2 9850 4727
> http://www.CCFS.mq.edu.au
> http://www.GEMOC.mq.edu.au
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
--
Timo Heister
http://www.math.clemson.edu/~heister/
More information about the Aspect-devel
mailing list