[aspect-devel] Solution goes wrong after restart

Mon May 4 06:28:00 PDT 2015

Hey Siqi,

I can not reproduce it on my workstation:
- changed end time to 0, resume=false
- ran with mpirun -n 24
- waited until it stopped
- set end time to 2e5, resume=true
- ran with mpirun -n 24
- output/solution-00001 looks fine

Sorry, I have no idea what is going on and I don't think that this is
a configuration problem (because you experience this on different
machines).

On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> Hi Timo,
>
> I tried to troubleshoot this problem, still no clue. This thing just drives
> me crazy.
> Disable/enable MPI_IO in p4est build doesn't change the result, revert p4est
> version from 1.1 to 0.3.4.2 doesn't change it either. I also tried the
> development version of deal.II, the problem still exists.
>
> After set the "end time = 0" while keeping the refinement setting, and
> restart with the same setting:
> The problem seems repeatable with 24 processors (across 2 nodes).
> The problem seems repeatable with 24 processors (on 1 node).
> The problem disappears with 12 processors (across 2 nodes).
>
> The problem disappear after remove the initial refinement (predefined
> refinement levels depends on depth) , I guess the grid need to be complex
> enough for this to happen.
>
> The problem is not so random here. For the certain prm with certain number
> of processors, the problem seems always happen. But it may disappear when
> changing the prm file and processor numbers.
> And I also encounter the similar problem at other machines (using same
> packages setting, manually built by different compiler and different MPI
> version, and different file system): supercomputer NCI_Raijin (OpenMPI
> 1.6.3, intel compiler 12.1.9.293, lustre file system), and our single node
> machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
> It is also strange that I never encounter similar problems with some large
> 3D models running on NCI_Raijin (using more than 200 processors, and get
> restarted quite a few times and using the similar mesh refinements)
>
> The simulation runs fine, just the checkpoint got corrupted. So I guess it
> happens when save/load distributed triangulation. And the grid seems fine
> for me, just some solution seems at the wrong place at restart.
>
> So could you verify if the prm file restarts fine on some of your machines?
> If it works fine could you send me some information on packages versions of
> deal.II/p4est/mpi? If it is something I did wrong while building those
> packages, do you have any clues about what could it be to lead to this
> problem?
>
> Thanks and regards,
>
> Siqi
>
> 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>>
>> Any news on this issue, Siqi?
>>
>> Can you experiment with the problem to find out when this problem
>> happens? How many processors do you need to see the problem? How often
>> does it occur? Can you maybe simplify the .prm to do one check point
>> after timestep 1 and end to check if that is enough?
>>
>>
>>
>> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
>> > Hi Timo,
>> >
>> > Thanks for your reply.
>> > The file system I am using for output in previously test on our in house
>> > cluster is just a remotely mounted drive, not a distributed file system.
>> > However a different tests on another Australian supercomputer NCI_Raijin
>> > which uses lustre file system also produces similar problem.
>> >
>> > My current p4est setup should have MPI_IO enabled, I will try to disable
>> > it
>> > see if it changes the story.
>> >
>> > Regards,
>> >
>> > Siqi
>> >
>> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>> >>
>> >> Hey Siqi,
>> >>
>> >> I wonder if this is could be related to the filesystem you are writing
>> >> the files to. If p4est is not using MPI_IO it uses posix in the hope
>> >> that this works reliably.
>> >>
>> >> Do you know what kind of filesystem your output directory is on? Is
>> >> there a different filesystem you can try?
>> >>
>> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang <siqi.zhang at mq.edu.au>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > Recently, I have encountered an issue while trying to restart from a
>> >> > checkpoint. It seems the solution is misplaced into the wrong place
>> >> > while
>> >> > restart (See the two figures). However, the problem is not always
>> >> > repeatable
>> >> > (sometimes it restarts fine), and it might be related to something
>> >> > wrong
>> >> > in
>> >> > deal.II or p4est.
>> >> >
>> >> > The versions I used to build ASPECT is:
>> >> > DEAL.II         8.2.1
>> >> > P4EST          0.3.4.2  (encountered similar problem on 1.1 as well)
>> >> > TRILINOS     11.12.1
>> >> > MPI                openmpi 1.8.3 (with gcc 4.4.7)
>> >> > and I am using the most recent development version of aspect
>> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
>> >> >
>> >> > I reproduced the problem with the attached prm file (using 2 nodes 24
>> >> > processors total), I wonder if any of you would like to give it a try
>> >> > see if
>> >> > it is a bug or just bad installation on my machine?
>> >> >
>> >> > Regards,
>> >> >
>> >> > Siqi
>> >> >
>> >> > --
>> >> > Siqi Zhang
>> >> >
>> >> > Research Associate
>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> >> > Department of Earth and Planetary Sciences
>> >> > Macquarie University
>> >> > NSW 2109
>> >> >
>> >> > Telephone: +61 2 9850 4727
>> >> > http://www.CCFS.mq.edu.au
>> >> > http://www.GEMOC.mq.edu.au
>> >> >
>> >> > _______________________________________________
>> >> > Aspect-devel mailing list
>> >> > Aspect-devel at geodynamics.org
>> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >> _______________________________________________
>> >> Aspect-devel mailing list
>> >> Aspect-devel at geodynamics.org
>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> >
>> >
>> >
>> >
>> > --
>> > Siqi Zhang
>> >
>> > Research Associate
>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> > Department of Earth and Planetary Sciences
>> > Macquarie University
>> > NSW 2109
>> >
>> > Telephone: +61 2 9850 4727
>> > http://www.CCFS.mq.edu.au
>> > http://www.GEMOC.mq.edu.au
>> >
>> > _______________________________________________
>> > Aspect-devel mailing list
>> > Aspect-devel at geodynamics.org
>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
>
> --
> Siqi Zhang
>
> Research Associate
> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> Department of Earth and Planetary Sciences
> Macquarie University
> NSW 2109
>
> Telephone: +61 2 9850 4727
> http://www.CCFS.mq.edu.au
> http://www.GEMOC.mq.edu.au
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel

-- 
Timo Heister
http://www.math.clemson.edu/~heister/