[aspect-devel] Solution goes wrong after restart
siqi.zhang at mq.edu.au
Sun Apr 19 18:14:54 PDT 2015
I tried to troubleshoot this problem, still no clue. This thing just drives
Disable/enable MPI_IO in p4est build doesn't change the result, revert
p4est version from 1.1 to 0.3.4.2 doesn't change it either. I also tried
the development version of deal.II, the problem still exists.
After set the "end time = 0" while keeping the refinement setting, and
restart with the same setting:
The problem seems repeatable with 24 processors (across 2 nodes).
The problem seems repeatable with 24 processors (on 1 node).
The problem disappears with 12 processors (across 2 nodes).
The problem disappear after remove the initial refinement (predefined
refinement levels depends on depth) , I guess the grid need to be complex
enough for this to happen.
The problem is not so random here. For the certain prm with certain number
of processors, the problem seems always happen. But it may disappear when
changing the prm file and processor numbers.
And I also encounter the similar problem at other machines (using same
packages setting, manually built by different compiler and different MPI
version, and different file system): supercomputer NCI_Raijin (OpenMPI
1.6.3, intel compiler 22.214.171.1243, lustre file system), and our single node
machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
It is also strange that I never encounter similar problems with some large
3D models running on NCI_Raijin (using more than 200 processors, and get
restarted quite a few times and using the similar mesh refinements)
The simulation runs fine, just the checkpoint got corrupted. So I guess it
happens when save/load distributed triangulation. And the grid seems fine
for me, just some solution seems at the wrong place at restart.
So could you verify if the prm file restarts fine on some of your machines?
If it works fine could you send me some information on packages versions of
deal.II/p4est/mpi? If it is something I did wrong while building those
packages, do you have any clues about what could it be to lead to this
Thanks and regards,
2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> Any news on this issue, Siqi?
> Can you experiment with the problem to find out when this problem
> happens? How many processors do you need to see the problem? How often
> does it occur? Can you maybe simplify the .prm to do one check point
> after timestep 1 and end to check if that is enough?
> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> > Hi Timo,
> > Thanks for your reply.
> > The file system I am using for output in previously test on our in house
> > cluster is just a remotely mounted drive, not a distributed file system.
> > However a different tests on another Australian supercomputer NCI_Raijin
> > which uses lustre file system also produces similar problem.
> > My current p4est setup should have MPI_IO enabled, I will try to disable
> > see if it changes the story.
> > Regards,
> > Siqi
> > 2015-04-11 1:20 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
> >> Hey Siqi,
> >> I wonder if this is could be related to the filesystem you are writing
> >> the files to. If p4est is not using MPI_IO it uses posix in the hope
> >> that this works reliably.
> >> Do you know what kind of filesystem your output directory is on? Is
> >> there a different filesystem you can try?
> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang <siqi.zhang at mq.edu.au>
> >> > Hi all,
> >> >
> >> > Recently, I have encountered an issue while trying to restart from a
> >> > checkpoint. It seems the solution is misplaced into the wrong place
> >> > while
> >> > restart (See the two figures). However, the problem is not always
> >> > repeatable
> >> > (sometimes it restarts fine), and it might be related to something
> >> > in
> >> > deal.II or p4est.
> >> >
> >> > The versions I used to build ASPECT is:
> >> > DEAL.II 8.2.1
> >> > P4EST 0.3.4.2 (encountered similar problem on 1.1 as well)
> >> > TRILINOS 11.12.1
> >> > MPI openmpi 1.8.3 (with gcc 4.4.7)
> >> > and I am using the most recent development version of aspect
> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
> >> >
> >> > I reproduced the problem with the attached prm file (using 2 nodes 24
> >> > processors total), I wonder if any of you would like to give it a try
> >> > see if
> >> > it is a bug or just bad installation on my machine?
> >> >
> >> > Regards,
> >> >
> >> > Siqi
> >> >
> >> > --
> >> > Siqi Zhang
> >> >
> >> > Research Associate
> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> >> > Department of Earth and Planetary Sciences
> >> > Macquarie University
> >> > NSW 2109
> >> >
> >> > Telephone: +61 2 9850 4727
> >> > http://www.CCFS.mq.edu.au
> >> > http://www.GEMOC.mq.edu.au
> >> >
> >> > _______________________________________________
> >> > Aspect-devel mailing list
> >> > Aspect-devel at geodynamics.org
> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >> _______________________________________________
> >> Aspect-devel mailing list
> >> Aspect-devel at geodynamics.org
> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> > --
> > Siqi Zhang
> > Research Associate
> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> > Department of Earth and Planetary Sciences
> > Macquarie University
> > NSW 2109
> > Telephone: +61 2 9850 4727
> > http://www.CCFS.mq.edu.au
> > http://www.GEMOC.mq.edu.au
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
Department of Earth and Planetary Sciences
Telephone: +61 2 9850 4727
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Aspect-devel