[aspect-devel] Solution goes wrong after restart
Timo Heister
timo.heister at gmail.com
Sun Jun 7 02:05:52 PDT 2015
Thanks Siqi. Of course I switched output to vtu before testing and
therefore couldn't see the problem. Sorry about that. :-(
See https://github.com/geodynamics/aspect/pull/525
On Thu, Jun 4, 2015 at 4:01 AM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
> Hi Timo,
>
> I finally found a solution for this problem.
> The problem lies somewhere in the HDF5 output. The simulation actually
> correct during the restart, just the hdf5 output has some problem.
> During the restart, the node numbering has somehow changed in the HDF5
> output although the mesh doesn't actually changed (Not sure why, it may be
> related to the redundant points filter, I guess.). However the hdf5 output
> do not write new mesh file if mesh not changing. After restart, the the
> solution look rubbish using the old mesh with old numbering.
> After forcing ASPECT to rewrite hdf5 mesh file after every restart, the
> problem fixed.
>
> Regards,
>
> Siqi
>
> 2015-05-21 14:48 GMT+10:00 Siqi Zhang <siqi.zhang at mq.edu.au>:
>>
>> I changed nothing in the vm. I didn't recompile the code, just using the
>> ~/aspect/aspect compiled there, it's in the DEBUG mode.
>> It's so strange it can produce different results, I tried a few times, and
>> got the same wrong results every time.
>>
>> 2015-05-21 14:41 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>>>
>>> did you change anything else? Did you update ASPECT? Debug or release
>>> mode?
>>>
>>> On Wed, May 20, 2015 at 7:49 PM, Siqi Zhang <siqi.zhang at mq.edu.au> wrote:
>>> > Hi Timo,
>>> > Thanks for the test.
>>> > This is supper strange. I got garbage results at step 1 with doing
>>> > exactly
>>> > the same thing with the vm.
>>> >
>>> > Siqi
>>> >
>>> > 2015-05-21 7:30 GMT+10:00 Timo Heister <heister at clemson.edu>:
>>> >>
>>> >> Siqi,
>>> >>
>>> >> so I should be able to (in the vm):
>>> >>
>>> >> 1. run the .prm with mpirun -n 24
>>> >> 2. wait until it ends
>>> >> 3. change: set Resume computation = true and set
>>> >> End time = 5e5
>>> >> 4. run again with mpirun -n 24, stop after timestep 1
>>> >> 5. look at T of solution-00001 and see garbage
>>> >>
>>> >> right? Because this works just fine for me.
>>> >>
>>> >>
>>> >> On Tue, May 19, 2015 at 8:46 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>>> >> wrote:
>>> >> > Hi Timo,
>>> >> >
>>> >> > Thanks for your reply. Since it only happens with some setting, I
>>> >> > stop
>>> >> > chasing it for a while.
>>> >> > The problem still exist at my end, even with the most recent v1.3
>>> >> > version. I
>>> >> > found this problem also can be reproduced with the virtual machine
>>> >> > you
>>> >> > created on aspect website (v11 with 24 MPI processes). So I think
>>> >> > this
>>> >> > must
>>> >> > be a bug. (However it might be a bug inside deal.II or p4est rather
>>> >> > than
>>> >> > aspect)
>>> >> > And I found some additional information regarding this problem. I
>>> >> > managed to
>>> >> > short wired the dof_indices into the output. I found those indices
>>> >> > of
>>> >> > some
>>> >> > process has changed during the restart (for a 24 process restart
>>> >> > test,
>>> >> > those
>>> >> > indices of process 0,2,4,6,8,... are OK; and those of process
>>> >> > 1,3,5,7,...
>>> >> > has changed) I guess the should stay the same to make the restart
>>> >> > success.
>>> >> > It seems this problem is caused by node numbering changes during the
>>> >> > restart
>>> >> > rather than the solution vector not stored properly.
>>> >> >
>>> >> > I attached the prm file again. I just start and restart with "end
>>> >> > time =
>>> >> > 0"
>>> >> > Hope this will help you to reproduce it and figure out what goes
>>> >> > wrong.
>>> >> >
>>> >> > Regards,
>>> >> >
>>> >> > Siqi
>>> >> >
>>> >> > 2015-05-04 23:28 GMT+10:00 Timo Heister <heister at clemson.edu>:
>>> >> >>
>>> >> >> Hey Siqi,
>>> >> >>
>>> >> >> I can not reproduce it on my workstation:
>>> >> >> - changed end time to 0, resume=false
>>> >> >> - ran with mpirun -n 24
>>> >> >> - waited until it stopped
>>> >> >> - set end time to 2e5, resume=true
>>> >> >> - ran with mpirun -n 24
>>> >> >> - output/solution-00001 looks fine
>>> >> >>
>>> >> >> Sorry, I have no idea what is going on and I don't think that this
>>> >> >> is
>>> >> >> a configuration problem (because you experience this on different
>>> >> >> machines).
>>> >> >>
>>> >> >> On Sun, Apr 19, 2015 at 9:14 PM, Siqi Zhang <siqi.zhang at mq.edu.au>
>>> >> >> wrote:
>>> >> >> > Hi Timo,
>>> >> >> >
>>> >> >> > I tried to troubleshoot this problem, still no clue. This thing
>>> >> >> > just
>>> >> >> > drives
>>> >> >> > me crazy.
>>> >> >> > Disable/enable MPI_IO in p4est build doesn't change the result,
>>> >> >> > revert
>>> >> >> > p4est
>>> >> >> > version from 1.1 to 0.3.4.2 doesn't change it either. I also
>>> >> >> > tried
>>> >> >> > the
>>> >> >> > development version of deal.II, the problem still exists.
>>> >> >> >
>>> >> >> > After set the "end time = 0" while keeping the refinement
>>> >> >> > setting,
>>> >> >> > and
>>> >> >> > restart with the same setting:
>>> >> >> > The problem seems repeatable with 24 processors (across 2 nodes).
>>> >> >> > The problem seems repeatable with 24 processors (on 1 node).
>>> >> >> > The problem disappears with 12 processors (across 2 nodes).
>>> >> >> >
>>> >> >> > The problem disappear after remove the initial refinement
>>> >> >> > (predefined
>>> >> >> > refinement levels depends on depth) , I guess the grid need to be
>>> >> >> > complex
>>> >> >> > enough for this to happen.
>>> >> >> >
>>> >> >> > The problem is not so random here. For the certain prm with
>>> >> >> > certain
>>> >> >> > number
>>> >> >> > of processors, the problem seems always happen. But it may
>>> >> >> > disappear
>>> >> >> > when
>>> >> >> > changing the prm file and processor numbers.
>>> >> >> > And I also encounter the similar problem at other machines (using
>>> >> >> > same
>>> >> >> > packages setting, manually built by different compiler and
>>> >> >> > different
>>> >> >> > MPI
>>> >> >> > version, and different file system): supercomputer NCI_Raijin
>>> >> >> > (OpenMPI
>>> >> >> > 1.6.3, intel compiler 12.1.9.293, lustre file system), and our
>>> >> >> > single
>>> >> >> > node
>>> >> >> > machine (OpenMPI 1.8.4, Intel compiler 15.0.0, local disk)
>>> >> >> > It is also strange that I never encounter similar problems with
>>> >> >> > some
>>> >> >> > large
>>> >> >> > 3D models running on NCI_Raijin (using more than 200 processors,
>>> >> >> > and
>>> >> >> > get
>>> >> >> > restarted quite a few times and using the similar mesh
>>> >> >> > refinements)
>>> >> >> >
>>> >> >> > The simulation runs fine, just the checkpoint got corrupted. So I
>>> >> >> > guess
>>> >> >> > it
>>> >> >> > happens when save/load distributed triangulation. And the grid
>>> >> >> > seems
>>> >> >> > fine
>>> >> >> > for me, just some solution seems at the wrong place at restart.
>>> >> >> >
>>> >> >> > So could you verify if the prm file restarts fine on some of your
>>> >> >> > machines?
>>> >> >> > If it works fine could you send me some information on packages
>>> >> >> > versions
>>> >> >> > of
>>> >> >> > deal.II/p4est/mpi? If it is something I did wrong while building
>>> >> >> > those
>>> >> >> > packages, do you have any clues about what could it be to lead to
>>> >> >> > this
>>> >> >> > problem?
>>> >> >> >
>>> >> >> > Thanks and regards,
>>> >> >> >
>>> >> >> > Siqi
>>> >> >> >
>>> >> >> > 2015-04-20 1:13 GMT+10:00 Timo Heister <timo.heister at gmail.com>:
>>> >> >> >>
>>> >> >> >> Any news on this issue, Siqi?
>>> >> >> >>
>>> >> >> >> Can you experiment with the problem to find out when this
>>> >> >> >> problem
>>> >> >> >> happens? How many processors do you need to see the problem? How
>>> >> >> >> often
>>> >> >> >> does it occur? Can you maybe simplify the .prm to do one check
>>> >> >> >> point
>>> >> >> >> after timestep 1 and end to check if that is enough?
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Sun, Apr 12, 2015 at 9:32 PM, Siqi Zhang
>>> >> >> >> <siqi.zhang at mq.edu.au>
>>> >> >> >> wrote:
>>> >> >> >> > Hi Timo,
>>> >> >> >> >
>>> >> >> >> > Thanks for your reply.
>>> >> >> >> > The file system I am using for output in previously test on
>>> >> >> >> > our in
>>> >> >> >> > house
>>> >> >> >> > cluster is just a remotely mounted drive, not a distributed
>>> >> >> >> > file
>>> >> >> >> > system.
>>> >> >> >> > However a different tests on another Australian supercomputer
>>> >> >> >> > NCI_Raijin
>>> >> >> >> > which uses lustre file system also produces similar problem.
>>> >> >> >> >
>>> >> >> >> > My current p4est setup should have MPI_IO enabled, I will try
>>> >> >> >> > to
>>> >> >> >> > disable
>>> >> >> >> > it
>>> >> >> >> > see if it changes the story.
>>> >> >> >> >
>>> >> >> >> > Regards,
>>> >> >> >> >
>>> >> >> >> > Siqi
>>> >> >> >> >
>>> >> >> >> > 2015-04-11 1:20 GMT+10:00 Timo Heister
>>> >> >> >> > <timo.heister at gmail.com>:
>>> >> >> >> >>
>>> >> >> >> >> Hey Siqi,
>>> >> >> >> >>
>>> >> >> >> >> I wonder if this is could be related to the filesystem you
>>> >> >> >> >> are
>>> >> >> >> >> writing
>>> >> >> >> >> the files to. If p4est is not using MPI_IO it uses posix in
>>> >> >> >> >> the
>>> >> >> >> >> hope
>>> >> >> >> >> that this works reliably.
>>> >> >> >> >>
>>> >> >> >> >> Do you know what kind of filesystem your output directory is
>>> >> >> >> >> on?
>>> >> >> >> >> Is
>>> >> >> >> >> there a different filesystem you can try?
>>> >> >> >> >>
>>> >> >> >> >> On Fri, Apr 10, 2015 at 2:05 AM, Siqi Zhang
>>> >> >> >> >> <siqi.zhang at mq.edu.au>
>>> >> >> >> >> wrote:
>>> >> >> >> >> > Hi all,
>>> >> >> >> >> >
>>> >> >> >> >> > Recently, I have encountered an issue while trying to
>>> >> >> >> >> > restart
>>> >> >> >> >> > from
>>> >> >> >> >> > a
>>> >> >> >> >> > checkpoint. It seems the solution is misplaced into the
>>> >> >> >> >> > wrong
>>> >> >> >> >> > place
>>> >> >> >> >> > while
>>> >> >> >> >> > restart (See the two figures). However, the problem is not
>>> >> >> >> >> > always
>>> >> >> >> >> > repeatable
>>> >> >> >> >> > (sometimes it restarts fine), and it might be related to
>>> >> >> >> >> > something
>>> >> >> >> >> > wrong
>>> >> >> >> >> > in
>>> >> >> >> >> > deal.II or p4est.
>>> >> >> >> >> >
>>> >> >> >> >> > The versions I used to build ASPECT is:
>>> >> >> >> >> > DEAL.II 8.2.1
>>> >> >> >> >> > P4EST 0.3.4.2 (encountered similar problem on 1.1
>>> >> >> >> >> > as
>>> >> >> >> >> > well)
>>> >> >> >> >> > TRILINOS 11.12.1
>>> >> >> >> >> > MPI openmpi 1.8.3 (with gcc 4.4.7)
>>> >> >> >> >> > and I am using the most recent development version of
>>> >> >> >> >> > aspect
>>> >> >> >> >> > (1b9c41713a1f234eba92b0179812a4d0b5e0c2a8)
>>> >> >> >> >> >
>>> >> >> >> >> > I reproduced the problem with the attached prm file (using
>>> >> >> >> >> > 2
>>> >> >> >> >> > nodes
>>> >> >> >> >> > 24
>>> >> >> >> >> > processors total), I wonder if any of you would like to
>>> >> >> >> >> > give it
>>> >> >> >> >> > a
>>> >> >> >> >> > try
>>> >> >> >> >> > see if
>>> >> >> >> >> > it is a bug or just bad installation on my machine?
>>> >> >> >> >> >
>>> >> >> >> >> > Regards,
>>> >> >> >> >> >
>>> >> >> >> >> > Siqi
>>> >> >> >> >> >
>>> >> >> >> >> > --
>>> >> >> >> >> > Siqi Zhang
>>> >> >> >> >> >
>>> >> >> >> >> > Research Associate
>>> >> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
>>> >> >> >> >> > (CCFS)
>>> >> >> >> >> > Department of Earth and Planetary Sciences
>>> >> >> >> >> > Macquarie University
>>> >> >> >> >> > NSW 2109
>>> >> >> >> >> >
>>> >> >> >> >> > Telephone: +61 2 9850 4727
>>> >> >> >> >> > http://www.CCFS.mq.edu.au
>>> >> >> >> >> > http://www.GEMOC.mq.edu.au
>>> >> >> >> >> >
>>> >> >> >> >> > _______________________________________________
>>> >> >> >> >> > Aspect-devel mailing list
>>> >> >> >> >> > Aspect-devel at geodynamics.org
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >> >> >> _______________________________________________
>>> >> >> >> >> Aspect-devel mailing list
>>> >> >> >> >> Aspect-devel at geodynamics.org
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > Siqi Zhang
>>> >> >> >> >
>>> >> >> >> > Research Associate
>>> >> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems
>>> >> >> >> > (CCFS)
>>> >> >> >> > Department of Earth and Planetary Sciences
>>> >> >> >> > Macquarie University
>>> >> >> >> > NSW 2109
>>> >> >> >> >
>>> >> >> >> > Telephone: +61 2 9850 4727
>>> >> >> >> > http://www.CCFS.mq.edu.au
>>> >> >> >> > http://www.GEMOC.mq.edu.au
>>> >> >> >> >
>>> >> >> >> > _______________________________________________
>>> >> >> >> > Aspect-devel mailing list
>>> >> >> >> > Aspect-devel at geodynamics.org
>>> >> >> >> >
>>> >> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >> >> _______________________________________________
>>> >> >> >> Aspect-devel mailing list
>>> >> >> >> Aspect-devel at geodynamics.org
>>> >> >> >>
>>> >> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Siqi Zhang
>>> >> >> >
>>> >> >> > Research Associate
>>> >> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>>> >> >> > Department of Earth and Planetary Sciences
>>> >> >> > Macquarie University
>>> >> >> > NSW 2109
>>> >> >> >
>>> >> >> > Telephone: +61 2 9850 4727
>>> >> >> > http://www.CCFS.mq.edu.au
>>> >> >> > http://www.GEMOC.mq.edu.au
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > Aspect-devel mailing list
>>> >> >> > Aspect-devel at geodynamics.org
>>> >> >> >
>>> >> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >>
>>> >> >> --
>>> >> >> Timo Heister
>>> >> >> http://www.math.clemson.edu/~heister/
>>> >> >> _______________________________________________
>>> >> >> Aspect-devel mailing list
>>> >> >> Aspect-devel at geodynamics.org
>>> >> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Siqi Zhang
>>> >> >
>>> >> > Research Associate
>>> >> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>>> >> > Department of Earth and Planetary Sciences
>>> >> > Macquarie University
>>> >> > NSW 2109
>>> >> >
>>> >> > Telephone: +61 2 9850 4727
>>> >> > http://www.CCFS.mq.edu.au
>>> >> > http://www.GEMOC.mq.edu.au
>>> >> >
>>> >> > _______________________________________________
>>> >> > Aspect-devel mailing list
>>> >> > Aspect-devel at geodynamics.org
>>> >> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Timo Heister
>>> >> http://www.math.clemson.edu/~heister/
>>> >> _______________________________________________
>>> >> Aspect-devel mailing list
>>> >> Aspect-devel at geodynamics.org
>>> >> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Siqi Zhang
>>> >
>>> > Research Associate
>>> > ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>>> > Department of Earth and Planetary Sciences
>>> > Macquarie University
>>> > NSW 2109
>>> >
>>> > Telephone: +61 2 9850 4727
>>> > http://www.CCFS.mq.edu.au
>>> > http://www.GEMOC.mq.edu.au
>>> >
>>> > _______________________________________________
>>> > Aspect-devel mailing list
>>> > Aspect-devel at geodynamics.org
>>> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>> _______________________________________________
>>> Aspect-devel mailing list
>>> Aspect-devel at geodynamics.org
>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>>
>>
>>
>>
>> --
>> Siqi Zhang
>>
>> Research Associate
>> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
>> Department of Earth and Planetary Sciences
>> Macquarie University
>> NSW 2109
>>
>> Telephone: +61 2 9850 4727
>> http://www.CCFS.mq.edu.au
>> http://www.GEMOC.mq.edu.au
>
>
>
>
> --
> Siqi Zhang
>
> Research Associate
> ARC Centre of Excellence for Core to Crust Fluid Systems (CCFS)
> Department of Earth and Planetary Sciences
> Macquarie University
> NSW 2109
>
> Telephone: +61 2 9850 4727
> http://www.CCFS.mq.edu.au
> http://www.GEMOC.mq.edu.au
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
More information about the Aspect-devel
mailing list