[CIG-SHORT] Pylith dies after thousands of time steps

Thu Apr 30 07:37:03 PDT 2009

Tabrez-

The KSP residual indicates that this is not a convergence issue (I guessed 
wrong). Is there anything going on in the BC, fault interfaces, etc that 
might be perturbing the residual/solution in this time step where it fails 
compared to any other time step.

One possible explanation is that one processor is throwing an exception that 
is not caught, resulting in termination of the process on one processor and 
not the other. I think it would be very instructive to run the job on a 
single processor to see if the process dies at the same time step and if 
there is a more verbose error message. It may take a while to run but it may 
save you and us a lot of run-around and speculative searching.

Brad

On Thursday 30 April 2009 6:42:37 am Tabrez Ali wrote:
> Matt
>
> > 1) Are you positive that it fails inside KSP? That is the most
> > bulletproof part of the code.
> >
> > 2) Does it always fail in the same place? If not, I can believe it
> > fails in KSP due to
> >
> >   a) Memory corruption somewhere else
> >
> >   b) Mismatched MPI calls hanging around
> >
> >   c) Someone holding on to MPI resources somewhere else
>
> Yes it fails at the same point on both machines. However if I use a
> different material property then it fails much later, but again the
> time step (at which it fails) is same on the two machines.
>
> > 3) Are the error messages identical on the two machines?
>
> Yes
>
> Tabrez
>
> > On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
> >> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
> >> Brad
> >>
> >> The solution at the last working step does converge and looks okay
> >> but
> >> then nothing happens and it dies. I am however experimenting with
> >> time_step and will also try to use the debugger.
> >>
> >> Btw do you know if I can use --petsc.on_error_attach_debugger when
> >> the
> >> job is submitted via PBS or should I just run it interactively?
> >>
> >> I do not understand why this is labeled a convergence issue. Unless
> >> I miss what
> >> you mean by "die". Non-convergence will result in a bad
> >> ConvergenceReason
> >> from the solver, but nothing else. The code will continue to run.
> >>
> >> This looks like death from a signal. With the very little
> >> information in front of
> >> me, this looks like a bug in the MPI on this machine. If it was
> >> doing Sieve stuff,
> >> I would put the blame on me. But with PETSc stuff (10+ years old
> >> and used by
> >> thousands of people), I put the blame on MPI or hardware for this
> >> computer.
> >>
> >>   Matt
> >>
> >>
> >> ...
> >> ...
> >> 87 KSP Residual norm 3.579491816101e-07
> >> 88 KSP Residual norm 3.241876854223e-07
> >> 89 KSP Residual norm 2.836307394788e-07
> >>
> >> [cli_0]: aborting job:
> >> Fatal error in MPI_Wait: Error message texts are not available
> >> [cli_1]: aborting job:
> >> Fatal error in MPI_Wait: Error message texts are not available
> >> [cli_3]: aborting job:
> >> Fatal error in MPI_Wait: Error message texts are not available
> >> [cli_2]: aborting job:
> >> Fatal error in MPI_Wait: Error message texts are not available
> >> mpiexec: Warning: tasks 0-3 exited with status 1.
> >> --pyre-start: mpiexec: exit 1
> >> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
> >> scratch96/s/stali/pylith/bin/nemesis: exit 1
> >>
> >> Tabrez
> >>
> >> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
> >> > Tabrez-
> >> >
> >> > You may want to set ksp_monitor=true so that you can see the
> >> > residual. If the
> >> > residual increases significantly, the solution is losing
> >> > convergence. This
> >> > can be alleviated a bit by using an absolute convergence tolerance
> >> > (ksp_atol). You probably need a slightly smaller time step or
> >> > slightly higher
> >> > quality mesh (improve the aspect ratio of the most distorted
> >>
> >> cells).
> >>
> >> > Brad
> >> >
> >> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
> >> >> Brad
> >> >>
> >> >> I think you were right. The elastic problem worked out fine. I
> >>
> >> will
> >>
> >> >> now try to play with time step (for the viscous runs)
> >> >>
> >> >> Tabrez
> >> >>
> >> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
> >> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
> >> >>>> Also I dont see the error until ~9000 time steps with one set of
> >> >>>> material properties but get the error at around 4000th time step
> >> >>>> with
> >> >>>> a different set of material properties (on the same mesh).
> >> >>>
> >> >>> This seems to indicate a time-integration stability issue. Does
> >>
> >> the
> >>
> >> >>> one that
> >> >>> has an error after 4000 time steps have a smaller Maxwell time?
> >>
> >> You
> >>
> >> >>> might try
> >> >>> running with purely elastic properties. If that works, then you
> >>
> >> may
> >>
> >> >>> need to
> >> >>> reduce your time step.
> >>
> >> _______________________________________________
> >> CIG-SHORT mailing list
> >> CIG-SHORT at geodynamics.org
> >> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
> >>
> >>
> >>
> >> --
> >> What most experimenters take for granted before they begin their
> >> experiments is infinitely more interesting than any results to
> >> which their experiments lead.
> >> -- Norbert Wiener
> >
> > --
> > What most experimenters take for granted before they begin their
> > experiments is infinitely more interesting than any results to which
> > their experiments lead.
> > -- Norbert Wiener