[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Matthew Knepley knepley at mcs.anl.gov
Thu Apr 30 06:15:02 PDT 2009


On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:

> Brad
>
> The solution at the last working step does converge and looks okay but
> then nothing happens and it dies. I am however experimenting with
> time_step and will also try to use the debugger.
>
> Btw do you know if I can use --petsc.on_error_attach_debugger when the
> job is submitted via PBS or should I just run it interactively?


I do not understand why this is labeled a convergence issue. Unless I miss
what
you mean by "die". Non-convergence will result in a bad ConvergenceReason
from the solver, but nothing else. The code will continue to run.

This looks like death from a signal. With the very little information in
front of
me, this looks like a bug in the MPI on this machine. If it was doing Sieve
stuff,
I would put the blame on me. But with PETSc stuff (10+ years old and used by
thousands of people), I put the blame on MPI or hardware for this computer.

  Matt


>
> ...
> ...
> 87 KSP Residual norm 3.579491816101e-07
> 88 KSP Residual norm 3.241876854223e-07
> 89 KSP Residual norm 2.836307394788e-07
>
> [cli_0]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_1]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_3]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_2]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> mpiexec: Warning: tasks 0-3 exited with status 1.
> --pyre-start: mpiexec: exit 1
> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
> scratch96/s/stali/pylith/bin/nemesis: exit 1
>
> Tabrez
>
> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>
> > Tabrez-
> >
> > You may want to set ksp_monitor=true so that you can see the
> > residual. If the
> > residual increases significantly, the solution is losing
> > convergence. This
> > can be alleviated a bit by using an absolute convergence tolerance
> > (ksp_atol). You probably need a slightly smaller time step or
> > slightly higher
> > quality mesh (improve the aspect ratio of the most distorted cells).
> >
> > Brad
> >
> >
> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
> >> Brad
> >>
> >> I think you were right. The elastic problem worked out fine. I will
> >> now try to play with time step (for the viscous runs)
> >>
> >> Tabrez
> >>
> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
> >>>> Also I dont see the error until ~9000 time steps with one set of
> >>>> material properties but get the error at around 4000th time step
> >>>> with
> >>>> a different set of material properties (on the same mesh).
> >>>
> >>> This seems to indicate a time-integration stability issue. Does the
> >>> one that
> >>> has an error after 4000 time steps have a smaller Maxwell time? You
> >>> might try
> >>> running with purely elastic properties. If that works, then you may
> >>> need to
> >>> reduce your time step.
> >
> >
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>



-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/e88c824d/attachment.htm 


More information about the CIG-SHORT mailing list