[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Matthew Knepley knepley at mcs.anl.gov
Thu Apr 30 10:44:17 PDT 2009


My gunzip says that file is corrupt.

  Matt

On Thu, Apr 30, 2009 at 10:11 AM, Tabrez Ali <stali at purdue.edu> wrote:

> Matt
>
> Yes it fails at the same point on both machines. However if I use a
>> different material property then it fails much later, but again the time
>> step (at which it fails) is same on the two machines.
>>
>
>
> Not just same time step. Same iterate. Everything.
>
>
> I deleted my output file on one machine but will run it again
>
>
> 3) Are the error messages identical on the two machines?
>>
>>
>> Yes
>>
>
> I need the entire error message. I mean EXACTLY the same. Letter for letter
>
>
>
> Here's the exact message. You can see out.txt at
> http://stali.freeshell.org/out.txt.gz
>
> $ pylith --nodes=4 --petsc.ksp_type=cg > out.txt
>
> [cli_0]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_1]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_3]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_2]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> mpiexec: Warning: tasks 0-3 exited with status 1.
> --pyre-start: mpiexec: exit 1
> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith:
> /usr/rmt_share/scratch96/s/stali/pylith/bin/nemesis: exit 1
>
> Tabrez
>
>
> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>>
>>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>>>
>>>> Brad
>>>>
>>>> The solution at the last working step does converge and looks okay but
>>>> then nothing happens and it dies. I am however experimenting with
>>>> time_step and will also try to use the debugger.
>>>>
>>>> Btw do you know if I can use --petsc.on_error_attach_debugger when the
>>>> job is submitted via PBS or should I just run it interactively?
>>>
>>>
>>> I do not understand why this is labeled a convergence issue. Unless I
>>> miss what
>>> you mean by "die". Non-convergence will result in a bad ConvergenceReason
>>> from the solver, but nothing else. The code will continue to run.
>>>
>>> This looks like death from a signal. With the very little information in
>>> front of
>>> me, this looks like a bug in the MPI on this machine. If it was doing
>>> Sieve stuff,
>>> I would put the blame on me. But with PETSc stuff (10+ years old and used
>>> by
>>> thousands of people), I put the blame on MPI or hardware for this
>>> computer.
>>>
>>>   Matt
>>>
>>>
>>>>
>>>> ...
>>>> ...
>>>> 87 KSP Residual norm 3.579491816101e-07
>>>> 88 KSP Residual norm 3.241876854223e-07
>>>> 89 KSP Residual norm 2.836307394788e-07
>>>>
>>>> [cli_0]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_1]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_3]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_2]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>>> --pyre-start: mpiexec: exit 1
>>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>>
>>>> Tabrez
>>>>
>>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>>
>>>> > Tabrez-
>>>> >
>>>> > You may want to set ksp_monitor=true so that you can see the
>>>> > residual. If the
>>>> > residual increases significantly, the solution is losing
>>>> > convergence. This
>>>> > can be alleviated a bit by using an absolute convergence tolerance
>>>> > (ksp_atol). You probably need a slightly smaller time step or
>>>> > slightly higher
>>>> > quality mesh (improve the aspect ratio of the most distorted cells).
>>>> >
>>>> > Brad
>>>> >
>>>> >
>>>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>>> >> Brad
>>>> >>
>>>> >> I think you were right. The elastic problem worked out fine. I will
>>>> >> now try to play with time step (for the viscous runs)
>>>> >>
>>>> >> Tabrez
>>>> >>
>>>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>>> >>>> Also I dont see the error until ~9000 time steps with one set of
>>>> >>>> material properties but get the error at around 4000th time step
>>>> >>>> with
>>>> >>>> a different set of material properties (on the same mesh).
>>>> >>>
>>>> >>> This seems to indicate a time-integration stability issue. Does the
>>>> >>> one that
>>>> >>> has an error after 4000 time steps have a smaller Maxwell time? You
>>>> >>> might try
>>>> >>> running with purely elastic properties. If that works, then you may
>>>> >>> need to
>>>> >>> reduce your time step.
>>>> >
>>>> >
>>>>
>>>> _______________________________________________
>>>> CIG-SHORT mailing list
>>>> CIG-SHORT at geodynamics.org
>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>>>
>>>
>


-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/8c7a5208/attachment.htm 


More information about the CIG-SHORT mailing list