[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Matthew Knepley knepley at mcs.anl.gov
Thu Apr 30 07:28:29 PDT 2009


On Thu, Apr 30, 2009 at 8:42 AM, Tabrez Ali <stali at purdue.edu> wrote:

> Matt
>
> 1) Are you positive that it fails inside KSP? That is the most bulletproof
> part of the code.
>
> 2) Does it always fail in the same place? If not, I can believe it fails in
> KSP due to
>
>   a) Memory corruption somewhere else
>
>   b) Mismatched MPI calls hanging around
>
>   c) Someone holding on to MPI resources somewhere else
>
>
> Yes it fails at the same point on both machines. However if I use a
> different material property then it fails much later, but again the time
> step (at which it fails) is same on the two machines.
>


Not just same time step. Same iterate. Everything.

>
>
> 3) Are the error messages identical on the two machines?
>
>
> Yes
>

I need the entire error message. I mean EXACTLY the same. Letter for letter.

  Matt


> Tabrez
>
>
> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>
>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>>
>>> Brad
>>>
>>> The solution at the last working step does converge and looks okay but
>>> then nothing happens and it dies. I am however experimenting with
>>> time_step and will also try to use the debugger.
>>>
>>> Btw do you know if I can use --petsc.on_error_attach_debugger when the
>>> job is submitted via PBS or should I just run it interactively?
>>
>>
>> I do not understand why this is labeled a convergence issue. Unless I miss
>> what
>> you mean by "die". Non-convergence will result in a bad ConvergenceReason
>> from the solver, but nothing else. The code will continue to run.
>>
>> This looks like death from a signal. With the very little information in
>> front of
>> me, this looks like a bug in the MPI on this machine. If it was doing
>> Sieve stuff,
>> I would put the blame on me. But with PETSc stuff (10+ years old and used
>> by
>> thousands of people), I put the blame on MPI or hardware for this
>> computer.
>>
>>   Matt
>>
>>
>>>
>>> ...
>>> ...
>>> 87 KSP Residual norm 3.579491816101e-07
>>> 88 KSP Residual norm 3.241876854223e-07
>>> 89 KSP Residual norm 2.836307394788e-07
>>>
>>> [cli_0]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_3]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_2]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>> --pyre-start: mpiexec: exit 1
>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>
>>> Tabrez
>>>
>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>
>>> > Tabrez-
>>> >
>>> > You may want to set ksp_monitor=true so that you can see the
>>> > residual. If the
>>> > residual increases significantly, the solution is losing
>>> > convergence. This
>>> > can be alleviated a bit by using an absolute convergence tolerance
>>> > (ksp_atol). You probably need a slightly smaller time step or
>>> > slightly higher
>>> > quality mesh (improve the aspect ratio of the most distorted cells).
>>> >
>>> > Brad
>>> >
>>> >
>>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>> >> Brad
>>> >>
>>> >> I think you were right. The elastic problem worked out fine. I will
>>> >> now try to play with time step (for the viscous runs)
>>> >>
>>> >> Tabrez
>>> >>
>>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>> >>>> Also I dont see the error until ~9000 time steps with one set of
>>> >>>> material properties but get the error at around 4000th time step
>>> >>>> with
>>> >>>> a different set of material properties (on the same mesh).
>>> >>>
>>> >>> This seems to indicate a time-integration stability issue. Does the
>>> >>> one that
>>> >>> has an error after 4000 time steps have a smaller Maxwell time? You
>>> >>> might try
>>> >>> running with purely elastic properties. If that works, then you may
>>> >>> need to
>>> >>> reduce your time step.
>>> >
>>> >
>>>
>>> _______________________________________________
>>> CIG-SHORT mailing list
>>> CIG-SHORT at geodynamics.org
>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>


-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/fa73fa32/attachment-0001.htm 


More information about the CIG-SHORT mailing list