[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Thu Apr 30 08:11:00 PDT 2009

Matt

> Yes it fails at the same point on both machines. However if I use a  
> different material property then it fails much later, but again the  
> time step (at which it fails) is same on the two machines.
>
>
> Not just same time step. Same iterate. Everything.

I deleted my output file on one machine but will run it again

>> 3) Are the error messages identical on the two machines?
>
> Yes
>
> I need the entire error message. I mean EXACTLY the same. Letter for  
> letter

Here's the exact message. You can see out.txt at http://stali.freeshell.org/out.txt.gz

$ pylith --nodes=4 --petsc.ksp_type=cg > out.txt

[cli_0]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[cli_1]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[cli_3]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
[cli_2]: aborting job:
Fatal error in MPI_Wait: Error message texts are not available
mpiexec: Warning: tasks 0-3 exited with status 1.
--pyre-start: mpiexec: exit 1
/usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/ 
scratch96/s/stali/pylith/bin/nemesis: exit 1

Tabrez

>> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>
>>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu>  
>>> wrote:
>>> Brad
>>>
>>> The solution at the last working step does converge and looks okay  
>>> but
>>> then nothing happens and it dies. I am however experimenting with
>>> time_step and will also try to use the debugger.
>>>
>>> Btw do you know if I can use --petsc.on_error_attach_debugger when  
>>> the
>>> job is submitted via PBS or should I just run it interactively?
>>>
>>> I do not understand why this is labeled a convergence issue.  
>>> Unless I miss what
>>> you mean by "die". Non-convergence will result in a bad  
>>> ConvergenceReason
>>> from the solver, but nothing else. The code will continue to run.
>>>
>>> This looks like death from a signal. With the very little  
>>> information in front of
>>> me, this looks like a bug in the MPI on this machine. If it was  
>>> doing Sieve stuff,
>>> I would put the blame on me. But with PETSc stuff (10+ years old  
>>> and used by
>>> thousands of people), I put the blame on MPI or hardware for this  
>>> computer.
>>>
>>>   Matt
>>>
>>>
>>> ...
>>> ...
>>> 87 KSP Residual norm 3.579491816101e-07
>>> 88 KSP Residual norm 3.241876854223e-07
>>> 89 KSP Residual norm 2.836307394788e-07
>>>
>>> [cli_0]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_1]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_3]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> [cli_2]: aborting job:
>>> Fatal error in MPI_Wait: Error message texts are not available
>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>> --pyre-start: mpiexec: exit 1
>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>
>>> Tabrez
>>>
>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>
>>> > Tabrez-
>>> >
>>> > You may want to set ksp_monitor=true so that you can see the
>>> > residual. If the
>>> > residual increases significantly, the solution is losing
>>> > convergence. This
>>> > can be alleviated a bit by using an absolute convergence tolerance
>>> > (ksp_atol). You probably need a slightly smaller time step or
>>> > slightly higher
>>> > quality mesh (improve the aspect ratio of the most distorted  
>>> cells).
>>> >
>>> > Brad
>>> >
>>> >
>>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>> >> Brad
>>> >>
>>> >> I think you were right. The elastic problem worked out fine. I  
>>> will
>>> >> now try to play with time step (for the viscous runs)
>>> >>
>>> >> Tabrez
>>> >>
>>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>> >>>> Also I dont see the error until ~9000 time steps with one set  
>>> of
>>> >>>> material properties but get the error at around 4000th time  
>>> step
>>> >>>> with
>>> >>>> a different set of material properties (on the same mesh).
>>> >>>
>>> >>> This seems to indicate a time-integration stability issue.  
>>> Does the
>>> >>> one that
>>> >>> has an error after 4000 time steps have a smaller Maxwell  
>>> time? You
>>> >>> might try
>>> >>> running with purely elastic properties. If that works, then  
>>> you may
>>> >>> need to
>>> >>> reduce your time step.
>>> >
>>> >
>>>
>>> _______________________________________________
>>> CIG-SHORT mailing list
>>> CIG-SHORT at geodynamics.org
>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/4ed38222/attachment-0001.htm