[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Tabrez Ali stali at purdue.edu
Thu Apr 30 12:26:01 PDT 2009


Matt

Here's the file

http://web.ics.purdue.edu/~stali/tmp/out.txt.gz

Thanks
Tabrez

On Apr 30, 2009, at 1:44 PM, Matthew Knepley wrote:

> My gunzip says that file is corrupt.
>
>   Matt
>
> On Thu, Apr 30, 2009 at 10:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
> Matt
>
>> Yes it fails at the same point on both machines. However if I use a  
>> different material property then it fails much later, but again the  
>> time step (at which it fails) is same on the two machines.
>>
>>
>> Not just same time step. Same iterate. Everything.
>
> I deleted my output file on one machine but will run it again
>
>
>>> 3) Are the error messages identical on the two machines?
>>
>> Yes
>>
>> I need the entire error message. I mean EXACTLY the same. Letter  
>> for letter
>
>
> Here's the exact message. You can see out.txt at http://stali.freeshell.org/out.txt.gz
>
> $ pylith --nodes=4 --petsc.ksp_type=cg > out.txt
>
> [cli_0]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_1]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_3]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> [cli_2]: aborting job:
> Fatal error in MPI_Wait: Error message texts are not available
> mpiexec: Warning: tasks 0-3 exited with status 1.
> --pyre-start: mpiexec: exit 1
> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/ 
> scratch96/s/stali/pylith/bin/nemesis: exit 1
>
> Tabrez
>
>
>>> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>>
>>>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu>  
>>>> wrote:
>>>> Brad
>>>>
>>>> The solution at the last working step does converge and looks  
>>>> okay but
>>>> then nothing happens and it dies. I am however experimenting with
>>>> time_step and will also try to use the debugger.
>>>>
>>>> Btw do you know if I can use --petsc.on_error_attach_debugger  
>>>> when the
>>>> job is submitted via PBS or should I just run it interactively?
>>>>
>>>> I do not understand why this is labeled a convergence issue.  
>>>> Unless I miss what
>>>> you mean by "die". Non-convergence will result in a bad  
>>>> ConvergenceReason
>>>> from the solver, but nothing else. The code will continue to run.
>>>>
>>>> This looks like death from a signal. With the very little  
>>>> information in front of
>>>> me, this looks like a bug in the MPI on this machine. If it was  
>>>> doing Sieve stuff,
>>>> I would put the blame on me. But with PETSc stuff (10+ years old  
>>>> and used by
>>>> thousands of people), I put the blame on MPI or hardware for this  
>>>> computer.
>>>>
>>>>   Matt
>>>>
>>>>
>>>> ...
>>>> ...
>>>> 87 KSP Residual norm 3.579491816101e-07
>>>> 88 KSP Residual norm 3.241876854223e-07
>>>> 89 KSP Residual norm 2.836307394788e-07
>>>>
>>>> [cli_0]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_1]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_3]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_2]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>>> --pyre-start: mpiexec: exit 1
>>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>>
>>>> Tabrez
>>>>
>>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>>
>>>> > Tabrez-
>>>> >
>>>> > You may want to set ksp_monitor=true so that you can see the
>>>> > residual. If the
>>>> > residual increases significantly, the solution is losing
>>>> > convergence. This
>>>> > can be alleviated a bit by using an absolute convergence  
>>>> tolerance
>>>> > (ksp_atol). You probably need a slightly smaller time step or
>>>> > slightly higher
>>>> > quality mesh (improve the aspect ratio of the most distorted  
>>>> cells).
>>>> >
>>>> > Brad
>>>> >
>>>> >
>>>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>>> >> Brad
>>>> >>
>>>> >> I think you were right. The elastic problem worked out fine. I  
>>>> will
>>>> >> now try to play with time step (for the viscous runs)
>>>> >>
>>>> >> Tabrez
>>>> >>
>>>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>>> >>>> Also I dont see the error until ~9000 time steps with one  
>>>> set of
>>>> >>>> material properties but get the error at around 4000th time  
>>>> step
>>>> >>>> with
>>>> >>>> a different set of material properties (on the same mesh).
>>>> >>>
>>>> >>> This seems to indicate a time-integration stability issue.  
>>>> Does the
>>>> >>> one that
>>>> >>> has an error after 4000 time steps have a smaller Maxwell  
>>>> time? You
>>>> >>> might try
>>>> >>> running with purely elastic properties. If that works, then  
>>>> you may
>>>> >>> need to
>>>> >>> reduce your time step.
>>>> >
>>>> >
>>>>
>>>> _______________________________________________
>>>> CIG-SHORT mailing list
>>>> CIG-SHORT at geodynamics.org
>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their  
> experiments is infinitely more interesting than any results to which  
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/224b6e63/attachment.htm 


More information about the CIG-SHORT mailing list