[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Tabrez Ali stali at purdue.edu
Thu Apr 30 06:42:37 PDT 2009


Matt

> 1) Are you positive that it fails inside KSP? That is the most  
> bulletproof part of the code.
>
> 2) Does it always fail in the same place? If not, I can believe it  
> fails in KSP due to
>
>   a) Memory corruption somewhere else
>
>   b) Mismatched MPI calls hanging around
>
>   c) Someone holding on to MPI resources somewhere else

Yes it fails at the same point on both machines. However if I use a  
different material property then it fails much later, but again the  
time step (at which it fails) is same on the two machines.

>
>
> 3) Are the error messages identical on the two machines?

Yes

Tabrez


> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>
>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>> Brad
>>
>> The solution at the last working step does converge and looks okay  
>> but
>> then nothing happens and it dies. I am however experimenting with
>> time_step and will also try to use the debugger.
>>
>> Btw do you know if I can use --petsc.on_error_attach_debugger when  
>> the
>> job is submitted via PBS or should I just run it interactively?
>>
>> I do not understand why this is labeled a convergence issue. Unless  
>> I miss what
>> you mean by "die". Non-convergence will result in a bad  
>> ConvergenceReason
>> from the solver, but nothing else. The code will continue to run.
>>
>> This looks like death from a signal. With the very little  
>> information in front of
>> me, this looks like a bug in the MPI on this machine. If it was  
>> doing Sieve stuff,
>> I would put the blame on me. But with PETSc stuff (10+ years old  
>> and used by
>> thousands of people), I put the blame on MPI or hardware for this  
>> computer.
>>
>>   Matt
>>
>>
>> ...
>> ...
>> 87 KSP Residual norm 3.579491816101e-07
>> 88 KSP Residual norm 3.241876854223e-07
>> 89 KSP Residual norm 2.836307394788e-07
>>
>> [cli_0]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_1]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_3]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_2]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> mpiexec: Warning: tasks 0-3 exited with status 1.
>> --pyre-start: mpiexec: exit 1
>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>
>> Tabrez
>>
>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>
>> > Tabrez-
>> >
>> > You may want to set ksp_monitor=true so that you can see the
>> > residual. If the
>> > residual increases significantly, the solution is losing
>> > convergence. This
>> > can be alleviated a bit by using an absolute convergence tolerance
>> > (ksp_atol). You probably need a slightly smaller time step or
>> > slightly higher
>> > quality mesh (improve the aspect ratio of the most distorted  
>> cells).
>> >
>> > Brad
>> >
>> >
>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>> >> Brad
>> >>
>> >> I think you were right. The elastic problem worked out fine. I  
>> will
>> >> now try to play with time step (for the viscous runs)
>> >>
>> >> Tabrez
>> >>
>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>> >>>> Also I dont see the error until ~9000 time steps with one set of
>> >>>> material properties but get the error at around 4000th time step
>> >>>> with
>> >>>> a different set of material properties (on the same mesh).
>> >>>
>> >>> This seems to indicate a time-integration stability issue. Does  
>> the
>> >>> one that
>> >>> has an error after 4000 time steps have a smaller Maxwell time?  
>> You
>> >>> might try
>> >>> running with purely elastic properties. If that works, then you  
>> may
>> >>> need to
>> >>> reduce your time step.
>> >
>> >
>>
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>
>>
>>
>> -- 
>> What most experimenters take for granted before they begin their  
>> experiments is infinitely more interesting than any results to  
>> which their experiments lead.
>> -- Norbert Wiener
>
>
>
>
> -- 
> What most experimenters take for granted before they begin their  
> experiments is infinitely more interesting than any results to which  
> their experiments lead.
> -- Norbert Wiener

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/cecea00b/attachment.htm 


More information about the CIG-SHORT mailing list