[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Matthew Knepley knepley at mcs.anl.gov
Thu Apr 30 12:33:25 PDT 2009


I am looking at the file:

1) There are no error message here

2) I cannot tell what timestep is being solved. How do you do that?

3) It appears to die after iterate 89, however looking at previous solves,
    they all take 89 iterates, so it could die any place between the end of
    this solve and the beginning of the next. This does not appear to die in
    the solve at all.

   Matt

On Thu, Apr 30, 2009 at 2:26 PM, Tabrez Ali <stali at purdue.edu> wrote:

> Matt
> Here's the file
> http://web.ics.purdue.edu/~stali/tmp/out.txt.gz<http://web.ics.purdue.edu/%7Estali/tmp/out.txt.gz>
>
> Thanks
> Tabrez
>
> On Apr 30, 2009, at 1:44 PM, Matthew Knepley wrote:
>
> My gunzip says that file is corrupt.
>
>   Matt
>
> On Thu, Apr 30, 2009 at 10:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>
>> Matt
>>
>> Yes it fails at the same point on both machines. However if I use a
>>> different material property then it fails much later, but again the time
>>> step (at which it fails) is same on the two machines.
>>>
>>
>>
>> Not just same time step. Same iterate. Everything.
>>
>>
>> I deleted my output file on one machine but will run it again
>>
>>
>> 3) Are the error messages identical on the two machines?
>>>
>>>
>>> Yes
>>>
>>
>> I need the entire error message. I mean EXACTLY the same. Letter for
>> letter
>>
>>
>>
>> Here's the exact message. You can see out.txt at
>> http://stali.freeshell.org/out.txt.gz
>>
>> $ pylith --nodes=4 --petsc.ksp_type=cg > out.txt
>>
>> [cli_0]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_1]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_3]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_2]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> mpiexec: Warning: tasks 0-3 exited with status 1.
>> --pyre-start: mpiexec: exit 1
>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith:
>> /usr/rmt_share/scratch96/s/stali/pylith/bin/nemesis: exit 1
>>
>> Tabrez
>>
>>
>>  On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>>>
>>>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>>>>
>>>>> Brad
>>>>>
>>>>> The solution at the last working step does converge and looks okay but
>>>>> then nothing happens and it dies. I am however experimenting with
>>>>> time_step and will also try to use the debugger.
>>>>>
>>>>> Btw do you know if I can use --petsc.on_error_attach_debugger when the
>>>>> job is submitted via PBS or should I just run it interactively?
>>>>
>>>>
>>>> I do not understand why this is labeled a convergence issue. Unless I
>>>> miss what
>>>> you mean by "die". Non-convergence will result in a bad
>>>> ConvergenceReason
>>>> from the solver, but nothing else. The code will continue to run.
>>>>
>>>> This looks like death from a signal. With the very little information in
>>>> front of
>>>> me, this looks like a bug in the MPI on this machine. If it was doing
>>>> Sieve stuff,
>>>> I would put the blame on me. But with PETSc stuff (10+ years old and
>>>> used by
>>>> thousands of people), I put the blame on MPI or hardware for this
>>>> computer.
>>>>
>>>>   Matt
>>>>
>>>>
>>>>>
>>>>> ...
>>>>> ...
>>>>> 87 KSP Residual norm 3.579491816101e-07
>>>>> 88 KSP Residual norm 3.241876854223e-07
>>>>> 89 KSP Residual norm 2.836307394788e-07
>>>>>
>>>>> [cli_0]: aborting job:
>>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>>> [cli_1]: aborting job:
>>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>>> [cli_3]: aborting job:
>>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>>> [cli_2]: aborting job:
>>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>>>> --pyre-start: mpiexec: exit 1
>>>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>>>
>>>>> Tabrez
>>>>>
>>>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>>>
>>>>> > Tabrez-
>>>>> >
>>>>> > You may want to set ksp_monitor=true so that you can see the
>>>>> > residual. If the
>>>>> > residual increases significantly, the solution is losing
>>>>> > convergence. This
>>>>> > can be alleviated a bit by using an absolute convergence tolerance
>>>>> > (ksp_atol). You probably need a slightly smaller time step or
>>>>> > slightly higher
>>>>> > quality mesh (improve the aspect ratio of the most distorted cells).
>>>>> >
>>>>> > Brad
>>>>> >
>>>>> >
>>>>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>>>> >> Brad
>>>>> >>
>>>>> >> I think you were right. The elastic problem worked out fine. I will
>>>>> >> now try to play with time step (for the viscous runs)
>>>>> >>
>>>>> >> Tabrez
>>>>> >>
>>>>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>>>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>>>> >>>> Also I dont see the error until ~9000 time steps with one set of
>>>>> >>>> material properties but get the error at around 4000th time step
>>>>> >>>> with
>>>>> >>>> a different set of material properties (on the same mesh).
>>>>> >>>
>>>>> >>> This seems to indicate a time-integration stability issue. Does the
>>>>> >>> one that
>>>>> >>> has an error after 4000 time steps have a smaller Maxwell time? You
>>>>> >>> might try
>>>>> >>> running with purely elastic properties. If that works, then you may
>>>>> >>> need to
>>>>> >>> reduce your time step.
>>>>> >
>>>>> >
>>>>>
>>>>> _______________________________________________
>>>>> CIG-SHORT mailing list
>>>>> CIG-SHORT at geodynamics.org
>>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>>>>
>>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>


-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/32612189/attachment-0001.htm 


More information about the CIG-SHORT mailing list