[CIG-SHORT] Pylith dies after thousands of time steps (convergence issue)

Matthew Knepley knepley at mcs.anl.gov
Thu Apr 30 06:30:58 PDT 2009


On Thu, Apr 30, 2009 at 8:24 AM, Tabrez Ali <stali at purdue.edu> wrote:

> Matt
> I have tried this on two different machines. One is a cluster with
> mpich2-1.0.5p4/gcc-4.2.1 and the other is an SMP machine with
> mpich2-1.0.8/gcc-4.1.2.
>
> What I havent tried yet is running it on a single processor (there are 0.25
> million elements and it might take a while to run it).
>

1) Are you positive that it fails inside KSP? That is the most bulletproof
part of the code.

2) Does it always fail in the same place? If not, I can believe it fails in
KSP due to

  a) Memory corruption somewhere else

  b) Mismatched MPI calls hanging around

  c) Someone holding on to MPI resources somewhere else

3) Are the error messages identical on the two machines?

  Matt


> Tabrez
>
>
> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>
> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu> wrote:
>
>> Brad
>>
>> The solution at the last working step does converge and looks okay but
>> then nothing happens and it dies. I am however experimenting with
>> time_step and will also try to use the debugger.
>>
>> Btw do you know if I can use --petsc.on_error_attach_debugger when the
>> job is submitted via PBS or should I just run it interactively?
>
>
> I do not understand why this is labeled a convergence issue. Unless I miss
> what
> you mean by "die". Non-convergence will result in a bad ConvergenceReason
> from the solver, but nothing else. The code will continue to run.
>
> This looks like death from a signal. With the very little information in
> front of
> me, this looks like a bug in the MPI on this machine. If it was doing Sieve
> stuff,
> I would put the blame on me. But with PETSc stuff (10+ years old and used
> by
> thousands of people), I put the blame on MPI or hardware for this computer.
>
>   Matt
>
>
>>
>> ...
>> ...
>> 87 KSP Residual norm 3.579491816101e-07
>> 88 KSP Residual norm 3.241876854223e-07
>> 89 KSP Residual norm 2.836307394788e-07
>>
>> [cli_0]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_1]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_3]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> [cli_2]: aborting job:
>> Fatal error in MPI_Wait: Error message texts are not available
>> mpiexec: Warning: tasks 0-3 exited with status 1.
>> --pyre-start: mpiexec: exit 1
>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>
>> Tabrez
>>
>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>
>> > Tabrez-
>> >
>> > You may want to set ksp_monitor=true so that you can see the
>> > residual. If the
>> > residual increases significantly, the solution is losing
>> > convergence. This
>> > can be alleviated a bit by using an absolute convergence tolerance
>> > (ksp_atol). You probably need a slightly smaller time step or
>> > slightly higher
>> > quality mesh (improve the aspect ratio of the most distorted cells).
>> >
>> > Brad
>> >
>> >
>> > On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>> >> Brad
>> >>
>> >> I think you were right. The elastic problem worked out fine. I will
>> >> now try to play with time step (for the viscous runs)
>> >>
>> >> Tabrez
>> >>
>> >> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>> >>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>> >>>> Also I dont see the error until ~9000 time steps with one set of
>> >>>> material properties but get the error at around 4000th time step
>> >>>> with
>> >>>> a different set of material properties (on the same mesh).
>> >>>
>> >>> This seems to indicate a time-integration stability issue. Does the
>> >>> one that
>> >>> has an error after 4000 time steps have a smaller Maxwell time? You
>> >>> might try
>> >>> running with purely elastic properties. If that works, then you may
>> >>> need to
>> >>> reduce your time step.
>> >
>> >
>>
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>


-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20090430/2216dca6/attachment.htm 


More information about the CIG-SHORT mailing list