[CIG-SHORT] Pylith dies after thousands of time steps

Tabrez Ali stali at purdue.edu
Thu Apr 30 08:10:46 PDT 2009


Brad

The model has no faults, just some velocity BCs. And yes I will run  
this on a single CPU to see if it fails.

Thanks
Tabrez

On Apr 30, 2009, at 10:37 AM, Brad Aagaard wrote:

> Tabrez-
>
> The KSP residual indicates that this is not a convergence issue (I  
> guessed
> wrong). Is there anything going on in the BC, fault interfaces, etc  
> that
> might be perturbing the residual/solution in this time step where it  
> fails
> compared to any other time step.
>
> One possible explanation is that one processor is throwing an  
> exception that
> is not caught, resulting in termination of the process on one  
> processor and
> not the other. I think it would be very instructive to run the job  
> on a
> single processor to see if the process dies at the same time step  
> and if
> there is a more verbose error message. It may take a while to run  
> but it may
> save you and us a lot of run-around and speculative searching.
>
> Brad
>
>
> On Thursday 30 April 2009 6:42:37 am Tabrez Ali wrote:
>> Matt
>>
>>> 1) Are you positive that it fails inside KSP? That is the most
>>> bulletproof part of the code.
>>>
>>> 2) Does it always fail in the same place? If not, I can believe it
>>> fails in KSP due to
>>>
>>>  a) Memory corruption somewhere else
>>>
>>>  b) Mismatched MPI calls hanging around
>>>
>>>  c) Someone holding on to MPI resources somewhere else
>>
>> Yes it fails at the same point on both machines. However if I use a
>> different material property then it fails much later, but again the
>> time step (at which it fails) is same on the two machines.
>>
>>> 3) Are the error messages identical on the two machines?
>>
>> Yes
>>
>> Tabrez
>>
>>> On Apr 30, 2009, at 9:15 AM, Matthew Knepley wrote:
>>>> On Thu, Apr 30, 2009 at 8:11 AM, Tabrez Ali <stali at purdue.edu>  
>>>> wrote:
>>>> Brad
>>>>
>>>> The solution at the last working step does converge and looks okay
>>>> but
>>>> then nothing happens and it dies. I am however experimenting with
>>>> time_step and will also try to use the debugger.
>>>>
>>>> Btw do you know if I can use --petsc.on_error_attach_debugger when
>>>> the
>>>> job is submitted via PBS or should I just run it interactively?
>>>>
>>>> I do not understand why this is labeled a convergence issue. Unless
>>>> I miss what
>>>> you mean by "die". Non-convergence will result in a bad
>>>> ConvergenceReason
>>>> from the solver, but nothing else. The code will continue to run.
>>>>
>>>> This looks like death from a signal. With the very little
>>>> information in front of
>>>> me, this looks like a bug in the MPI on this machine. If it was
>>>> doing Sieve stuff,
>>>> I would put the blame on me. But with PETSc stuff (10+ years old
>>>> and used by
>>>> thousands of people), I put the blame on MPI or hardware for this
>>>> computer.
>>>>
>>>>  Matt
>>>>
>>>>
>>>> ...
>>>> ...
>>>> 87 KSP Residual norm 3.579491816101e-07
>>>> 88 KSP Residual norm 3.241876854223e-07
>>>> 89 KSP Residual norm 2.836307394788e-07
>>>>
>>>> [cli_0]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_1]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_3]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> [cli_2]: aborting job:
>>>> Fatal error in MPI_Wait: Error message texts are not available
>>>> mpiexec: Warning: tasks 0-3 exited with status 1.
>>>> --pyre-start: mpiexec: exit 1
>>>> /usr/rmt_share/scratch96/s/stali/pylith/bin/pylith: /usr/rmt_share/
>>>> scratch96/s/stali/pylith/bin/nemesis: exit 1
>>>>
>>>> Tabrez
>>>>
>>>> On Apr 29, 2009, at 4:26 PM, Brad Aagaard wrote:
>>>>> Tabrez-
>>>>>
>>>>> You may want to set ksp_monitor=true so that you can see the
>>>>> residual. If the
>>>>> residual increases significantly, the solution is losing
>>>>> convergence. This
>>>>> can be alleviated a bit by using an absolute convergence tolerance
>>>>> (ksp_atol). You probably need a slightly smaller time step or
>>>>> slightly higher
>>>>> quality mesh (improve the aspect ratio of the most distorted
>>>>
>>>> cells).
>>>>
>>>>> Brad
>>>>>
>>>>> On Wednesday 29 April 2009 1:13:21 pm Tabrez Ali wrote:
>>>>>> Brad
>>>>>>
>>>>>> I think you were right. The elastic problem worked out fine. I
>>>>
>>>> will
>>>>
>>>>>> now try to play with time step (for the viscous runs)
>>>>>>
>>>>>> Tabrez
>>>>>>
>>>>>> On Apr 29, 2009, at 1:19 PM, Brad Aagaard wrote:
>>>>>>> On Wednesday 29 April 2009 10:09:26 am Tabrez Ali wrote:
>>>>>>>> Also I dont see the error until ~9000 time steps with one set  
>>>>>>>> of
>>>>>>>> material properties but get the error at around 4000th time  
>>>>>>>> step
>>>>>>>> with
>>>>>>>> a different set of material properties (on the same mesh).
>>>>>>>
>>>>>>> This seems to indicate a time-integration stability issue. Does
>>>>
>>>> the
>>>>
>>>>>>> one that
>>>>>>> has an error after 4000 time steps have a smaller Maxwell time?
>>>>
>>>> You
>>>>
>>>>>>> might try
>>>>>>> running with purely elastic properties. If that works, then you
>>>>
>>>> may
>>>>
>>>>>>> need to
>>>>>>> reduce your time step.
>>>>
>>>> _______________________________________________
>>>> CIG-SHORT mailing list
>>>> CIG-SHORT at geodynamics.org
>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to
>>>> which their experiments lead.
>>>> -- Norbert Wiener
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which
>>> their experiments lead.
>>> -- Norbert Wiener
>
>



More information about the CIG-SHORT mailing list