Issue145

Title example input/benchmark/extension.xml hangs forever
Priority bug Status chatting
Superseder Nosy List bill, walter
Assigned To walter Topics Gale

Created on 2008-05-02.21:30:22 by bill, last changed 2008-05-12.20:30:01 by bill.

Messages
msg461 (view) Author: bill Date: 2008-05-12.20:30:01
I did find:
    http://graal.ens-lyon.fr/MUMPS

I applied for a download.
msg460 (view) Author: walter Date: 2008-05-12.19:15:01
"Bill Broadley \"Roundup Issue Tracker\"" <issue_tracker@geodynamics.org> wrote:
> 
> Bill Broadley <bill@cse.ucdavis.edu> added the comment:
> 
> Walter Landry "Roundup Issue Tracker" wrote:
> > Walter Landry <walter@geodynamics.org> added the comment:
> > 
> > Please uncomment the four lines near the end
> > 
> > <!--   <param name="journal.info">True</param> -->
> > <!--   <param name="journal.debug">True</param> -->
> > <!--   <param name="journal-level.info">2</param> -->
> > <!--   <param name="journal-level.debug">2</param> -->
> > 
> > That will let me know exactly where it is getting stuck.  Also, does it work if
> > you only run with one processor?
> > 
> 
> I kind of expected some output from a fast (2.2 GHz opteron) within 24 hours 
> or so.
> 
> After 275 hours on a 4 CPU 2.2 GHz opteron I got:
>   mpirun -np 4 /share/apps/gale-1.2.2/bin/Gale 
> `pwd`/input/benchmarks/extension.xml
> TimeStep = 1, Start time = 0 + 0 prev timeStep dt
> TimeStep = 1, Start time = 0 + 0 prev timeStep dt
> TimeStep = 1, Start time = 0 + 0 prev timeStep dt
> TimeStep = 1, Start time = 0 + 0 prev timeStep dt
> 3: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
> 500 iterations.
> 0: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
> 500 iterations.
> 2: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
> 500 iterations.
> 1: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
> 500 iterations.
> TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
> TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
> TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
> TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
> TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
> TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
> TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
> TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
> 
> Top reports:
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 23215 bill      25   0 91880  23m  13m R  101  0.3  14235:05 Gale
> 23216 bill      25   0 92272  24m  13m R  101  0.3  14235:56 Gale
> 23218 bill      25   0 91928  23m  13m R  101  0.3  14224:13 Gale
> 23214 bill      25   0 92168  24m  13m R   99  0.3  14234:13 Gale
> 
> Is that kind of performance expected?

Iterative solvers for this problem do not work.  You should use a
direct solver.  Basically, just add

  -pc_type lu -ksp_type preonly

to the command line in serial.  In parallel, you need to compile PETSc
with MUMPS and add

  -mat_type aijmumps -ksp_type preonly -pc_type lu

to the command line.  You will get MUCH faster results that way.

> Is the failed to converge message expected?

With an iterative solver, yes.  Try running it with a direct solver
and let me know if you still get the same problem.  It should complete
quickly (a few minutes), and, with the debugging output, you will always
see progress.

Cheers,
Walter Landry
walter@geodynamics.org
msg459 (view) Author: bill Date: 2008-05-12.18:50:01
Walter Landry "Roundup Issue Tracker" wrote:
> Walter Landry <walter@geodynamics.org> added the comment:
> 
> Please uncomment the four lines near the end
> 
> <!--   <param name="journal.info">True</param> -->
> <!--   <param name="journal.debug">True</param> -->
> <!--   <param name="journal-level.info">2</param> -->
> <!--   <param name="journal-level.debug">2</param> -->
> 
> That will let me know exactly where it is getting stuck.  Also, does it work if
> you only run with one processor?
> 

I kind of expected some output from a fast (2.2 GHz opteron) within 24 hours 
or so.

After 275 hours on a 4 CPU 2.2 GHz opteron I got:
  mpirun -np 4 /share/apps/gale-1.2.2/bin/Gale 
`pwd`/input/benchmarks/extension.xml
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
3: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
500 iterations.
0: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
500 iterations.
2: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
500 iterations.
1: In func SystemLinearEquations_NonLinearExecute: Failed to converge after 
500 iterations.
TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
TimeStep = 2, Start time = 0 + 110.965 prev timeStep dt
TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt
TimeStep = 3, Start time = 110.965 + 109.049 prev timeStep dt

Top reports:
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
23215 bill      25   0 91880  23m  13m R  101  0.3  14235:05 Gale
23216 bill      25   0 92272  24m  13m R  101  0.3  14235:56 Gale
23218 bill      25   0 91928  23m  13m R  101  0.3  14224:13 Gale
23214 bill      25   0 92168  24m  13m R   99  0.3  14234:13 Gale

Is that kind of performance expected?  Is the failed to converge message expected?
msg452 (view) Author: walter Date: 2008-05-12.17:00:41
Please uncomment the four lines near the end

<!--   <param name="journal.info">True</param> -->
<!--   <param name="journal.debug">True</param> -->
<!--   <param name="journal-level.info">2</param> -->
<!--   <param name="journal-level.debug">2</param> -->

That will let me know exactly where it is getting stuck.  Also, does it work if
you only run with one processor?
msg449 (view) Author: bill Date: 2008-05-02.21:30:22
petsc-2.3.2-p10 seems to work fine, make test:
C/C++ example src/snes/examples/tutorials/ex19 run successfully with 1 MPI process
C/C++ example src/snes/examples/tutorials/ex19 run successfully with 2 MPI processes
Graphics example src/snes/examples/tutorials/ex19 run successfully with 1 MPI
process
Fortran example src/snes/examples/tutorials/ex5f run successfully with 1 MPI process
Completed test examples

I configured petsc with:
./config/configure.py --with-cc=/usr/bin/mpicc --with-fc=/usr/bin/mpif77
--prefix=/share/apps/petsc-2.3.2-p10  --with-mpirun=/usr/bin/mpirun


When I compile Gale to use that petsc:
./configure.py --prefix=/share/apps/gale-1.2.2
--with-petsc-dir=/share/apps/petsc-2.3.2-p10/  --with-mpirun=/usr/bin/mpirun

Then I do a make, and make install.

When I try an example like:
mpirun -np 4 /share/apps/gale-1.2.2/bin/Gale input/benchmarks/extension.xml 

I get:
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt
TimeStep = 1, Start time = 0 + 0 prev timeStep dt

It seems like it never mades progress, if I strace one of the processes I just get:
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

It's running on a node with 4 opteron cores (one process per core), any idea how
long that job would take?  

BTW, it's not where the installation documentation claims (input/extension.xml),
but in input/benchmarks/extension.xml.

Any ideas?
History
Date User Action Args
2008-05-12 20:30:01billsetmessages: + msg461
2008-05-12 19:15:02waltersetmessages: + msg460
2008-05-12 18:50:02billsetmessages: + msg459
2008-05-12 17:00:42waltersetstatus: unread -> chatting
messages: + msg452
2008-05-02 22:55:56tan2settopic: + Gale
nosy: + walter
assignedto: walter
2008-05-02 21:30:22billcreate