[CIG-SHORT] PETSc error when running Pylith on a cluster

Matthew Knepley knepley at mcs.anl.gov
Thu Feb 16 13:29:44 PST 2012


On Thu, Feb 16, 2012 at 3:26 PM, Brad Aagaard <baagaard at usgs.gov> wrote:

> Hongfeng,
>
> The stdout file contains the error message:
>

It could be a bug in the Cholesky factorization for symmetric storage (it
well tested, but who knows for sure).
We can try and debug it by first testing full storage:

[pylithapp.timedependent.formulation]
matrix_type = aij

  Thanks,

     Matt


>
> [13]PETSC ERROR:
> ------------------------------------------------------------------------
> [13]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [13]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [13]PETSC ERROR: or see
>
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[13]PETSC
> ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to fi
> nd memory corruption errors
> [13]PETSC ERROR: likely location of problem given in stack below
> [13]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [15]PETSC ERROR: likely location of problem given in stack below
> [15]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [11]PETSC ERROR: likely location of problem given in stack below
> [11]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [13]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [13]PETSC ERROR:       INSTEAD the line number of the start of the function
> [13]PETSC ERROR:       is given.
> [13]PETSC ERROR: [11]PETSC ERROR: Note: The EXACT line numbers in the
> stack are not available,
> [11]PETSC ERROR:       INSTEAD the line number of the start of the function
> [11]PETSC ERROR:       is given.
> [11]PETSC ERROR: [11]
> MatCholeskyFactorNumeric_SeqSBAIJ_1_NaturalOrdering line 1320
> src/mat/impls/sbaij/seq/sbaijfact.c
> [11]PETSC ERROR: [11] MatCholeskyFactorNumeric line 3043
> src/mat/interface/matrix.c
> [11]PETSC ERROR: [11] PCSetup_ICC line 13 src/ksp/pc/impls/factor/icc/icc.c
>
> I do not understand what you are describing here:
>  > The 1 node, 8 core jobs were 100% successful (24 passes)
>  > The 2 node, 16 core jobs were 33% successful (8 passes)
>  > The higher node/core count jobs all failed
>
> What does "24 passes" mean? What does 33% successful mean? What do you
> mean by "higher node/core count jobs"? What do you mean by "failed"? Are
> you getting the same error message (i.e., the one I mention above) in
> all cases?
>
> Note: There is a very easy way to submit a PyLith job to the PBS batch
> queue system. Create a batch.cfg file with the lines
>
> [pylithapp]
> scheduler = pbs
>
> [pylithapp.pbs]
> shell = /bin/bash
> qsub-options = -V -m bea -M MYEMAIL_ADDRESS
>
> [pylithapp.launcher]
> command = mpirun -np ${nodes} -machinefile ${PBS_NODEFILE}
>
> and then run PyLith via
>
> pylith batch.cfg step14.cfg --nodes=16 --scheduler.ppn=8
> --job.name=YOURJOBNAME --job.stdout=YOURJOBNAME.log
>
> This will run the job on 16 cores with 8 cores/node = 2 nodes and send
> an email (replace MYEMAIL_ADDRESS with your email address) when the job
> starts, ends, or aborts. PyLith will create the batch shell script and
> submit the job.
>
> Brad
>
> On 02/16/2012 12:40 PM, Hongfeng Yang wrote:
> > Each node on the cluster has 8 cores.
> > We ran 24 tests for each of 5 configurations. (95 total runs)
> > 1 node, 2 nodes, 4 nodes, 6 nodes, 8 nodes.
> >
> > The 1 node, 8 core jobs were 100% successful (24 passes)
> > The 2 node, 16 core jobs were 33% successful (8 passes)
> > The higher node/core count jobs all failed
> >
> > Attached is the stdout file.
> >
> > The full run command is the following:
> >
> > /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE -np
> > $NPROCS /home/username/pylith57/bin/mpinemesis --pyre-start
> >
> /home/username/pylith57/bin:/home/username/pylith57/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith57/lib/python2.6/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/lib-dynload:/home/username/pylith57/lib/python2.6:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old:/home/username/pylith57/lib64/python/site-packages:/home/usern
>
> ame/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages::/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old
> > pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step14.cfg
> > --nodes=$NPROCS --petsc.start_in_debugger --launcher.dry --nodes=$NPROCS
> > --macros.nodes=$NPROCS --macros.job.name= --macros.job.id=8403>&
> > ./$PBS_JOBID.log
> >
> >
> > Thanks,
> >
> > Hongfeng
> >
> > On 02/16/2012 10:46 AM, Brad Aagaard wrote:
> >> Hongfeng,
> >>
> >> Please send everything that was written to stdout. Also please indicate
> >> what NPROCS is (how many processes you are using). It also helps when
> >> you state what command you entered on the command line so that we can
> >> see if we can reproduce what you did.
> >>
> >> The error message you list only indicates that one of the processes
> >> aborted because another process already aborted due to an error. The
> >> message associated with the real error should have been written earlier.
> >>
> >> Brad
> >>
> >>
> >> On 02/16/2012 07:26 AM, Hongfeng Yang wrote:
> >>> Hi All,
> >>>
> >>> The cluster is running CentOS 5.7. Options to build Pylith are
> >>>
> >>>
> >>> $HOME/src57/pylith/pylith-installer-1.6.2-0/configure \
> >>> --enable-python --with-make-threads=2 \
> >>> --with-petsc-options="--download-chaco=1 --download-ml=1
> >>> --download-f-blas-lapack=1 --with-debugging=yes" \
> >>> --prefix=$HOME/pylith
> >>>
> >>>
> >>> However, the following error message appears when running an example on
> >>> the cluster.
> >>>
> >>> [30]PETSC ERROR: Try option -start_in_debugger or
> >>> -on_error_attach_debugger
> >>>
> >>>
> >>> So, we have successfully built debugging into petsc, but it is not
> >>> enabled.
> >>>
> >>> Here is the full run command:
> >>>
> >>> /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE -np
> >>> $NPROCS /home/username/pylith57/bin/mpinemesis --pyre-start
> >>>
> /home/username/pylith57/bin:/home/username/pylith57/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith57/lib/python2.6/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/lib-dynload:/home/username/pylith57/lib/python2.6:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/l
> >>>
> >> ib
> >>>
> /python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages::/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old
> >>> pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step14.cfg
> >>> --nodes=$NPROCS --petsc.start_in_debugger --launcher.dry
> >>> --nodes=$NPROCS --macros.nodes=$NPROCS --macros.job.name=
> >>> --macros.job.id=8403>& ./$PBS_JOBID.log
> >>>
> >>> Here is the full error message which states that we are not in
> >>> debugging mode:
> >>>
> >>> [35]PETSC ERROR: --------------------- Stack Frames
> >>> ------------------------------------
> >>> [30]PETSC ERROR:
> >>>
> ------------------------------------------------------------------------
> >>> [30]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or
> >>> the batch system) has told this process to end
> >>> [30]PETSC ERROR: Try option -start_in_debugger or
> >>> -on_error_attach_debugger
> >>> [30]PETSC ERROR: or
> >>> seehttp://
> www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[30]PETSC
> >>> ERROR: or tryhttp://valgrind.org on GNU/linux and Apple Mac OS X to
> >>> find memory corruption errors
> >>> [30]PETSC ERROR: likely location of problem given in stack below
> >>> [30]PETSC ERROR: --------------------- Stack Frames
> >>> ------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>> Anyone could help? Thanks!
> >>>
> >>> Hongfeng Yang
> >>>
> >> _______________________________________________
> >> CIG-SHORT mailing list
> >> CIG-SHORT at geodynamics.org
> >> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
> >>
> >
> >
> >
> >
> > _______________________________________________
> > CIG-SHORT mailing list
> > CIG-SHORT at geodynamics.org
> > http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20120216/4a753d1e/attachment-0001.htm 


More information about the CIG-SHORT mailing list