[CIG-SHORT] PETSc error when running Pylith on a cluster

Thu Feb 16 13:26:53 PST 2012

Hongfeng,

The stdout file contains the error message:

[13]PETSC ERROR: 
------------------------------------------------------------------------
[13]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
probably memory access out of range
[13]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[13]PETSC ERROR: or see 
http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[13]PETSC 
ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to fi
nd memory corruption errors
[13]PETSC ERROR: likely location of problem given in stack below
[13]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[15]PETSC ERROR: likely location of problem given in stack below
[15]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[11]PETSC ERROR: likely location of problem given in stack below
[11]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[13]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
available,
[13]PETSC ERROR:       INSTEAD the line number of the start of the function
[13]PETSC ERROR:       is given.
[13]PETSC ERROR: [11]PETSC ERROR: Note: The EXACT line numbers in the 
stack are not available,
[11]PETSC ERROR:       INSTEAD the line number of the start of the function
[11]PETSC ERROR:       is given.
[11]PETSC ERROR: [11] 
MatCholeskyFactorNumeric_SeqSBAIJ_1_NaturalOrdering line 1320 
src/mat/impls/sbaij/seq/sbaijfact.c
[11]PETSC ERROR: [11] MatCholeskyFactorNumeric line 3043 
src/mat/interface/matrix.c
[11]PETSC ERROR: [11] PCSetup_ICC line 13 src/ksp/pc/impls/factor/icc/icc.c

I do not understand what you are describing here:
 > The 1 node, 8 core jobs were 100% successful (24 passes)
 > The 2 node, 16 core jobs were 33% successful (8 passes)
 > The higher node/core count jobs all failed

What does "24 passes" mean? What does 33% successful mean? What do you 
mean by "higher node/core count jobs"? What do you mean by "failed"? Are 
you getting the same error message (i.e., the one I mention above) in 
all cases?

Note: There is a very easy way to submit a PyLith job to the PBS batch 
queue system. Create a batch.cfg file with the lines

[pylithapp]
scheduler = pbs

[pylithapp.pbs]
shell = /bin/bash
qsub-options = -V -m bea -M MYEMAIL_ADDRESS

[pylithapp.launcher]
command = mpirun -np ${nodes} -machinefile ${PBS_NODEFILE}

and then run PyLith via

pylith batch.cfg step14.cfg --nodes=16 --scheduler.ppn=8 
--job.name=YOURJOBNAME --job.stdout=YOURJOBNAME.log

This will run the job on 16 cores with 8 cores/node = 2 nodes and send 
an email (replace MYEMAIL_ADDRESS with your email address) when the job 
starts, ends, or aborts. PyLith will create the batch shell script and 
submit the job.

Brad

On 02/16/2012 12:40 PM, Hongfeng Yang wrote:
> Each node on the cluster has 8 cores.
> We ran 24 tests for each of 5 configurations. (95 total runs)
> 1 node, 2 nodes, 4 nodes, 6 nodes, 8 nodes.
>
> The 1 node, 8 core jobs were 100% successful (24 passes)
> The 2 node, 16 core jobs were 33% successful (8 passes)
> The higher node/core count jobs all failed
>
> Attached is the stdout file.
>
> The full run command is the following:
>
> /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE -np
> $NPROCS /home/username/pylith57/bin/mpinemesis --pyre-start
> /home/username/pylith57/bin:/home/username/pylith57/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith57/lib/python2.6/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/lib-dynload:/home/username/pylith57/lib/python2.6:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old:/home/username/pylith57/lib64/python/site-packages:/home/usern
ame/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages::/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old
> pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step14.cfg
> --nodes=$NPROCS --petsc.start_in_debugger --launcher.dry --nodes=$NPROCS
> --macros.nodes=$NPROCS --macros.job.name= --macros.job.id=8403>&
> ./$PBS_JOBID.log
>
>
> Thanks,
>
> Hongfeng
>
> On 02/16/2012 10:46 AM, Brad Aagaard wrote:
>> Hongfeng,
>>
>> Please send everything that was written to stdout. Also please indicate
>> what NPROCS is (how many processes you are using). It also helps when
>> you state what command you entered on the command line so that we can
>> see if we can reproduce what you did.
>>
>> The error message you list only indicates that one of the processes
>> aborted because another process already aborted due to an error. The
>> message associated with the real error should have been written earlier.
>>
>> Brad
>>
>>
>> On 02/16/2012 07:26 AM, Hongfeng Yang wrote:
>>> Hi All,
>>>
>>> The cluster is running CentOS 5.7. Options to build Pylith are
>>>
>>>
>>> $HOME/src57/pylith/pylith-installer-1.6.2-0/configure \
>>> --enable-python --with-make-threads=2 \
>>> --with-petsc-options="--download-chaco=1 --download-ml=1
>>> --download-f-blas-lapack=1 --with-debugging=yes" \
>>> --prefix=$HOME/pylith
>>>
>>>
>>> However, the following error message appears when running an example on
>>> the cluster.
>>>
>>> [30]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>>
>>>
>>> So, we have successfully built debugging into petsc, but it is not
>>> enabled.
>>>
>>> Here is the full run command:
>>>
>>> /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --hostfile $PBS_NODEFILE -np
>>> $NPROCS /home/username/pylith57/bin/mpinemesis --pyre-start
>>> /home/username/pylith57/bin:/home/username/pylith57/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith57/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith57/lib/python2.6/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/lib-dynload:/home/username/pylith57/lib/python2.6:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/l
>>>
>> ib
>>> /python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/lib/python/site-packages::/home/username/pylith57/lib/python/site-packages:/home/username/pylith57/lib64/python/site-packages:/home/username/pylith57/src/pylith/examples/3d/hex8:/home/username/pylith57/lib/python26.zip:/home/username/pylith57/lib/python2.6/plat-linux2:/home/username/pylith57/lib/python2.6/lib-tk:/home/username/pylith57/lib/python2.6/lib-old
>>> pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step14.cfg
>>> --nodes=$NPROCS --petsc.start_in_debugger --launcher.dry
>>> --nodes=$NPROCS --macros.nodes=$NPROCS --macros.job.name=
>>> --macros.job.id=8403>& ./$PBS_JOBID.log
>>>
>>> Here is the full error message which states that we are not in
>>> debugging mode:
>>>
>>> [35]PETSC ERROR: --------------------- Stack Frames
>>> ------------------------------------
>>> [30]PETSC ERROR:
>>> ------------------------------------------------------------------------
>>> [30]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or
>>> the batch system) has told this process to end
>>> [30]PETSC ERROR: Try option -start_in_debugger or
>>> -on_error_attach_debugger
>>> [30]PETSC ERROR: or
>>> seehttp://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[30]PETSC
>>> ERROR: or tryhttp://valgrind.org on GNU/linux and Apple Mac OS X to
>>> find memory corruption errors
>>> [30]PETSC ERROR: likely location of problem given in stack below
>>> [30]PETSC ERROR: --------------------- Stack Frames
>>> ------------------------------------
>>>
>>>
>>>
>>>
>>> Anyone could help? Thanks!
>>>
>>> Hongfeng Yang
>>>
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>
>
>
>
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short