[CIG-SHORT] Running Pylith on a cluster

hyang at whoi.edu hyang at whoi.edu
Tue Dec 6 11:26:36 PST 2011


Here is an update with more information on our particular problem:

We can get parallel jobs to run on our cluster, but about 50% of the time, one
of the mpinemesis threads exits with an error:

PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range

(the other 50% of the time, the run is successful)

This is observed when running on multiple compute nodes, using the example
file:3d/hex8/step01.cfg

The problem never occurs when running on a single computer, but does occur both
on our Opteron cluster (8 cores per node) and our Xeon cluster (12 cores per
node).  We also see the issue occurring if the cores of a given node are not
fully utilized, eg using 4 cores on node70, when node70 actually has 12
available cores.

For the purposes of debugging, we used the --launcher.dry flag to generate an
mpirun command, and we have been running the mpirun command directly:

time mpirun  --hostfile mpirun.nodes -np 24 
/home/username/pylith/bin/mpinemesis --pyre-start
/home/username/pylith/bin:/home/username/pylith/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith/lib/python2.6/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/lib-dynload:/home/username/pylith/lib/python2.6:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/hom
 e/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages::/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old
pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step01.cfg --nodes=24
--launcher.dry --nodes=24 --macros.nodes=24 --macros.job.name=
--macros.job.id=8403 >& c.log

real    0m8.842s
user    0m0.060s
sys     0m0.050s

We also see open mpi complaining that it is dangerous to use the fork() system
call, so there is a possibility that this is related to the failure of half of
the jobs.

The cluster we are using is in good health, and has multiple users also
continuously running openmpi/gcc jobs, so it doesn't seem like a basic issue
with the cluster itself.

Any ideas would be greatly appreciated.

Hongfeng

Quoting hyang at whoi.edu:

> Hi all,
>
> I have successfully built the latest Pylith on a linux cluster (centOS 5.5
> with
> openmpi). Then I followed the instruction in the Pylith manual "running
> without
> a batch system" by specifying nodegen and nodelist info in mymachines.cfg.
>
> But running pylith examples.cfg mymachines.cfg only launch pylith on the
> master
> node, not the nodes that I specified in the mymachines.cfg file. Indeed an
> output file mpirun.nodes shows that the nodes have been recognized by Pylith
> but somehow it did not send the job to them. Has anyone encountered this
> problem before, and would you please show me how to fix it?
>
> Thanks,
>
> Hongfeng
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



More information about the CIG-SHORT mailing list