[CIG-SHORT] Running Pylith on a cluster
hyang at whoi.edu
hyang at whoi.edu
Tue Dec 6 11:26:36 PST 2011
Here is an update with more information on our particular problem:
We can get parallel jobs to run on our cluster, but about 50% of the time, one
of the mpinemesis threads exits with an error:
PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range
(the other 50% of the time, the run is successful)
This is observed when running on multiple compute nodes, using the example
file:3d/hex8/step01.cfg
The problem never occurs when running on a single computer, but does occur both
on our Opteron cluster (8 cores per node) and our Xeon cluster (12 cores per
node). We also see the issue occurring if the cores of a given node are not
fully utilized, eg using 4 cores on node70, when node70 actually has 12
available cores.
For the purposes of debugging, we used the --launcher.dry flag to generate an
mpirun command, and we have been running the mpirun command directly:
time mpirun --hostfile mpirun.nodes -np 24
/home/username/pylith/bin/mpinemesis --pyre-start
/home/username/pylith/bin:/home/username/pylith/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith/lib/python2.6/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/lib-dynload:/home/username/pylith/lib/python2.6:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/hom
e/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages::/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old
pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step01.cfg --nodes=24
--launcher.dry --nodes=24 --macros.nodes=24 --macros.job.name=
--macros.job.id=8403 >& c.log
real 0m8.842s
user 0m0.060s
sys 0m0.050s
We also see open mpi complaining that it is dangerous to use the fork() system
call, so there is a possibility that this is related to the failure of half of
the jobs.
The cluster we are using is in good health, and has multiple users also
continuously running openmpi/gcc jobs, so it doesn't seem like a basic issue
with the cluster itself.
Any ideas would be greatly appreciated.
Hongfeng
Quoting hyang at whoi.edu:
> Hi all,
>
> I have successfully built the latest Pylith on a linux cluster (centOS 5.5
> with
> openmpi). Then I followed the instruction in the Pylith manual "running
> without
> a batch system" by specifying nodegen and nodelist info in mymachines.cfg.
>
> But running pylith examples.cfg mymachines.cfg only launch pylith on the
> master
> node, not the nodes that I specified in the mymachines.cfg file. Indeed an
> output file mpirun.nodes shows that the nodes have been recognized by Pylith
> but somehow it did not send the job to them. Has anyone encountered this
> problem before, and would you please show me how to fix it?
>
> Thanks,
>
> Hongfeng
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>
>
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
More information about the CIG-SHORT
mailing list