[CIG-SHORT] Running Pylith on a cluster

Tue Dec 6 11:36:17 PST 2011

On Tue, Dec 6, 2011 at 1:26 PM, <hyang at whoi.edu> wrote:

> Here is an update with more information on our particular problem:
>
> We can get parallel jobs to run on our cluster, but about 50% of the time,
> one
> of the mpinemesis threads exits with an error:
>
> PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
> memory access out of range
>

It would be really helpful to get a stack trace of the crash using
--petsc.start_in_debugger. The
error message above should have a process number (you stripped it off?).
You can use
--petsc.debugger_nodes=<proc number> to only launch 1 gdb.

   Matt

> (the other 50% of the time, the run is successful)
>
> This is observed when running on multiple compute nodes, using the example
> file:3d/hex8/step01.cfg
>
> The problem never occurs when running on a single computer, but does occur
> both
> on our Opteron cluster (8 cores per node) and our Xeon cluster (12 cores
> per
> node).  We also see the issue occurring if the cores of a given node are
> not
> fully utilized, eg using 4 cores on node70, when node70 actually has 12
> available cores.
>
> For the purposes of debugging, we used the --launcher.dry flag to generate
> an
> mpirun command, and we have been running the mpirun command directly:
>
> time mpirun  --hostfile mpirun.nodes -np 24
> /home/username/pylith/bin/mpinemesis --pyre-start
>
> /home/username/pylith/bin:/home/username/pylith/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith/lib/python2.6/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/lib-dynload:/home/username/pylith/lib/python2.6:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/hom
>
>  e/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages::/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old
> pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step01.cfg --nodes=24
> --launcher.dry --nodes=24 --macros.nodes=24 --macros.job.name=
> --macros.job.id=8403 >& c.log
>
> real    0m8.842s
> user    0m0.060s
> sys     0m0.050s
>
> We also see open mpi complaining that it is dangerous to use the fork()
> system
> call, so there is a possibility that this is related to the failure of
> half of
> the jobs.
>
> The cluster we are using is in good health, and has multiple users also
> continuously running openmpi/gcc jobs, so it doesn't seem like a basic
> issue
> with the cluster itself.
>
> Any ideas would be greatly appreciated.
>
> Hongfeng
>
> Quoting hyang at whoi.edu:
>
> > Hi all,
> >
> > I have successfully built the latest Pylith on a linux cluster (centOS
> 5.5
> > with
> > openmpi). Then I followed the instruction in the Pylith manual "running
> > without
> > a batch system" by specifying nodegen and nodelist info in
> mymachines.cfg.
> >
> > But running pylith examples.cfg mymachines.cfg only launch pylith on the
> > master
> > node, not the nodes that I specified in the mymachines.cfg file. Indeed
> an
> > output file mpirun.nodes shows that the nodes have been recognized by
> Pylith
> > but somehow it did not send the job to them. Has anyone encountered this
> > problem before, and would you please show me how to fix it?
> >
> > Thanks,
> >
> > Hongfeng
> >
> >
> >
> > ----------------------------------------------------------------
> > This message was sent using IMP, the Internet Messaging Program.
> >
> > _______________________________________________
> > CIG-SHORT mailing list
> > CIG-SHORT at geodynamics.org
> > http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
> >
> >
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-short/attachments/20111206/4ddfbf4e/attachment.htm