On Tue, Dec 6, 2011 at 1:26 PM,  <span dir="ltr">&lt;<a href="mailto:hyang@whoi.edu">hyang@whoi.edu</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Here is an update with more information on our particular problem:<br>

<br>

We can get parallel jobs to run on our cluster, but about 50% of the time, one<br>

of the mpinemesis threads exits with an error:<br>

<br>

PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably<br>

memory access out of range<br></blockquote><div><br></div><div>It would be really helpful to get a stack trace of the crash using --petsc.start_in_debugger. The</div><div>error message above should have a process number (you stripped it off?). You can use</div>

<div>--petsc.debugger_nodes=&lt;proc number&gt; to only launch 1 gdb.</div><div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


(the other 50% of the time, the run is successful)<br>

<br>

This is observed when running on multiple compute nodes, using the example<br>

file:3d/hex8/step01.cfg<br>

<br>

The problem never occurs when running on a single computer, but does occur both<br>

on our Opteron cluster (8 cores per node) and our Xeon cluster (12 cores per<br>

node).  We also see the issue occurring if the cores of a given node are not<br>

fully utilized, eg using 4 cores on node70, when node70 actually has 12<br>

available cores.<br>

<br>

For the purposes of debugging, we used the --launcher.dry flag to generate an<br>

mpirun command, and we have been running the mpirun command directly:<br>

<br>

time mpirun  --hostfile mpirun.nodes -np 24<br>

/home/username/pylith/bin/mpinemesis --pyre-start<br>

/home/username/pylith/bin:/home/username/pylith/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith/lib/python2.6/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/lib-dynload:/home/username/pylith/lib/python2.6:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/hom<br>


 e/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages::/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old<br>


pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step01.cfg --nodes=24<br>

--launcher.dry --nodes=24 --macros.nodes=24 --<a href="http://macros.job.name" target="_blank">macros.job.name</a>=<br>

--<a href="http://macros.job.id" target="_blank">macros.job.id</a>=8403 &gt;&amp; c.log<br>

<br>

real    0m8.842s<br>

user    0m0.060s<br>

sys     0m0.050s<br>

<br>

We also see open mpi complaining that it is dangerous to use the fork() system<br>

call, so there is a possibility that this is related to the failure of half of<br>

the jobs.<br>

<br>

The cluster we are using is in good health, and has multiple users also<br>

continuously running openmpi/gcc jobs, so it doesn&#39;t seem like a basic issue<br>

with the cluster itself.<br>

<br>

Any ideas would be greatly appreciated.<br>

<br>

Hongfeng<br>

<br>

Quoting <a href="mailto:hyang@whoi.edu">hyang@whoi.edu</a>:<br>

<br>

&gt; Hi all,<br>

&gt;<br>

&gt; I have successfully built the latest Pylith on a linux cluster (centOS 5.5<br>

&gt; with<br>

&gt; openmpi). Then I followed the instruction in the Pylith manual &quot;running<br>

&gt; without<br>

&gt; a batch system&quot; by specifying nodegen and nodelist info in mymachines.cfg.<br>

&gt;<br>

&gt; But running pylith examples.cfg mymachines.cfg only launch pylith on the<br>

&gt; master<br>

&gt; node, not the nodes that I specified in the mymachines.cfg file. Indeed an<br>

&gt; output file mpirun.nodes shows that the nodes have been recognized by Pylith<br>

&gt; but somehow it did not send the job to them. Has anyone encountered this<br>

&gt; problem before, and would you please show me how to fix it?<br>

&gt;<br>

&gt; Thanks,<br>

&gt;<br>

&gt; Hongfeng<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; ----------------------------------------------------------------<br>

&gt; This message was sent using IMP, the Internet Messaging Program.<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; CIG-SHORT mailing list<br>

&gt; <a href="mailto:CIG-SHORT@geodynamics.org">CIG-SHORT@geodynamics.org</a><br>

&gt; <a href="http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short" target="_blank">http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short</a><br>

&gt;<br>

&gt;<br>

<br>

<br>

<br>

<br>

----------------------------------------------------------------<br>

This message was sent using IMP, the Internet Messaging Program.<br>

<br>

_______________________________________________<br>

CIG-SHORT mailing list<br>

<a href="mailto:CIG-SHORT@geodynamics.org">CIG-SHORT@geodynamics.org</a><br>

<a href="http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short" target="_blank">http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>