[CIG-SHORT] Running Pylith on a cluster

Brad Aagaard baagaard at usgs.gov
Tue Dec 6 11:46:31 PST 2011


Hongfeng,

My guess is that either your MPI is not configured properly for your 
hardware or your PyLith MPI settings are not appropriate for your 
cluster. The first step is to verify that your MPI is configured 
properly outside of PyLith.

Which MPI are you using and how is it configured? Did you build MPI with 
the pylith installer or are you using MPI provided by a system 
administrator? Are you able to run other jobs in parallel on your system 
using this MPI? A really simple test is to run the command hostname on a 
bunch of nodes via mpiexec, for example
mpiexec -n24 INSERT_OTHER_MPI_PARAMS_HERE hostname.

Brad


On 12/06/2011 11:26 AM, hyang at whoi.edu wrote:
> Here is an update with more information on our particular problem:
>
> We can get parallel jobs to run on our cluster, but about 50% of the time, one
> of the mpinemesis threads exits with an error:
>
> PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
> memory access out of range
>
> (the other 50% of the time, the run is successful)
>
> This is observed when running on multiple compute nodes, using the example
> file:3d/hex8/step01.cfg
>
> The problem never occurs when running on a single computer, but does occur both
> on our Opteron cluster (8 cores per node) and our Xeon cluster (12 cores per
> node).  We also see the issue occurring if the cores of a given node are not
> fully utilized, eg using 4 cores on node70, when node70 actually has 12
> available cores.
>
> For the purposes of debugging, we used the --launcher.dry flag to generate an
> mpirun command, and we have been running the mpirun command directly:
>
> time mpirun  --hostfile mpirun.nodes -np 24
> /home/username/pylith/bin/mpinemesis --pyre-start
> /home/username/pylith/bin:/home/username/pylith/lib/python2.6/site-packages/pythia-0.8.1.12-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg:/home/username/pylith/lib/python2.6/site-packages/merlin-1.7.egg:/home/username/pylith/lib/python2.6/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/lib-dynload:/home/username/pylith/lib/python2.6:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages:/h
om
>   e/username/pylith/lib64/python/site-packages:/home/username/pylith/lib/python/site-packages::/home/username/pylith/lib/python/site-packages:/home/username/pylith/lib64/python/site-packages:/home/username/pylith/src/pylith/examples/3d/hex8:/home/username/pylith/lib/python26.zip:/home/username/pylith/lib/python2.6/plat-linux2:/home/username/pylith/lib/python2.6/lib-tk:/home/username/pylith/lib/python2.6/lib-old
> pythia mpi:mpistart pylith.apps.PyLithApp:PyLithApp step01.cfg --nodes=24
> --launcher.dry --nodes=24 --macros.nodes=24 --macros.job.name=
> --macros.job.id=8403>&  c.log
>
> real    0m8.842s
> user    0m0.060s
> sys     0m0.050s
>
> We also see open mpi complaining that it is dangerous to use the fork() system
> call, so there is a possibility that this is related to the failure of half of
> the jobs.
>
> The cluster we are using is in good health, and has multiple users also
> continuously running openmpi/gcc jobs, so it doesn't seem like a basic issue
> with the cluster itself.
>
> Any ideas would be greatly appreciated.
>
> Hongfeng
>
> Quoting hyang at whoi.edu:
>
>> Hi all,
>>
>> I have successfully built the latest Pylith on a linux cluster (centOS 5.5
>> with
>> openmpi). Then I followed the instruction in the Pylith manual "running
>> without
>> a batch system" by specifying nodegen and nodelist info in mymachines.cfg.
>>
>> But running pylith examples.cfg mymachines.cfg only launch pylith on the
>> master
>> node, not the nodes that I specified in the mymachines.cfg file. Indeed an
>> output file mpirun.nodes shows that the nodes have been recognized by Pylith
>> but somehow it did not send the job to them. Has anyone encountered this
>> problem before, and would you please show me how to fix it?
>>
>> Thanks,
>>
>> Hongfeng
>>
>>
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>>
>>
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-short
>



More information about the CIG-SHORT mailing list