Errors running PyLith
Errors when running PyLith.
Spatialdata
- Error:
RuntimeError: Error occurred while reading spatial database file 'FILENAME'. I/O error while reading !SimpleDB data.
Make sure the num-locs values in the header matches the number of lines of data and that the last line of data includes an end-of-line character.
Running on a Cluster
Issues related to running PyLith on a cluster or other parallel computer.
OpenMPI and Infiniband
- Segmentation faults when using OpenMPI with Infiniband
PETSC ERROR: ------------------------------------------------------------------------ PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[14]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run PETSC ERROR: to get more information on the crash. PETSC ERROR: --------------------- Error Message ------------------------------------ PETSC ERROR: Signal received! PETSC ERROR: ------------------------------------------------------------------------ PETSC ERROR: Petsc Development HG revision: 78eda070d9530a3e6c403cf54d9873c76e711d49 HG Date: Wed Oct 24 00:04:09 2012 -0400 PETSC ERROR: See docs/changes/index.html for recent updates. PETSC ERROR: See docs/faq.html for hints about trouble shooting. PETSC ERROR: See docs/index.html for manual pages. PETSC ERROR: ------------------------------------------------------------------------ PETSC ERROR: /home/brad/pylith-1.8.0/bin/mpinemesis on a arch-linu named des-compute11.des by brad Tue Nov 13 10:44:06 2012 PETSC ERROR: Libraries linked from /home/brad/pylith-1.8.0/lib PETSC ERROR: Configure run at Wed Nov 7 16:42:26 2012 PETSC ERROR: Configure options --prefix=/home/brad/pylith-1.8.0 --with-c2html=0 --with-x=0 --with-clanguage=C++ --with-mpicompilers=1 --with-debugging=0 --with-shared-libraries=1 --with-sieve=1 --download-boost=1 --download-chaco=1 --download-ml=1 --download-f-blas-lapack=1 --with-hdf5=1 --with-hdf5-include=/home/brad/pylith-1.8.0/include --with-hdf5-lib=/home/brad/pylith-1.8.0/lib/libhdf5.dylib --LIBS=-lz CPPFLAGS="-I/home/brad/pylith-1.8.0/include " LDFLAGS="-L/home/brad/pylith-1.8.0/lib " CFLAGS="-g -O2" CXXFLAGS="-g -O2 -DMPICH_IGNORE_CXX_SEEK" FCFLAGS="-g -O2" PETSC_DIR=/home/brad/build/pylith_installer/petsc-dev PETSC ERROR: ------------------------------------------------------------------------ PETSC ERROR: User provided function() line 0 in unknown directory unknown file
This appears to be associated with how OpenMPI interprets calls to fork() when PyLith starts up. Set your environment (these can also be set on the command line like other OpenMPI parameters) to turn off Infiniband support for fork so that a normal fork call is made:
export OMPI_MCA_mpi_warn_on_fork=0 export OMPI_MCA_btl_openib_want_fork_support=0
- Turn on processor and memory affinity by using the —bind-to-core command line argument for mpirun.
Submitting to batch systems
PBS/Torque
- pylithapp.cfg:
[pylithapp] scheduler = pbs [pylithapp.pbs] shell = /bin/bash qsub-options = -V -m bea -M johndoe@university.edu [pylithapp.launcher] command = mpirun -np ${nodes} -machinefile ${PBS_NODEFILE}
Command line arguments:
−−nodes=NUMPROCS --scheduler.ppn=N --job.name=NAME --job.stdout=LOG_FILE # NPROCS = total number of processes # N = number of processes per compute node # NAME = name of job in queue # LOG_FILE = name of file where stdout will be written
Sun Grid Engine
- pylithapp.cfg:
[pylithapp] scheduler = sge [pylithapp.pbs] shell = /bin/bash pe-name = orte qsub-options = -V -m bea -M johndoe@university.edu -j y [pylithapp.launcher] command = mpirun -np ${nodes} # Use the options below if not using the !OpenMPI ORTE Parallel Environment #command = mpirun -np ${nodes}-machinefile ${PE_HOSTFILE} -n ${NSLOTS}
Command line arguments:
−−nodes=NPROCS --job.name=NAME --job.stdout=LOG_FILE # NPROCS = total number of processes # NAME = name of job in queue # LOG_FILE = name of file where stdout will be written
HDF5 and parallel I/O
The PyLith HDF5 data writers (DataWriterHDF5Mesh, etc) use HDF5 parallel I/O to write files in parallel. As noted in the PyLith manual, this is not nearly as robust as the HDF5Ext data writers (DataWriterHDF5ExtMesh, etc) that write raw binary files using MPI I/O accompanied by an HDF5 metadata file written. If you experience errors when running on multiple compute nodes where jobs mysteriously get hung up with or without HDF5 error messages, switching from the DataWriterHDF5 data writers to the DataWriterHDF5Ext data writers may fix the problem (if HDF5 parallel I/O is the source of the problem). This will produce one raw binary file per HDF5 dataset, so it means lots more files that must be kept together.