- Version 1
- by (unknown)
- Version 2
- by (unknown)
Deletions or items before changed
Additions or items after changed
1 | - | == Errors running !PyLith ==
|
||
---|---|---|---|---|
2 | Errors when running !PyLith. | |||
3 | === Spatialdata === | |||
4 | ||||
5 | * Error: | |||
6 | {{{ | |||
7 | RuntimeError: Error occurred while reading spatial database file 'FILENAME'. | |||
8 | I/O error while reading !SimpleDB data. | |||
9 | }}} | |||
10 | Make sure the ''num-locs'' values in the header matches | |||
11 | the number of lines of data and that the last line of data | |||
12 | includes an end-of-line character. | |||
13 | ---- | |||
14 | ||||
15 | === Running on a Cluster === | |||
16 | Issues related to running !PyLith on a cluster or other parallel computer. | |||
17 | ||||
18 | ==== !OpenMPI and Infiniband ==== | |||
19 | ||||
20 | * Segmentation faults when using !OpenMPI with Infiniband | |||
21 | {{{ | |||
22 | PETSC ERROR: ------------------------------------------------------------------------ | |||
23 | PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range | |||
24 | PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger | |||
25 | PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[14]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors | |||
26 | PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run | |||
27 | PETSC ERROR: to get more information on the crash. | |||
28 | PETSC ERROR: --------------------- Error Message ------------------------------------ | |||
29 | PETSC ERROR: Signal received! | |||
30 | PETSC ERROR: ------------------------------------------------------------------------ | |||
31 | PETSC ERROR: Petsc Development HG revision: 78eda070d9530a3e6c403cf54d9873c76e711d49 HG Date: Wed Oct 24 00:04:09 2012 -0400 | |||
32 | PETSC ERROR: See docs/changes/index.html for recent updates. | |||
33 | PETSC ERROR: See docs/faq.html for hints about trouble shooting. | |||
34 | PETSC ERROR: See docs/index.html for manual pages. | |||
35 | PETSC ERROR: ------------------------------------------------------------------------ | |||
36 | PETSC ERROR: /home/brad/pylith-1.8.0/bin/mpinemesis on a arch-linu named des-compute11.des by brad Tue Nov 13 10:44:06 2012 | |||
37 | PETSC ERROR: Libraries linked from /home/brad/pylith-1.8.0/lib | |||
38 | PETSC ERROR: Configure run at Wed Nov 7 16:42:26 2012 | |||
39 | PETSC ERROR: Configure options --prefix=/home/brad/pylith-1.8.0 --with-c2html=0 --with-x=0 --with-clanguage=C++ --with-mpicompilers=1 --with-debugging=0 --with-shared-libraries=1 --with-sieve=1 --download-boost=1 --download-chaco=1 --download-ml=1 --download-f-blas-lapack=1 --with-hdf5=1 --with-hdf5-include=/home/brad/pylith-1.8.0/include --with-hdf5-lib=/home/brad/pylith-1.8.0/lib/libhdf5.dylib --LIBS=-lz CPPFLAGS="-I/home/brad/pylith-1.8.0/include " LDFLAGS="-L/home/brad/pylith-1.8.0/lib " CFLAGS="-g -O2" CXXFLAGS="-g -O2 -DMPICH_IGNORE_CXX_SEEK" FCFLAGS="-g -O2" PETSC_DIR=/home/brad/build/pylith_installer/petsc-dev | |||
40 | PETSC ERROR: ------------------------------------------------------------------------ | |||
41 | PETSC ERROR: User provided function() line 0 in unknown directory unknown file | |||
42 | }}} | |||
43 | This appears to be associated with how !OpenMPI interprets calls to fork() when !PyLith starts up. Set your environment (these can also be set on the command line like other !OpenMPI parameters) to turn off Infiniband support for fork so that a normal fork call is made: | |||
44 | {{{ | |||
45 | export OMPI_MCA_mpi_warn_on_fork=0 | |||
46 | export OMPI_MCA_btl_openib_want_fork_support=0 | |||
47 | }}} | |||
48 | * Turn on processor and memory affinity by using the ''--bind-to-core'' command line argument for mpirun. | |||
49 | ||||
50 | ==== Submitting to batch systems ==== | |||
51 | ||||
52 | ===== PBS/Torque ===== | |||
53 | * pylithapp.cfg: | |||
54 | {{{ | |||
55 | [pylithapp] | |||
56 | scheduler = pbs | |||
57 | ||||
58 | [pylithapp.pbs] | |||
59 | shell = /bin/bash | |||
60 | qsub-options = -V -m bea -M johndoe@university.edu | |||
61 | ||||
62 | [pylithapp.launcher] | |||
63 | command = mpirun -np ${nodes} -machinefile ${PBS_NODEFILE} | |||
64 | }}} | |||
65 | Command line arguments: | |||
66 | {{{ | |||
67 | −−nodes=NUMPROCS --scheduler.ppn=N --job.name=NAME --job.stdout=LOG_FILE | |||
68 | ||||
69 | # NPROCS = total number of processes | |||
70 | # N = number of processes per compute node | |||
71 | # NAME = name of job in queue | |||
72 | # LOG_FILE = name of file where stdout will be written | |||
73 | }}} | |||
74 | ===== Sun Grid Engine ===== | |||
75 | * pylithapp.cfg: | |||
76 | {{{ | |||
77 | [pylithapp] | |||
78 | scheduler = sge | |||
79 | ||||
80 | [pylithapp.pbs] | |||
81 | shell = /bin/bash | |||
82 | pe-name = orte | |||
83 | qsub-options = -V -m bea -M johndoe@university.edu -j y | |||
84 | ||||
85 | [pylithapp.launcher] | |||
86 | command = mpirun -np ${nodes} | |||
87 | # Use the options below if not using the !OpenMPI ORTE Parallel Environment | |||
88 | #command = mpirun -np ${nodes}-machinefile ${PE_HOSTFILE} -n ${NSLOTS} | |||
89 | }}} | |||
90 | Command line arguments: | |||
91 | {{{ | |||
92 | −−nodes=NPROCS --job.name=NAME --job.stdout=LOG_FILE | |||
93 | ||||
94 | # NPROCS = total number of processes | |||
95 | # NAME = name of job in queue | |||
96 | # LOG_FILE = name of file where stdout will be written | |||
97 | }}} | |||
98 | ==== HDF5 and parallel I/O ==== | |||
99 | ||||
100 | The !PyLith HDF5 data writers (!DataWriterHDF5Mesh, etc) use HDF5 parallel I/O to write files in parallel. As noted in the !PyLith manual, this is not nearly as robust as the HDF5Ext data writers (!DataWriterHDF5!ExtMesh, etc) that write raw binary files using MPI I/O accompanied by an HDF5 metadata file written. If you experience errors when running on multiple compute nodes where jobs mysteriously get hung up with or without HDF5 error messages, switching from the !DataWriterHDF5 data writers to the !DataWriterHDF5Ext data writers may fix the problem (if HDF5 parallel I/O is the source of the problem). This will produce one raw binary file per HDF5 dataset, so it means lots more files that must be kept together. |