[CIG-MC] CitcomS crashing when reading tracer files

Dan Bower danb at gps.caltech.edu
Thu Mar 13 11:09:41 PDT 2014


Hi CIG,

In brief, I am using a modified version of CitcomS from the svn (r16400).
 To demonstrate my problem, when I read in 103345 tracers from a file
(using tracer_ic_method=1) the relevant part of the stderr looks like the
following (I added in more debug output) and the model proceeds fine:

--------------------
 Beginning Mapping
Beginning Regtoel submapping
Mapping completed (26.341404 seconds)
tracer setup done
initial_mesh_solver_setup done
initialization time = 36.243746
Sum of Tracers: 103345
before find_tracers(E)
after j,k loop
before parallel_process_sync
after parallel_process_sync
after lost_souls
after free later arrays
after reduce_tracer_arrays
find_tracers(E) complete
--------------------

However, for a tracer file that contains, say 10 times more tracers
(1033449), the model crashes at the parallel_process_sync call:

---------------------------
Beginning Mapping
Beginning Regtoel submapping
Mapping completed (26.581470 seconds)
tracer setup done
initial_mesh_solver_setup done
initialization time = 35.774893
Sum of Tracers: 1033449
before find_tracers(E)
after j,k loop
before parallel_process_sync
[proxy:0:0 at compute-8-209.local] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN &
~POLLOUT & ~POLLHUP)) failed
[proxy:0:0 at compute-8-209.local] main (./pm/pmiserv/pmip.c:387): demux
engine error waiting for event
[mpiexec at compute-8-209.local] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated
badly; aborting
[mpiexec at compute-8-209.local] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error
waiting for completion
[mpiexec at compute-8-209.local] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting
for completion
[mpiexec at compute-8-209.local] main (./ui/mpich/mpiexec.c:548): process
manager error waiting for completion
/home/danb/cig/CitcomS-assim/bin/pycitcoms: /opt/intel/impi/
4.0.3.008/bin/mpirun: exit 255
Connection to compute-3-109 closed by remote host.^M
Connection to compute-3-110 closed by remote host.^M
Connection to compute-3-111 closed by remote host.^M
Connection to compute-4-121 closed by remote host.^M
Connection to compute-4-122 closed by remote host.^M
Connection to compute-4-140 closed by remote host.^M
Connection to compute-4-145 closed by remote host.^M
Connection to compute-7-162 closed by remote host.^M
Connection to compute-7-166 closed by remote host.^M
Connection to compute-7-167 closed by remote host.^M
Connection to compute-7-168 closed by remote host.^M
Connection to compute-7-169 closed by remote host.^M
Connection to compute-7-170 closed by remote host.^M
Connection to compute-7-171 closed by remote host.^M
Connection to compute-7-179 closed by remote host.^M
Connection to compute-7-180 closed by remote host.^M
Connection to compute-7-183 closed by remote host.^M
Connection to compute-7-184 closed by remote host.^M
Connection to compute-7-189 closed by remote host.^M
Connection to compute-7-191 closed by remote host.^M
Connection to compute-7-192 closed by remote host.^M
Connection to compute-7-193 closed by remote host.^M
Connection to compute-7-194 closed by remote host.^M
Connection to compute-8-195 closed by remote host.^M
Connection to compute-8-196 closed by remote host.^M
Connection to compute-8-198 closed by remote host.^M
Connection to compute-8-201 closed by remote host.^M
Connection to compute-8-205 closed by remote host.^M
Connection to compute-8-206 closed by remote host.^M
Connection to compute-8-207 closed by remote host.^M
Connection to compute-8-208 closed by remote host.^M
Killed by signal 15.^M

--------------------------------

parallel_process_sync() is in Parallel_util.c and is simply a call to
MPI_Barrier:

-----------------------------
void parallel_process_sync(struct All_variables *E)
{

  MPI_Barrier(E->parallel.world);
  return;
  }
-----------------------------

Any ideas why this crashes when I have a larger input tracer file?  I am
using the pyre version of CitcomS and using intel-12 and intel/impi/4.0
compilers.

Any advice greatly appreciated.

Cheers,

Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/cig-mc/attachments/20140313/9bea30fb/attachment.html>


More information about the CIG-MC mailing list