[CIG-MC] CitcomS crashing when reading tracer files
Dan Bower
danb at gps.caltech.edu
Thu Mar 13 11:45:57 PDT 2014
Hi Eh,
I've grepped for 'Error', 'error', 'warn', and 'Warn' in the tracer_log and
.info. files and nothing is showing up. I assume it's some kind of memory
allocation issue since it is literally the total number of tracers that
causes the code to crash?
Cheers,
Dan
On Thu, Mar 13, 2014 at 11:29 AM, tan2 <tan2tan2 at gmail.com> wrote:
>
> Dan,
>
> The problem does not occur inside parallel_process_sync(). It is likely
> that one of the mpi process got crashed while reading the tracer file. The
> rest mpi processes were unaware of the crash and keep moving forward until
> they call parallel_process_sync().
>
> Please check the content of *.tracer_log.*. One of the file might hold the
> real error message of the crash.
>
> Cheers,
> Eh
>
>
> On Fri, Mar 14, 2014 at 2:09 AM, Dan Bower <danb at gps.caltech.edu> wrote:
>
>> Hi CIG,
>>
>> In brief, I am using a modified version of CitcomS from the svn (r16400).
>> To demonstrate my problem, when I read in 103345 tracers from a file
>> (using tracer_ic_method=1) the relevant part of the stderr looks like the
>> following (I added in more debug output) and the model proceeds fine:
>>
>> --------------------
>> Beginning Mapping
>> Beginning Regtoel submapping
>> Mapping completed (26.341404 seconds)
>> tracer setup done
>> initial_mesh_solver_setup done
>> initialization time = 36.243746
>> Sum of Tracers: 103345
>> before find_tracers(E)
>> after j,k loop
>> before parallel_process_sync
>> after parallel_process_sync
>> after lost_souls
>> after free later arrays
>> after reduce_tracer_arrays
>> find_tracers(E) complete
>> --------------------
>>
>> However, for a tracer file that contains, say 10 times more tracers
>> (1033449), the model crashes at the parallel_process_sync call:
>>
>> ---------------------------
>> Beginning Mapping
>> Beginning Regtoel submapping
>> Mapping completed (26.581470 seconds)
>> tracer setup done
>> initial_mesh_solver_setup done
>> initialization time = 35.774893
>> Sum of Tracers: 1033449
>> before find_tracers(E)
>> after j,k loop
>> before parallel_process_sync
>> [proxy:0:0 at compute-8-209.local] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN &
>> ~POLLOUT & ~POLLHUP)) failed
>> [proxy:0:0 at compute-8-209.local] main (./pm/pmiserv/pmip.c:387): demux
>> engine error waiting for event
>> [mpiexec at compute-8-209.local] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated
>> badly; aborting
>> [mpiexec at compute-8-209.local] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error
>> waiting for completion
>> [mpiexec at compute-8-209.local] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting
>> for completion
>> [mpiexec at compute-8-209.local] main (./ui/mpich/mpiexec.c:548): process
>> manager error waiting for completion
>> /home/danb/cig/CitcomS-assim/bin/pycitcoms: /opt/intel/impi/
>> 4.0.3.008/bin/mpirun: exit 255
>> Connection to compute-3-109 closed by remote host.^M
>> Connection to compute-3-110 closed by remote host.^M
>> Connection to compute-3-111 closed by remote host.^M
>> Connection to compute-4-121 closed by remote host.^M
>> Connection to compute-4-122 closed by remote host.^M
>> Connection to compute-4-140 closed by remote host.^M
>> Connection to compute-4-145 closed by remote host.^M
>> Connection to compute-7-162 closed by remote host.^M
>> Connection to compute-7-166 closed by remote host.^M
>> Connection to compute-7-167 closed by remote host.^M
>> Connection to compute-7-168 closed by remote host.^M
>> Connection to compute-7-169 closed by remote host.^M
>> Connection to compute-7-170 closed by remote host.^M
>> Connection to compute-7-171 closed by remote host.^M
>> Connection to compute-7-179 closed by remote host.^M
>> Connection to compute-7-180 closed by remote host.^M
>> Connection to compute-7-183 closed by remote host.^M
>> Connection to compute-7-184 closed by remote host.^M
>> Connection to compute-7-189 closed by remote host.^M
>> Connection to compute-7-191 closed by remote host.^M
>> Connection to compute-7-192 closed by remote host.^M
>> Connection to compute-7-193 closed by remote host.^M
>> Connection to compute-7-194 closed by remote host.^M
>> Connection to compute-8-195 closed by remote host.^M
>> Connection to compute-8-196 closed by remote host.^M
>> Connection to compute-8-198 closed by remote host.^M
>> Connection to compute-8-201 closed by remote host.^M
>> Connection to compute-8-205 closed by remote host.^M
>> Connection to compute-8-206 closed by remote host.^M
>> Connection to compute-8-207 closed by remote host.^M
>> Connection to compute-8-208 closed by remote host.^M
>> Killed by signal 15.^M
>>
>> --------------------------------
>>
>> parallel_process_sync() is in Parallel_util.c and is simply a call to
>> MPI_Barrier:
>>
>> -----------------------------
>> void parallel_process_sync(struct All_variables *E)
>> {
>>
>> MPI_Barrier(E->parallel.world);
>> return;
>> }
>> -----------------------------
>>
>> Any ideas why this crashes when I have a larger input tracer file? I am
>> using the pyre version of CitcomS and using intel-12 and intel/impi/4.0
>> compilers.
>>
>> Any advice greatly appreciated.
>>
>> Cheers,
>>
>> Dan
>>
>> _______________________________________________
>> CIG-MC mailing list
>> CIG-MC at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>
>
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/cig-mc/attachments/20140313/47d60802/attachment-0001.html>
More information about the CIG-MC
mailing list