[CIG-MC] CitcomS crashing when reading tracer files

Dan Bower danb at gps.caltech.edu
Thu Mar 13 12:16:55 PDT 2014


Hi Eh, Thorsten,

Thanks for the pointers (no pun intended).  I have solved the problem by
increasing 'icushion' (originally set to 100) which is used to allocate
memory for the tracer array.  Relevant code snippet below for others that
are interested:

 960     /* initially size tracer arrays to number of tracers divided by
processors */
 961
 962     icushion=10000;
 963
 964     /* for absolute tracer method */
 965     E->trace.number_of_tracers = number_of_tracers;
 966
 967     iestimate=number_of_tracers/E->parallel.nproc + icushion;

Cheers,

Dan


On Thu, Mar 13, 2014 at 12:07 PM, Thorsten Becker <thorstinski at gmail.com>wrote:

> I think the memory problem could be it. At some point, we started to put
> "safe" allocations in, ie. allocs which would complain if NULL is returned
> and then terminate all MPI jobs, but they are probably not everywhere yet.
>
> Cheers
>
> T
>
> Thorsten W Becker - http://geodynamics.usc.edu/~becker
>
>
> On Thu, Mar 13, 2014 at 11:45 AM, Dan Bower <danb at gps.caltech.edu> wrote:
>
>> Hi Eh,
>>
>> I've grepped for 'Error', 'error', 'warn', and 'Warn' in the tracer_log
>> and .info. files and nothing is showing up.  I assume it's some kind of
>> memory allocation issue since it is literally the total number of tracers
>> that causes the code to crash?
>>
>> Cheers,
>>
>> Dan
>>
>>
>> On Thu, Mar 13, 2014 at 11:29 AM, tan2 <tan2tan2 at gmail.com> wrote:
>>
>>>
>>> Dan,
>>>
>>> The problem does not occur inside parallel_process_sync(). It is likely
>>> that one of the mpi process got crashed while reading the tracer file. The
>>> rest mpi processes were unaware of the crash and keep moving forward until
>>> they call parallel_process_sync().
>>>
>>> Please check the content of *.tracer_log.*. One of the file might hold
>>> the real error message of the crash.
>>>
>>> Cheers,
>>>  Eh
>>>
>>>
>>>  On Fri, Mar 14, 2014 at 2:09 AM, Dan Bower <danb at gps.caltech.edu>wrote:
>>>
>>>>  Hi CIG,
>>>>
>>>> In brief, I am using a modified version of CitcomS from the svn
>>>> (r16400).  To demonstrate my problem, when I read in 103345 tracers from a
>>>> file (using tracer_ic_method=1) the relevant part of the stderr looks like
>>>> the following (I added in more debug output) and the model proceeds fine:
>>>>
>>>> --------------------
>>>>  Beginning Mapping
>>>> Beginning Regtoel submapping
>>>> Mapping completed (26.341404 seconds)
>>>> tracer setup done
>>>> initial_mesh_solver_setup done
>>>> initialization time = 36.243746
>>>> Sum of Tracers: 103345
>>>> before find_tracers(E)
>>>> after j,k loop
>>>> before parallel_process_sync
>>>> after parallel_process_sync
>>>> after lost_souls
>>>> after free later arrays
>>>> after reduce_tracer_arrays
>>>> find_tracers(E) complete
>>>> --------------------
>>>>
>>>> However, for a tracer file that contains, say 10 times more tracers
>>>> (1033449), the model crashes at the parallel_process_sync call:
>>>>
>>>> ---------------------------
>>>> Beginning Mapping
>>>> Beginning Regtoel submapping
>>>> Mapping completed (26.581470 seconds)
>>>> tracer setup done
>>>> initial_mesh_solver_setup done
>>>> initialization time = 35.774893
>>>> Sum of Tracers: 1033449
>>>> before find_tracers(E)
>>>> after j,k loop
>>>> before parallel_process_sync
>>>> [proxy:0:0 at compute-8-209.local] HYDT_dmxu_poll_wait_for_event
>>>> (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN &
>>>> ~POLLOUT & ~POLLHUP)) failed
>>>> [proxy:0:0 at compute-8-209.local] main (./pm/pmiserv/pmip.c:387): demux
>>>> engine error waiting for event
>>>> [mpiexec at compute-8-209.local] HYDT_bscu_wait_for_completion
>>>> (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated
>>>> badly; aborting
>>>> [mpiexec at compute-8-209.local] HYDT_bsci_wait_for_completion
>>>> (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error
>>>> waiting for completion
>>>> [mpiexec at compute-8-209.local] HYD_pmci_wait_for_completion
>>>> (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting
>>>> for completion
>>>> [mpiexec at compute-8-209.local] main (./ui/mpich/mpiexec.c:548): process
>>>> manager error waiting for completion
>>>> /home/danb/cig/CitcomS-assim/bin/pycitcoms: /opt/intel/impi/
>>>> 4.0.3.008/bin/mpirun: exit 255
>>>> Connection to compute-3-109 closed by remote host.^M
>>>> Connection to compute-3-110 closed by remote host.^M
>>>> Connection to compute-3-111 closed by remote host.^M
>>>> Connection to compute-4-121 closed by remote host.^M
>>>> Connection to compute-4-122 closed by remote host.^M
>>>> Connection to compute-4-140 closed by remote host.^M
>>>> Connection to compute-4-145 closed by remote host.^M
>>>> Connection to compute-7-162 closed by remote host.^M
>>>> Connection to compute-7-166 closed by remote host.^M
>>>> Connection to compute-7-167 closed by remote host.^M
>>>> Connection to compute-7-168 closed by remote host.^M
>>>> Connection to compute-7-169 closed by remote host.^M
>>>> Connection to compute-7-170 closed by remote host.^M
>>>> Connection to compute-7-171 closed by remote host.^M
>>>> Connection to compute-7-179 closed by remote host.^M
>>>> Connection to compute-7-180 closed by remote host.^M
>>>> Connection to compute-7-183 closed by remote host.^M
>>>> Connection to compute-7-184 closed by remote host.^M
>>>> Connection to compute-7-189 closed by remote host.^M
>>>> Connection to compute-7-191 closed by remote host.^M
>>>> Connection to compute-7-192 closed by remote host.^M
>>>> Connection to compute-7-193 closed by remote host.^M
>>>> Connection to compute-7-194 closed by remote host.^M
>>>> Connection to compute-8-195 closed by remote host.^M
>>>> Connection to compute-8-196 closed by remote host.^M
>>>> Connection to compute-8-198 closed by remote host.^M
>>>> Connection to compute-8-201 closed by remote host.^M
>>>> Connection to compute-8-205 closed by remote host.^M
>>>> Connection to compute-8-206 closed by remote host.^M
>>>> Connection to compute-8-207 closed by remote host.^M
>>>> Connection to compute-8-208 closed by remote host.^M
>>>> Killed by signal 15.^M
>>>>
>>>> --------------------------------
>>>>
>>>> parallel_process_sync() is in Parallel_util.c and is simply a call to
>>>> MPI_Barrier:
>>>>
>>>> -----------------------------
>>>> void parallel_process_sync(struct All_variables *E)
>>>> {
>>>>
>>>>   MPI_Barrier(E->parallel.world);
>>>>   return;
>>>>   }
>>>> -----------------------------
>>>>
>>>> Any ideas why this crashes when I have a larger input tracer file?  I
>>>> am using the pyre version of CitcomS and using intel-12 and intel/impi/4.0
>>>> compilers.
>>>>
>>>> Any advice greatly appreciated.
>>>>
>>>> Cheers,
>>>>
>>>> Dan
>>>>
>>>> _______________________________________________
>>>> CIG-MC mailing list
>>>> CIG-MC at geodynamics.org
>>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>>
>>>
>>>
>>> _______________________________________________
>>> CIG-MC mailing list
>>> CIG-MC at geodynamics.org
>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>
>>
>>
>> _______________________________________________
>> CIG-MC mailing list
>> CIG-MC at geodynamics.org
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>
>
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/cig-mc/attachments/20140313/ca3c851e/attachment-0001.html>


More information about the CIG-MC mailing list