[aspect-devel] A confused "not terminated"problem with multiple nodes.
sxliu at vt.edu
Fri Jun 19 08:54:00 PDT 2015
There is a problem always confuses me. On our machine, each node has 16
processors. When I run ASPECT with 1 node, 16 processors, and request a
walltime larger than the finish time, the case can be terminated when
finished. However, when I run ASPECT with multiple nodes (more than 1), the
case cannot be terminated when finished. It can only be terminated (killed)
when the time exceeds the request walltime. For example, if I run ASPECT
with 3 nodes (48 processors), requesting 3 hours, the case finishes at 10
minutes from the record file, but it cannot be terminated at 10 minutes
when it's finished. It is terminated (killed) at 10 hours walltime by the
machine. I also make a test in deal.II and find deal.II doesn't have this
multiple nodes "not terminated" problem. So I suppose there may be
something in the MPI part of ASPECT incompatible with our machine. But why
one node cases can be terminated while multiple nodes cases cannot?
Although this problem doesn't influence the results, it makes the debug
very slow. Every time with multiple nodes case I have to wait until the
requested walltime. Any suggestions to solve this confused problem?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Aspect-devel