[aspect-devel] A confused "not terminated"problem with multiple nodes.

Shangxin Liu sxliu at vt.edu
Sat Jun 20 19:16:10 PDT 2015


Hi Timo,

Normally, a job should be terminated by the cluster when it's finished. Of
course, this finish time is shorter than requested walltime. However, when
I run ASPECT in our cluster with multiple nodes (more than 1), the program
cannot be terminated when it's finished, i.e., when the case is finished,
it still continues to take up nodes and display in the "running" state on
the cluster. It can only be released from the cluster when the time exceeds
the requested walltime. I know I can manually kill the job, but in the
large computation size case, I don't know exactly when it will be finished
before submitting the job request, so I normally request a long enough
walltime. Since the job cannot be terminated when finished, I have to check
manually again and again or wait until the walltime. This makes the debug
very inefficient. So what reason do you think causes this weird "not
terminated" problem?

Best,

Shangxin

On Sat, Jun 20, 2015 at 3:34 AM, Timo Heister <heister at clemson.edu> wrote:

> Hey Shangxin,
>
> what do you mean by "the case cannot be terminated"? Does ASPECT not
> stop when the computation is done? Is it not killed by the job
> scheduler on the cluster (then it is a bug in their scheduler)? What
> happens if you ask the scheduler to kill the job (using qdel on pbs
> systems)?
> You should be able to ssh into one of the nodes and kill ASPECT using
> "kill" manually.
>
>
> On Fri, Jun 19, 2015 at 11:54 AM, Shangxin Liu <sxliu at vt.edu> wrote:
> > Hi everyone,
> >
> > There is a problem always confuses me. On our machine, each node has 16
> > processors. When I run ASPECT with 1 node, 16 processors, and request a
> > walltime larger than the finish time, the case can be terminated when
> > finished. However, when I run ASPECT with multiple nodes (more than 1),
> the
> > case cannot be terminated when finished. It can only be terminated
> (killed)
> > when the time exceeds the request walltime. For example, if I run ASPECT
> > with 3 nodes (48 processors), requesting 3 hours, the case finishes at 10
> > minutes from the record file, but it cannot be terminated at 10 minutes
> when
> > it's finished. It is terminated (killed) at 10 hours walltime by the
> > machine. I also make a test in deal.II and find deal.II doesn't have this
> > multiple nodes "not terminated" problem. So I suppose there may be
> something
> > in the MPI part of ASPECT incompatible with our machine. But why one node
> > cases can be terminated while multiple nodes cases cannot?
> >
> > Although this problem doesn't influence the results, it makes the debug
> very
> > slow. Every time with multiple nodes case I have to wait until the
> requested
> > walltime. Any suggestions to solve this confused problem?
> >
> > Best regards,
> >
> > Shangxin
> >
> >
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
> --
> Timo Heister
> http://www.math.clemson.edu/~heister/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150620/46c11ca5/attachment.html>


More information about the Aspect-devel mailing list