[aspect-devel] A confused "not terminated"problem with multiple nodes.

Sun Jun 21 17:39:18 PDT 2015

Thanks for the detailed suggestions. I'll contact our system
administrators. Btw, there is another error on our cluster that I'm not
sure whether is related with this "not terminated" problem. Every time I
run an ASPECT job, the following error always appear in the record file:

[mpiexec at br310] HYDT_bscd_pbs_wait_for_completion
(./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with
TM error 17002
[mpiexec at br310] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at br310] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for
completion
[mpiexec at br310] main (./ui/mpich/mpiexec.c:325): process manager error
waiting for completion

This error appears both in single node and multiple nodes case, but doesn't
inhibit the results output. Our cluster uses mvapich MPI module and
mpicc/mpicxx compilers.

Although I'm not sure what does this error mean, from the forth line
"process manager error waiting for completion", I'm worried it has
something to do with the "not terminated" problem in multiple nodes case.
What do you think of this error?

Cheers,

Shangxin

Date: Sat, 20 Jun 2015 19:44:07 -0700
> From: Jonathan Perry-Houts <jperryh2 at uoregon.edu>
> To: aspect-devel at geodynamics.org
> Subject: Re: [aspect-devel] A confused "not terminated"problem with
>         multiple nodes.
> Message-ID: <558624F7.1050309 at uoregon.edu>
> Content-Type: text/plain; charset="utf-8"
>
> Shangxin,
>
> That sounds like a problem with the scheduler, as Timo mentioned before.
> Try sshing in to one of the compute nodes after Aspect has finished, but
> while the job is still "running" and poke around. Does aspect show up in
> `ps aux | grep aspect`? What about `top`? If Aspect is actually done
> running, but the job remains active, then it is a problem with the
> cluster's scheduler. You need to talk to a sys admin about that, I guess.
>
> Cheers,
> Jonathan
>
> On 06/20/2015 07:16 PM, Shangxin Liu wrote:
> > Hi Timo,
> >
> > Normally, a job should be terminated by the cluster when it's finished.
> > Of course, this finish time is shorter than requested walltime. However,
> > when I run ASPECT in our cluster with multiple nodes (more than 1), the
> > program cannot be terminated when it's finished, i.e., when the case is
> > finished, it still continues to take up nodes and display in the
> > "running" state on the cluster. It can only be released from the cluster
> > when the time exceeds the requested walltime. I know I can manually kill
> > the job, but in the large computation size case, I don't know exactly
> > when it will be finished before submitting the job request, so I
> > normally request a long enough walltime. Since the job cannot be
> > terminated when finished, I have to check manually again and again or
> > wait until the walltime. This makes the debug very inefficient. So what
> > reason do you think causes this weird "not terminated" problem?
> >
> > Best,
> >
> > Shangxin
> >
> > On Sat, Jun 20, 2015 at 3:34 AM, Timo Heister <heister at clemson.edu
> > <mailto:heister at clemson.edu>> wrote:
> >
> >     Hey Shangxin,
> >
> >     what do you mean by "the case cannot be terminated"? Does ASPECT not
> >     stop when the computation is done? Is it not killed by the job
> >     scheduler on the cluster (then it is a bug in their scheduler)? What
> >     happens if you ask the scheduler to kill the job (using qdel on pbs
> >     systems)?
> >     You should be able to ssh into one of the nodes and kill ASPECT using
> >     "kill" manually.
> >
> >
> >     On Fri, Jun 19, 2015 at 11:54 AM, Shangxin Liu <sxliu at vt.edu
> >     <mailto:sxliu at vt.edu>> wrote:
> >     > Hi everyone,
> >     >
> >     > There is a problem always confuses me. On our machine, each node
> >     has 16
> >     > processors. When I run ASPECT with 1 node, 16 processors, and
> >     request a
> >     > walltime larger than the finish time, the case can be terminated
> when
> >     > finished. However, when I run ASPECT with multiple nodes (more
> >     than 1), the
> >     > case cannot be terminated when finished. It can only be terminated
> >     (killed)
> >     > when the time exceeds the request walltime. For example, if I run
> >     ASPECT
> >     > with 3 nodes (48 processors), requesting 3 hours, the case
> >     finishes at 10
> >     > minutes from the record file, but it cannot be terminated at 10
> >     minutes when
> >     > it's finished. It is terminated (killed) at 10 hours walltime by
> the
> >     > machine. I also make a test in deal.II and find deal.II doesn't
> >     have this
> >     > multiple nodes "not terminated" problem. So I suppose there may be
> >     something
> >     > in the MPI part of ASPECT incompatible with our machine. But why
> >     one node
> >     > cases can be terminated while multiple nodes cases cannot?
> >     >
> >     > Although this problem doesn't influence the results, it makes the
> >     debug very
> >     > slow. Every time with multiple nodes case I have to wait until the
> >     requested
> >     > walltime. Any suggestions to solve this confused problem?
> >     >
> >     > Best regards,
> >     >
> >     > Shangxin
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > Aspect-devel mailing list
> >     > Aspect-devel at geodynamics.org <mailto:Aspect-devel at geodynamics.org>
> >     > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> >     --
> >     Timo Heister
> >     http://www.math.clemson.edu/~heister/
>

> ********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150621/b6806433/attachment.html>