[aspect-devel] Aspect-devel Digest, Vol 43, Issue 11

Sun Jun 21 16:56:03 PDT 2015

Thanks for the detailed suggestions. I'll contact our system
administrators. Btw, there is another error on our cluster that I'm not
sure whether is related with this "not terminated" problem. Every time I
run an ASPECT job, the following error always appear in the record file:

[mpiexec at br310] HYDT_bscd_pbs_wait_for_completion
(./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with
TM error 17002
[mpiexec at br310] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at br310] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for
completion
[mpiexec at br310] main (./ui/mpich/mpiexec.c:325): process manager error
waiting for completion

This error appears both in single node and multiple nodes case, but doesn't
inhibit the results output. Our cluster uses mvapich MPI module and
mpicc/mpicxx compilers.

Although I'm not sure what does this error mean, from the forth line
"process manager error waiting for completion", I'm worried it has
something to do with the "not terminated" problem in multiple nodes case.
What do you think of this error?

Cheers,

Shangxin

On Sun, Jun 21, 2015 at 3:00 PM, <aspect-devel-request at geodynamics.org>
wrote:

> Send Aspect-devel mailing list submissions to
>         aspect-devel at geodynamics.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> or, via email, send a message with subject or body 'help' to
>         aspect-devel-request at geodynamics.org
>
> You can reach the person managing the list at
>         aspect-devel-owner at geodynamics.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Aspect-devel digest..."
>
>
> Today's Topics:
>
>    1. Re: A confused "not terminated"problem with multiple      nodes.
>       (Shangxin Liu)
>    2. Re: A confused "not terminated"problem with multiple nodes.
>       (Jonathan Perry-Houts)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 20 Jun 2015 22:16:10 -0400
> From: Shangxin Liu <sxliu at vt.edu>
> To: Timo Heister <heister at clemson.edu>
> Cc: aspect-devel <aspect-devel at geodynamics.org>
> Subject: Re: [aspect-devel] A confused "not terminated"problem with
>         multiple        nodes.
> Message-ID:
>         <CACg+8PS3LoOL=VAMzL=
> 04y8d1j45E-46Pfs66xCCNUker6odMw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Timo,
>
> Normally, a job should be terminated by the cluster when it's finished. Of
> course, this finish time is shorter than requested walltime. However, when
> I run ASPECT in our cluster with multiple nodes (more than 1), the program
> cannot be terminated when it's finished, i.e., when the case is finished,
> it still continues to take up nodes and display in the "running" state on
> the cluster. It can only be released from the cluster when the time exceeds
> the requested walltime. I know I can manually kill the job, but in the
> large computation size case, I don't know exactly when it will be finished
> before submitting the job request, so I normally request a long enough
> walltime. Since the job cannot be terminated when finished, I have to check
> manually again and again or wait until the walltime. This makes the debug
> very inefficient. So what reason do you think causes this weird "not
> terminated" problem?
>
> Best,
>
> Shangxin
>
> On Sat, Jun 20, 2015 at 3:34 AM, Timo Heister <heister at clemson.edu> wrote:
>
> > Hey Shangxin,
> >
> > what do you mean by "the case cannot be terminated"? Does ASPECT not
> > stop when the computation is done? Is it not killed by the job
> > scheduler on the cluster (then it is a bug in their scheduler)? What
> > happens if you ask the scheduler to kill the job (using qdel on pbs
> > systems)?
> > You should be able to ssh into one of the nodes and kill ASPECT using
> > "kill" manually.
> >
> >
> > On Fri, Jun 19, 2015 at 11:54 AM, Shangxin Liu <sxliu at vt.edu> wrote:
> > > Hi everyone,
> > >
> > > There is a problem always confuses me. On our machine, each node has 16
> > > processors. When I run ASPECT with 1 node, 16 processors, and request a
> > > walltime larger than the finish time, the case can be terminated when
> > > finished. However, when I run ASPECT with multiple nodes (more than 1),
> > the
> > > case cannot be terminated when finished. It can only be terminated
> > (killed)
> > > when the time exceeds the request walltime. For example, if I run
> ASPECT
> > > with 3 nodes (48 processors), requesting 3 hours, the case finishes at
> 10
> > > minutes from the record file, but it cannot be terminated at 10 minutes
> > when
> > > it's finished. It is terminated (killed) at 10 hours walltime by the
> > > machine. I also make a test in deal.II and find deal.II doesn't have
> this
> > > multiple nodes "not terminated" problem. So I suppose there may be
> > something
> > > in the MPI part of ASPECT incompatible with our machine. But why one
> node
> > > cases can be terminated while multiple nodes cases cannot?
> > >
> > > Although this problem doesn't influence the results, it makes the debug
> > very
> > > slow. Every time with multiple nodes case I have to wait until the
> > requested
> > > walltime. Any suggestions to solve this confused problem?
> > >
> > > Best regards,
> > >
> > > Shangxin
> > >
> > >
> > >
> > > _______________________________________________
> > > Aspect-devel mailing list
> > > Aspect-devel at geodynamics.org
> > > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> > --
> > Timo Heister
> > http://www.math.clemson.edu/~heister/
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150620/46c11ca5/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Sat, 20 Jun 2015 19:44:07 -0700
> From: Jonathan Perry-Houts <jperryh2 at uoregon.edu>
> To: aspect-devel at geodynamics.org
> Subject: Re: [aspect-devel] A confused "not terminated"problem with
>         multiple nodes.
> Message-ID: <558624F7.1050309 at uoregon.edu>
> Content-Type: text/plain; charset="utf-8"
>
> Shangxin,
>
> That sounds like a problem with the scheduler, as Timo mentioned before.
> Try sshing in to one of the compute nodes after Aspect has finished, but
> while the job is still "running" and poke around. Does aspect show up in
> `ps aux | grep aspect`? What about `top`? If Aspect is actually done
> running, but the job remains active, then it is a problem with the
> cluster's scheduler. You need to talk to a sys admin about that, I guess.
>
> Cheers,
> Jonathan
>
> On 06/20/2015 07:16 PM, Shangxin Liu wrote:
> > Hi Timo,
> >
> > Normally, a job should be terminated by the cluster when it's finished.
> > Of course, this finish time is shorter than requested walltime. However,
> > when I run ASPECT in our cluster with multiple nodes (more than 1), the
> > program cannot be terminated when it's finished, i.e., when the case is
> > finished, it still continues to take up nodes and display in the
> > "running" state on the cluster. It can only be released from the cluster
> > when the time exceeds the requested walltime. I know I can manually kill
> > the job, but in the large computation size case, I don't know exactly
> > when it will be finished before submitting the job request, so I
> > normally request a long enough walltime. Since the job cannot be
> > terminated when finished, I have to check manually again and again or
> > wait until the walltime. This makes the debug very inefficient. So what
> > reason do you think causes this weird "not terminated" problem?
> >
> > Best,
> >
> > Shangxin
> >
> > On Sat, Jun 20, 2015 at 3:34 AM, Timo Heister <heister at clemson.edu
> > <mailto:heister at clemson.edu>> wrote:
> >
> >     Hey Shangxin,
> >
> >     what do you mean by "the case cannot be terminated"? Does ASPECT not
> >     stop when the computation is done? Is it not killed by the job
> >     scheduler on the cluster (then it is a bug in their scheduler)? What
> >     happens if you ask the scheduler to kill the job (using qdel on pbs
> >     systems)?
> >     You should be able to ssh into one of the nodes and kill ASPECT using
> >     "kill" manually.
> >
> >
> >     On Fri, Jun 19, 2015 at 11:54 AM, Shangxin Liu <sxliu at vt.edu
> >     <mailto:sxliu at vt.edu>> wrote:
> >     > Hi everyone,
> >     >
> >     > There is a problem always confuses me. On our machine, each node
> >     has 16
> >     > processors. When I run ASPECT with 1 node, 16 processors, and
> >     request a
> >     > walltime larger than the finish time, the case can be terminated
> when
> >     > finished. However, when I run ASPECT with multiple nodes (more
> >     than 1), the
> >     > case cannot be terminated when finished. It can only be terminated
> >     (killed)
> >     > when the time exceeds the request walltime. For example, if I run
> >     ASPECT
> >     > with 3 nodes (48 processors), requesting 3 hours, the case
> >     finishes at 10
> >     > minutes from the record file, but it cannot be terminated at 10
> >     minutes when
> >     > it's finished. It is terminated (killed) at 10 hours walltime by
> the
> >     > machine. I also make a test in deal.II and find deal.II doesn't
> >     have this
> >     > multiple nodes "not terminated" problem. So I suppose there may be
> >     something
> >     > in the MPI part of ASPECT incompatible with our machine. But why
> >     one node
> >     > cases can be terminated while multiple nodes cases cannot?
> >     >
> >     > Although this problem doesn't influence the results, it makes the
> >     debug very
> >     > slow. Every time with multiple nodes case I have to wait until the
> >     requested
> >     > walltime. Any suggestions to solve this confused problem?
> >     >
> >     > Best regards,
> >     >
> >     > Shangxin
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > Aspect-devel mailing list
> >     > Aspect-devel at geodynamics.org <mailto:Aspect-devel at geodynamics.org>
> >     > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
> >
> >
> >     --
> >     Timo Heister
> >     http://www.math.clemson.edu/~heister/
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
> >
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 473 bytes
> Desc: OpenPGP digital signature
> URL: <
> http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150620/477374be/attachment-0001.sig
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
> ------------------------------
>
> End of Aspect-devel Digest, Vol 43, Issue 11
> ********************************************
>

-- 
Shangxin Liu

PhD Student
Geodynamics Group
Department of Geosciences
Virginia Tech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150621/fde1745b/attachment-0001.html>