[aspect-devel] A confused "not terminated"problem with multiple nodes.
Jonathan Perry-Houts
jperryh2 at uoregon.edu
Sat Jun 20 19:44:07 PDT 2015
Shangxin,
That sounds like a problem with the scheduler, as Timo mentioned before.
Try sshing in to one of the compute nodes after Aspect has finished, but
while the job is still "running" and poke around. Does aspect show up in
`ps aux | grep aspect`? What about `top`? If Aspect is actually done
running, but the job remains active, then it is a problem with the
cluster's scheduler. You need to talk to a sys admin about that, I guess.
Cheers,
Jonathan
On 06/20/2015 07:16 PM, Shangxin Liu wrote:
> Hi Timo,
>
> Normally, a job should be terminated by the cluster when it's finished.
> Of course, this finish time is shorter than requested walltime. However,
> when I run ASPECT in our cluster with multiple nodes (more than 1), the
> program cannot be terminated when it's finished, i.e., when the case is
> finished, it still continues to take up nodes and display in the
> "running" state on the cluster. It can only be released from the cluster
> when the time exceeds the requested walltime. I know I can manually kill
> the job, but in the large computation size case, I don't know exactly
> when it will be finished before submitting the job request, so I
> normally request a long enough walltime. Since the job cannot be
> terminated when finished, I have to check manually again and again or
> wait until the walltime. This makes the debug very inefficient. So what
> reason do you think causes this weird "not terminated" problem?
>
> Best,
>
> Shangxin
>
> On Sat, Jun 20, 2015 at 3:34 AM, Timo Heister <heister at clemson.edu
> <mailto:heister at clemson.edu>> wrote:
>
> Hey Shangxin,
>
> what do you mean by "the case cannot be terminated"? Does ASPECT not
> stop when the computation is done? Is it not killed by the job
> scheduler on the cluster (then it is a bug in their scheduler)? What
> happens if you ask the scheduler to kill the job (using qdel on pbs
> systems)?
> You should be able to ssh into one of the nodes and kill ASPECT using
> "kill" manually.
>
>
> On Fri, Jun 19, 2015 at 11:54 AM, Shangxin Liu <sxliu at vt.edu
> <mailto:sxliu at vt.edu>> wrote:
> > Hi everyone,
> >
> > There is a problem always confuses me. On our machine, each node
> has 16
> > processors. When I run ASPECT with 1 node, 16 processors, and
> request a
> > walltime larger than the finish time, the case can be terminated when
> > finished. However, when I run ASPECT with multiple nodes (more
> than 1), the
> > case cannot be terminated when finished. It can only be terminated
> (killed)
> > when the time exceeds the request walltime. For example, if I run
> ASPECT
> > with 3 nodes (48 processors), requesting 3 hours, the case
> finishes at 10
> > minutes from the record file, but it cannot be terminated at 10
> minutes when
> > it's finished. It is terminated (killed) at 10 hours walltime by the
> > machine. I also make a test in deal.II and find deal.II doesn't
> have this
> > multiple nodes "not terminated" problem. So I suppose there may be
> something
> > in the MPI part of ASPECT incompatible with our machine. But why
> one node
> > cases can be terminated while multiple nodes cases cannot?
> >
> > Although this problem doesn't influence the results, it makes the
> debug very
> > slow. Every time with multiple nodes case I have to wait until the
> requested
> > walltime. Any suggestions to solve this confused problem?
> >
> > Best regards,
> >
> > Shangxin
> >
> >
> >
> > _______________________________________________
> > Aspect-devel mailing list
> > Aspect-devel at geodynamics.org <mailto:Aspect-devel at geodynamics.org>
> > http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
>
>
> --
> Timo Heister
> http://www.math.clemson.edu/~heister/
>
>
>
>
>
>
> _______________________________________________
> Aspect-devel mailing list
> Aspect-devel at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20150620/477374be/attachment.sig>
More information about the Aspect-devel
mailing list