[aspect-devel] A confused "not terminated"problem with multiple nodes.

Timo Heister heister at clemson.edu
Mon Jun 22 13:20:20 PDT 2015


> I use mpirun  -np  $PBS_NP  aspect  file.prm in my job script ($PBS_NP is
> the total processor numbers). what does the extension "_rsh" mean?

I have no idea, but when I googled for
HYDT_bscd_pbs_wait_for_completion I found
https://github.com/radical-cybertools/radical.pilot/issues/309#issuecomment-54909111
that mentions that you need to use mpirun_rsh on that machine (this
seems to be a mvapich thing).

>
>
> On Mon, Jun 22, 2015 at 12:56 AM, Timo Heister <timo.heister at gmail.com>
> wrote:
>>
>> > Thanks for the detailed suggestions. I'll contact our system
>> > administrators.
>> > Btw, there is another error on our cluster that I'm not sure whether is
>> > related with this "not terminated" problem. Every time I run an ASPECT
>> > job,
>> > the following error always appear in the record file:
>> >
>> > [mpiexec at br310] HYDT_bscd_pbs_wait_for_completion
>> > (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed
>> > with
>> > TM error 17002
>>
>> Might be related and something you should ask your admins.
>>
>> > This error appears both in single node and multiple nodes case, but
>> > doesn't
>> > inhibit the results output. Our cluster uses mvapich MPI module and
>> > mpicc/mpicxx compilers.
>>
>> Are you using mpirun_rsh in your job script?
>
>
>

-- 
Timo Heister
http://www.math.clemson.edu/~heister/


More information about the Aspect-devel mailing list