[aspect-devel] A confused "not terminated"problem with multiple nodes.
Timo Heister
heister at clemson.edu
Mon Jun 22 13:20:20 PDT 2015
> I use mpirun -np $PBS_NP aspect file.prm in my job script ($PBS_NP is
> the total processor numbers). what does the extension "_rsh" mean?
I have no idea, but when I googled for
HYDT_bscd_pbs_wait_for_completion I found
https://github.com/radical-cybertools/radical.pilot/issues/309#issuecomment-54909111
that mentions that you need to use mpirun_rsh on that machine (this
seems to be a mvapich thing).
>
>
> On Mon, Jun 22, 2015 at 12:56 AM, Timo Heister <timo.heister at gmail.com>
> wrote:
>>
>> > Thanks for the detailed suggestions. I'll contact our system
>> > administrators.
>> > Btw, there is another error on our cluster that I'm not sure whether is
>> > related with this "not terminated" problem. Every time I run an ASPECT
>> > job,
>> > the following error always appear in the record file:
>> >
>> > [mpiexec at br310] HYDT_bscd_pbs_wait_for_completion
>> > (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed
>> > with
>> > TM error 17002
>>
>> Might be related and something you should ask your admins.
>>
>> > This error appears both in single node and multiple nodes case, but
>> > doesn't
>> > inhibit the results output. Our cluster uses mvapich MPI module and
>> > mpicc/mpicxx compilers.
>>
>> Are you using mpirun_rsh in your job script?
>
>
>
--
Timo Heister
http://www.math.clemson.edu/~heister/
More information about the Aspect-devel
mailing list