Created on 2007-03-29.02:34:56 by leif, last changed 2007-04-03.21:25:01 by leif.
| msg325 (view) |
Author: leif |
Date: 2007-04-03.21:25:01 |
Joya Tetreault wrote:
> cool, i suspected as much with the python and citcomCU. I am using a
> 64 bit Linux machine, with Red Hat.
>
OK. The "signed integer is greater than maximum" error while downloading
dependencies is a bug in Python 2.3 (the bug only affects 64-bit
machines). So you should continue to use Python 2.5.
>
> and then when i try compiling it using python 2.5 and LAM-MPI, it does
> compile! now when i run it this is the error:
>
> lithos.unm.edu{joya}/home/joya 26: ./bin/citcoms examples/example0.cfg
> -----------------------------------------------------------------------------
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 13309 failed on node n0 (127.0.0.1) due to signal 11.
> -----------------------------------------------------------------------------
>
> --pyre-start: mpirun: exit 11
> ./bin/citcoms: /usr/local/CitcomS-2.1.0/bin/pycitcoms: exit 1
This is actually quite different form the error I'm getting ("exit 11"
vs. "exit -11" ... the minus sign makes all the difference). This means
that one of the MPI processes is crashing... as reported by 'mpirun'.
The only 'mpirun' debugging option I see for LAM/MPI is "-tv":
-tv Launch processes under TotalView Debugger
You wouldn't happen to have TotalView installed?
--Leif
|
| msg324 (view) |
Author: joya |
Date: 2007-04-03.20:55:02 |
cool, i suspected as much with the python and citcomCU. I am using a 64
bit Linux machine, with Red Hat.
When compiling citcomS with python 2.3.4, it had trouble retrieving
certain dependencies. But using python 2.5, this is not a problem.
here is the error in the config.log, when i tried installing the older
citcomS-2.1.0 with python 2.3.4:
Downloading
http://cheeseshop.python.org/packages/any/m/merlin/merlin-1.1.egg
Traceback (most recent call last):
File "setup.py", line 3, in ?
use_merlin()
File "/usr/local/CitcomS-2.1.0/archimedes/__init__.py", line 56, in
use_merlin
bootstrap()
File "/usr/local/CitcomS-2.1.0/archimedes/__init__.py", line 46, in
bootstrap
import merlin; merlin.bootstrap_install_from = egg
OverflowError: signed integer is greater than maximum
configure:2304: $? = 1
configure:2309: error: cannot download missing Python dependencies
and then when i try compiling it using python 2.5 and LAM-MPI, it does
compile! now when i run it this is the error:
lithos.unm.edu{joya}/home/joya 26: ./bin/citcoms examples/example0.cfg
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 13309 failed on node n0 (127.0.0.1) due to signal 11.
-----------------------------------------------------------------------------
--pyre-start: mpirun: exit 11
./bin/citcoms: /usr/local/CitcomS-2.1.0/bin/pycitcoms: exit 1
I will now try it with giving it node info, and tell you if that works.
p.s. this is citcomS-2.1.0 that i'm testing, so i'll go ahead and also
try citcomS-2.2.1...
joya
On Tue, 3 Apr 2007, Leif Strand wrote:
> Joya,
>
> CitcomCU doesn't use Python, so the "signal 11" there is unrelated.
>
> What error do you get while trying to compile CitcomS with LAM/MPI? What is
> the problem with Python 2.3.4 and CitcomS? Is this a 64-bit machine?
>
> --Leif
>
>
|
| msg323 (view) |
Author: leif |
Date: 2007-04-03.20:55:01 |
Leif Strand "Roundup Issue Tracker" wrote:
>I'll have to build Python with debugging info in order to pursue this
>further... the stack trace I get under GDB isn't helpful...
>
>
>
So I downloaded the latest & greatest -- Python 2.5 -- and built it with
debugging info. The crash still reproduced. Looks like an infinite
regress, but I'm not sure what to make of it yet. Here is the call stack:
#0 0xb7fb1c40 in pthread_getspecific () from /lib/tls/libpthread.so.0
#1 0x081a02e0 in malloc_atfork ()
#2 0x0819cf76 in malloc ()
#3 0x0819cf76 in malloc ()
#4 0x0819cf76 in malloc ()
[...]
#261542 0x0819cf76 in malloc ()
#261543 0x0819cf76 in malloc ()
#261544 0x0819cf76 in malloc ()
#261545 0x080d15eb in PyList_New (size=576) at Objects/listobject.c:110
#261546 0x080dff1a in dict_items (mp=0x8268604) at Objects/dictobject.c:1079
#261547 0x0812048a in call_function (pp_stack=0xbf86e688, oparg=0) at
Python/ceval.c:3550
#261548 0x0811da78 in PyEval_EvalFrameEx (f=0x83256ec, throwflag=0) at
Python/ceval.c:2269
#261549 0x0811eea2 in PyEval_EvalCodeEx (co=0x843cec0,
globals=0x83f4dfc, locals=0x0, args=0x8324e80, argcount=2,
kws=0x8324e88, kwcount=0,
defs=0x8451518, defcount=1, closure=0x0) at Python/ceval.c:2833
#261550 0x08120ae5 in fast_function (func=0x8442d14,
pp_stack=0xbf86e8b8, n=2, na=2, nk=0) at Python/ceval.c:3662
#261551 0x08120842 in call_function (pp_stack=0xbf86e8b8, oparg=2) at
Python/ceval.c:3587
#261552 0x0811da78 in PyEval_EvalFrameEx (f=0x8324d2c, throwflag=0) at
Python/ceval.c:2269
#261553 0x08120a2c in fast_function (func=0x8442ca4,
pp_stack=0xbf86ea58, n=1, na=1, nk=0) at Python/ceval.c:3652
#261554 0x08120842 in call_function (pp_stack=0xbf86ea58, oparg=1) at
Python/ceval.c:3587
#261555 0x0811da78 in PyEval_EvalFrameEx (f=0x8324a3c, throwflag=0) at
Python/ceval.c:2269
#261556 0x0811eea2 in PyEval_EvalCodeEx (co=0x8440848,
globals=0x83f4dfc, locals=0x0, args=0x8324a1c, argcount=2,
kws=0x8324a24, kwcount=0,
defs=0x84517b8, defcount=1, closure=0x0) at Python/ceval.c:2833
#261557 0x08120ae5 in fast_function (func=0x845725c,
pp_stack=0xbf86ec88, n=2, na=2, nk=0) at Python/ceval.c:3662
#261558 0x08120842 in call_function (pp_stack=0xbf86ec88, oparg=2) at
Python/ceval.c:3587
#261559 0x0811da78 in PyEval_EvalFrameEx (f=0x83248cc, throwflag=0) at
Python/ceval.c:2269
#261560 0x0811eea2 in PyEval_EvalCodeEx (co=0x8440920,
globals=0x83f4dfc, locals=0x0, args=0x8324888, argcount=2,
kws=0x8324890, kwcount=0,
defs=0x8451958, defcount=1, closure=0x0) at Python/ceval.c:2833
#261561 0x08120ae5 in fast_function (func=0x8457304,
pp_stack=0xbf86eeb8, n=2, na=2, nk=0) at Python/ceval.c:3662
#261562 0x08120842 in call_function (pp_stack=0xbf86eeb8, oparg=2) at
Python/ceval.c:3587
#261563 0x0811da78 in PyEval_EvalFrameEx (f=0x83246fc, throwflag=0) at
Python/ceval.c:2269
#261564 0x0811eea2 in PyEval_EvalCodeEx (co=0x8622530,
globals=0x861f46c, locals=0x0, args=0x8324508, argcount=3,
kws=0x8324514, kwcount=0,
defs=0x862a058, defcount=2, closure=0x0) at Python/ceval.c:2833
#261565 0x08120ae5 in fast_function (func=0x86afe64,
pp_stack=0xbf86f0e8, n=3, na=3, nk=0) at Python/ceval.c:3662
#261566 0x08120842 in call_function (pp_stack=0xbf86f0e8, oparg=3) at
Python/ceval.c:3587
#261567 0x0811da78 in PyEval_EvalFrameEx (f=0x83242e4, throwflag=0) at
Python/ceval.c:2269
#261568 0x0811eea2 in PyEval_EvalCodeEx (co=0x8622d58,
globals=0x861f46c, locals=0x0, args=0x88d6880, argcount=4,
kws=0x82c54d8, kwcount=1,
defs=0x8681df8, defcount=1, closure=0x0) at Python/ceval.c:2833
#261569 0x0816f2f5 in function_call (func=0x86b117c, arg=0x88d6874,
kw=0x8818604) at Objects/funcobject.c:517
#261570 0x080b9848 in PyObject_Call (func=0x86b117c, arg=0x88d6874,
kw=0x8818604) at Objects/abstract.c:1860
#261571 0x080c0674 in instancemethod_call (func=0x86b117c,
arg=0x88d6874, kw=0x8818604) at Objects/classobject.c:2493
#261572 0x080b9848 in PyObject_Call (func=0x8449e3c, arg=0x88d6874,
kw=0x8818604) at Objects/abstract.c:1860
#261573 0x08120e47 in do_call (func=0x8449e3c, pp_stack=0xbf86f5d8,
na=4, nk=1) at Python/ceval.c:3777
#261574 0x08120867 in call_function (pp_stack=0xbf86f5d8, oparg=260) at
Python/ceval.c:3589
#261575 0x0811da78 in PyEval_EvalFrameEx (f=0x832415c, throwflag=0) at
Python/ceval.c:2269
#261576 0x0811eea2 in PyEval_EvalCodeEx (co=0x8627020,
globals=0x861f46c, locals=0x0, args=0x88f75e8, argcount=4,
kws=0x88f75f8, kwcount=0,
defs=0x8686db8, defcount=2, closure=0x0) at Python/ceval.c:2833
#261577 0x08120ae5 in fast_function (func=0x86b12cc,
pp_stack=0xbf86f808, n=4, na=4, nk=0) at Python/ceval.c:3662
#261578 0x08120842 in call_function (pp_stack=0xbf86f808, oparg=3) at
Python/ceval.c:3587
#261579 0x0811da78 in PyEval_EvalFrameEx (f=0x88f7494, throwflag=0) at
Python/ceval.c:2269
#261580 0x0811eea2 in PyEval_EvalCodeEx (co=0x8622de8,
globals=0x861f46c, locals=0x0, args=0x88ead44, argcount=2,
kws=0x88ead4c, kwcount=0,
defs=0x862a918, defcount=1, closure=0x0) at Python/ceval.c:2833
#261581 0x08120ae5 in fast_function (func=0x86b11ec,
pp_stack=0xbf86fa38, n=2, na=2, nk=0) at Python/ceval.c:3662
---Type <return> to continue, or q <return> to quit---
#261582 0x08120842 in call_function (pp_stack=0xbf86fa38, oparg=1) at
Python/ceval.c:3587
#261583 0x0811da78 in PyEval_EvalFrameEx (f=0x88eabfc, throwflag=0) at
Python/ceval.c:2269
#261584 0x0811eea2 in PyEval_EvalCodeEx (co=0x8622e30,
globals=0x861f46c, locals=0x0, args=0x88d6f60, argcount=4, kws=0x0,
kwcount=0, defs=0x869ef88,
defcount=3, closure=0x0) at Python/ceval.c:2833
#261585 0x0816f2f5 in function_call (func=0x86b1224, arg=0x88d6f54,
kw=0x0) at Objects/funcobject.c:517
#261586 0x080b9848 in PyObject_Call (func=0x86b1224, arg=0x88d6f54,
kw=0x0) at Objects/abstract.c:1860
#261587 0x080c0674 in instancemethod_call (func=0x86b1224,
arg=0x88d6f54, kw=0x0) at Objects/classobject.c:2493
#261588 0x080b9848 in PyObject_Call (func=0x8449dec, arg=0x88d6bbc,
kw=0x0) at Objects/abstract.c:1860
#261589 0x080bf9bc in instance_call (func=0x88442cc, arg=0x88d6bbc,
kw=0x0) at Objects/classobject.c:2051
#261590 0x080b9848 in PyObject_Call (func=0x88442cc, arg=0x88d6bbc,
kw=0x0) at Objects/abstract.c:1860
#261591 0x081200f1 in PyEval_CallObjectWithKeywords (func=0x88442cc,
arg=0x88d6bbc, kw=0x0) at Python/ceval.c:3435
#261592 0x081413a3 in PyErr_PrintEx (set_sys_last_vars=1) at
Python/pythonrun.c:1073
#261593 0x08141000 in PyErr_Print () at Python/pythonrun.c:969
#261594 0x08140cf3 in PyRun_SimpleStringFlags (
command=0x81ded00 "import sys; path = sys.argv[1]; requires =
sys.argv[2]; entry = sys.argv[3]; path = path.split(':');
path.extend(sys.path); sys.path = path; from merlin import loadObject;
entry = loadObject(entry); e"..., flags=0x0) at Python/pythonrun.c:893
#261595 0x0806ba0e in main (argc=8, argv=0x0) at ../../bin/pycitcoms.c:89
(gdb)
|
| msg322 (view) |
Author: leif |
Date: 2007-04-03.20:05:01 |
Joya,
CitcomCU doesn't use Python, so the "signal 11" there is unrelated.
What error do you get while trying to compile CitcomS with LAM/MPI? What
is the problem with Python 2.3.4 and CitcomS? Is this a 64-bit machine?
--Leif
|
| msg321 (view) |
Author: joya |
Date: 2007-04-03.18:00:02 |
Hi everyone,
my signal error, when running it with LAM-MPI, is not -11, just 11. I had
to mess with it a bit to get this error--and this is with citcomCU. here
is my error message:
lithos.unm.edu{joya}/home/joya 129: mpirun n0 citcom.mpi input1
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 6983 failed on node n0 (127.0.0.1) due to signal 11.
-----------------------------------------------------------------------------
It could be a problem with python or LAM-MPI, because nowhere in the
makefile for citcomCU is it specified which python to use. And i think
everything automatically goes thru the old python (2.3.4) on our machine,
unless i specify it when compiling. but maybe not, since I was able to
get citcomCU running with MPICH, and not specifying the python path.
so perhaps it is a LAM-MPI problem. this signal error I received is from
mpirun, that is the LAM mpirun. with the MPICH mpirun I don't need to
specify the nodes used, which is strange, but citcomCU works.
I haven't been able to reproduce this LAM-MPI error with CitcomS, becuase
it isn't able to compile at all with LAM-MPI, even when i force it to use
python2.5. but it compiles no problem with MPICH2 (and python 2.5 NOT
python 2.3.4).
I don't know if this helps? But i'll keep messing with LAM-MPI and
citcomS to see if i can get it to at least compile.
Joya
On Mon, 2 Apr 2007, Leif Strand wrote:
> Michael Aivazis "Roundup Issue Tracker" wrote:
>
>> In case you didn't know, negative exit codes in Un*x imply death-by-signal,
>> with the absolute value of the return code being the signal that caused the
>> death.
>>
>>
>
> I'm against mapping "death by signal" onto an exit code, because it creates
> ambiguity. Python's 'os' module is doing
>
> if WIFSIGNALED(sts):
> return -WTERMSIG(sts)
>
> which makes it impossible to tell the difference between exit(-11)/exit(245)
> and SIGSEGV. If you're going map signals onto exit codes, you might as well
> be consistent with the shell and map them to 128 + sig. But in the context of
> an object-oriented language, why do it at all? Is the language not rich
> enough to return something other than an integer?
>
> I hate module 'os' in general, because it was obviously written by someone
> with a C mindset... e.g., was it really necessary to recreate the entire
> family of exec* functions (execl, execle, execlp, execlpe, execv, execve,
> execvp, and execvpe) in the Python realm? Unlike C, Python has flexible
> argument processing: *args and **kwds... intelligent use of *args and **kwds
> could have collapsed the entire exec* family into one or two functions. But
> instead, they did a mindless one-to-one mapping between Python and C. It is
> in keeping with the general quality of the Python standard library, though.
>
>> It is curious that any exception in the child after fork causes SIGSEGV.
>> Let me know what you uncover.
>>
> I will. Stay tuned to issue101 for details.
>
> --Leif
>
>
|
| msg320 (view) |
Author: leif |
Date: 2007-04-03.01:25:01 |
Michael Aivazis "Roundup Issue Tracker" wrote:
>In case you didn't know, negative exit codes in Un*x imply
>death-by-signal, with the absolute value of the return code being the
>signal that caused the death.
>
>
>
I'm against mapping "death by signal" onto an exit code, because it
creates ambiguity. Python's 'os' module is doing
if WIFSIGNALED(sts):
return -WTERMSIG(sts)
which makes it impossible to tell the difference between
exit(-11)/exit(245) and SIGSEGV. If you're going map signals onto exit
codes, you might as well be consistent with the shell and map them to
128 + sig. But in the context of an object-oriented language, why do it
at all? Is the language not rich enough to return something other than
an integer?
I hate module 'os' in general, because it was obviously written by
someone with a C mindset... e.g., was it really necessary to recreate
the entire family of exec* functions (execl, execle, execlp, execlpe,
execv, execve, execvp, and execvpe) in the Python realm? Unlike C,
Python has flexible argument processing: *args and **kwds... intelligent
use of *args and **kwds could have collapsed the entire exec* family
into one or two functions. But instead, they did a mindless one-to-one
mapping between Python and C. It is in keeping with the general quality
of the Python standard library, though.
I will. Stay tuned to issue101 for details.
--Leif
|
| msg319 (view) |
Author: aivazis |
Date: 2007-04-02.23:45:01 |
Leif,
In case you didn't know, negative exit codes in Un*x imply
death-by-signal, with the absolute value of the return code being the
signal that caused the death.
It is curious that any exception in the child after fork causes SIGSEGV.
Let me know what you uncover. It sounds like a serious bug to me. Might
be file descriptor related...
-- Michael
|
| msg318 (view) |
Author: leif |
Date: 2007-04-02.23:35:02 |
Joya,
On my system, the "exit -11" seems to be a bug in Python itself. (I'm
using Python 2.4 on Debian.) Shortly after calling fork(), the Python
interpreter in the child process dies with SIGSEGV -- before it even
tries to launch 'mpirun'. The Python standard library code (in the
parent process) confusingly reports this as "exit -11" (even though it
didn't 'exit' -- it died with a signal).
I don't know if this is related or not, but I discovered by accident
that if the child process raises any Python exception at all after the
fork(), the interpreter dies with SIGSEGV.
For some reason, giving the full path to 'mpirun' avoids the bug. I have
no idea why we don't hit this bug with MPICH.
I'll have to build Python with debugging info in order to pursue this
further... the stack trace I get under GDB isn't helpful...
--Leif
Joya Tetreault wrote:
> Yeah i think i got the same error--it said signal 11. but I have
> since quit trying to install citcom with LAM-MPI, since we also have
> MPICH2. And now i have both citcomCU and citcomS working with MPICH2.
> I can go ahead and try installing citcomS with LAM-MPI again, and then
> give it the node info--i'm also not using a cluster, so i can use the
> same flags that you used. I'll keep you posted on what happens!
>
> Joya
>
> On Mon, 2 Apr 2007, Leif Strand wrote:
>
>> Hi Joya,
>>
>> I'm curious what error message you were getting while trying to run
>> CitcomS with LAM/MPI.
>>
>> I just installed LAM/MPI on my Linux workstation here at CIG. With
>> the default options, I did experience some difficulty:
>>
>> leif@crust:~/dv/CitcomS/work$ citcoms example1.cfg
>> --pyre-start: mpirun: exit -11
>> citcoms: /home/leif/dv/CitcomS/build/bin/pycitcoms: exit 1
>> leif@crust:~/dv/CitcomS/work$
>>
>> The default setting for the 'mpirun' command is the following:
>>
>> [CitcomS.launcher]
>> command = mpirun -np ${nodes}
>>
>> I discovered that if I added the following to my
>> ~/.pyre/CitcomS/CitcomS.cfg file, everything worked:
>>
>> [CitcomS.launcher]
>> command = /home/leif/opt/lam/bin/mpirun -np ${nodes}
>>
>> Additionally, I could specify a node list (although since I'm not on
>> a cluster, this is a bit silly):
>>
>> [CitcomS.launcher]
>> command = /home/leif/opt/lam/bin/mpirun -np ${nodes} n0,0,0,0,0,0,0
>>
>> I'm not certain why adding the full path to LAM's 'mpirun' made any
>> difference -- in my setup, at least, the environment should have been
>> the same in all cases. I will investigate this further.
>>
>> But it would be helpful if you copied & pasted the error message you
>> are getting into an e-mail, and sent it to us.
>>
>> --Leif
>>
>>
>> Eh Tan wrote:
>>
>>> Hi Joya,
>>>
>>> Adding support of LAM/MPI is on our bug tracker. There are several
>>> possible work-around, see
>>> http://geodynamics.org/roundup/issues/issue101
>>>
>>> Leif Strand is working on this issue. If you can help to test, it will
>>> be very appreciated.
>>>
>>>
>>> Eh
>>>
>>>
>>> Joya Tetreault wrote:
>>>
>>>
>>>> Hello Eh,
>>>>
>>>> I actually attended the CIG workshop on Tuesday, and you were
>>>> extremely helpful (and patient) with getting citcomS installed on my
>>>> Linux here at UNM. I don't know if you remember me--i was the one
>>>> trying to get citcomS to run but it was having issues with LAM-MPI so
>>>> we had to force it to find MPICH and run it thru that. So actually,
>>>> since the workshop i have edited citcomCU's workfile to go thru MPICH2
>>>> and I have been able to get citcomCU to run as well. for some reason,
>>>> LAM-MPI doesn't work with either version of citcom.
>>>>
>>>> Thanks for all of your help! And Eh, that workshop was great--it was
>>>> extremely helpful!
>>>>
>>>> Joya
>>>>
>>>>
>>>> On Fri, 30 Mar 2007, Eh Tan wrote:
>>>>
>>>>
>>>>> Hi Joya,
>>>>>
>>>>> Signal 11 usually indicates memory error. It could be due to
>>>>> either bad
>>>>> software or bad hardware. Since other people have ran CitcomCU on
>>>>> their
>>>>> computer without problem, CitcomCU is unlikely to be source of error.
>>>>>
>>>>> Besides hardware error, there may be another two possible sources of
>>>>> software error.
>>>>>
>>>>> One possibility is in the launching of your CitcomCU run. Sorry to
>>>>> ask a
>>>>> silly question. Did you run lamboot before launching CitcomCU?
>>>>>
>>>>> http://www.lam-mpi.org/tutorials/lam/boot.php
>>>>>
>>>>>
>>>>> The other possibility is the complier/linker. What OS and compiler
>>>>> are
>>>>> you using? It will be useful if you can attach the Makefile to me.
>>>>>
>>>>>
>>>>>
>>>>> Eh
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> Date: Sun, 25 Mar 2007 12:09:29 -0600 (MDT)
>>>>>>> From: Joya Tetreault <joya@unm.edu>
>>>>>>> Subject: questions about citcomCU and MPI
>>>>>>> To: Shijie Zhong <szhong@spice.colorado.edu>
>>>>>>>
>>>>>>> Hi Shijie, How are you doing? I hope you are having a good
>>>>>>> semester! I
>>>>>>> am starting to run citcomCU to calculate velocity fields and
>>>>>>> velocity
>>>>>>> gradients in the upper mantle for my post-doc project. However
>>>>>>> I am
>>>>>>> having trouble running citcom, and am quite unfamiliar with
>>>>>>> MPI. I am
>>>>>>> trying to run citcom on a dual core Linux machine (which has a
>>>>>>> total of 4
>>>>>>> processors). We have LAM-MPI installed on this machine. I am
>>>>>>> just tring to run the example input1 file, but everytime i run
>>>>>>> mpirun, i get an error stating that one of the processes has exited
>>>>>>> with signal 11. So i think i am oversubscribing mpi.
>>>>>>> But I really don't know enough about MPI to figure out how to solve
>>>>>>> this
>>>>>>> issue. If you could give me any suggestions about how to debug
>>>>>>> it, I
>>>>>>> would greatly appreciate it.
>>>>>>>
>>>>>>> Also, I will be up in Boulder the week of April 12th, and was
>>>>>>> hoping I
>>>>>>> could meet with you to talk more specifically about citcom. I
>>>>>>> would like
>>>>>>> to use citcom for a modified couette flow, where the upper boundary
>>>>>>> is a
>>>>>>> step function.
>>>>>>>
>>>>>>> Joya Tetreault
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> Eh Tan
>>>>> Staff Scientist
>>>>> Computational Infrastructure for Geodynamics
>>>>> 2750 E. Washington Blvd. Suite 210
>>>>> Pasadena, CA 91107
>>>>> (626) 395-1693
>>>>> http://www.geodynamics.org
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>
>>
|
| msg317 (view) |
Author: joya |
Date: 2007-04-02.22:50:01 |
Yeah i think i got the same error--it said signal 11. but I have since
quit trying to install citcom with LAM-MPI, since we also have MPICH2.
And now i have both citcomCU and citcomS working with MPICH2. I can go
ahead and try installing citcomS with LAM-MPI again, and then give it the
node info--i'm also not using a cluster, so i can use the same flags that
you used. I'll keep you posted on what happens!
Joya
On Mon, 2 Apr 2007, Leif Strand wrote:
> Hi Joya,
>
> I'm curious what error message you were getting while trying to run CitcomS
> with LAM/MPI.
>
> I just installed LAM/MPI on my Linux workstation here at CIG. With the
> default options, I did experience some difficulty:
>
> leif@crust:~/dv/CitcomS/work$ citcoms example1.cfg
> --pyre-start: mpirun: exit -11
> citcoms: /home/leif/dv/CitcomS/build/bin/pycitcoms: exit 1
> leif@crust:~/dv/CitcomS/work$
>
> The default setting for the 'mpirun' command is the following:
>
> [CitcomS.launcher]
> command = mpirun -np ${nodes}
>
> I discovered that if I added the following to my ~/.pyre/CitcomS/CitcomS.cfg
> file, everything worked:
>
> [CitcomS.launcher]
> command = /home/leif/opt/lam/bin/mpirun -np ${nodes}
>
> Additionally, I could specify a node list (although since I'm not on a
> cluster, this is a bit silly):
>
> [CitcomS.launcher]
> command = /home/leif/opt/lam/bin/mpirun -np ${nodes} n0,0,0,0,0,0,0
>
> I'm not certain why adding the full path to LAM's 'mpirun' made any
> difference -- in my setup, at least, the environment should have been the
> same in all cases. I will investigate this further.
>
> But it would be helpful if you copied & pasted the error message you are
> getting into an e-mail, and sent it to us.
>
> --Leif
>
>
> Eh Tan wrote:
>
>> Hi Joya,
>>
>> Adding support of LAM/MPI is on our bug tracker. There are several
>> possible work-around, see
>> http://geodynamics.org/roundup/issues/issue101
>>
>> Leif Strand is working on this issue. If you can help to test, it will
>> be very appreciated.
>>
>>
>> Eh
>>
>>
>> Joya Tetreault wrote:
>>
>>
>>> Hello Eh,
>>>
>>> I actually attended the CIG workshop on Tuesday, and you were
>>> extremely helpful (and patient) with getting citcomS installed on my
>>> Linux here at UNM. I don't know if you remember me--i was the one
>>> trying to get citcomS to run but it was having issues with LAM-MPI so
>>> we had to force it to find MPICH and run it thru that. So actually,
>>> since the workshop i have edited citcomCU's workfile to go thru MPICH2
>>> and I have been able to get citcomCU to run as well. for some reason,
>>> LAM-MPI doesn't work with either version of citcom.
>>>
>>> Thanks for all of your help! And Eh, that workshop was great--it was
>>> extremely helpful!
>>>
>>> Joya
>>>
>>>
>>> On Fri, 30 Mar 2007, Eh Tan wrote:
>>>
>>>
>>>> Hi Joya,
>>>>
>>>> Signal 11 usually indicates memory error. It could be due to either bad
>>>> software or bad hardware. Since other people have ran CitcomCU on their
>>>> computer without problem, CitcomCU is unlikely to be source of error.
>>>>
>>>> Besides hardware error, there may be another two possible sources of
>>>> software error.
>>>>
>>>> One possibility is in the launching of your CitcomCU run. Sorry to ask a
>>>> silly question. Did you run lamboot before launching CitcomCU?
>>>>
>>>> http://www.lam-mpi.org/tutorials/lam/boot.php
>>>>
>>>>
>>>> The other possibility is the complier/linker. What OS and compiler are
>>>> you using? It will be useful if you can attach the Makefile to me.
>>>>
>>>>
>>>>
>>>> Eh
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>> Date: Sun, 25 Mar 2007 12:09:29 -0600 (MDT)
>>>>>> From: Joya Tetreault <joya@unm.edu>
>>>>>> Subject: questions about citcomCU and MPI
>>>>>> To: Shijie Zhong <szhong@spice.colorado.edu>
>>>>>>
>>>>>> Hi Shijie, How are you doing? I hope you are having a good
>>>>>> semester! I
>>>>>> am starting to run citcomCU to calculate velocity fields and velocity
>>>>>> gradients in the upper mantle for my post-doc project. However I am
>>>>>> having trouble running citcom, and am quite unfamiliar with MPI. I am
>>>>>> trying to run citcom on a dual core Linux machine (which has a
>>>>>> total of 4
>>>>>> processors). We have LAM-MPI installed on this machine. I am
>>>>>> just tring to run the example input1 file, but everytime i run
>>>>>> mpirun, i get an error stating that one of the processes has exited
>>>>>> with signal 11. So i think i am oversubscribing mpi.
>>>>>> But I really don't know enough about MPI to figure out how to solve
>>>>>> this
>>>>>> issue. If you could give me any suggestions about how to debug it, I
>>>>>> would greatly appreciate it.
>>>>>>
>>>>>> Also, I will be up in Boulder the week of April 12th, and was hoping I
>>>>>> could meet with you to talk more specifically about citcom. I
>>>>>> would like
>>>>>> to use citcom for a modified couette flow, where the upper boundary
>>>>>> is a
>>>>>> step function.
>>>>>>
>>>>>> Joya Tetreault
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> Eh Tan
>>>> Staff Scientist
>>>> Computational Infrastructure for Geodynamics
>>>> 2750 E. Washington Blvd. Suite 210
>>>> Pasadena, CA 91107
>>>> (626) 395-1693
>>>> http://www.geodynamics.org
>>>>
>>>>
>>>>
>>
>>
>
>
>
|
| msg316 (view) |
Author: leif |
Date: 2007-04-02.22:00:02 |
Hi Joya,
I'm curious what error message you were getting while trying to run
CitcomS with LAM/MPI.
I just installed LAM/MPI on my Linux workstation here at CIG. With the
default options, I did experience some difficulty:
leif@crust:~/dv/CitcomS/work$ citcoms example1.cfg
--pyre-start: mpirun: exit -11
citcoms: /home/leif/dv/CitcomS/build/bin/pycitcoms: exit 1
leif@crust:~/dv/CitcomS/work$
The default setting for the 'mpirun' command is the following:
[CitcomS.launcher]
command = mpirun -np ${nodes}
I discovered that if I added the following to my
~/.pyre/CitcomS/CitcomS.cfg file, everything worked:
[CitcomS.launcher]
command = /home/leif/opt/lam/bin/mpirun -np ${nodes}
Additionally, I could specify a node list (although since I'm not on a
cluster, this is a bit silly):
[CitcomS.launcher]
command = /home/leif/opt/lam/bin/mpirun -np ${nodes} n0,0,0,0,0,0,0
I'm not certain why adding the full path to LAM's 'mpirun' made any
difference -- in my setup, at least, the environment should have been
the same in all cases. I will investigate this further.
But it would be helpful if you copied & pasted the error message you are
getting into an e-mail, and sent it to us.
--Leif
Eh Tan wrote:
>Hi Joya,
>
>Adding support of LAM/MPI is on our bug tracker. There are several
>possible work-around, see
>http://geodynamics.org/roundup/issues/issue101
>
>Leif Strand is working on this issue. If you can help to test, it will
>be very appreciated.
>
>
>Eh
>
>
>Joya Tetreault wrote:
>
>
>
>>Hello Eh,
>>
>>I actually attended the CIG workshop on Tuesday, and you were
>>extremely helpful (and patient) with getting citcomS installed on my
>>Linux here at UNM. I don't know if you remember me--i was the one
>>trying to get citcomS to run but it was having issues with LAM-MPI so
>>we had to force it to find MPICH and run it thru that. So actually,
>>since the workshop i have edited citcomCU's workfile to go thru MPICH2
>>and I have been able to get citcomCU to run as well. for some reason,
>>LAM-MPI doesn't work with either version of citcom.
>>
>>Thanks for all of your help! And Eh, that workshop was great--it was
>>extremely helpful!
>>
>>Joya
>>
>>
>>On Fri, 30 Mar 2007, Eh Tan wrote:
>>
>>
>>
>>>Hi Joya,
>>>
>>>Signal 11 usually indicates memory error. It could be due to either bad
>>>software or bad hardware. Since other people have ran CitcomCU on their
>>>computer without problem, CitcomCU is unlikely to be source of error.
>>>
>>>Besides hardware error, there may be another two possible sources of
>>>software error.
>>>
>>>One possibility is in the launching of your CitcomCU run. Sorry to ask a
>>>silly question. Did you run lamboot before launching CitcomCU?
>>>
>>>http://www.lam-mpi.org/tutorials/lam/boot.php
>>>
>>>
>>>The other possibility is the complier/linker. What OS and compiler are
>>>you using? It will be useful if you can attach the Makefile to me.
>>>
>>>
>>>
>>>Eh
>>>
>>>
>>>
>>>
>>>
>>>
>>>>>Date: Sun, 25 Mar 2007 12:09:29 -0600 (MDT)
>>>>>From: Joya Tetreault <joya@unm.edu>
>>>>>Subject: questions about citcomCU and MPI
>>>>>To: Shijie Zhong <szhong@spice.colorado.edu>
>>>>>
>>>>>Hi Shijie, How are you doing? I hope you are having a good
>>>>>semester! I
>>>>>am starting to run citcomCU to calculate velocity fields and velocity
>>>>>gradients in the upper mantle for my post-doc project. However I am
>>>>>having trouble running citcom, and am quite unfamiliar with MPI. I am
>>>>>trying to run citcom on a dual core Linux machine (which has a
>>>>>total of 4
>>>>>processors). We have LAM-MPI installed on this machine. I am
>>>>>just tring to run the example input1 file, but everytime i run
>>>>>mpirun, i get an error stating that one of the processes has exited
>>>>>with signal 11. So i think i am oversubscribing mpi.
>>>>>But I really don't know enough about MPI to figure out how to solve
>>>>>this
>>>>>issue. If you could give me any suggestions about how to debug it, I
>>>>>would greatly appreciate it.
>>>>>
>>>>>Also, I will be up in Boulder the week of April 12th, and was hoping I
>>>>>could meet with you to talk more specifically about citcom. I
>>>>>would like
>>>>>to use citcom for a modified couette flow, where the upper boundary
>>>>>is a
>>>>>step function.
>>>>>
>>>>>Joya Tetreault
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>--
>>>Eh Tan
>>>Staff Scientist
>>>Computational Infrastructure for Geodynamics
>>>2750 E. Washington Blvd. Suite 210
>>>Pasadena, CA 91107
>>>(626) 395-1693
>>>http://www.geodynamics.org
>>>
>>>
>>>
>>>
>
>
>
|
| msg312 (view) |
Author: leif |
Date: 2007-03-29.02:34:55 |
At the staff meeting today, Eh pointed out that LAM/MPI support is missing from
CIG-Pyre. I wrote a launcher for LAM/MPI a some time ago, but it got lost in the
shuffle while transitioning to CIG-Pyre. So now I have to go back, dust it off,
test it, and merge it into Pyre.
An old version of LauncherLAMMPI can be found in the CitcomS v2.0.x source:
http://geodynamics.org/wsvn/cig/mc/3D/CitcomS/tags/v2.0.1/pyre/Components/Launchers.py?op=file&rev=0&sc=0
As can be seen from the method _appendNodeListArgs(), LAM/MPI's mpirun expects a
nodelist on the command line (n101,102,103,104). As I recall, the nodelist
doesn't correspond to hostnames (as it does under MPICH); they are arbitrary
numbers which LAM/MPI assigns to different hosts.
A possible work-around is to use a fixed nodelist hard-coded in a .cfg file:
[CitcomS.launcher]
command = mpirun -np ${nodes} n101,102,103,104,105,106
Also, CIG-Pyre accepts arbitrary environment strings in 'command', so another
possible work-around is use an environment variable:
[CitcomS.launcher]
command = mpirun -np ${nodes} ${MY_LAMMPI_NODES}
This feature was actually added to support ${PBS_NODEFILE}; see
http://www.geodynamics.org/cig/software/packages/cs/pythia/docs/batch
I'll attach an .odb file for LAM/MPI (for use in Pythia v0.8.1.2), once I have
one ready.
|
|
| Date |
User |
Action |
Args |
| 2007-04-03 21:25:01 | leif | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg325 |
| 2007-04-03 20:55:02 | joya | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg324 |
| 2007-04-03 20:55:01 | leif | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg323 |
| 2007-04-03 20:05:01 | leif | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg322 |
| 2007-04-03 18:00:02 | joya | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg321 |
| 2007-04-03 01:25:01 | leif | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg320 |
| 2007-04-02 23:45:01 | aivazis | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg319 |
| 2007-04-02 23:35:02 | leif | set | nosy:
leif, aivazis, tan2, joya messages:
+ msg318 |
| 2007-04-02 22:50:02 | joya | set | nosy:
+ joya messages:
+ msg317 |
| 2007-04-02 22:00:04 | leif | set | status: unread -> chatting nosy:
leif, aivazis, tan2 messages:
+ msg316 title: LAM/MPI support is missing -> Re: questions about citcomCU and MPI |
| 2007-03-29 02:34:56 | leif | create | |
|