[CIG-LONG] Gale 2.0 in cluster

Fri Mar 16 12:03:50 PDT 2012

Hi Dennis,

Cc'ing the list.

If you get a crash on the command line with a single core, run it in a
debugger and tell me where it is crashing.  You will probably want to
compile the code with debugging info and without optimization.  You
can also run it under valgrind and see if that points to any errors.
It should be valgrind clean except for two errors at the beginning due
to MPI.

Cheers,
Walter Landry

Dennis Michael <dennis at stanford.edu> wrote:
> 
> Walter,
> 
> Here's some observations.
> 
> All code is compiled using Intel 11.1. We use openmpi-1.4.3,
> petsc-3.0.0-p12, and hdf5-1.8.4. I've recompiled all of these
> packages.
> 
> I downloaded and tried to run the binary version. We don't have
> glibc2.7, so they didn't work. We are currently running CentOS 5.7.
> 
> I downloaded the source code and compiled it, and got the errors. I
> then created a mercurial clone of the latest dev code, compiled it,
> but got the same errors.
> 
> The yielding.json file segfaults almost immediately when it starts a
> MPI run on the master node. However, the same file runs serially on a
> single core. The xml version of input files seems to run ok. I
> specified the full paths.
> 
> George Hilley has mentioned that after convergence has been achieved
> across the finite element domain and the particles are to be advected
> in the model, the program crashes on his input file (xml). When I run
> this input on a single core on one of the compute nodes or on the
> command line, I get the following error:
> 
>   Error running Underworld (revision unknown) - Signal 11 'SIGSEGV'
>   (Segmentation Fault).
>   This is probably caused by an illegal access of memory.
> 
> Any suggestions on where should I start looking?
> 
> Dennis
> 
> 
> 
> On 2/8/2012 5:04 PM, Walter Landry wrote:
>> Hi Leonardo,
>>
>> I am Cc'ing the list in case anyone else has similar problems.
>>
>> I do not remember doing anything special to the input files.  I am
>> attaching an input file that I have run on 16 cores on Lonestar at
>> TACC.  Does Gale write the xml version of the file in the output
>> directory, or does it segfault before that?  Do you get any error
>> messages?  Have you tried specifying the complete path to the input
>> file?
>>
>> Manually setting shadowDepth should not be necessary.
>>
>> If none of that works, can you try inserting print statements into
>>
>>    StGermain/Base/IO/src/IO_Handler.cxx
>>
>> after lines 373, 376, 377, 380, and 381, and tell me where it dies.
>>
>> Cheers,
>> Walter Landry
>>
>> "Leonardo  Cruz"<leocruz at stanford.edu>  wrote:
>>> Hello Walter,
>>>
>>> We are having some issues running gale 2.0 in the cluster. Dennis is
>>> getting some segmentation fault errors when running the .json files
>>> that come with the binaries package. Interestingly enough, he is
>>> able to run previous xml files with this new gale built.
>>> Could you provide us with a json file that you have tested in your
>>> cluster for testing?
>>>
>>> Thanks in advance for your help .
>>>
>>> Leonardo
>>>
>>> ----- Forwarded Message -----
>>> From: "Leonardo Cruz"<leocruz at stanford.edu>
>>> To: "Dennis Michael"<dennis at stanford.edu>
>>> Cc: "George Hilley"<hilley at stanford.edu>, "María Helga
>>> Guðmundsdóttir"<mariahg at stanford.edu>
>>> Sent: Monday, February 6, 2012 9:30:07 AM
>>> Subject: Re: Gale 2.0 in cluster
>>>
>>> Dennis,
>>> I will ask Walter Landry to provide one of json files that he has
>>> tested in parallel, so we have something to start with.
>>>
>>> Thanks
>>> Leonardo
>>>
>>> ----- Original Message -----
>>> From: "Dennis Michael"<dennis at stanford.edu>
>>> To: "Leonardo Cruz"<leocruz at stanford.edu>
>>> Cc: "George Hilley"<hilley at stanford.edu>
>>> Sent: Monday, February 6, 2012 8:59:18 AM
>>> Subject: Re: Gale 2.0 in cluster
>>>
>>> Leo,
>>>
>>> The info about the parallel variable is interesting since Gale2 is
>>> failing at
>>> the point when it starts to run MPI.  The master node sets up the job
>>> on the
>>> various compute nodes, then gets a segmentation fault from Gale2 when
>>> the job
>>> starts.
>>>
>>> Dennis
>>>
>>> On 2/4/2012 7:41 AM, Leonardo Cruz wrote:
>>>> Dennis,
>>>>
>>>> I have run two more files from the Gale Package (extension.json&
>>>> viscous.json) in my mac (binary version) successfully.
>>>> I tried those files in the cluster unsuccessfully
>>>> (data/cees/temp1/leocruz1/gale). Then I added a variable (shadowDepth)
>>>> to extension.json that according to the manual needs to be included
>>>> when running in parallel
>>>>
>>>> ..."shadowDepth: When running in parallel, every parameter only
>>>> computes quantities over a portion of the
>>>> grid. To do this, each processor must keep copies of points that
>>>> belong to other processors. This
>>>> parameter speciﬁes how wide the region of copied points is. You should
>>>> never need to change this from
>>>> 1.....
>>>>
>>>> but still got errors (galejob_extension.err).
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Dennis Michael"<dennis at stanford.edu>
>>>> To: "Leonardo Cruz"<leocruz at stanford.edu>
>>>> Cc: "George Hilley"<hilley at stanford.edu>
>>>> Sent: Friday, February 3, 2012 3:10:42 PM
>>>> Subject: Re: Gale 2.0 in cluster
>>>>
>>>>
>>>> I've reinstalled everything, and still the *.json packages do not
>>>> work.  I've
>>>> tried different compilers and packages.
>>>>
>>>> Gale-2 runs fine on the Dragonsback*.xml file, so I don't think the
>>>> binary is at
>>>> fault.
>>>>
>>>> Can we take a look at the *.json files?  Are they known to be good?
>>>>
>>>> the only thing I can do now is to download the latest development
>>>> versions in
>>>> the hopes that there's a fix.
>>>>
>>>> Dennis
>>>>
>>>> On 2/1/2012 6:39 PM, Leonardo Cruz wrote:
>>>>> Dennis,
>>>>> I ran the file yielding.json using the script yielding1.sh
>>>>> (data/cees/temp1/leocruz1/gale) and got the error indicated in
>>>>> galejob_yielding.err.
>>>>>
>>>>> One thing I noticed is that you are running a xml file and I am using
>>>>> a new json file from tested input files that come with the package.
>>>>>
>>>>> Any help is appreciated as usual!
>>>>>
>>>>> Thanks
>>>>> Leonardo
>>>>>
>>>>> -------------
>>>>> Leonardo Cruz
>>>>> Dept. Geological and Environmental Sciences
>>>>> 450 Serra Mall
>>>>> Braun Hall, Building 320
>>>>> Stanford University
>>>>> Stanford, CA 94305-2115
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Leonardo Cruz"<leocruz at stanford.edu>
>>>>> To: "Dennis Michael"<dennis at stanford.edu>
>>>>> Cc: "George Hilley"<hilley at stanford.edu>
>>>>> Sent: Tuesday, January 31, 2012 10:53:04 AM
>>>>> Subject: Re: Gale 2.0 in cluster
>>>>>
>>>>> Thanks Dennis!
>>>>> I will run my files this afternoon.
>>>>> Leo
>>>>>
>>>>> -------------
>>>>> Leonardo Cruz
>>>>> Dept. Geological and Environmental Sciences
>>>>> 450 Serra Mall
>>>>> Braun Hall, Building 320
>>>>> Stanford University
>>>>> Stanford, CA 94305-2115
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Dennis Michael"<dennis at stanford.edu>
>>>>> To: "Leonardo Cruz"<leocruz at stanford.edu>
>>>>> Cc: "George Hilley"<hilley at stanford.edu>
>>>>> Sent: Tuesday, January 31, 2012 10:51:07 AM
>>>>> Subject: Re: Gale 2.0 in cluster
>>>>>
>>>>>
>>>>> I have Gale-2.0.0 running.  It's been running almost 30 minutes so I
>>>>> think I
>>>>> fixed the segmentation faults.  I'm not sure how long it will run - it
>>>>> will be
>>>>> killed after 2 hours.
>>>>>
>>>>> Output is in /data/cees/dennis/Gale.  The script 'rundb.sh' shows how
>>>>> I ran it.
>>>>>
>>>>> The code is in /usr/local/Gale-2_0_0
>>>>>
>>>>> Dennis
>>>>>
>>>>> On 1/30/2012 9:05 AM, Leonardo Cruz wrote:
>>>>>> Hi Dennis,
>>>>>> I hope you have a speedy recovery and thanks for your help!
>>>>>>
>>>>>> Leo
>>>>>>
>>>>>> -------------
>>>>>> Leonardo Cruz
>>>>>> Dept. Geological and Environmental Sciences
>>>>>> 450 Serra Mall
>>>>>> Braun Hall, Building 320
>>>>>> Stanford University
>>>>>> Stanford, CA 94305-2115
>>>>>>
>>>>>> ----- Original Message -----
>>>>>> From: "Dennis Michael"<dennis at stanford.edu>
>>>>>> To: "Leonardo Cruz"<leocruz at stanford.edu>
>>>>>> Cc: "George Hilley"<hilley at stanford.edu>
>>>>>> Sent: Monday, January 30, 2012 8:00:33 AM
>>>>>> Subject: Re: Gale 2.0 in cluster
>>>>>>
>>>>>>
>>>>>> P.S. sorry for the late response.  I came down with a bad cold on
>>>>>> Friday and I'm
>>>>>> still struggling.
>>>>>>
>>>>>> Dennis
>>>>>>
>>>>>> On 1/27/2012 5:59 PM, Leonardo Cruz wrote:
>>>>>>> Hi Dennis,
>>>>>>>
>>>>>>> I installed Gale 2.0 in my local machine and ran one of the cookbook
>>>>>>> files successfully a few minutes ago. I am using the same file to test
>>>>>>> it in the cluster but I got some errors.
>>>>>>> Any help is appreciated as usual. I am attaching the input, job
>>>>>>> script, output and error files to this email.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Leonardo
>>>>>>>
>>>>>>> -------------
>>>>>>> Leonardo Cruz
>>>>>>> Dept. Geological and Environmental Sciences
>>>>>>> 450 Serra Mall
>>>>>>> Braun Hall, Building 320
>>>>>>> Stanford University
>>>>>>> Stanford, CA 94305-2115
>>>>>>>
>>> -- 
>>> Dennis Michael
>>> Manager, High Productivity Technical Computing
>>> Stanford Center for Computational Earth and Environmental Science
>>> (CEES)
>>> School of Earth Sciences
>>> Stanford University
>>> 397 Panama Mall Mitchell Building room 415
>>> http://cees.stanford.edu/
>>> phone # (650) 723 2014
>>>
> 
> -- 
> Dennis Michael
> Manager, High Productivity Technical Computing
> Stanford Center for Computational Earth and Environmental Science
> (CEES)
> School of Earth Sciences
> Stanford University
> 397 Panama Mall Mitchell Building room 415
> http://cees.stanford.edu/
> phone # (650) 723 2014
>