[CIG-MC] Fwd: MPI_Isend error

Magali Billen mibillen at ucdavis.edu
Wed Nov 18 11:05:34 PST 2009


Hello Eh (and everyone else following the saga),

We (I mean Bill Broadley) determined that this issue is somehow  
related to  the compilers either
openmpi-1.3.3 or gcc-4.4. Bill recompiled on the new machine using  
openmpi-1.3.1 and gcc-4.3 and
CitcomCU is no longer hanging.

Apparently, we are not the first to meet this issue with the compiler  
(there's stuff on the web about the issue -
I will send the specifics, when I hear back from Bill - he was up  
until 4am debugging, so I don't have the details yet).

Thanks for everyone's feedback and help!
And, thanks to Jinshui for identifying the other (unknown) race  
condition.

I've recapped below as some communication was done off the CIG list.

Magali

To recap:

Symptoms:
1) A CitcomCU test run on a new cluster using 8 processors hangs on  
either MPI_Isend or MPI_WaitAll,
but in a random way.  Sometimes the first time these commands are  
used, sometimes not until these
have been used many times before. The commands are used in functions  
that exchange information
between processors, so they are called during initialization when the  
mass_matrix is defined and
later on during the solving iterations by the gauss_seidel function.  
This exact same version of
CitcomCU runs on another cluster without issues.

2) CitcomS was compiled on the same cluster with the same compilers  
and it seems to run fine.

3) On the new cluster the code was compiled with openmpi-1.3 and  
gcc-4.4. On the old cluster
is was compiled with openmpi-1.2.6 and gcc-4.3

Possibilities:
1) Race condition in code - always a possibility, but probably not  
source of this issue:
	- MPI_Isend is always paired with an MPI_Irecv and is always followed  
by an MPI_Waitall.
	- In the test we were running, no markers were being used so E- 
 >parallel.mst1, mst2, mst3 were not being used.
	   (although, its certainly good to have found this problem, and I  
will update this my code).
	   E->parallel.mst is used, but this array is initialized in  
parallel_domain_decomp1.
	- The mst array was also big enough (100 x 100) as the test was only  
on 8 processors.

2) Machine hardware (chipset + gigabit ethernet) - ugh, daunting.

3) Compilers.



On Nov 18, 2009, at 12:13 AM, Eh Tan wrote:

> Hi Magali,
>
> Like Shijie said, the function exchange_id_d20() in CitcomCU is very
> similar to regional_exchange_id_d() in CitcomS. I don't have an
> immediate answer why one works but the other doesn't.
>
> BTW, in your earlier email, you mentioned that the code died inside
> function mass_matrix(). In this email, the code died inside function
> gauss_seidel(). Did the code die at different places randomly?
>
> Eh
>
>
>
> Magali Billen wrote:
>> Hello Eh,
>>
>> This is a run on 8 processors. If I print the stack I get:
>>
>> (gdb) bt
>> #0  0x00002b943e3c208a in opal_progress () from
>> /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
>> #1  0x00002b943def5c85 in ompi_request_default_wait_all () from
>> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>> #2  0x00002b943df229d3 in PMPI_Waitall () from
>> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>> #3  0x0000000000427ef5 in exchange_id_d20 ()
>> #4  0x00000000004166f3 in gauss_seidel ()
>> #5  0x000000000041884b in multi_grid ()
>> #6  0x0000000000418c44 in solve_del2_u ()
>> #7  0x000000000041b151 in solve_Ahat_p_fhat ()
>> #8  0x000000000041b9a1 in solve_constrained_flow_iterative ()
>> #9  0x0000000000411ca6 in general_stokes_solver ()
>> #10 0x0000000000409c21 in main ()
>>
>> I've attached the version of Parallel_related.c that is used... I  
>> have
>> not modified this in anyway
>> from the CIG release of CitcomCU.
>>
>>
>> ------------------------------------------------------------------------
>>
>> Luckily, there are commented fprintf statements in just that part of
>> the code... we'll continue to dig...
>>
>> Oh, and just to eliminate the new cluster from suspicion, we
>> downloaded, compiled and ran CitcomS
>> example1.cfg on the same cluster with the same compilers, and their
>> was not problem.
>>
>> Maybe this is the sign that I'm suppose to finally switch from
>> CitcomCU to CitcomS... :-(
>> Magali
>>
>> On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>>
>>> Hi Magali,
>>>
>>> How many processors are you using? If more than 100 processors are  
>>> used,
>>> you are seeing this bug:
>>> http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
>>>
>>>
>>> Eh
>>>
>>>
>>>
>>> Magali Billen wrote:
>>>> One correction to the e-mail below, we've been compiling CitcomCU
>>>> using openmpi on our old
>>>> cluster, so the compiler on the new cluster is the same. The big
>>>> difference is that the cluster
>>>> is about twice as fast as the 5-year old cluster. This suggests  
>>>> that
>>>> this change to a much faster
>>>> clsuter may have exposed an existing race condition in CitcomCU??
>>>> Magali
>>>>
>>>>
>>>> Begin forwarded message:
>>>>
>>>>> *From: *Magali Billen <mibillen at ucdavis.edu
>>>>> <mailto:mibillen at ucdavis.edu>>
>>>>> *Date: *November 17, 2009 4:23:45 PM PST
>>>>> *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
>>>>> *Subject: **[CIG-MC] MPI_Isend error*
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm using CitcomCU and am having a strange problem with problem
>>>>> either hanging (no error, just doesn't
>>>>> go anywhere) or it dies with an MPI_Isend error (see below).  I  
>>>>> seem
>>>>> to recall having problems with the MPI_Isend
>>>>> command and the lam-mpi version of mpi, but I've not had any  
>>>>> problems
>>>>> with mpich-2.
>>>>> On the new cluster we are compling with openmpi instead of  
>>>>> MPICH-2.
>>>>>
>>>>> The MPI_Isend error seems to occur during Initialization in the  
>>>>> call
>>>>> to the function mass_matrix, which then
>>>>> calls exchange_node_f20, which is where the call to MPI_Isend is.
>>>>>
>>>>> --snip--
>>>>> ok14: parallel shuffle element and id arrays
>>>>> ok15: construct shape functions
>>>>> [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
>>>>> [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
>>>>> [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
>>>>> [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your MPI  
>>>>> job
>>>>> will now abort)
>>>>>
>>>>> Has this (or these) types of error occurred for other versions of
>>>>> Citcom using MPI_Isend (it seems that CitcomS uses
>>>>> this command also).   I'm not sure how to debug this error,
>>>>> especially since sometimes it just hangs with no error.
>>>>>
>>>>> Any advice you have would be hepful,
>>>>> Magali
>>>>>
>>>>>
>>>>> -----------------------------
>>>>> Associate Professor, U.C. Davis
>>>>> Department of Geology/KeckCAVEs
>>>>> Physical & Earth Sciences Bldg, rm 2129
>>>>> Davis, CA 95616
>>>>> -----------------
>>>>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>>>> (530) 754-5696
>>>>> *-----------------------------*
>>>>> *** Note new e-mail, building, office*
>>>>> *    information as of Sept. 2009 ***
>>>>> -----------------------------
>>>>>
>>>>> _______________________________________________
>>>>> CIG-MC mailing list
>>>>> CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
>>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>>
>>>> -----------------------------
>>>> Associate Professor, U.C. Davis
>>>> Department of Geology/KeckCAVEs
>>>> Physical & Earth Sciences Bldg, rm 2129
>>>> Davis, CA 95616
>>>> -----------------
>>>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>>> (530) 754-5696
>>>> *-----------------------------*
>>>> *** Note new e-mail, building, office*
>>>> *    information as of Sept. 2009 ***
>>>> -----------------------------
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> CIG-MC mailing list
>>>> CIG-MC at geodynamics.org
>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>>
>>>
>>> -- 
>>> Eh Tan
>>> Staff Scientist
>>> Computational Infrastructure for Geodynamics
>>> California Institute of Technology, 158-79
>>> Pasadena, CA 91125
>>> (626) 395-1693
>>> http://www.geodynamics.org
>>>
>>> _______________________________________________
>>> CIG-MC mailing list
>>> CIG-MC at geodynamics.org
>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>
>> -----------------------------
>> Associate Professor, U.C. Davis
>> Department of Geology/KeckCAVEs
>> Physical & Earth Sciences Bldg, rm 2129
>> Davis, CA 95616
>> -----------------
>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>> (530) 754-5696
>> *-----------------------------*
>> *** Note new e-mail, building, office*
>> *    information as of Sept. 2009 ***
>> -----------------------------
>>
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc

-----------------------------
Associate Professor, U.C. Davis
Department of Geology/KeckCAVEs
Physical & Earth Sciences Bldg, rm 2129
Davis, CA 95616
-----------------
mibillen at ucdavis.edu
(530) 754-5696
-----------------------------
** Note new e-mail, building, office
     information as of Sept. 2009 **
-----------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-mc/attachments/20091118/2c855b2d/attachment-0001.htm 


More information about the CIG-MC mailing list