[CIG-MC] Fw: Fwd: MPI_Isend error

Eh Tan tan2 at geodynamics.org
Wed Nov 18 10:46:49 PST 2009


Hi Jinshui,

Thanks for the code. This might fix the problem.

I just checked the code and found E->parallel.mst1, E->parallel.mst2, 
and E->parallel.mst3 are not initialized anywhere in the code. So, these 
arrays are likely contain 0 for all elements. These arrays are used as 
MPI tags. When a processor sends two MPI messages to another processor, 
the messages are distinguished by the tags. If the tags are the same, 
the receiving processor might get confused. This is really a race condition.


Eh


jshhuang wrote:
> Hi, all. Sorry, the added code should be:
> ---------------------------------------------------------------------------- 
>
>     for(j = 0; j < E->parallel.nproc; j++)
>         for(i = 0; i <= E->parallel.nproc; i++)
>             {
>                 E->parallel.mst1[j][i] = 1;
>                 E->parallel.mst2[j][i] = 2;
>
>                 E->parallel.mst3[j][i] = 3;
>             }
> ----------------------------------------------------------------------------
>
> a "2" should be "3".
>
>  
>
> Good luck!
>
>  
>
> Jinshui
>
>  
>
> ----- Original Message -----
> *From:* jshhuang <mailto:jshhuang at ustc.edu.cn>
> *To:* Magali Billen <mailto:mibillen at ucdavis.edu>
> *Cc:* tan2 at geodynamics.org <mailto:tan2 at geodynamics.org> ; Shijie 
> Zhong <mailto:Shijie.Zhong at Colorado.Edu> ; cig-mc at geodynamics.org 
> <mailto:cig-mc at geodynamics.org>
> *Sent:* Wednesday, November 18, 2009 6:01 PM
> *Subject:* Re: [CIG-MC] Fwd: MPI_Isend error
>
> Hi, Magali,
>  
> You can try to add the following to the subroutine: void 
> parallel_domain_decomp1(struct All_variables *E) in Parallel_related.c:
> ---------------------------------------------------------------------------- 
>
>     for(j = 0; j < E->parallel.nproc; j++)
>         for(i = 0; i <= E->parallel.nproc; i++)
>             {
>                 E->parallel.mst1[j][i] = 1;
>                 E->parallel.mst2[j][i] = 2;
>
>                 E->parallel.mst2[j][i] = 3;
>             }
> ----------------------------------------------------------------------------
>
>  
>
> I'm not sure if it works, but I thought it deserve a try. This is a 
> machine-dependent issue.
>
>  
>
> Good luck!
>
>  
>
>  
>
> Jinshui Huang
> ---------------------------------------
> School of Earth and Space Sciences
> University of Science and Technology of China
> Hefei, Anhui 230026, China
> 0551-3606781
> ---------------------------------------
>
>     ----- Original Message -----
>     *From:* Magali Billen <mailto:mibillen at ucdavis.edu>
>     *To:* Eh Tan <mailto:tan2 at geodynamics.org>
>     *Cc:* cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
>     *Sent:* Wednesday, November 18, 2009 10:23 AM
>     *Subject:* [?? Probable Spam] Re: [CIG-MC] Fwd: MPI_Isend error
>
>     Hello Eh,
>
>     This is a run on 8 processors. If I print the stack I get:
>
>     (gdb) bt
>     #0  0x00002b943e3c208a in opal_progress () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
>     #1  0x00002b943def5c85 in ompi_request_default_wait_all () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>     #2  0x00002b943df229d3 in PMPI_Waitall () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>     #3  0x0000000000427ef5 in exchange_id_d20 ()
>     #4  0x00000000004166f3 in gauss_seidel ()
>     #5  0x000000000041884b in multi_grid ()
>     #6  0x0000000000418c44 in solve_del2_u ()
>     #7  0x000000000041b151 in solve_Ahat_p_fhat ()
>     #8  0x000000000041b9a1 in solve_constrained_flow_iterative ()
>     #9  0x0000000000411ca6 in general_stokes_solver ()
>     #10 0x0000000000409c21 in main ()
>
>     I've attached the version of Parallel_related.c that is used... I
>     have not modified this in anyway
>     from the CIG release of CitcomCU.
>
>     ------------------------------------------------------------------------
>     Luckily, there are commented fprintf statements in just that part
>     of the code... we'll continue to dig...
>
>     Oh, and just to eliminate the new cluster from suspicion, we
>     downloaded, compiled and ran CitcomS
>     example1.cfg on the same cluster with the same compilers, and
>     their was not problem.
>
>     Maybe this is the sign that I'm suppose to finally switch from
>     CitcomCU to CitcomS... :-(
>     Magali
>
>     On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
>>     Hi Magali,
>>
>>     How many processors are you using? If more than 100 processors
>>     are used,
>>     you are seeing this bug:
>>     http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
>>
>>
>>     Eh
>>
>>
>>
>>     Magali Billen wrote:
>>>     One correction to the e-mail below, we've been compiling CitcomCU
>>>     using openmpi on our old
>>>     cluster, so the compiler on the new cluster is the same. The big
>>>     difference is that the cluster
>>>     is about twice as fast as the 5-year old cluster. This suggests that
>>>     this change to a much faster
>>>     clsuter may have exposed an existing race condition in CitcomCU??
>>>     Magali
>>>
>>>
>>>     Begin forwarded message:
>>>
>>>>     *From: *Magali Billen <mibillen at ucdavis.edu
>>>>     <mailto:mibillen at ucdavis.edu>>
>>>>     *Date: *November 17, 2009 4:23:45 PM PST
>>>>     *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
>>>>     *Subject: **[CIG-MC] MPI_Isend error*
>>>>
>>>>     Hello,
>>>>
>>>>     I'm using CitcomCU and am having a strange problem with problem
>>>>     either hanging (no error, just doesn't
>>>>     go anywhere) or it dies with an MPI_Isend error (see below).  I
>>>>     seem
>>>>     to recall having problems with the MPI_Isend
>>>>     command and the lam-mpi version of mpi, but I've not had any
>>>>     problems
>>>>     with mpich-2.
>>>>     On the new cluster we are compling with openmpi instead of MPICH-2.
>>>>
>>>>     The MPI_Isend error seems to occur during Initialization in the
>>>>     call
>>>>     to the function mass_matrix, which then
>>>>     calls exchange_node_f20, which is where the call to MPI_Isend is.
>>>>
>>>>     --snip--
>>>>     ok14: parallel shuffle element and id arrays
>>>>     ok15: construct shape functions
>>>>     [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
>>>>     [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
>>>>     [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
>>>>     [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your
>>>>     MPI job
>>>>     will now abort)
>>>>
>>>>     Has this (or these) types of error occurred for other versions of
>>>>     Citcom using MPI_Isend (it seems that CitcomS uses
>>>>     this command also).   I'm not sure how to debug this error,
>>>>     especially since sometimes it just hangs with no error.
>>>>
>>>>     Any advice you have would be hepful,
>>>>     Magali
>>>>
>>>>
>>>>     -----------------------------
>>>>     Associate Professor, U.C. Davis
>>>>     Department of Geology/KeckCAVEs
>>>>     Physical & Earth Sciences Bldg, rm 2129
>>>>     Davis, CA 95616
>>>>     -----------------
>>>>     mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>>>     (530) 754-5696
>>>>     *-----------------------------*
>>>>     *** Note new e-mail, building, office*
>>>>     *    information as of Sept. 2009 ***
>>>>     -----------------------------
>>>>
>>>>     _______________________________________________
>>>>     CIG-MC mailing list
>>>>     CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
>>>>     http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>
>>>     -----------------------------
>>>     Associate Professor, U.C. Davis
>>>     Department of Geology/KeckCAVEs
>>>     Physical & Earth Sciences Bldg, rm 2129
>>>     Davis, CA 95616
>>>     -----------------
>>>     mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>>     (530) 754-5696
>>>     *-----------------------------*
>>>     *** Note new e-mail, building, office*
>>>     *    information as of Sept. 2009 ***
>>>     -----------------------------
>>>
>>>     ------------------------------------------------------------------------
>>>
>>>     _______________________________________________
>>>     CIG-MC mailing list
>>>     CIG-MC at geodynamics.org
>>>     http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>
>>
>>     -- 
>>     Eh Tan
>>     Staff Scientist
>>     Computational Infrastructure for Geodynamics
>>     California Institute of Technology, 158-79
>>     Pasadena, CA 91125
>>     (626) 395-1693
>>     http://www.geodynamics.org
>>
>>     _______________________________________________
>>     CIG-MC mailing list
>>     CIG-MC at geodynamics.org
>>     http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
>     -----------------------------
>     Associate Professor, U.C. Davis
>     Department of Geology/KeckCAVEs
>     Physical & Earth Sciences Bldg, rm 2129
>     Davis, CA 95616
>     -----------------
>     mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>     (530) 754-5696
>     *-----------------------------*
>     *** Note new e-mail, building, office*
>     *    information as of Sept. 2009 ***
>     -----------------------------
>
>     ------------------------------------------------------------------------
>     Hello Eh,
>
>     This is a run on 8 processors. If I print the stack I get:
>
>     (gdb) bt
>     #0  0x00002b943e3c208a in opal_progress () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
>     #1  0x00002b943def5c85 in ompi_request_default_wait_all () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>     #2  0x00002b943df229d3 in PMPI_Waitall () from
>     /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>     #3  0x0000000000427ef5 in exchange_id_d20 ()
>     #4  0x00000000004166f3 in gauss_seidel ()
>     #5  0x000000000041884b in multi_grid ()
>     #6  0x0000000000418c44 in solve_del2_u ()
>     #7  0x000000000041b151 in solve_Ahat_p_fhat ()
>     #8  0x000000000041b9a1 in solve_constrained_flow_iterative ()
>     #9  0x0000000000411ca6 in general_stokes_solver ()
>     #10 0x0000000000409c21 in main ()
>
>     I've attached the version of Parallel_related.c that is used... I
>     have 
>     not modified this in anyway
>     from the CIG release of CitcomCU.
>
>
>     Luckily, there are commented fprintf statements in just that part of 
>     the code... we'll continue to dig...
>
>     Oh, and just to eliminate the new cluster from suspicion, we 
>     downloaded, compiled and ran CitcomS
>     example1.cfg on the same cluster with the same compilers, and their 
>     was not problem.
>
>     Maybe this is the sign that I'm suppose to finally switch from 
>     CitcomCU to CitcomS... :-(
>     Magali
>
>     On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
>     > Hi Magali,
>     >
>     > How many processors are you using? If more than 100 processors are 
>     > used,
>     > you are seeing this bug:
>     > http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
>     >
>     >
>     > Eh
>     >
>     >
>     >
>     > Magali Billen wrote:
>     >> One correction to the e-mail below, we've been compiling CitcomCU
>     >> using openmpi on our old
>     >> cluster, so the compiler on the new cluster is the same. The big
>     >> difference is that the cluster
>     >> is about twice as fast as the 5-year old cluster. This suggests
>     that
>     >> this change to a much faster
>     >> clsuter may have exposed an existing race condition in CitcomCU??
>     >> Magali
>     >>
>     >>
>     >> Begin forwarded message:
>     >>
>     >>> *From: *Magali Billen <mibillen at ucdavis.edu
>     >>> <mailto:mibillen at ucdavis.edu>>
>     >>> *Date: *November 17, 2009 4:23:45 PM PST
>     >>> *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
>     >>> *Subject: **[CIG-MC] MPI_Isend error*
>     >>>
>     >>> Hello,
>     >>>
>     >>> I'm using CitcomCU and am having a strange problem with problem
>     >>> either hanging (no error, just doesn't
>     >>> go anywhere) or it dies with an MPI_Isend error (see below). 
>     I seem
>     >>> to recall having problems with the MPI_Isend
>     >>> command and the lam-mpi version of mpi, but I've not had any 
>     >>> problems
>     >>> with mpich-2.
>     >>> On the new cluster we are compling with openmpi instead of
>     MPICH-2.
>     >>>
>     >>> The MPI_Isend error seems to occur during Initialization in
>     the call
>     >>> to the function mass_matrix, which then
>     >>> calls exchange_node_f20, which is where the call to MPI_Isend is.
>     >>>
>     >>> --snip--
>     >>> ok14: parallel shuffle element and id arrays
>     >>> ok15: construct shape functions
>     >>> [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
>     >>> [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
>     >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
>     >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your
>     MPI job
>     >>> will now abort)
>     >>>
>     >>> Has this (or these) types of error occurred for other versions of
>     >>> Citcom using MPI_Isend (it seems that CitcomS uses
>     >>> this command also).   I'm not sure how to debug this error,
>     >>> especially since sometimes it just hangs with no error.
>     >>>
>     >>> Any advice you have would be hepful,
>     >>> Magali
>     >>>
>     >>>
>     >>> -----------------------------
>     >>> Associate Professor, U.C. Davis
>     >>> Department of Geology/KeckCAVEs
>     >>> Physical & Earth Sciences Bldg, rm 2129
>     >>> Davis, CA 95616
>     >>> -----------------
>     >>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>     >>> (530) 754-5696
>     >>> *-----------------------------*
>     >>> *** Note new e-mail, building, office*
>     >>> *    information as of Sept. 2009 ***
>     >>> -----------------------------
>     >>>
>     >>> _______________________________________________
>     >>> CIG-MC mailing list
>     >>> CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
>     >>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>     >>
>     >> -----------------------------
>     >> Associate Professor, U.C. Davis
>     >> Department of Geology/KeckCAVEs
>     >> Physical & Earth Sciences Bldg, rm 2129
>     >> Davis, CA 95616
>     >> -----------------
>     >> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>     >> (530) 754-5696
>     >> *-----------------------------*
>     >> *** Note new e-mail, building, office*
>     >> *    information as of Sept. 2009 ***
>     >> -----------------------------
>     >>
>     >>
>     ------------------------------------------------------------------------
>     >>
>     >> _______________________________________________
>     >> CIG-MC mailing list
>     >> CIG-MC at geodynamics.org
>     >> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>     >>
>     >
>     > --
>     > Eh Tan
>     > Staff Scientist
>     > Computational Infrastructure for Geodynamics
>     > California Institute of Technology, 158-79
>     > Pasadena, CA 91125
>     > (626) 395-1693
>     > http://www.geodynamics.org
>     >
>     > _______________________________________________
>     > CIG-MC mailing list
>     > CIG-MC at geodynamics.org
>     > http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
>     -----------------------------
>     Associate Professor, U.C. Davis
>     Department of Geology/KeckCAVEs
>     Physical & Earth Sciences Bldg, rm 2129
>     Davis, CA 95616
>     -----------------
>     mibillen at ucdavis.edu
>     (530) 754-5696
>     -----------------------------
>     ** Note new e-mail, building, office
>          information as of Sept. 2009 **
>     -----------------------------
>
>     ------------------------------------------------------------------------
>     _______________________________________________
>     CIG-MC mailing list
>     CIG-MC at geodynamics.org
>     http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>



More information about the CIG-MC mailing list