[CIG-MC] Fw: Fwd: MPI_Isend error
Eh Tan
tan2 at geodynamics.org
Wed Nov 18 10:46:49 PST 2009
Hi Jinshui,
Thanks for the code. This might fix the problem.
I just checked the code and found E->parallel.mst1, E->parallel.mst2,
and E->parallel.mst3 are not initialized anywhere in the code. So, these
arrays are likely contain 0 for all elements. These arrays are used as
MPI tags. When a processor sends two MPI messages to another processor,
the messages are distinguished by the tags. If the tags are the same,
the receiving processor might get confused. This is really a race condition.
Eh
jshhuang wrote:
> Hi, all. Sorry, the added code should be:
> ----------------------------------------------------------------------------
>
> for(j = 0; j < E->parallel.nproc; j++)
> for(i = 0; i <= E->parallel.nproc; i++)
> {
> E->parallel.mst1[j][i] = 1;
> E->parallel.mst2[j][i] = 2;
>
> E->parallel.mst3[j][i] = 3;
> }
> ----------------------------------------------------------------------------
>
> a "2" should be "3".
>
>
>
> Good luck!
>
>
>
> Jinshui
>
>
>
> ----- Original Message -----
> *From:* jshhuang <mailto:jshhuang at ustc.edu.cn>
> *To:* Magali Billen <mailto:mibillen at ucdavis.edu>
> *Cc:* tan2 at geodynamics.org <mailto:tan2 at geodynamics.org> ; Shijie
> Zhong <mailto:Shijie.Zhong at Colorado.Edu> ; cig-mc at geodynamics.org
> <mailto:cig-mc at geodynamics.org>
> *Sent:* Wednesday, November 18, 2009 6:01 PM
> *Subject:* Re: [CIG-MC] Fwd: MPI_Isend error
>
> Hi, Magali,
>
> You can try to add the following to the subroutine: void
> parallel_domain_decomp1(struct All_variables *E) in Parallel_related.c:
> ----------------------------------------------------------------------------
>
> for(j = 0; j < E->parallel.nproc; j++)
> for(i = 0; i <= E->parallel.nproc; i++)
> {
> E->parallel.mst1[j][i] = 1;
> E->parallel.mst2[j][i] = 2;
>
> E->parallel.mst2[j][i] = 3;
> }
> ----------------------------------------------------------------------------
>
>
>
> I'm not sure if it works, but I thought it deserve a try. This is a
> machine-dependent issue.
>
>
>
> Good luck!
>
>
>
>
>
> Jinshui Huang
> ---------------------------------------
> School of Earth and Space Sciences
> University of Science and Technology of China
> Hefei, Anhui 230026, China
> 0551-3606781
> ---------------------------------------
>
> ----- Original Message -----
> *From:* Magali Billen <mailto:mibillen at ucdavis.edu>
> *To:* Eh Tan <mailto:tan2 at geodynamics.org>
> *Cc:* cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
> *Sent:* Wednesday, November 18, 2009 10:23 AM
> *Subject:* [?? Probable Spam] Re: [CIG-MC] Fwd: MPI_Isend error
>
> Hello Eh,
>
> This is a run on 8 processors. If I print the stack I get:
>
> (gdb) bt
> #0 0x00002b943e3c208a in opal_progress () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
> #1 0x00002b943def5c85 in ompi_request_default_wait_all () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #2 0x00002b943df229d3 in PMPI_Waitall () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #3 0x0000000000427ef5 in exchange_id_d20 ()
> #4 0x00000000004166f3 in gauss_seidel ()
> #5 0x000000000041884b in multi_grid ()
> #6 0x0000000000418c44 in solve_del2_u ()
> #7 0x000000000041b151 in solve_Ahat_p_fhat ()
> #8 0x000000000041b9a1 in solve_constrained_flow_iterative ()
> #9 0x0000000000411ca6 in general_stokes_solver ()
> #10 0x0000000000409c21 in main ()
>
> I've attached the version of Parallel_related.c that is used... I
> have not modified this in anyway
> from the CIG release of CitcomCU.
>
> ------------------------------------------------------------------------
> Luckily, there are commented fprintf statements in just that part
> of the code... we'll continue to dig...
>
> Oh, and just to eliminate the new cluster from suspicion, we
> downloaded, compiled and ran CitcomS
> example1.cfg on the same cluster with the same compilers, and
> their was not problem.
>
> Maybe this is the sign that I'm suppose to finally switch from
> CitcomCU to CitcomS... :-(
> Magali
>
> On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
>> Hi Magali,
>>
>> How many processors are you using? If more than 100 processors
>> are used,
>> you are seeing this bug:
>> http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
>>
>>
>> Eh
>>
>>
>>
>> Magali Billen wrote:
>>> One correction to the e-mail below, we've been compiling CitcomCU
>>> using openmpi on our old
>>> cluster, so the compiler on the new cluster is the same. The big
>>> difference is that the cluster
>>> is about twice as fast as the 5-year old cluster. This suggests that
>>> this change to a much faster
>>> clsuter may have exposed an existing race condition in CitcomCU??
>>> Magali
>>>
>>>
>>> Begin forwarded message:
>>>
>>>> *From: *Magali Billen <mibillen at ucdavis.edu
>>>> <mailto:mibillen at ucdavis.edu>>
>>>> *Date: *November 17, 2009 4:23:45 PM PST
>>>> *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
>>>> *Subject: **[CIG-MC] MPI_Isend error*
>>>>
>>>> Hello,
>>>>
>>>> I'm using CitcomCU and am having a strange problem with problem
>>>> either hanging (no error, just doesn't
>>>> go anywhere) or it dies with an MPI_Isend error (see below). I
>>>> seem
>>>> to recall having problems with the MPI_Isend
>>>> command and the lam-mpi version of mpi, but I've not had any
>>>> problems
>>>> with mpich-2.
>>>> On the new cluster we are compling with openmpi instead of MPICH-2.
>>>>
>>>> The MPI_Isend error seems to occur during Initialization in the
>>>> call
>>>> to the function mass_matrix, which then
>>>> calls exchange_node_f20, which is where the call to MPI_Isend is.
>>>>
>>>> --snip--
>>>> ok14: parallel shuffle element and id arrays
>>>> ok15: construct shape functions
>>>> [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
>>>> [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
>>>> [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
>>>> [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your
>>>> MPI job
>>>> will now abort)
>>>>
>>>> Has this (or these) types of error occurred for other versions of
>>>> Citcom using MPI_Isend (it seems that CitcomS uses
>>>> this command also). I'm not sure how to debug this error,
>>>> especially since sometimes it just hangs with no error.
>>>>
>>>> Any advice you have would be hepful,
>>>> Magali
>>>>
>>>>
>>>> -----------------------------
>>>> Associate Professor, U.C. Davis
>>>> Department of Geology/KeckCAVEs
>>>> Physical & Earth Sciences Bldg, rm 2129
>>>> Davis, CA 95616
>>>> -----------------
>>>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>>> (530) 754-5696
>>>> *-----------------------------*
>>>> *** Note new e-mail, building, office*
>>>> * information as of Sept. 2009 ***
>>>> -----------------------------
>>>>
>>>> _______________________________________________
>>>> CIG-MC mailing list
>>>> CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
>>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>
>>> -----------------------------
>>> Associate Professor, U.C. Davis
>>> Department of Geology/KeckCAVEs
>>> Physical & Earth Sciences Bldg, rm 2129
>>> Davis, CA 95616
>>> -----------------
>>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
>>> (530) 754-5696
>>> *-----------------------------*
>>> *** Note new e-mail, building, office*
>>> * information as of Sept. 2009 ***
>>> -----------------------------
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> CIG-MC mailing list
>>> CIG-MC at geodynamics.org
>>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>>>
>>
>> --
>> Eh Tan
>> Staff Scientist
>> Computational Infrastructure for Geodynamics
>> California Institute of Technology, 158-79
>> Pasadena, CA 91125
>> (626) 395-1693
>> http://www.geodynamics.org
>>
>> _______________________________________________
>> CIG-MC mailing list
>> CIG-MC at geodynamics.org
>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> -----------------------------
> Associate Professor, U.C. Davis
> Department of Geology/KeckCAVEs
> Physical & Earth Sciences Bldg, rm 2129
> Davis, CA 95616
> -----------------
> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
> (530) 754-5696
> *-----------------------------*
> *** Note new e-mail, building, office*
> * information as of Sept. 2009 ***
> -----------------------------
>
> ------------------------------------------------------------------------
> Hello Eh,
>
> This is a run on 8 processors. If I print the stack I get:
>
> (gdb) bt
> #0 0x00002b943e3c208a in opal_progress () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
> #1 0x00002b943def5c85 in ompi_request_default_wait_all () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #2 0x00002b943df229d3 in PMPI_Waitall () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #3 0x0000000000427ef5 in exchange_id_d20 ()
> #4 0x00000000004166f3 in gauss_seidel ()
> #5 0x000000000041884b in multi_grid ()
> #6 0x0000000000418c44 in solve_del2_u ()
> #7 0x000000000041b151 in solve_Ahat_p_fhat ()
> #8 0x000000000041b9a1 in solve_constrained_flow_iterative ()
> #9 0x0000000000411ca6 in general_stokes_solver ()
> #10 0x0000000000409c21 in main ()
>
> I've attached the version of Parallel_related.c that is used... I
> have
> not modified this in anyway
> from the CIG release of CitcomCU.
>
>
> Luckily, there are commented fprintf statements in just that part of
> the code... we'll continue to dig...
>
> Oh, and just to eliminate the new cluster from suspicion, we
> downloaded, compiled and ran CitcomS
> example1.cfg on the same cluster with the same compilers, and their
> was not problem.
>
> Maybe this is the sign that I'm suppose to finally switch from
> CitcomCU to CitcomS... :-(
> Magali
>
> On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
> > Hi Magali,
> >
> > How many processors are you using? If more than 100 processors are
> > used,
> > you are seeing this bug:
> > http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
> >
> >
> > Eh
> >
> >
> >
> > Magali Billen wrote:
> >> One correction to the e-mail below, we've been compiling CitcomCU
> >> using openmpi on our old
> >> cluster, so the compiler on the new cluster is the same. The big
> >> difference is that the cluster
> >> is about twice as fast as the 5-year old cluster. This suggests
> that
> >> this change to a much faster
> >> clsuter may have exposed an existing race condition in CitcomCU??
> >> Magali
> >>
> >>
> >> Begin forwarded message:
> >>
> >>> *From: *Magali Billen <mibillen at ucdavis.edu
> >>> <mailto:mibillen at ucdavis.edu>>
> >>> *Date: *November 17, 2009 4:23:45 PM PST
> >>> *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
> >>> *Subject: **[CIG-MC] MPI_Isend error*
> >>>
> >>> Hello,
> >>>
> >>> I'm using CitcomCU and am having a strange problem with problem
> >>> either hanging (no error, just doesn't
> >>> go anywhere) or it dies with an MPI_Isend error (see below).
> I seem
> >>> to recall having problems with the MPI_Isend
> >>> command and the lam-mpi version of mpi, but I've not had any
> >>> problems
> >>> with mpich-2.
> >>> On the new cluster we are compling with openmpi instead of
> MPICH-2.
> >>>
> >>> The MPI_Isend error seems to occur during Initialization in
> the call
> >>> to the function mass_matrix, which then
> >>> calls exchange_node_f20, which is where the call to MPI_Isend is.
> >>>
> >>> --snip--
> >>> ok14: parallel shuffle element and id arrays
> >>> ok15: construct shape functions
> >>> [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
> >>> [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
> >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
> >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your
> MPI job
> >>> will now abort)
> >>>
> >>> Has this (or these) types of error occurred for other versions of
> >>> Citcom using MPI_Isend (it seems that CitcomS uses
> >>> this command also). I'm not sure how to debug this error,
> >>> especially since sometimes it just hangs with no error.
> >>>
> >>> Any advice you have would be hepful,
> >>> Magali
> >>>
> >>>
> >>> -----------------------------
> >>> Associate Professor, U.C. Davis
> >>> Department of Geology/KeckCAVEs
> >>> Physical & Earth Sciences Bldg, rm 2129
> >>> Davis, CA 95616
> >>> -----------------
> >>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
> >>> (530) 754-5696
> >>> *-----------------------------*
> >>> *** Note new e-mail, building, office*
> >>> * information as of Sept. 2009 ***
> >>> -----------------------------
> >>>
> >>> _______________________________________________
> >>> CIG-MC mailing list
> >>> CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
> >>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
> >>
> >> -----------------------------
> >> Associate Professor, U.C. Davis
> >> Department of Geology/KeckCAVEs
> >> Physical & Earth Sciences Bldg, rm 2129
> >> Davis, CA 95616
> >> -----------------
> >> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
> >> (530) 754-5696
> >> *-----------------------------*
> >> *** Note new e-mail, building, office*
> >> * information as of Sept. 2009 ***
> >> -----------------------------
> >>
> >>
> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> CIG-MC mailing list
> >> CIG-MC at geodynamics.org
> >> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
> >>
> >
> > --
> > Eh Tan
> > Staff Scientist
> > Computational Infrastructure for Geodynamics
> > California Institute of Technology, 158-79
> > Pasadena, CA 91125
> > (626) 395-1693
> > http://www.geodynamics.org
> >
> > _______________________________________________
> > CIG-MC mailing list
> > CIG-MC at geodynamics.org
> > http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> -----------------------------
> Associate Professor, U.C. Davis
> Department of Geology/KeckCAVEs
> Physical & Earth Sciences Bldg, rm 2129
> Davis, CA 95616
> -----------------
> mibillen at ucdavis.edu
> (530) 754-5696
> -----------------------------
> ** Note new e-mail, building, office
> information as of Sept. 2009 **
> -----------------------------
>
> ------------------------------------------------------------------------
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
More information about the CIG-MC
mailing list