[CIG-MC] Fwd: MPI_Isend error
Shijie Zhong
Shijie.Zhong at Colorado.Edu
Wed Nov 18 10:16:07 PST 2009
Hi Jinshui,
Interesting. Did you have similar problems before? These arrays store MPI
messages, and they are rarely used for anything. Perhaps Magali can give it a try
to see whether this would fix the problems.
Since the problems that Magali has seem to have happened nearly randomly,
which is similar to the troubles I had a couple of years ago, I suspect that it is
because of the hardware (newer chipsets and gigabit ethernet cards). In our
case, the computer company eventually replaced the gigabit ethernet cards with
myrinet, and then everything has been fine since. Given that CPUs are much
faster these days, it may be more cost effective to buy myrinet or infiniband for
citcom codes.
Shijie
Shijie Zhong
Department of Physics
University of Colorado at Boulder
Boulder, CO 80309
Tel: 303-735-5095; Fax: 303-492-7935
Web: http://anquetil.colorado.edu/szhong
---- Original message ----
>Date: Wed, 18 Nov 2009 18:01:07 +0800
>From: "jshhuang" <jshhuang at ustc.edu.cn>
>Subject: Re: [CIG-MC] Fwd: MPI_Isend error
>To: "Magali Billen" <mibillen at ucdavis.edu>
>Cc: <tan2 at geodynamics.org>,"Shijie Zhong"
<Shijie.Zhong at Colorado.Edu>,<cig-mc at geodynamics.org>
>
> Hi, Magali,
>
> You can try to add the following to the subroutine:
> void parallel_domain_decomp1(struct All_variables
> *E) in Parallel_related.c:
> -----------------------------------------------------------
-----------------
>
> for(j = 0; j < E->parallel.nproc; j++)
> for(i = 0; i <= E->parallel.nproc; i++)
> {
> E->parallel.mst1[j][i] = 1;
> E->parallel.mst2[j][i] = 2;
>
> E->parallel.mst2[j][i] = 3;
> }
> -----------------------------------------------------------
-----------------
>
>
>
> I'm not sure if it works, but I thought it deserve a
> try. This is a machine-dependent issue.
>
>
>
> Good luck!
>
>
>
>
>
> Jinshui Huang
> ---------------------------------------
> School of Earth and Space Sciences
> University of Science and Technology of China
> Hefei, Anhui 230026, China
> 0551-3606781
> ---------------------------------------
>
> ----- Original Message -----
> From: Magali Billen
> To: Eh Tan
> Cc: cig-mc at geodynamics.org
> Sent: Wednesday, November 18, 2009 10:23 AM
> Subject: [?? Probable Spam] Re: [CIG-MC] Fwd:
> MPI_Isend error
> Hello Eh,
> This is a run on 8 processors. If I print the
> stack I get:
> (gdb) bt
> #0 0x00002b943e3c208a in opal_progress () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
> #1 0x00002b943def5c85 in
> ompi_request_default_wait_all () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #2 0x00002b943df229d3 in PMPI_Waitall () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #3 0x0000000000427ef5 in exchange_id_d20 ()
> #4 0x00000000004166f3 in gauss_seidel ()
> #5 0x000000000041884b in multi_grid ()
> #6 0x0000000000418c44 in solve_del2_u ()
> #7 0x000000000041b151 in solve_Ahat_p_fhat ()
> #8 0x000000000041b9a1 in
> solve_constrained_flow_iterative ()
> #9 0x0000000000411ca6 in general_stokes_solver ()
> #10 0x0000000000409c21 in main ()
> I've attached the version of Parallel_related.c
> that is used... I have not modified this in anyway
> from the CIG release of CitcomCU.
>
> ------------------------------------------------
>
> Luckily, there are commented fprintf statements in
> just that part of the code... we'll continue to
> dig...
> Oh, and just to eliminate the new cluster from
> suspicion, we downloaded, compiled and ran CitcomS
> example1.cfg on the same cluster with the same
> compilers, and their was not problem.
> Maybe this is the sign that I'm suppose to finally
> switch from CitcomCU to CitcomS... :-(
> Magali
> On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
> Hi Magali,
>
> How many processors are you using? If more than
> 100 processors are used,
> you are seeing this bug:
> http://www.geodynamics.org/pipermail/cig-mc/2008-
March/000080.html
>
> Eh
>
> Magali Billen wrote:
>
> One correction to the e-mail below, we've been
> compiling CitcomCU
>
> using openmpi on our old
>
> cluster, so the compiler on the new cluster is
> the same. The big
>
> difference is that the cluster
>
> is about twice as fast as the 5-year old
> cluster. This suggests that
>
> this change to a much faster
>
> clsuter may have exposed an existing race
> condition in CitcomCU??
>
> Magali
>
> Begin forwarded message:
>
> *From: *Magali Billen <mibillen at ucdavis.edu
>
> <mailto:mibillen at ucdavis.edu>>
>
> *Date: *November 17, 2009 4:23:45 PM PST
>
> *To: *cig-mc at geodynamics.org
> <mailto:cig-mc at geodynamics.org>
>
> *Subject: **[CIG-MC] MPI_Isend error*
>
> Hello,
>
> I'm using CitcomCU and am having a strange
> problem with problem
>
> either hanging (no error, just doesn't
>
> go anywhere) or it dies with an MPI_Isend
> error (see below). I seem
>
> to recall having problems with the MPI_Isend
>
> command and the lam-mpi version of mpi, but
> I've not had any problems
>
> with mpich-2.
>
> On the new cluster we are compling with
> openmpi instead of MPICH-2.
>
> The MPI_Isend error seems to occur during
> Initialization in the call
>
> to the function mass_matrix, which then
>
> calls exchange_node_f20, which is where the
> call to MPI_Isend is.
>
> --snip--
>
> ok14: parallel shuffle element and id arrays
>
> ok15: construct shape functions
>
> [farm.caes.ucdavis.edu:27041] *** An error
> occurred in MPI_Isend
>
> [farm.caes.ucdavis.edu:27041] *** on
> communicator MPI_COMM_WORLD
>
> [farm.caes.ucdavis.edu:27041] ***
> MPI_ERR_RANK: invalid rank
>
> [farm.caes.ucdavis.edu:27041] ***
> MPI_ERRORS_ARE_FATAL (your MPI job
>
> will now abort)
>
> Has this (or these) types of error occurred
> for other versions of
>
> Citcom using MPI_Isend (it seems that
> CitcomS uses
>
> this command also). I'm not sure how to
> debug this error,
>
> especially since sometimes it just hangs
> with no error.
>
> Any advice you have would be hepful,
>
> Magali
>
> -----------------------------
>
> Associate Professor, U.C. Davis
>
> Department of Geology/KeckCAVEs
>
> Physical & Earth Sciences Bldg, rm 2129
>
> Davis, CA 95616
>
> -----------------
>
> mibillen at ucdavis.edu
> <mailto:mibillen at ucdavis.edu>
>
> (530) 754-5696
>
> *-----------------------------*
>
> *** Note new e-mail, building, office*
>
> * information as of Sept. 2009 ***
>
> -----------------------------
>
> _______________________________________________
>
> CIG-MC mailing list
>
> CIG-MC at geodynamics.org
> <mailto:CIG-MC at geodynamics.org>
>
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> -----------------------------
>
> Associate Professor, U.C. Davis
>
> Department of Geology/KeckCAVEs
>
> Physical & Earth Sciences Bldg, rm 2129
>
> Davis, CA 95616
>
> -----------------
>
> mibillen at ucdavis.edu
> <mailto:mibillen at ucdavis.edu>
>
> (530) 754-5696
>
> *-----------------------------*
>
> *** Note new e-mail, building, office*
>
> * information as of Sept. 2009 ***
>
> -----------------------------
>
> --------------------------------------------------------
----------------
>
> _______________________________________________
>
> CIG-MC mailing list
>
> CIG-MC at geodynamics.org
>
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> --
> Eh Tan
> Staff Scientist
> Computational Infrastructure for Geodynamics
> California Institute of Technology, 158-79
> Pasadena, CA 91125
> (626) 395-1693
> http://www.geodynamics.org
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> -----------------------------
> Associate Professor, U.C. Davis
> Department of Geology/KeckCAVEs
> Physical & Earth Sciences Bldg, rm 2129
> Davis, CA 95616
> -----------------
> mibillen at ucdavis.edu
> (530) 754-5696
> -----------------------------
> ** Note new e-mail, building, office
> information as of Sept. 2009 **
> -----------------------------
>
> ------------------------------------------------
>
> Hello Eh,
>
> This is a run on 8 processors. If I print the
> stack I get:
>
> (gdb) bt
> #0 0x00002b943e3c208a in opal_progress () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
> #1 0x00002b943def5c85 in
> ompi_request_default_wait_all () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #2 0x00002b943df229d3 in PMPI_Waitall () from
> /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
> #3 0x0000000000427ef5 in exchange_id_d20 ()
> #4 0x00000000004166f3 in gauss_seidel ()
> #5 0x000000000041884b in multi_grid ()
> #6 0x0000000000418c44 in solve_del2_u ()
> #7 0x000000000041b151 in solve_Ahat_p_fhat ()
> #8 0x000000000041b9a1 in
> solve_constrained_flow_iterative ()
> #9 0x0000000000411ca6 in general_stokes_solver ()
> #10 0x0000000000409c21 in main ()
>
> I've attached the version of Parallel_related.c
> that is used... I have
> not modified this in anyway
> from the CIG release of CitcomCU.
>
> Luckily, there are commented fprintf statements in
> just that part of
> the code... we'll continue to dig...
>
> Oh, and just to eliminate the new cluster from
> suspicion, we
> downloaded, compiled and ran CitcomS
> example1.cfg on the same cluster with the same
> compilers, and their
> was not problem.
>
> Maybe this is the sign that I'm suppose to finally
> switch from
> CitcomCU to CitcomS... :-(
> Magali
>
> On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
> > Hi Magali,
> >
> > How many processors are you using? If more than
> 100 processors are
> > used,
> > you are seeing this bug:
> >
> http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
> >
> >
> > Eh
> >
> >
> >
> > Magali Billen wrote:
> >> One correction to the e-mail below, we've been
> compiling CitcomCU
> >> using openmpi on our old
> >> cluster, so the compiler on the new cluster is
> the same. The big
> >> difference is that the cluster
> >> is about twice as fast as the 5-year old
> cluster. This suggests that
> >> this change to a much faster
> >> clsuter may have exposed an existing race
> condition in CitcomCU??
> >> Magali
> >>
> >>
> >> Begin forwarded message:
> >>
> >>> *From: *Magali Billen <mibillen at ucdavis.edu
> >>> <mailto:mibillen at ucdavis.edu>>
> >>> *Date: *November 17, 2009 4:23:45 PM PST
> >>> *To: *cig-mc at geodynamics.org
> <mailto:cig-mc at geodynamics.org>
> >>> *Subject: **[CIG-MC] MPI_Isend error*
> >>>
> >>> Hello,
> >>>
> >>> I'm using CitcomCU and am having a strange
> problem with problem
> >>> either hanging (no error, just doesn't
> >>> go anywhere) or it dies with an MPI_Isend
> error (see below). I seem
> >>> to recall having problems with the MPI_Isend
> >>> command and the lam-mpi version of mpi, but
> I've not had any
> >>> problems
> >>> with mpich-2.
> >>> On the new cluster we are compling with
> openmpi instead of MPICH-2.
> >>>
> >>> The MPI_Isend error seems to occur during
> Initialization in the call
> >>> to the function mass_matrix, which then
> >>> calls exchange_node_f20, which is where the
> call to MPI_Isend is.
> >>>
> >>> --snip--
> >>> ok14: parallel shuffle element and id arrays
> >>> ok15: construct shape functions
> >>> [farm.caes.ucdavis.edu:27041] *** An error
> occurred in MPI_Isend
> >>> [farm.caes.ucdavis.edu:27041] *** on
> communicator MPI_COMM_WORLD
> >>> [farm.caes.ucdavis.edu:27041] ***
> MPI_ERR_RANK: invalid rank
> >>> [farm.caes.ucdavis.edu:27041] ***
> MPI_ERRORS_ARE_FATAL (your MPI job
> >>> will now abort)
> >>>
> >>> Has this (or these) types of error occurred
> for other versions of
> >>> Citcom using MPI_Isend (it seems that CitcomS
> uses
> >>> this command also). I'm not sure how to
> debug this error,
> >>> especially since sometimes it just hangs with
> no error.
> >>>
> >>> Any advice you have would be hepful,
> >>> Magali
> >>>
> >>>
> >>> -----------------------------
> >>> Associate Professor, U.C. Davis
> >>> Department of Geology/KeckCAVEs
> >>> Physical & Earth Sciences Bldg, rm 2129
> >>> Davis, CA 95616
> >>> -----------------
> >>> mibillen at ucdavis.edu
> <mailto:mibillen at ucdavis.edu>
> >>> (530) 754-5696
> >>> *-----------------------------*
> >>> *** Note new e-mail, building, office*
> >>> * information as of Sept. 2009 ***
> >>> -----------------------------
> >>>
> >>>
> _______________________________________________
> >>> CIG-MC mailing list
> >>> CIG-MC at geodynamics.org
> <mailto:CIG-MC at geodynamics.org>
> >>>
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
> >>
> >> -----------------------------
> >> Associate Professor, U.C. Davis
> >> Department of Geology/KeckCAVEs
> >> Physical & Earth Sciences Bldg, rm 2129
> >> Davis, CA 95616
> >> -----------------
> >> mibillen at ucdavis.edu
> <mailto:mibillen at ucdavis.edu>
> >> (530) 754-5696
> >> *-----------------------------*
> >> *** Note new e-mail, building, office*
> >> * information as of Sept. 2009 ***
> >> -----------------------------
> >>
> >>
> ----------------------------------------------------------
--------------
> >>
> >> _______________________________________________
> >> CIG-MC mailing list
> >> CIG-MC at geodynamics.org
> >>
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
> >>
> >
> > --
> > Eh Tan
> > Staff Scientist
> > Computational Infrastructure for Geodynamics
> > California Institute of Technology, 158-79
> > Pasadena, CA 91125
> > (626) 395-1693
> > http://www.geodynamics.org
> >
> > _______________________________________________
> > CIG-MC mailing list
> > CIG-MC at geodynamics.org
> >
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
> -----------------------------
> Associate Professor, U.C. Davis
> Department of Geology/KeckCAVEs
> Physical & Earth Sciences Bldg, rm 2129
> Davis, CA 95616
> -----------------
> mibillen at ucdavis.edu
> (530) 754-5696
> -----------------------------
> ** Note new e-mail, building, office
> information as of Sept. 2009 **
> -----------------------------
>
> ------------------------------------------------
>
> _______________________________________________
> CIG-MC mailing list
> CIG-MC at geodynamics.org
> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
More information about the CIG-MC
mailing list