[CIG-MC] Fwd: MPI_Isend error

Shijie Zhong Shijie.Zhong at Colorado.Edu
Tue Nov 17 19:09:22 PST 2009


Magali,

exchange_id_d20 () in Parallel_related.c for CitcomCU and CitcomS are quite 
similar. 

Does your new cluster use myrinet or just gigabit ethernet cards? I had some MPI 
problems a couple of years ago with our new cluster. The problems showed up 
nearly randomly. We worked with the company that sold us the machine and 
eventually figured out that the MPI problems were caused by the chipsets and 
gigabit ethernet. 

Shijie


Shijie Zhong
Department of Physics
University of Colorado at Boulder
Boulder, CO 80309
Tel: 303-735-5095; Fax: 303-492-7935
Web: http://anquetil.colorado.edu/szhong

---- Original message ----
>Date: Tue, 17 Nov 2009 18:23:28 -0800
>From: cig-mc-bounces at geodynamics.org (on behalf of Magali Billen 
<mibillen at ucdavis.edu>)
>Subject: Re: [CIG-MC] Fwd:  MPI_Isend error  
>To: Eh Tan <tan2 at geodynamics.org>
>Cc: cig-mc at geodynamics.org
>
>   Hello Eh,
>   This is a run on 8 processors. If I print the stack
>   I get:
>   (gdb) bt
>   #0  0x00002b943e3c208a in opal_progress () from
>   /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
>   #1  0x00002b943def5c85 in
>   ompi_request_default_wait_all () from
>   /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>   #2  0x00002b943df229d3 in PMPI_Waitall () from
>   /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
>   #3  0x0000000000427ef5 in exchange_id_d20 ()
>   #4  0x00000000004166f3 in gauss_seidel ()
>   #5  0x000000000041884b in multi_grid ()
>   #6  0x0000000000418c44 in solve_del2_u ()
>   #7  0x000000000041b151 in solve_Ahat_p_fhat ()
>   #8  0x000000000041b9a1 in
>   solve_constrained_flow_iterative ()
>   #9  0x0000000000411ca6 in general_stokes_solver ()
>   #10 0x0000000000409c21 in main ()
>   I've attached the version of Parallel_related.c that
>   is used... I have not modified this in anyway
>   from the CIG release of CitcomCU.
>________________
>________________
>   Luckily, there are commented fprintf statements in
>   just that part of the code... we'll continue to
>   dig...
>   Oh, and just to eliminate the new cluster from
>   suspicion, we downloaded, compiled and ran CitcomS
>   example1.cfg on the same cluster with the same
>   compilers, and their was not problem.
>   Maybe this is the sign that I'm suppose to finally
>   switch from CitcomCU to CitcomS... :-(
>   Magali
>   On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:
>
>     Hi Magali,
>
>     How many processors are you using? If more than
>     100 processors are used,
>     you are seeing this bug:
>     http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
>
>     Eh
>
>     Magali Billen wrote:
>
>       One correction to the e-mail below, we've been
>       compiling CitcomCU
>
>       using openmpi on our old
>
>       cluster, so the compiler on the new cluster is
>       the same. The big
>
>       difference is that the cluster
>
>       is about twice as fast as the 5-year old
>       cluster. This suggests that
>
>       this change to a much faster
>
>       clsuter may have exposed an existing race
>       condition in CitcomCU??
>
>       Magali
>
>       Begin forwarded message:
>
>         *From: *Magali Billen <mibillen at ucdavis.edu
>
>         <mailto:mibillen at ucdavis.edu>>
>
>         *Date: *November 17, 2009 4:23:45 PM PST
>
>         *To: *cig-mc at geodynamics.org
>         <mailto:cig-mc at geodynamics.org>
>
>         *Subject: **[CIG-MC] MPI_Isend error*
>
>         Hello,
>
>         I'm using CitcomCU and am having a strange
>         problem with problem
>
>         either hanging (no error, just doesn't
>
>         go anywhere) or it dies with an MPI_Isend
>         error (see below).  I seem
>
>         to recall having problems with the MPI_Isend
>
>         command and the lam-mpi version of mpi, but
>         I've not had any problems
>
>         with mpich-2.
>
>         On the new cluster we are compling with
>         openmpi instead of MPICH-2.
>
>         The MPI_Isend error seems to occur during
>         Initialization in the call
>
>         to the function mass_matrix, which then
>
>         calls exchange_node_f20, which is where the
>         call to MPI_Isend is.
>
>         --snip--
>
>         ok14: parallel shuffle element and id arrays
>
>         ok15: construct shape functions
>
>         [farm.caes.ucdavis.edu:27041] *** An error
>         occurred in MPI_Isend
>
>         [farm.caes.ucdavis.edu:27041] *** on
>         communicator MPI_COMM_WORLD
>
>         [farm.caes.ucdavis.edu:27041] ***
>         MPI_ERR_RANK: invalid rank
>
>         [farm.caes.ucdavis.edu:27041] ***
>         MPI_ERRORS_ARE_FATAL (your MPI job
>
>         will now abort)
>
>         Has this (or these) types of error occurred
>         for other versions of
>
>         Citcom using MPI_Isend (it seems that CitcomS
>         uses
>
>         this command also).   I'm not sure how to
>         debug this error,
>
>         especially since sometimes it just hangs with
>         no error.
>
>         Any advice you have would be hepful,
>
>         Magali
>
>         -----------------------------
>
>         Associate Professor, U.C. Davis
>
>         Department of Geology/KeckCAVEs
>
>         Physical & Earth Sciences Bldg, rm 2129
>
>         Davis, CA 95616
>
>         -----------------
>
>         mibillen at ucdavis.edu
>         <mailto:mibillen at ucdavis.edu>
>
>         (530) 754-5696
>
>         *-----------------------------*
>
>         *** Note new e-mail, building, office*
>
>         *    information as of Sept. 2009 ***
>
>         -----------------------------
>
>         _______________________________________________
>
>         CIG-MC mailing list
>
>         CIG-MC at geodynamics.org
>         <mailto:CIG-MC at geodynamics.org>
>
>         http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
>       -----------------------------
>
>       Associate Professor, U.C. Davis
>
>       Department of Geology/KeckCAVEs
>
>       Physical & Earth Sciences Bldg, rm 2129
>
>       Davis, CA 95616
>
>       -----------------
>
>       mibillen at ucdavis.edu
>       <mailto:mibillen at ucdavis.edu>
>
>       (530) 754-5696
>
>       *-----------------------------*
>
>       *** Note new e-mail, building, office*
>
>       *    information as of Sept. 2009 ***
>
>       -----------------------------
>
>       ---------------------------------------------------------
---------------
>
>       _______________________________________________
>
>       CIG-MC mailing list
>
>       CIG-MC at geodynamics.org
>
>       http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
>     --
>     Eh Tan
>     Staff Scientist
>     Computational Infrastructure for Geodynamics
>     California Institute of Technology, 158-79
>     Pasadena, CA 91125
>     (626) 395-1693
>     http://www.geodynamics.org
>
>     _______________________________________________
>     CIG-MC mailing list
>     CIG-MC at geodynamics.org
>     http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
>
>   -----------------------------
>   Associate Professor, U.C. Davis
>   Department of Geology/KeckCAVEs
>   Physical & Earth Sciences Bldg, rm 2129
>   Davis, CA 95616
>   -----------------
>   mibillen at ucdavis.edu
>   (530) 754-5696
>   -----------------------------
>   ** Note new e-mail, building, office
>       information as of Sept. 2009 **
>   -----------------------------
>________________
>_______________________________________________
>CIG-MC mailing list
>CIG-MC at geodynamics.org
>http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc


More information about the CIG-MC mailing list