[CIG-MC] Fwd: MPI_Isend error

jshhuang jshhuang at ustc.edu.cn
Wed Nov 18 02:01:07 PST 2009


Hi, Magali,

You can try to add the following to the subroutine: void parallel_domain_decomp1(struct All_variables *E) in Parallel_related.c:
---------------------------------------------------------------------------- 
    for(j = 0; j < E->parallel.nproc; j++)
        for(i = 0; i <= E->parallel.nproc; i++)
            {
                E->parallel.mst1[j][i] = 1;
                E->parallel.mst2[j][i] = 2;

                E->parallel.mst2[j][i] = 3;
            }
----------------------------------------------------------------------------



I'm not sure if it works, but I thought it deserve a try. This is a machine-dependent issue. 



Good luck!





Jinshui Huang
---------------------------------------
School of Earth and Space Sciences
University of Science and Technology of China
Hefei, Anhui 230026, China
0551-3606781
---------------------------------------

  ----- Original Message ----- 
  From: Magali Billen 
  To: Eh Tan 
  Cc: cig-mc at geodynamics.org 
  Sent: Wednesday, November 18, 2009 10:23 AM
  Subject: [?? Probable Spam] Re: [CIG-MC] Fwd: MPI_Isend error


  Hello Eh,


  This is a run on 8 processors. If I print the stack I get:


  (gdb) bt
  #0  0x00002b943e3c208a in opal_progress () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
  #1  0x00002b943def5c85 in ompi_request_default_wait_all () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
  #2  0x00002b943df229d3 in PMPI_Waitall () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
  #3  0x0000000000427ef5 in exchange_id_d20 ()
  #4  0x00000000004166f3 in gauss_seidel ()
  #5  0x000000000041884b in multi_grid ()
  #6  0x0000000000418c44 in solve_del2_u ()
  #7  0x000000000041b151 in solve_Ahat_p_fhat ()
  #8  0x000000000041b9a1 in solve_constrained_flow_iterative ()
  #9  0x0000000000411ca6 in general_stokes_solver ()
  #10 0x0000000000409c21 in main ()



  I've attached the version of Parallel_related.c that is used... I have not modified this in anyway
  from the CIG release of CitcomCU.




------------------------------------------------------------------------------


  Luckily, there are commented fprintf statements in just that part of the code... we'll continue to dig...


  Oh, and just to eliminate the new cluster from suspicion, we downloaded, compiled and ran CitcomS
  example1.cfg on the same cluster with the same compilers, and their was not problem.


  Maybe this is the sign that I'm suppose to finally switch from CitcomCU to CitcomS... :-(
  Magali


  On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:


    Hi Magali,

    How many processors are you using? If more than 100 processors are used,
    you are seeing this bug:
    http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html


    Eh



    Magali Billen wrote:

      One correction to the e-mail below, we've been compiling CitcomCU

      using openmpi on our old

      cluster, so the compiler on the new cluster is the same. The big

      difference is that the cluster

      is about twice as fast as the 5-year old cluster. This suggests that

      this change to a much faster

      clsuter may have exposed an existing race condition in CitcomCU??

      Magali





      Begin forwarded message:



        *From: *Magali Billen <mibillen at ucdavis.edu

        <mailto:mibillen at ucdavis.edu>>

        *Date: *November 17, 2009 4:23:45 PM PST

        *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>

        *Subject: **[CIG-MC] MPI_Isend error*



        Hello,



        I'm using CitcomCU and am having a strange problem with problem

        either hanging (no error, just doesn't 

        go anywhere) or it dies with an MPI_Isend error (see below).  I seem

        to recall having problems with the MPI_Isend 

        command and the lam-mpi version of mpi, but I've not had any problems

        with mpich-2.

        On the new cluster we are compling with openmpi instead of MPICH-2.



        The MPI_Isend error seems to occur during Initialization in the call

        to the function mass_matrix, which then 

        calls exchange_node_f20, which is where the call to MPI_Isend is.



        --snip--

        ok14: parallel shuffle element and id arrays

        ok15: construct shape functions

        [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend

        [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD

        [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank

        [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your MPI job

        will now abort)



        Has this (or these) types of error occurred for other versions of

        Citcom using MPI_Isend (it seems that CitcomS uses

        this command also).   I'm not sure how to debug this error,

        especially since sometimes it just hangs with no error.



        Any advice you have would be hepful,

        Magali





        -----------------------------

        Associate Professor, U.C. Davis

        Department of Geology/KeckCAVEs

        Physical & Earth Sciences Bldg, rm 2129

        Davis, CA 95616

        -----------------

        mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>

        (530) 754-5696

        *-----------------------------*

        *** Note new e-mail, building, office*

        *    information as of Sept. 2009 ***

        -----------------------------



        _______________________________________________

        CIG-MC mailing list

        CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>

        http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc



      -----------------------------

      Associate Professor, U.C. Davis

      Department of Geology/KeckCAVEs

      Physical & Earth Sciences Bldg, rm 2129

      Davis, CA 95616

      -----------------

      mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>

      (530) 754-5696

      *-----------------------------*

      *** Note new e-mail, building, office*

      *    information as of Sept. 2009 ***

      -----------------------------



      ------------------------------------------------------------------------



      _______________________________________________

      CIG-MC mailing list

      CIG-MC at geodynamics.org

      http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc




    -- 
    Eh Tan
    Staff Scientist
    Computational Infrastructure for Geodynamics
    California Institute of Technology, 158-79
    Pasadena, CA 91125
    (626) 395-1693
    http://www.geodynamics.org

    _______________________________________________
    CIG-MC mailing list
    CIG-MC at geodynamics.org
    http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc



  -----------------------------
  Associate Professor, U.C. Davis
  Department of Geology/KeckCAVEs
  Physical & Earth Sciences Bldg, rm 2129
  Davis, CA 95616
  -----------------
  mibillen at ucdavis.edu
  (530) 754-5696
  -----------------------------
  ** Note new e-mail, building, office
      information as of Sept. 2009 **
  -----------------------------




------------------------------------------------------------------------------


  Hello Eh,

  This is a run on 8 processors. If I print the stack I get:

  (gdb) bt
  #0  0x00002b943e3c208a in opal_progress () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libopen-pal.so.0
  #1  0x00002b943def5c85 in ompi_request_default_wait_all () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
  #2  0x00002b943df229d3 in PMPI_Waitall () from
  /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
  #3  0x0000000000427ef5 in exchange_id_d20 ()
  #4  0x00000000004166f3 in gauss_seidel ()
  #5  0x000000000041884b in multi_grid ()
  #6  0x0000000000418c44 in solve_del2_u ()
  #7  0x000000000041b151 in solve_Ahat_p_fhat ()
  #8  0x000000000041b9a1 in solve_constrained_flow_iterative ()
  #9  0x0000000000411ca6 in general_stokes_solver ()
  #10 0x0000000000409c21 in main ()

  I've attached the version of Parallel_related.c that is used... I have  
  not modified this in anyway
  from the CIG release of CitcomCU.


  Luckily, there are commented fprintf statements in just that part of  
  the code... we'll continue to dig...

  Oh, and just to eliminate the new cluster from suspicion, we  
  downloaded, compiled and ran CitcomS
  example1.cfg on the same cluster with the same compilers, and their  
  was not problem.

  Maybe this is the sign that I'm suppose to finally switch from  
  CitcomCU to CitcomS... :-(
  Magali

  On Nov 17, 2009, at 5:02 PM, Eh Tan wrote:

  > Hi Magali,
  >
  > How many processors are you using? If more than 100 processors are  
  > used,
  > you are seeing this bug:
  > http://www.geodynamics.org/pipermail/cig-mc/2008-March/000080.html
  >
  >
  > Eh
  >
  >
  >
  > Magali Billen wrote:
  >> One correction to the e-mail below, we've been compiling CitcomCU
  >> using openmpi on our old
  >> cluster, so the compiler on the new cluster is the same. The big
  >> difference is that the cluster
  >> is about twice as fast as the 5-year old cluster. This suggests that
  >> this change to a much faster
  >> clsuter may have exposed an existing race condition in CitcomCU??
  >> Magali
  >>
  >>
  >> Begin forwarded message:
  >>
  >>> *From: *Magali Billen <mibillen at ucdavis.edu
  >>> <mailto:mibillen at ucdavis.edu>>
  >>> *Date: *November 17, 2009 4:23:45 PM PST
  >>> *To: *cig-mc at geodynamics.org <mailto:cig-mc at geodynamics.org>
  >>> *Subject: **[CIG-MC] MPI_Isend error*
  >>>
  >>> Hello,
  >>>
  >>> I'm using CitcomCU and am having a strange problem with problem
  >>> either hanging (no error, just doesn't
  >>> go anywhere) or it dies with an MPI_Isend error (see below).  I seem
  >>> to recall having problems with the MPI_Isend
  >>> command and the lam-mpi version of mpi, but I've not had any  
  >>> problems
  >>> with mpich-2.
  >>> On the new cluster we are compling with openmpi instead of MPICH-2.
  >>>
  >>> The MPI_Isend error seems to occur during Initialization in the call
  >>> to the function mass_matrix, which then
  >>> calls exchange_node_f20, which is where the call to MPI_Isend is.
  >>>
  >>> --snip--
  >>> ok14: parallel shuffle element and id arrays
  >>> ok15: construct shape functions
  >>> [farm.caes.ucdavis.edu:27041] *** An error occurred in MPI_Isend
  >>> [farm.caes.ucdavis.edu:27041] *** on communicator MPI_COMM_WORLD
  >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERR_RANK: invalid rank
  >>> [farm.caes.ucdavis.edu:27041] *** MPI_ERRORS_ARE_FATAL (your MPI job
  >>> will now abort)
  >>>
  >>> Has this (or these) types of error occurred for other versions of
  >>> Citcom using MPI_Isend (it seems that CitcomS uses
  >>> this command also).   I'm not sure how to debug this error,
  >>> especially since sometimes it just hangs with no error.
  >>>
  >>> Any advice you have would be hepful,
  >>> Magali
  >>>
  >>>
  >>> -----------------------------
  >>> Associate Professor, U.C. Davis
  >>> Department of Geology/KeckCAVEs
  >>> Physical & Earth Sciences Bldg, rm 2129
  >>> Davis, CA 95616
  >>> -----------------
  >>> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
  >>> (530) 754-5696
  >>> *-----------------------------*
  >>> *** Note new e-mail, building, office*
  >>> *    information as of Sept. 2009 ***
  >>> -----------------------------
  >>>
  >>> _______________________________________________
  >>> CIG-MC mailing list
  >>> CIG-MC at geodynamics.org <mailto:CIG-MC at geodynamics.org>
  >>> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
  >>
  >> -----------------------------
  >> Associate Professor, U.C. Davis
  >> Department of Geology/KeckCAVEs
  >> Physical & Earth Sciences Bldg, rm 2129
  >> Davis, CA 95616
  >> -----------------
  >> mibillen at ucdavis.edu <mailto:mibillen at ucdavis.edu>
  >> (530) 754-5696
  >> *-----------------------------*
  >> *** Note new e-mail, building, office*
  >> *    information as of Sept. 2009 ***
  >> -----------------------------
  >>
  >> ------------------------------------------------------------------------
  >>
  >> _______________________________________________
  >> CIG-MC mailing list
  >> CIG-MC at geodynamics.org
  >> http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
  >>
  >
  > -- 
  > Eh Tan
  > Staff Scientist
  > Computational Infrastructure for Geodynamics
  > California Institute of Technology, 158-79
  > Pasadena, CA 91125
  > (626) 395-1693
  > http://www.geodynamics.org
  >
  > _______________________________________________
  > CIG-MC mailing list
  > CIG-MC at geodynamics.org
  > http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc

  -----------------------------
  Associate Professor, U.C. Davis
  Department of Geology/KeckCAVEs
  Physical & Earth Sciences Bldg, rm 2129
  Davis, CA 95616
  -----------------
  mibillen at ucdavis.edu
  (530) 754-5696
  -----------------------------
  ** Note new e-mail, building, office
       information as of Sept. 2009 **
  -----------------------------




------------------------------------------------------------------------------


  _______________________________________________
  CIG-MC mailing list
  CIG-MC at geodynamics.org
  http://geodynamics.org/cgi-bin/mailman/listinfo/cig-mc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://geodynamics.org/pipermail/cig-mc/attachments/20091118/cd5434aa/attachment-0001.htm 


More information about the CIG-MC mailing list