[CIG-SHORT] Pylith-running mpi on cluster

Thu Feb 8 17:03:52 PST 2018

Hi Niloufar,

As Matt mentioned, this could be a solver issue.  For example, here are some typical settings I use for a linear problem with a fault:

[pylithapp]

# ----------------------------------------------------------------------
# PETSc
# ----------------------------------------------------------------------
# Set the solver options.
[pylithapp.problem.formulation]
split_fields = True
matrix_type = aij
use_custom_constraint_pc = True

[pylithapp.petsc]
ksp_rtol = 1.0e-8
ksp_atol = 1.0e-20
ksp_max_it = 4000
ksp_gmres_restart = 100

ksp_monitor = true
# ksp_view = true
ksp_converged_reason = true

fs_pc_type = fieldsplit
fs_pc_use_amat = True
fs_pc_fieldsplit_type = multiplicative
fs_fieldsplit_displacement_pc_type = ml
fs_fieldsplit_lagrange_multiplier_pc_type = jacobi
fs_fieldsplit_displacement_ksp_type = preonly
fs_fieldsplit_lagrange_multiplier_ksp_type = preonly

log_summary = true

Also, I have found that it is generally not great to use all the cores on a node.  I don’t know what queuing system you are using, but there should be a way to request a certain number of cores for each node.  Also, now that I know your problem size, it is actually not very large. I’m not sure how much of a speedup you will get using a lot of cores.  For something like this, I would try maybe using just 10 cores on a single node and see what that does (after changing your solver settings).

Cheers,
Charles

> On 9/02/2018, at 12:32 PM, Niloufar Abolfathian <niloufar.abolfathian at gmail.com> wrote:
> 
> Hi,
> 
> Thanks again for helping me with this code. This is the work I am collaborating with Chris Johnson. I have cc'd him on this email.
> As I explained we are trying to run it for 10,000 years, but after ~1800 years the code will crash. That is why in the cfg file I only run it for 1780 years!
> In addition, the code is taking ~2days to run. We want to run it on mpi and be faster!
> 
> By problem size, what I mean is the size of your mesh (number of vertices and cells).  
> My mesh is a 3 dimension including 159681 nodes and 150000 elements.
> 
> 
> Also, the actual run log (not just the PETSc summary) would be helpful, as it shows us what is happening with convergence. 
> I am not really sure what is the run log and how I can find it.
> 
> 
> Also, did you run the problem on 24 nodes or 24 cores on the cluster?  If 24 nodes, how many cores per node?
> We tried to run it on 
> i) 1 node, 24 core, linux server (shared memory) and 
> ii) 2 nodes, 24 core each linux server (mpi)
> but all of the runs, took the same amount of time as I just was running on my own mac.
> 
> 
> If you send all of your .cfg files (including the one with your job submission information), that might help.
> I have attached a zip file including all my cfg files and also my mesh model. If you try to run it for 2000 years it will crashed.
> 
> For running without mpi but on different nodes of 1 core we tried:  "pylith your.cfg --nodes=24" .  We tried with downloaded binaries and building from source.  No performance difference.
> 
> Best,
> Niloufar
> 
> 
> 
> On Thu, Feb 8, 2018 at 3:12 PM, Matthew Knepley <knepley at rice.edu <mailto:knepley at rice.edu>> wrote:
> On Fri, Feb 9, 2018 at 4:58 AM, Charles Williams <willic3 at gmail.com <mailto:willic3 at gmail.com>> wrote:
> Hi Niloufar,
> 
> By problem size, what I mean is the size of your mesh (number of vertices and cells).  Also, the actual run log (not just the PETSc summary) would be helpful, as it shows us what is happening with convergence.  Also, did you run the problem on 24 nodes or 24 cores on the cluster?  If 24 nodes, how many cores per node?  If you send all of your .cfg files (including the one with your job submission information), that might help.
> 
> It is possible for 24 cores to be on a single node in a modern machine. The best thing to do would be to run the STREAMS benchmark on your
> compute machine, so we could see how much speedup we expect. However, the output from
> 
>   --petsc.log_view
> 
> would be an acceptable substitute.
> 
>   Thanks,
> 
>      Matt
>  
> Cheers,
> Charles
> 
> 
>> On 8/02/2018, at 4:29 PM, Niloufar Abolfathian <niloufar.abolfathian at gmail.com <mailto:niloufar.abolfathian at gmail.com>> wrote:
>> 
>> Hi, thanks for your replies. Here are my answers to your questions. 
>> 
>> 1.  What size of problem are you running?
>> I am running a quasi-static model to simulate a vertically dipping strike-slip fault with static friction that is loaded by tectonic forces. The boundary conditions include a far-field velocity of 1 cm/yr and an initial displacement of 0.1 m applied normal to the fault surface to maintain a compressive stress on the fault. I want to run this simple model for thousands of years. The first issue is that the model will give run-time error after ~1800 years. The second problem I am encountering is that each run will take more than two days! That is why I am trying to use multicore so it may run faster. From Matt's link, I understand that I should not expect the program to run faster when using multicore on my own mac, but I have tried it on 24 nodes on the cluster and it took the same time as on my own mac.
>> 
>> 2.  What solver settings are you using?
>> pylith.problems.SolverNonlinear
>> 
>> 3.  Is this a linear or nonlinear problem?
>> A nonlinear problem.
>> 
>> 4.  Is this a 2D or 3D problem?
>> A 3D problem.
>> 
>> 5.  What does the run log show?  This will include convergence information and a PETSc summary of calls, etc.
>> I did not have my PETSc summary for the runs. I made a new run for only 200 years and the summary is attached as a text file. And here is my PETSc configuration:
>> 
>> # Set the solver options.
>> [pylithapp.petsc]
>> malloc_dump = 
>> 
>> # Preconditioner settings.
>> pc_type = asm
>> sub_pc_factor_shift_type = nonzero
>> 
>> # Convergence parameters.
>> ksp_rtol = 1.0e-8
>> ksp_atol = 1.0e-12
>> ksp_max_it = 500
>> ksp_gmres_restart = 50
>> 
>> # Linear solver monitoring options.
>> ksp_monitor = true
>> #ksp_view = true
>> ksp_converged_reason = true
>> ksp_error_if_not_converged = true
>> 
>> # Nonlinear solver monitoring options.
>> snes_rtol = 1.0e-8
>> snes_atol = 1.0e-12
>> snes_max_it = 100
>> snes_monitor = true
>> snes_linesearch_monitor = true
>> #snes_view = true
>> snes_converged_reason = true
>> snes_error_if_not_converged = true
>> 
>> 
>> Hope this information can help. Please let me know if I need to provide you with any other information.
>> 
>> Thanks,
>> Niloufar
>> 
>> 
>> 
>> On Wed, Feb 7, 2018 at 4:01 AM, Matthew Knepley <knepley at rice.edu <mailto:knepley at rice.edu>> wrote:
>> On Wed, Feb 7, 2018 at 2:24 PM, Charles Williams <willic3 at gmail.com <mailto:willic3 at gmail.com>> wrote:
>> Dear Niloufar,
>> 
>> It is hard to diagnose your problem without more information.  Information that would be helpful includes:
>> 
>> 1.  What size of problem are you running?
>> 2.  What solver settings are you using?
>> 3.  Is this a linear or nonlinear problem?
>> 4.  Is this a 2D or 3D problem?
>> 5.  What does the run log show?  This will include convergence information and a PETSc summary of calls, etc.
>> 
>> There are probably other things it would be good to know, but this should get us started.
>> 
>> In addition to the points Charles makes, it is very useful to understand how performance is affected by architecture.
>> The advantages of multiple cores are very often oversold by vendors. Here is a useful reference:
>> 
>>   http://www.mcs.anl.gov/petsc/documentation/faq.html#computers <http://www.mcs.anl.gov/petsc/documentation/faq.html#computers>
>> 
>> I recommend running the streams program, which can be found in the PETSc installation.
>> 
>>   Thanks,
>> 
>>      Matt
>>  
>> 
>> Cheers,
>> Charles
>> 
>> 
>>> On 7/02/2018, at 1:06 PM, Niloufar Abolfathian <niloufar.abolfathian at gmail.com <mailto:niloufar.abolfathian at gmail.com>> wrote:
>>> 
>>> Hi, 
>>> 
>>> I am trying to run my code on the cluster but I have not gotten any improvements when using multiple cores.
>>> 
>>> What I have tried:
>>> 
>>> Downloaded binaries for both Mac and Linux.  Single core vs multiple cores (2 and 24 for Mac and Linux respectively) takes the same amount of time.
>>> 
>>> Compiled from source, no speed up either using shared memory or mpi, even though the correct number of mpinemesis processes show up on multiple nodes.
>>> 
>>> I appreciate if you can help me with running the mpi on the cluster.
>>> 
>>> Thanks,
>>> Niloufar
>>> _______________________________________________
>>> CIG-SHORT mailing list
>>> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
>>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
>> Charles Williams I Geodynamic Modeler
>> GNS Science I Te Pῡ Ao
>> 1 Fairway Drive, Avalon 5010, PO Box 30368, Lower Hutt 5040, New Zealand
>> Ph 0064-4-570-4566 I Mob 0064-22-350-7326 I Fax 0064-4-570-4600
>> http://www.gns.cri.nz/ <http://www.gns.cri.nz/> I Email: C.Williams at gns.cri.nz <mailto:your.email at gns.cri.nz>
>> 
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
>> 
>> 
>> _______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
>> 
>> <PETSc_log_summary.txt>_______________________________________________
>> CIG-SHORT mailing list
>> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
>> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
> Charles Williams I Geodynamic Modeler
> GNS Science I Te Pῡ Ao
> 1 Fairway Drive, Avalon 5010, PO Box 30368, Lower Hutt 5040, New Zealand
> Ph 0064-4-570-4566 I Mob 0064-22-350-7326 I Fax 0064-4-570-4600
> http://www.gns.cri.nz/ <http://www.gns.cri.nz/> I Email: C.Williams at gns.cri.nz <mailto:your.email at gns.cri.nz>
> 
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
> 
> 
> _______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org <mailto:CIG-SHORT at geodynamics.org>
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short <http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short>
> 
> <model1.zip>_______________________________________________
> CIG-SHORT mailing list
> CIG-SHORT at geodynamics.org
> http://lists.geodynamics.org/cgi-bin/mailman/listinfo/cig-short

Charles Williams I Geodynamic Modeler
GNS Science I Te Pῡ Ao
1 Fairway Drive, Avalon 5010, PO Box 30368, Lower Hutt 5040, New Zealand
Ph 0064-4-570-4566 I Mob 0064-22-350-7326 I Fax 0064-4-570-4600
http://www.gns.cri.nz/ <http://www.gns.cri.nz/> I Email: C.Williams at gns.cri.nz <mailto:your.email at gns.cri.nz>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/cig-short/attachments/20180209/3fec89cf/attachment-0001.html>