[aspect-devel] Fwd: [cse.ucdavis.edu #13562] Fwd: Fwd: error when writing checkpoint files

Magali Billen mibillen at ucdavis.edu
Fri Jul 6 01:48:24 PDT 2018


Hi Timo  (cc Bill Broadley),

Its seems we’ve hit an incapability between how the cluster is set up and what the code wants to do.

I’m forwarding the e-mail from Bill Broadley. He is the IT support for my cluster (Ymir…an older cluster) 
as well as the much newer cluster that CIG uses at UC Davis, called Peleton.

Is there a way to write checkpointing files without uses MPI -I/O?

Is the trickery involved in writing the checkpointing files such that I should ask Bill (cc’d on this email) to enable MPI-IO?
That is, even though Bill says is generally incompatible with the NFS file system, should it work?  

The reason I’d like check pointing at this time is so I can run models using Newtonian rheology (which is fast) and
then restart them at different points in the evolution of the slab to test the non-linear rheology.  Of course, in the future
it would also be useful to be able to checkpoint longer runs, especially when using the non-linear rheology since they
take so much longer to run. 

Thanks for your help in figuring out what the options are at this point,
Magali



> Begin forwarded message:
> 
> From: "Bill Broadley via RT" <help at cse.ucdavis.edu>
> Subject: Re: [cse.ucdavis.edu #13562] Fwd: [aspect-devel] Fwd: error when writing checkpoint files
> Date: July 6, 2018 at 12:17:41 AM GMT+2
> To: mibillen at ucdavis.edu
> Reply-To: help at cse.ucdavis.edu
> 
> On 07/05/2018 02:26 PM, Magali Billen via RT wrote:
>> 
>> Thu Jul 05 14:26:08 2018: Request 13562 was acted upon.
>> Transaction: Ticket created by mibillen at ucdavis.edu
>>       Queue: CSE Help
>>     Subject: Fwd: [aspect-devel] Fwd: error when writing checkpoint files
>>       Owner: Nobody
>>  Requestors: mibillen at ucdavis.edu
>>          Cc: 
>> 
>>      Status: new
>> Ticket <URL: https://help.cse.ucdavis.edu/rt/Ticket/Display.html?id=13562 >
>> 
>> 
>> Hello, 
>> 
>> I’ve run into a strange error when trying to write output for a checkpoint in Aspect on Ymir. The error appears to be
>> related to MPI I/O. 
> 
> Ymir doesn't support MPI I/O.  It's generally incompatible with NFS.
> 
>> I e-mailed the developer list, and Timo has asked me some questions about write access, MPI I/O access
>> type of file system that I’m writing too… that I don’t know how to answer. 
> 
> Your /home is NFS mounted.  There's also /scratch available if you want to write
> checkpoints to the local disk.
> 
>> I am running the model using sbatch (and srun) and the output directory in /home/billen.
>> I have not had any problems writing the visualization output (which also writes individual files for each node),
>> but it maybe that this handled differently for the checkpointing (I’ve asked Timo this question).
> 
> What was the job number?  What directory did you run sbatch in?  What arguments
> to sbatch, script, CPUs, etc where used?
> 
>> Can you look at his questions below…. I don’t think I have any quota or write access issues?
> 
> With a job number I can find out what nodes you landed on.
> 
>> But I don’t know how the  MPI I/O access permissions work or what kind of file system is set up on Ymir.  
>> Is there an option to write somewhere other than my home directory (scratch on the nodes? - would this make sense?)
> 
> Well the trick is, /scratch is faster, but if you run for a month and then a
> node dies you lose that nodes /scratch.
> 
>> It would be helpful to be able to tell Timo how the cluster is set-up so we can exclude “external” factors,
>> and then know whether to search for a bug in the code.
> 
> MPI-IO is generally used to increase performance on parallel filesystems like
> Lustre, Ceph, or BeeGFS.  It can lead to corruption on other filesystems.  The
> main problem is in NFS you should never write to the same file from two nodes,
> which is what MPI-IO does.
> 
> 

____________________________________________________________
Professor of Geophysics 
Earth & Planetary Sciences Dept., UC Davis
Davis, CA 95616
2129 Earth & Physical Sciences Bldg.
Office Phone: (530) 752-4169
http://magalibillen.faculty.ucdavis.edu

Currently on Sabbatical at Munich University (LMU)
Department of Geophysics (PST + 9 hr)

Avoid implicit bias - check before you submit: 
http://www.tomforth.co.uk/genderbias/
___________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20180706/3efc4b25/attachment.html>


More information about the Aspect-devel mailing list