[aspect-devel] [cse.ucdavis.edu #13562] Fwd: Fwd: error when writing checkpoint files

Sat Jul 7 05:45:23 PDT 2018

Hi All,

Thanks for all the information.  Its seems to me that I won’t be using checkpointing in Aspect in the near term.

I am surprised that no other users have come across this issue… it suggest that very few people are using
there own small clusters to run models with aspect, or if they do they don’t use checkpointing - seems strange?
Is this true? Or do other people use smaller clusters and somehow make this work?  

I’ll be at CIDER the last two weeks of  July and I’ll try to talk to Rene in person about this issue and try to understand
more about what options might exists.  Since this is all handled by other libraries (p4est), there may be no real option.   I
 don’t feel like I have the expertise or experience with Aspect to wade into this on my own. Maybe after talking with Rene,
we can see about trying to compiling p4est with mpi-io and see what happens. 

In the meantime I’ll move ahead without checkpointing - however, I’m already looking at running 2D models
with Newtonian rheology that will need 3-5 days to run on my cluster using 128 processors, 
so I already see a need for this to avoid starting over if for example a node crashes etc… with non-newtonian rheology,
it will be an even bigger problem for me.

Magali

> On Jul 7, 2018, at 9:08 AM, Bill Broadley <bill at cse.ucdavis.edu> wrote:
> 
> On 07/06/2018 08:23 PM, Wolfgang Bangerth wrote:
>> 
>> Magali & Bill,
>> 
>>> Is there a way to write checkpointing files without uses MPI -I/O?
>>> 
>>> Is the trickery involved in writing the checkpointing files such that I should
>>> ask Bill (cc’d on this email) to enable MPI-IO?
>>> That is, even though Bill says is generally incompatible with the NFS file
>>> system, should it work?
>> 
>> I don't know whether we really want to support systems that don't have MPI-IO.
> 
> That vast majority of clusters don't support MPI-IO, which is why it's support
> is so limited and off by default with many libraries.   Like say HDF.
> 
> Sure national lab level clusters that use Lustre, Ceph, and BeeGFS support MPI-IO.
> 
>> You're the first person to report a cluster where this doesn't work. I have no
>> idea how MPI-IO is internally implemented (e.g., whether really every processor
>> opens the same file at the same time, using file system support; or whether all
>> MPI processes send their data to one process that then does the write), but the
>> only way to achieve scalability is to use MPI-IO.
> 
> MPI-IO allows multiple nodes to arrange access to stripes of a file to allow
> reading/writing in parallel.  But such storage systems start becoming reasonable
> at $100k just for the storage.  Typically they include things like 8 storage
> arrays, each doubly connected to two servers.  The array of 16 machines are the
> block store, then you buy a few other machines to be the metadata servers,
> typically with tons of ram and SSDs.  If building a few $million cluster it's
> definitely the way to go.
> 
> The minimum reasonable config is somewhere around one metadata server, 2 storage
> arrays, and 4 object stores.  Using our standard building blocks that would be
> somewhere around 1.2PB of storage.
> 
> The most popular clusters are of course much smaller and have 1 to a few
> fileservers, most often running NFS.  That's the typical default install for any
> cluster software I've seen like Rocks, Warewulf, etc.
> 
> Is aspect really going to target $0.5M cluster and up or so?  Lustre manages
> this, used to require a license to be current, but not the licensing changed.
> Last HPC meeting I went to I talked to a group of 6 or so faculty who had used
> Lustre and the related horror stories consumed a the first part of the meeting.
> 
> Seems kind of strange to write checkpoints to the slower central storage that's
> very expensive and ignoring the local dedicate disk that's not shared.
> 
> I believe just about ever aspect run ever run at Davis has been without MPI-IO,
> it's not a learning issue, it's just that parallel filesystems are very
> expensive and haven't currently been justified.  I wouldn't rule them out, just
> haven't had a big enough chunk of funding to be spend at a single time with a
> I/O heavy workload in mind.
> 
>> So what I'm trying to say is that our preference would be for your clusters to
>> learn how to use MPI-IO :-)
> 
> Seems pretty silly to talk about scaling when ultimately many clusters only have
> a single file server.

____________________________________________________________
Professor of Geophysics 
Earth & Planetary Sciences Dept., UC Davis
Davis, CA 95616
2129 Earth & Physical Sciences Bldg.
Office Phone: (530) 752-4169
http://magalibillen.faculty.ucdavis.edu

Currently on Sabbatical at Munich University (LMU)
Department of Geophysics (PST + 9 hr)

Avoid implicit bias - check before you submit: 
http://www.tomforth.co.uk/genderbias/
___________________________________________________________