[aspect-devel] Missing 'restart.resume' files

Juliane Dannberg judannberg at gmail.com
Mon Jan 22 17:49:09 PST 2018


Hi Matt,

sometimes the file is also just not written correctly because the job is 
cancelled at the exact time when it writes the file (I've had the same 
problem before).

What you can do to restart your old runs in this case is to rename _all_ 
of the "old" restart files. There should be 3 of them:
restart.resume.z.old
restart.mesh.info.old
restart.mesh.old

If you remove the ".old" from all of those file names (and overwrite the 
current restart files) you should be able to restart your computation 
from the point when the old restart file was written.

Hope that helps, and please let us know if that doesn't work for you.

Cheers,
Juliane


On 01/23/2018 12:56 AM, Timo Heister wrote:
> Matt,
>
> when looking at the code I realized that we never check if the write
> of that file succeeded. I created a PR here
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geodynamics_aspect_pull_2066&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=ohjkgTNaUvkwB5iVUd2fwRDFowsq4ym-h4C3UqZqOFY&s=oatrJUhueiAy_fa7pAeuOnc_MMbUPx7U4KX18RPsfxg&e=  that changes that. It
> probably won't help fixing the problem, but it should fail immediately
> instead of silently continuing.
>
> Assuming it is not a bug in our code, there can be several reasons why
> writes can fail: a weird/slow network filesystem (check if you are
> using a file system recommended to use for output if you are running
> on a cluster), not enough free disk space, quotas, etc.
>
>> At the moment I have 2 simulations
>> which I need to continue running, and have no way of resuming them.
> If the file doesn't exist, you won't be able to continue those runs.
> Sorry for not having a "solution". I would try experimenting to find
> out when this problem occurs.
>
> Best,
> Timo
>
>
> On Mon, Jan 22, 2018 at 9:03 AM, Matthew Lees <mlees0209 at gmail.com> wrote:
>> Hi all,
>>
>> I'm running simulations with checkpointing enabled, to allow computations to
>> be resumed. Normally it works just fine, but sometimes I find that a run
>> finishes and no 'restart.resume.z' file is created. I thought I might be
>> able to resolve this by renaming 'restart.resume.z.old' to
>> 'restart.resume.z' then attempting to resume the simulation, but this
>> doesn't seem to work either (it resumes but not from the right point).
>>
>> Any ideas on how this might be solved? At the moment I have 2 simulations
>> which I need to continue running, and have no way of resuming them.
>>
>> Many thanks,
>>
>> Matt Lees
>> Department of Geophysics
>> University of Cambridge
>>
>> _______________________________________________
>> Aspect-devel mailing list
>> Aspect-devel at geodynamics.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.geodynamics.org_cgi-2Dbin_mailman_listinfo_aspect-2Ddevel&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=c08Btfq4m9QEScXN3ZQwLZzzWQE7S8CYq1IYuzKV_Zk&m=ohjkgTNaUvkwB5iVUd2fwRDFowsq4ym-h4C3UqZqOFY&s=Z_TksJvYI47H6zBYBn1xLbEhSuS_Rwj1V5fIBcWZhjw&e=
>
>

----------------------------------------------------------------------
Juliane Dannberg
Postdoctoral Fellow, Colorado State University
http://www.math.colostate.edu/~dannberg/ 
<http://www.math.colostate.edu/%7Edannberg/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.geodynamics.org/pipermail/aspect-devel/attachments/20180123/14f459fa/attachment.html>


More information about the Aspect-devel mailing list