i agree that for bigger machines your solution is better <div>most (all) of these machine do not even have local storage<br><br>yes indeed the low tech solution i suggested/implemetented is to just execute a system call to start moving the files during the run in the background from local scratch storage to network drive</div>
<div><br></div><div>what i suggest is to make something that is good for large systems and something that will work for your average joe's cluster.</div><div>with slow network storage and fast local storage the MPI/IO solution will be suboptimal i guess.</div>
<div>having the user start messing around with simlinks pointing to drives on different nodes etc is probably also not very fail save.</div><div><br></div><div>cheers</div><div>Thomas</div><div><br></div><div><br><br><div class="gmail_quote">
On Tue, Feb 28, 2012 at 3:47 PM, Timo Heister <span dir="ltr"><<a href="mailto:heister@math.tamu.edu">heister@math.tamu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
First: the solution to use MPI I/O and merge output files is the only<br>
way to scale to bigger machines. You can not run with 10'000 cores and<br>
write out 10'000 files per timestep.<br>
Second: the merging of files is optional. It is a runtime parameter<br>
you can set. You might want to generate one file per node (instead of<br>
one file per core now) or you can leave it as it is today.<br>
Third: In your setup I would just symlink the output/ directory to the<br>
local scratch space and copy the files to the central nfs at the end<br>
of the computation (you can do this in your jobfile). If you want to<br>
do both things at the same time, you can execute a shell script after<br>
the visualization that does the mv in the background.<br>
<div class="HOEnZb"><div class="h5"><br>
On Tue, Feb 28, 2012 at 8:32 AM, Thomas Geenen <<a href="mailto:geenen@gmail.com">geenen@gmail.com</a>> wrote:<br>
> i am not sure if this will be very efficient on the type of cluster we have.<br>
><br>
> we have a cluster with a bunch of nodes with fast local io that are<br>
> interconnected with infiniband and have an ethernet connection for the io<br>
> etc with the master node. On the master node we have our network drive (a<br>
> large slow beast, NFS).<br>
><br>
> in the proposed solution we will be using the infiniband for the io during<br>
> the computation (assuming the io will be in the background) how will that<br>
> affect the speed of the solver? how large are the mpi buffers needed for<br>
> this type of io and is that pinned memory? do we have enough left for the<br>
> application?<br>
><br>
> if its not doing io in the background this will be a bottleneck for us since<br>
> we have a very basic disk setup on the master node<br>
><br>
> locally (on the compute nodes) we have something like 500MB/s throughput so<br>
> for a typical run on 10-20 nodes we have an effective bandwith of 5-10GB/s<br>
><br>
> i would be in favor of implementing a few IO strategies and leave it to the<br>
> user to pick the one that is most efficient for his/her hardware setup.<br>
><br>
> the low tech option i proposed before (write to local storage fast and mv<br>
> the files in the background over the ethernet connection to the slow network<br>
> drive) will probably work best for me.<br>
><br>
> cheers<br>
> Thomas<br>
><br>
><br>
><br>
> On Mon, Feb 27, 2012 at 6:40 PM, Wolfgang Bangerth <<a href="mailto:bangerth@math.tamu.edu">bangerth@math.tamu.edu</a>><br>
> wrote:<br>
>><br>
>><br>
>> > Oh, one thing I forgot to mention is that I am not sure if we want one<br>
>> > big file per time step. It might happen that paraview in parallel is<br>
>> > less efficient reading one big file. One solution would be to write<br>
>> > something like n*0.05 files, where n is the number of compute<br>
>> > processes.<br>
>><br>
>> Yes, go with that. Make the reduction factor (here, 20) a run-time<br>
>> parameter so that people can choose whatever is convenient for them when<br>
>> visualizing. For single-processor visualization, one could set it equal<br>
>> to the number of MPI jobs, for example.<br>
>><br>
>> Cheers<br>
>> W.<br>
>><br>
>> ------------------------------------------------------------------------<br>
>> Wolfgang Bangerth email: <a href="mailto:bangerth@math.tamu.edu">bangerth@math.tamu.edu</a><br>
>> www: <a href="http://www.math.tamu.edu/~bangerth/" target="_blank">http://www.math.tamu.edu/~bangerth/</a><br>
>><br>
>> _______________________________________________<br>
>> Aspect-devel mailing list<br>
>> <a href="mailto:Aspect-devel@geodynamics.org">Aspect-devel@geodynamics.org</a><br>
>> <a href="http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel" target="_blank">http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel</a><br>
><br>
><br>
><br>
> _______________________________________________<br>
> Aspect-devel mailing list<br>
> <a href="mailto:Aspect-devel@geodynamics.org">Aspect-devel@geodynamics.org</a><br>
> <a href="http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel" target="_blank">http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel</a><br>
><br>
<br>
<br>
<br>
</div></div><div class="im HOEnZb">--<br>
Timo Heister<br>
<a href="http://www.math.tamu.edu/~heister/" target="_blank">http://www.math.tamu.edu/~heister/</a><br>
</div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
Aspect-devel mailing list<br>
<a href="mailto:Aspect-devel@geodynamics.org">Aspect-devel@geodynamics.org</a><br>
<a href="http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel" target="_blank">http://geodynamics.org/cgi-bin/mailman/listinfo/aspect-devel</a><br>
</div></div></blockquote></div><br></div>