[aspect-devel] update on memory pbs

Wolfgang Bangerth bangerth at math.tamu.edu
Fri Feb 15 11:57:19 PST 2013


Cedric,

  7881m 4.4g

 > For instance, following your advice, we have been running models on
 > one processor only as follows:
> 1) we book the whole node for ourselves with
>   #PBS -l nodes=node009.cm.cluster:ppn=32
> 2) we run a simulation with
> mpirun -np 1 ./lib/aspect convection-box.prm > opla
>
> However, running simple tests does not help us so much since their
> memory footprint is not
> significant enough to show on the ganglia readings (scale of the graph
> renders the reading
> of small memory variations impossible).
>
> *What would be very helpful to us would be the amount of memory that the
> code uses *
> *on your machine for some of the cookbook input files, or the vkk.prm. *

I've run your vkk.prm with global refinement 9 on my machine and I'm 
getting the following numbers (which remain constant over the run-time 
of the program):
   Total memory:    7.8 GB
   Resident memory: 4.4 GB
I get these numbers by just running 'top' on the same machine and 
looking at the VIRT and RES columns of its output.

These numbers actually look rather reasonable to me. There are 4.4M 
unknowns, which boils down to a total memory consumption of under 1.8kB 
per degree of freedom -- a number quite typical for complex 2d codes.


> Since the grid is regular and not coarsened nor refined, is it right to
> think that the memory needed to run the code should also be (really)
> constant ?
> Indeed, when we run a big subduction model and set the CFL to zero, the
> code ultimately solves the same matrix over and over, so
> does it make sense that the used memory fluctuate ?

Memory should be more or less constant between successive time steps. 
There may be fluctuation within time steps (the solver and 
preconditioners will have to allocate temporary memory that they release 
when they're done, and so do other parts of the program) but at the end 
of each time step we should be back at the same level as what we had at 
the end of the previous time step. I don't know how many time steps the 
ganglia graph you show represents, but I think that within reasonable 
accuracy the graph represents a roughly constant memory consumption.

So this data looks correct to me. What happens if you now run with 
mpirun -np 2 the same program on an otherwise empty machine?


> We have also tried to run the same input file using valgrind as you
> advised us to do.
> The code runs 'fine', but we do not get a massif file as expected and
> instead we get the following message.
>
> ==34454== Massif, a heap profiler
>    2 ==34454== Copyright (C) 2003-2010, and GNU GPL'd, by Nicholas
> Nethercote
>    3 ==34454== Using Valgrind-3.6.0 and LibVEX; rerun with -h for
> copyright info
>    4 ==34454== Command: ./lib/aspect convection-box.prm
>    5 ==34454==
>    6
>    7 valgrind: m_mallocfree.c:248 (get_bszB_as_is): Assertion 'bszB_lo
> == bszB_hi' failed.
>    8 valgrind: Heap block lo/hi size mismatch: lo = 253524616, hi = 0.
>    9 This is probably caused by your program erroneously writing past the
>   10 end of a heap block and corrupting heap metadata.  If you fix any
>   11 invalid writes reported by Memcheck, this assertion failure will
>   12 probably go away.  Please try that before reporting this as a bug.
>
> However, I am not sure we can trust this since I get the same message
> with my fortran based code,
> which, when ran on the cluster does not show any sign of memory leakage.

I have no idea where that comes from. I guess it happens right at the 
beginning? I don't see this here on my machine, and if you get the same 
with a completely unrelated program as well it may be something that 
happens in one of your runtime libraries.


> Wim (our boss) suggested we could open an account on our cluster here.
> Would that be an option ?

I think that could be a last option. I don't doubt your data, I simply 
can't seem to reproduce things here.

I use the very same Trilinos version as you do, and likely also the same 
p4est version. The only thing we may differ in is the MPI system we 
have. Does your cluster allow you to select different MPI versions or 
kinds? If so, can you try that? You'll have to re-compile everything 
with the different MPI choice, so it'll probably be useful to do this in 
a completely different, parallel directory so you can easily switch 
between the two.

Best
  Wolfgang

-- 
------------------------------------------------------------------------
Wolfgang Bangerth               email:            bangerth at math.tamu.edu
                                 www: http://www.math.tamu.edu/~bangerth/



More information about the Aspect-devel mailing list