Journal Articles

Overload Journal #112 - December 2012 + Programming Topics
Browse in : All > Journals > Overload > o112 (6)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Valgrind Part 5 – Massif

Author: Martin Moene

Date: 01 December 2012 20:49:20 +00:00 or Sat, 01 December 2012 20:49:20 +00:00

Summary: Poor performance can be caused by memory usage. Paul Floyd introduces Valgrind’s heap memory profiling tool Massif.

Body: 

This time there is a bit more good news. You’ve identified your CPU bottlenecks and have ironed them out. But now you have customers complaining that performance is poor when they are using very big projects. When you hear the word ‘big’, you should start thinking ‘memory’.

Using system profiling tools

Before you do any lengthy testing, you can get a quick impression of your application’s memory use by using 'top' (or a similar process monitor tool).

Here I’m running a simulator on a 48Gbyte server. You should be looking at the Mem: and Swap: lines in the header block and the VIRT (virtual memory) and RES (resident memory) in the columns in Figure 1.

top - 17:21:05 up 9 days,  2:16,  2 users,  load average: 1.09, 1.05, 1.01
Tasks: 232 total,   2 running, 230 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.6%us,  0.0%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49414760k total, 21724376k used, 27690384k free,   185660k buffers
Swap: 49151992k total,     2128k used, 49149864k free,  2669128k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12733 paulf     20   0 16.5g  16g  22m R 100.0 34.5 159:06.14 sim64.exe
23657 paulf     20   0 13252 1264  884 R  0.7  0.0   0:02.56 top
			
Figure 1

You can also get an idea what your system is doing by using a tool like vmstat (on UNIX-like systems). If you type "vmstat 5" then you’ll sample the system every 5 seconds (with the first line being the average since the system booted). You’ll see something like Figure 2.

vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0   2128 27686448 185312 2667364    0    0     0     3    0    4 11  3 84  1  0
 1  0   2128 27686876 185312 2667364    0    0     0     0 1136  134 12  0 88  0  0
 1  0   2128 27687488 185312 2667364    0    0     0     0 1088   74 13  0 87  0  0
			
Figure 2

swpd, free, buff and cache correspond to similarly named values in the Mem: and Swap: lines of top. When your performance is really poor, then it’s the si and so values that you should look at. They are the amount of memory that is being swapped in and out from disk per second. Swapping is bad news. Disk drives are many orders of magnitude slower than RAM.

Enough of system profiling, I’m here to write about Valgrind. In particular, Massif, Valgrind’s heap memory profiling tool. Massif will give you an overview of the memory used by your application over time. You can use it to help you to identify where and when your application allocates memory on the heap with a view to either reducing the amount of memory that you allocate or freeing it sooner. It can indicate that you have memory leaks, but it will only tell you about memory use, not memory abuse (like unreachable memory due to pointers going out of scope or being overwritten).

I’ll start with a very simple example that just allocates and frees some memory (Listing 1).

// m1.cpp
#include <unistd.h>

const size_t DATA_SIZE = 100U;
const size_t BLOCK_SIZE = 1000000;

int main()
{
  long **data = new long *[DATA_SIZE];

  for (size_t i = 0; i < DATA_SIZE; ++i)
  {
    data[i] = new long[BLOCK_SIZE];
    // do something with data[i]
    usleep(100000);
  }
  sleep(1);

  for (size_t i = 0; i < DATA_SIZE; ++i)
  {
    delete [] data[i];
  }
  delete [] data;
}
			
Listing 1

To select Massif, run valgrind with the --tool=massif option. In this case, since the example uses sleep() and usleep(), I'll add the --time-unit=ms option, the default being instructions executed. There are further options to profile the stack, control the call context depth recorded and the number of snapshots taken. There isn’t much output to the terminal, just the small piece of header information in Figure 3.

 valgrind --tool=massif --time-unit=ms ./m1
==12939== Massif, a heap profiler
==12939== Copyright (C) 2003-2012, and GNU GPL'd, by Nicholas Nethercote
==12939== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==12939== Command: ./m1
==12939== 
==12939==
			
Figure 3

The important information is written to a file named massif.out.<pid> (you can control the file name with the --massif-out-file option). This isn’t really meant for human consumption. For completeness, I’ll include an extract, see Figure 4.

desc: --time-unit=ms
cmd: ./m1
time_unit: ms
#-----------
snapshot=0
#-----------
time=0
mem_heap_B=0
mem_heap_extra_B=0
mem_stacks_B=0
heap_tree=empty
[content deleted]
#-----------
snapshot=49
#-----------
time=9830
mem_heap_B=784000800
mem_heap_extra_B=345752
mem_stacks_B=0
heap_tree=detailed
n2: 784000800 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n0: 784000000 0x400759: main (m1.cpp:12)
 n0: 800 in 1 place, below massif's threshold (01.00%)
[more content]
			
Figure 4

The output file is meant to be processed by ms_print. This outputs to the terminal. Firstly, there is a short summary, then there is an ASCII art plot of the memory use as a function of time. Figure 5 is what my example code produces.

    MB
763.3^                                                                       #
     |                                                             ::@:::::::#
     |                                                           ::: @:      #
     |                                                       ::::::: @:      #
     |                                                    :::::: ::: @:      #
     |                                                 :@::: ::: ::: @:      #
     |                                             :::::@::: ::: ::: @:      #
     |                                          ::::: ::@::: ::: ::: @:      #
     |                                      ::::::::: ::@::: ::: ::: @:      #
     |                                    :@: ::::::: ::@::: ::: ::: @:      #
     |                                 ::::@: ::::::: ::@::: ::: ::: @:      #
     |                             ::::: ::@: ::::::: ::@::: ::: ::: @:      #
     |                          :::::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |                      :::::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |                   :::: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |                ::@: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |             :::: @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |          :@@:: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |      :::::@ :: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
     |   ::@:: ::@ :: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @:      #
   0 +----------------------------------------------------------------------->s
     0                                                                   11.13
			
Figure 5

As expected, the memory use rises monotonically for 10 seconds. The application allocates a bit over 800 million bytes, or 762.94MB. That’s quite close to the peak shown in the graph. After this, the ms_print output shows information about snapshots taken during the execution of the application. All of the snapshots show a one line summary of the heap use. By default, every tenth snapshot shows details of where the memory was allocated. Figure 6 is the end of the output from the m1 test application.

 39          7,825      624,275,992      624,000,800       275,192            0
99.96% (624,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.96% (624,000,000B) 0x400759: main (m1.cpp:12)
| 
->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%)

--------------------------------------------------------------------------------
  n       time(ms)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 40          7,926      632,279,520      632,000,800       278,720            0
 41          8,126      648,286,576      648,000,800       285,776            0
[content deleted]
 48          9,530      760,335,968      760,000,800       335,168            0
 49          9,830      784,346,552      784,000,800       345,752            0
99.96% (784,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.96% (784,000,000B) 0x400759: main (m1.cpp:12)
| 
->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%)

--------------------------------------------------------------------------------
  n       time(ms)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 50          9,931      792,350,080      792,000,800       349,280            0
 51         11,134      800,353,608      800,000,800       352,808            0
99.96% (800,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.96% (800,000,000B) 0x400759: main (m1.cpp:12)
| 
->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%)
			
Figure 6

We can see that the figure for the useful-heap matches what we expect. The extra-heap figure is the memory allocated for book-keeping. In a real world application, there would be an extensive tree of calls displayed showing where memory was allocated and how much (% of total and number of bytes). In this example, everything is done in main(), so there’s not much context to see. Now that we have this information, what do we want to do with it? Generally, two things: try to free memory earlier and try to find more efficient data structures.

Now, let’s go back to the small example and make some changes to improve the memory use. (See Listing 2.)

// m2.cpp
#include <unistd.h>

const size_t DATA_SIZE = 100U;
const size_t BLOCK_SIZE = 1000000;

int main()
{
  long **data = new long *[DATA_SIZE];

  for (size_t i = 0; i < DATA_SIZE; ++i)
  {
    data[i] = new long[BLOCK_SIZE];
    // do something with data[i]
    usleep(100000);
    delete [] data[i];
    usleep(100000);
  }

  sleep(1);

  delete [] data;
}
			
Listing 2

Now the memory is freed straight after it is used, rather than all at once at the end of the application. The ASCII art graph now looks like Figure 7.

    MB
7.634^#::@@::@::::::@::   @@:::::@@    : @:::     ::::@::    ::::::   :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
     |#: @ : @ : : :@::   @ ::: :@     : @::      ::: @:     :: ::    :  :    
   0 +----------------------------------------------------------------------->s
     0                                                                   21.15
			
Figure 7

Notice that the peaks where memory is allocated are not evenly spaced. This is an artefact of the sampling, and if your application does a lot of allocation and freeing, you may need to play with the --detailed-freq or --max-snapshots options.

As an alternative to the ASCII art, there is a separate tool, ‘massif-visualizer’ (Figure 8).

Figure 8

I installed massif-visualizer on Kubuntu with the package manager. It wasn’t available in the package manager of openSUSE, and I had problems with Cmake dependencies when trying to build it from source. The GUI, on top of being a bit prettier than the ASCII drawings, allows you to interact with the snapshots. However, it does not show stack memory if you use the --stacks=yes option. This leads to an example that does use stack profiling. Listing 3 is the code (from Bertrans Meyer’s A Touch of Class, p. 477, translated from Eiffel to C++). This is a nice example of stack use that is hard to analyze.

// bizarre.cpp
#include <iostream>
#include <cassert>

using std::cout;

long int bizarre(long int n)
{
  assert(n >= 1);
  if (n == 1)
  {
    return n;
  }
  else if (0L == n%2L)
  {
    return bizarre(n/2L);
  }
  else
  {
    return bizarre((3L*n + 1L)/2L);
  }
}

int main()
{
  for (long int i = 1L; i < 200000000L; ++i)
  {
    bizarre(i);
  }
}
			
Listing 3

Profiling this with valgrind --tool=massif --stacks=yes ./bizarre gives Figure 9.

   KB
18.77^                     #                                                  
     |                     #                                                  
     |                     #                                                  
     |                     #                                                  
     |                     #                                                  
     |                     #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                                  
     |    @                #                                        :    :    
     |    @                #                                        :  :::  ::
     |    @                #       ::                      :  @@    :  ::: :::
     |    @  ::    ::: ::: #       :  :::    :: :  :: @@ :::  @ ::  :::::::@::
     |    @  : :@@:::  ::  #     :::  :: ::: :  :  :  @ @: :  @ ::  : :::::@:@
     |   :@: : :@ :::  ::  #:::: ::: ::: : :::  : :: :@ @: :::@ ::  : :::::@:@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   548.2
			
Figure 9

In one last example, let's see what happens if we use some low level allocation like mmap. (Listing 4.)

// m3.cpp
#include <unistd.h>
#include <sys/mman.h>

const size_t DATA_SIZE = 100U;
const size_t BLOCK_SIZE = 1000000;

int main()
{
  long **data = new long *[DATA_SIZE];

  for (size_t i = 0; i < DATA_SIZE; ++i)
  {
    data[i] = reinterpret_cast<long *>(mmap(NULL,
       BLOCK_SIZE*sizeof(long),
       PROT_READ | PROT_WRITE,
       MAP_SHARED | MAP_ANONYMOUS, -1, 0));
    // do something with data[i]
    usleep(100000);
  }

  sleep(1);

  for (size_t i = 0; i < DATA_SIZE; ++i)
  {
    munmap(data[i], BLOCK_SIZE*sizeof(long));
  }

  delete [] data;
}
			
Listing 4

With this, all I see is the memory allocated for 'data'. (I haven’t included it here as it’s just an ASCII rectangle showing 808 bytes allocated and a time of 11.14s). Where have the other 763MB gone? By default, massif does not trace memory allocated in low level functions like mmap. If you want to see that memory (which might be the case if you are using a custom allocator that is based on mmap), then you will need to add the --pages-as-heap=yes option. If I add this option to the above example, then I see all of the memory being allocated (in fact, 777.5MB, since it now includes memory allocated when loading the application).

Note that normally on Linux, glibc will use mmap for malloc requests to allocate above a certain threshold, but Valgrind intercepts the call to malloc, so there is no need to use --pages-as-heap unless you are using mmap directly.

That just about wraps up this installment. As you can see, Massif is straightforward to use and it presents a simple view of the memory use of your application.

In my next article, I’ll cover two Valgrind tools for detecting thread hazards, Helgrind and DRD.

Notes: 

More fields may be available via dynamicdata ..