Journal Articles

Overload Journal #144 - April 2018 + Programming Topics
Browse in : All > Journals > Overload > o144 (7)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: No News is Good News

Author: Bob Schmidt

Date: 06 April 2018 16:54:50 +01:00 or Fri, 06 April 2018 16:54:50 +01:00

Summary: Using ‘new’ without care can be slow. Paul Floyd uses Godbolt’s compiler explorer to see what happens when you do.

Body: 

There are two influences that have inspired me to write this article. The first is that I’ve been playing a lot with Compiler Explorer (https://godbolt.org). Secondly, a while back I read Optimized C++ by Kurt Guntheroth. It contains a chapter on using dynamic memory (Chapter 6: Optimize Dynamically Allocated Variables).

I agree with a lot of what is said. In short, there is a description of the types of memory available in C++ (automatic, dynamic and static); a description of how this memory relates to variables in code; APIs to deal with dynamic memory; smart pointers and many tips on optimizations related to dynamic memory. In this article I’m going to explore why you should be trying to optimize use of new by digging down to the machine code.

When I’m looking at production C++ code I do see a lot of gratuitous uses of new, for instance

  list <handle>* handleList = new list <handle>;
  ...
  processList(handleList);

I suspect that there are a few possible reasons for writing such code:

Obviously, in the second case the memory doesn’t need to be allocated dynamically. The top-level object can perfectly well be on the stack. For instance, the above example could have been written:

  list <handle> handleList;
  ...
  processList(&handleList);

In addition, if I could change the interface to the processList API, I would almost certainly change it to take a reference rather than a pointer. Furthermore, std::list is rarely a good choice when it comes to performance, so I’d probably also change that if I could.

Both versions of the code do the same thing. So, what is wrong with the first version?

Memory refresher

For those of you who are a bit rusty on what the difference is between dynamic and automatic memory (I’ll skip static), here is a quick refresher (Wikipedia has a longer description with diagrams [Wikipedia]). Firstly, they are also commonly also known by other names, referring to how they are often implemented. Dynamic allocation is also known as heap allocation, and automatic allocation is known as stack allocation. The stack is a large block of memory. It is referred to by CPU registers such as the Stack Pointer. Memory can be ‘allocated’ on the stack very quickly simply by manipulating the stack registers. The main drawbacks of stack memory are that it does not persist beyond the current scope and it can be quite limited in size. Heap memory is a separate block of memory, but this time it is controlled via functions like malloc [C++Ref-a] and operator new [C++Ref-b]. It’s not so limited in size and persists until explicitly deallocated.

Problems with new

Performance

There are a couple of reasons why heap allocation has worse performance than stack allocation.

There is a third question, concerning the size of code that gets generated. This complicates the picture because it isn’t always an apples and pears comparison. When you use stack allocation, it generally means that you are using RAII and the compiler will generate the code necessary for clean-up at the end of the scope. When you use heap allocation with raw pointers as above then it’s up to you to ensure that resources get freed.

About the example code

In the examples that follow, I will continue to use handleList. In my testing I defined Handle to be

  class Handle {
  public:
    int data;
  };

It doesn’t matter what Handle is. The only thing of importance is to consider that handleList itself is something that needs some memory. I’m going to stick with the std::list in the examples for two reasons. Firstly, I want there to be a code smell. Secondly and more seriously, though the examples presented here are trivial I don’t want them to be so small that the compiler optimizes them to almost nothing. Also, for that reason I’ve added calls to an externally defined populateList function.

The assembly was generated on Compiler Explorer using GCC 7.3 64bit.

Comparison of allocation methods

Digging deeper into the different allocation methods, if we have a stack allocation function like this:

  void processStack()
  {
    list<Handle> handleList;
    populateList(&handleList);
  }

the optimized machine code that gets generated is in Table 1

processStack():
1 push r12 A:
Function entry prologue
2 push rbp
3 push rbx
4 sub rsp, 48
5 lea rbp, [rsp+16]
6 mov QWORD PTR [rsp+32], 0 B:
Inlined std::list construction of handleList on stack
7 mov QWORD PTR [rsp+8], rbp
8 mov rdi, rbp
9 movq xmm0, QWORD PTR [rsp+8]
10 punpcklqdq xmm0, xmm0
11 movaps XMMWORD PTR [rsp+16], xmm0
12 call populateList() C: Call function
13 mov rdi, QWORD PTR [rsp+16] D:
Check result and inlined destructor
14 cmp rdi, rbp
15 je .L1
16 .L3: mov rbx, QWORD PTR [rdi]
17 call operator delete(void*)
18 cmp rbx, rbp
19 mov rdi, rbx
20 jne .L3 E:
Function exit epilogue
21 .L1: add rsp, 48
22 pop rbx
23 pop rbp
24 pop r12
25 ret
26 mov rdi, QWORD PTR [rsp+16] F:
Stack Unwind Handling
27 mov rbx, rax
28 .L6: cmp rdi, rbp
29 je .L5
30 mov r12, QWORD PTR [rdi]
31 call operator delete(void*)
32 mov rdi, r12
33 jmp .L6
34 .L5: mov rdi, rbx
35 call _Unwind_Resume
Table 1

Note that for non-exceptional flow, only items in blocks A to F in column 3 are executed. Block F only gets called when exceptions are thrown via the stack unwinding mechanism.

On the other hand, for heap allocation that does not ensure proper clean-up, like this:

  void processHeap()
  {
    list<Handle>* handleList = new list<Handle>;
    populateList(handleList);
    delete handleList;
  }

the machine code that gets generated is in Table 2.

processHeap():
1 push rbp A:
Function Entry Prologue
2 push rbx
3 mov edi, 24
4 sub rsp, 8
5 call operator new(unsigned long) B:
Dynamic allocation and inlined std::list constructor
6 mov rbx, rax
7 mov rdi, rax
8 mov QWORD PTR [rax+16], 0
9 mov QWORD PTR [rax], rax
10 mov QWORD PTR [rax+8], rax
11 call populateList() C: Call function
12 mov rdi, QWORD PTR [rbx] D:
Inlined std::list destructor, function exit epilogue and tail optimized delete
13 cmp rbx, rdi
14 je .L12
15 .L13: mov rbp, QWORD PTR [rdi]
16 call operator delete(void*)
17 cmp rbx, rbp
18 mov rdi, rbp
19 jne .L13
20 .L12: add rsp, 8
21 mov rdi, rbx
22 mov esi, 24
23 pop rbx
24 pop rbp
25 jmp operator delete(void*, unsigned long)
Table 2

Thus, it has lost exception safety but made the generated code slightly shorter.

To regain exception safety with heap allocation, still using raw pointers, we would need to write something like Listing 1.

void processHeapNoleak()
{
    list<Handle>* handleList = nullptr;
    try
    {
        handleList = new list<Handle>;
        populateList(handleList);
        delete handleList;    
    }
    catch (...)
    {
        delete handleList;
        throw;
    }
}
			
Listing 1

That doesn’t look too pretty. Please don’t do this at home. The generated machine code for this is in Table 3.

processHeapNoleak():
1 push rbp A:
Function Entry Prologue
2 push rbx
3 mov edi, 24
4 sub rsp, 8
5 call operator new(unsigned long) B:
Dynamic allocation and inlined std::list constructor
6 mov rbx, rax
7 mov QWORD PTR [rax+16], 0
8 mov rdi, rax
9 mov QWORD PTR [rbx], rax
10 mov QWORD PTR [rbx+8], rax
11 call populateList() C: Call function
12 mov rdi, QWORD PTR [rbx] D:
Inlined std::list destructor and tail optimized delete
13 cmp rbx, rdi
14 je .L20
15 .L21: mov rbp, QWORD PTR [rdi]
16 call operator delete(void*)
17 cmp rbx, rbp
18 mov rdi, rbp
19 jne .L21
20 .L20: add rsp, 8
21 mov rdi, rbx
22 mov esi, 24
23 pop rbx
24 pop rbp
25 jmp operator delete(void*, unsigned long)
26 mov rdi, rax E:
Exception handling, inlined std::list destructor delete and rethrow
27 call __cxa_begin_catch
28 .L23: call __cxa_rethrow
29 mov rdi, rax
30 call __cxa_begin_catch
31 mov rdi, QWORD PTR [rbx]
32 .L19: cmp rbx, rdi
33 je .L24
34 mov rbp, QWORD PTR [rdi]
35 call operator delete(void*)
36 mov rdi, rbp
37 jmp .L19
38 mov rbx, rax
39 call __cxa_end_catch
40 mov rdi, rbx
41 call _Unwind_Resume
42 .L24: mov esi, 24
43 mov rdi, rbx
44 call operator delete(void*, unsigned long)
45 jmp .L23
Table 3

Moving on quickly, let’s consider a fourth alternative, using a smart pointer.

  void processHeapSmartPtr()
  {
    auto handleList = make_unique<list<Handle>>();
    populateList(handleList.get());
  }

OK, that’s code that I could live with.

Note that if you want to use only the smart pointer and not the underlying raw pointer you would have to rewrite or overload populateList. When I tried this, I noticed that passing a reference to the unique_ptr prevented the compiler from inlining and optimizing the pointer use resulting in more register use and larger code. Furthermore, the CppCoreGuidelines discourage passing references to smart pointers in cases like this [github-b]. When get() is used, there is no interface issue and the code size is barely any larger than the stack version.

The code flow in this case is quite similar to processStack.

The machine code for this function is in Table 4.

processHeapSmartPtr():
1 push r12 A:
Function Entry Prologue
2 push rbp
3 mov edi, 24
4 push rbx
5 call operator new(unsigned long) B:
Dynamic allocation and inlined std::list constructor
6 mov rbx, rax
7 mov QWORD PTR [rax+16], 0
8 mov rdi, rax
9 mov QWORD PTR [rbx], rax
10 mov QWORD PTR [rbx+8], rax
11 call populateList() C: Call function
12 mov rdi, QWORD PTR [rbx] D:
Inlined std::list destructor and tail optimized delete
13 cmp rbx, rdi
14 je .L33
15 .L34: mov rbp, QWORD PTR [rdi]
16 call operator delete(void*)
17 cmp rbx, rbp
18 mov rdi, rbp
19 jne .L34
20 .L33: mov rdi, rbx
21 mov esi, 24
22 pop rbx
23 pop rbp
24 pop r12
25 jmp operator delete(void*, unsigned long)
26 mov rdi, QWORD PTR [rbx] E:
Stack unwind handling, inlined std::list destructor
27 mov rbp, rax
28 .L37: cmp rbx, rdi
29 je .L36
30 mov r12, QWORD PTR [rdi]
31 call operator delete(void*)
32 mov rdi, r12
33 jmp .L37
34 .L36: mov rdi, rbx
35 mov esi, 24
36 call operator delete(void*, unsigned long)
37 mov rdi, rbp
38 call _Unwind_Resume
Table 4

You may have noticed that the non-exception path for the three versions using the heap are the same (lines 1 to 25 and blocks A to D in the tables of assembler). The only thing that is different is how they handle exceptions.

Here is a summary of the size of the code generated. The byte sizes were obtained by nm

Performance

I did some measurements of these 4 functions. I just wrote a main() with a loop running a million times calling the 4 functions with empty stub populateList functions.

With Valgrind callgrind, I got the following numbers of op-codes executed per call.

Function Op-codes executed
processStack 23
processHeap 27
processHeapNoLeak 27
processHeapSmartPointer 29

These are the exclusive counts i.e., only for the functions themselves. Whilst callgrind counts every instruction, by default it doesn’t record functions that take less than 1% of the total. So, adding a loop is an easy way to ensure that they are included in the output. I get a slightly higher count for processHeapSmartPointer as I was doing these tests with GCC trunk, I expect that if I’d used GCC 7.3 the count would have been the same as the other two Heap functions.

This is pretty much what I was expecting

  1. processStack is the fastest but not the smallest due to the exception handling
  2. processHeap is the smallest because it does no exception handling
  3. All of the functions using the heap execute similar numbers of machine instructions.

The picture is very different for the inclusive counts, that is the functions plus any callees.

Function Op-codes executed
processStack 24
processHeap 360
processHeapNoLeak 360
processHeapSmartPointer 362

There are two things that stand out

As usual, you may get different results on different platforms/compilers – I tried a couple of others and the results were broadly similar. Furthermore, this test case just uses stub functions. With real functions that actually do something the cost of the heap allocation would become relatively smaller. That said, the stack allocation here is around 15 times faster.

Conclusions

I think that the case is more or less settled. Use stack allocation where you can. It’s safer, faster and requires writing the least code. Obviously, there are times when you need heap allocation:

When you have to use heap allocation, use smart pointers. There is a small code size overhead, but if you use make_unique (or make_shared) then the difference in time performance is negligible compared to using smart pointers with the benefit of not having to worry about resource leaks.

Acknowledgements

Thanks to the reviewers for pointing out my omissions and inconsistencies.

Thanks also to Matt Godbolt for providing Compiler Explorer.

References

[C++Ref-a] malloc: http://en.cppreference.com/w/cpp/memory/c

[C++Ref-b] operator new: http://en.cppreference.com/w/cpp/memory/new

[github-a] jemalloc:https://github.com/jemalloc/jemalloc/wiki/Background

[github-b] CppCoreGuidelines: http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#r30-take-smart-pointers-as-parameters-only-to-explicitly-express-lifetime-semantics

[Wikipedia] Data segment: https://en.wikipedia.org/wiki/Data_segment

Notes: 

More fields may be available via dynamicdata ..