    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: No News is Good News</title>
        <link>https://members.accu.org/index.php/articles/2489</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + Overload Journal #144 - April 2018</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c384/">o144</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-384/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+384/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;No News is Good News</h1>
<p><strong>Author:</strong>&nbsp;Bob Schmidt</p>
<p>
<strong>Date:</strong> 06 April 2018 16:54:50 +01:00 or Fri, 06 April 2018 16:54:50 +01:00</p>
<p><strong>Summary:</strong>&nbsp;Using â€˜newâ€™ without care can be slow. Paul Floyd uses Godboltâ€™s compiler explorer to see what happens when you do.</p>
<p><strong>Body:</strong>&nbsp;<p>There are two influences that have inspired me to write this article. The first is that Iâ€™ve been playing a lot with Compiler Explorer (<a href="https://godbolt.org">https://godbolt.org</a>). Secondly, a while back I read <em>Optimized C++</em> by Kurt Guntheroth. It contains a chapter on using dynamic memory (Chapter 6: Optimize Dynamically Allocated Variables).</p>

<p>I agree with a lot of what is said. In short, there is a description of the types of memory available in C++ (automatic, dynamic and static); a description of how this memory relates to variables in code; APIs to deal with dynamic memory; smart pointers and many tips on optimizations related to dynamic memory. In this article Iâ€™m going to explore why you should be trying to optimize use of <code>new</code> by digging down to the machine code.</p>

<p>When Iâ€™m looking at production C++ code I do see a lot of gratuitous uses of <code>new</code>, for instance</p>

<pre class="programlisting">
  list &lt;handle&gt;* handleList = new list &lt;handle&gt;;
  ...
  processList(handleList);</pre>
  
<p>I suspect that there are a few possible reasons for writing such code:</p>

<ul>
	<li>the influence of Java</li>
	<li>the assumption that where an API takes a pointer, you have to pass it a pointer allocated on the heap</li>
	<li>not realizing that, at least for standard library containers, the object itself is quite small and that the bulk of the memory (what is contained) is dynamically allocated. A <code>std::list</code> is only 24 bytes, for instance (on 64bit Linux with GCC).</li>
</ul>

<p>Obviously, in the second case the memory doesnâ€™t need to be allocated dynamically. The top-level object can perfectly well be on the stack. For instance, the above example could have been written:</p>

<pre class="programlisting">
  list &lt;handle&gt; handleList;
  ...
  processList(&amp;handleList);</pre>
  
<p>In addition, if I could change the interface to the <code>processList</code> API, I would almost certainly change it to take a reference rather than a pointer. Furthermore, <code>std::list</code> is rarely a good choice when it comes to performance, so Iâ€™d probably also change that if I could.</p>

<p>Both versions of the code do the same thing. So, what is wrong with the first version?</p>

<h3>Memory refresher</h3>

<p>For those of you who are a bit rusty on what the difference is between dynamic and automatic memory (Iâ€™ll skip static), here is a quick refresher (Wikipedia has a longer description with diagrams [<a href="#[Wikipedia]">Wikipedia</a>]). Firstly, they are also commonly also known by other names, referring to how they are often implemented. Dynamic allocation is also known as heap allocation, and automatic allocation is known as stack allocation. The stack is a large block of memory. It is referred to by CPU registers such as the Stack Pointer. Memory can be â€˜allocatedâ€™ on the stack very quickly simply by manipulating the stack registers. The main drawbacks of stack memory are that it does not persist beyond the current scope and it can be quite limited in size. Heap memory is a separate block of memory, but this time it is controlled via functions like <code>malloc</code> [<a href="#[C++Ref-a]">C++Ref-a</a>] and <code>operator new</code> [<a href="#[C++Ref-b]">C++Ref-b</a>]. Itâ€™s not so limited in size and persists until explicitly deallocated.</p>

<h2>Problems with new</h2>

<h3>Performance</h3>

<p>There are a couple of reasons why heap allocation has worse performance than stack allocation.</p>

<ul>
	<li>More memory is required â€“ the allocator generally needs a small amount of memory for housekeeping in addition to the memory requested by the caller. In addition, the amount of memory might get rounded up to a larger size like the nearest power of 2 whilst stack allocation probably only rounds up to the nearest machine word size. On 64bit Linux, allocations have a 16byte minimum size and there is an 8byte housekeeping overhead.</li>
	
	<li>The biggest difference is that functions like <code>new</code> and <code>delete</code> are relatively slow. Much effort by library and OS writers has been made to make them as fast as possible (for instance, see the history of jemalloc [<a href="#[github-a]">github-a</a>]).</li>
</ul>

<p>There is a third question, concerning the size of code that gets generated. This complicates the picture because it isnâ€™t always an apples and pears comparison. When you use stack allocation, it generally means that you are using RAII and the compiler will generate the code necessary for clean-up at the end of the scope. When you use heap allocation with raw pointers as above then itâ€™s up to you to ensure that resources get freed.</p>

<h3>About the example code</h3>

<p>In the examples that follow, I will continue to use <code>handleList</code>. In my testing I defined <code>Handle</code> to be</p>

<pre class="programlisting">
  class Handle {
  public:
    int data;
  };</pre>
  
<p>It doesnâ€™t matter what <code>Handle</code> is. The only thing of importance is to consider that <code>handleList</code> itself is something that needs some memory. Iâ€™m going to stick with the <code>std::list</code> in the examples for two reasons. Firstly, I want there to be a code smell. Secondly and more seriously, though the examples presented here are trivial I donâ€™t want them to be so small that the compiler optimizes them to almost nothing. Also, for that reason Iâ€™ve added calls to an externally defined <code>populateList</code> function.</p>

<p>The assembly was generated on Compiler Explorer using GCC 7.3 64bit.</p>

<h4>Comparison of allocation methods</h4>

<p>Digging deeper into the different allocation methods, if we have a stack allocation function like this:</p>

<pre class="programlisting">
  void processStack()
  {
    list&lt;Handle&gt; handleList;
    populateList(&amp;handleList);
  }</pre>
  
<p>the optimized machine code that gets generated is in Table 1</p>

<table class="sidebartable">
	<tr>
		<td>
			<table class="journaltable">
	<tr>
		<th colspan="3">processStack():</th>
	</tr>
	
	<tr>
		<td>1</td>
		<td>push r12</td>
		<td rowspan="5">A:<br />
			Function entry prologue</td>
	</tr>
	
	<tr>
		<td>2</td>
		<td>push rbp</td>
	</tr>
	
	<tr>
		<td>3</td>
		<td>push rbx</td>
	</tr>
	
	<tr>
		<td>4</td>
		<td>sub rsp, 48</td>
	</tr>
	
	<tr>
		<td>5</td>
		<td>lea rbp, [rsp+16]</td>
	</tr>
	
	<tr>
		<td>6</td>
		<td>mov QWORD PTR [rsp+32], 0</td>
		<td rowspan="6">B:<br />
			Inlined std::list construction of handleList on stack</td>
	</tr>
	
	<tr>
		<td>7</td>
		<td>mov QWORD PTR [rsp+8], rbp</td>
	</tr>
	
	<tr>
		<td>8</td>
		<td>mov rdi, rbp</td>
	</tr>
	
	<tr>
		<td>9</td>
		<td>movq xmm0, QWORD PTR [rsp+8]</td>
	</tr>
	
	<tr>
		<td>10</td>
		<td>punpcklqdq xmm0, xmm0</td>
	</tr>
	
	<tr>
		<td>11</td>
		<td>movaps XMMWORD PTR [rsp+16], xmm0</td>
	</tr>
	
	<tr>
		<td>12</td>
		<td>call populateList()</td>
		<td>C: Call function</td>
	</tr>
	
	<tr>
		<td>13</td>
		<td>mov rdi, QWORD PTR [rsp+16]</td>
		<td rowspan="7">D:<br />
			Check result and inlined destructor</td>
	</tr>
	
	<tr>
		<td>14</td>
		<td>cmp rdi, rbp</td>
	</tr>
	
	<tr>
		<td>15</td>
		<td>je .L1</td>
	</tr>
	
	<tr>
		<td>16</td>
		<td>.L3:  mov rbx, QWORD PTR [rdi]</td>
	</tr>
	
	<tr>
		<td>17</td>
		<td>call operator delete(void*)</td>
	</tr>
	
	<tr>
		<td>18</td>
		<td>cmp rbx, rbp</td>
	</tr>
	
	<tr>
		<td>19</td>
		<td>mov rdi, rbx</td>
	</tr>
	
	<tr>
		<td>20</td>
		<td>jne .L3</td>
		<td rowspan="6">E:<br />
			Function exit epilogue</td>
	</tr>
	
	<tr>
		<td>21</td>
		<td>.L1: add rsp, 48</td>
	</tr>
	
	<tr>
		<td>22</td>
		<td>pop rbx</td>
	</tr>
	
	<tr>
		<td>23</td>
		<td>pop rbp</td>
	</tr>
	
	<tr>
		<td>24</td>
		<td>pop r12</td>
	</tr>
	
	<tr>
		<td>25</td>
		<td>ret</td>
	</tr>
	
	<tr>
		<td>26</td>
		<td>mov rdi, QWORD PTR [rsp+16]</td>
		<td rowspan="10">F:<br />
			Stack Unwind Handling</td>
	</tr>
	
	<tr>
		<td>27</td>
		<td>mov rbx, rax</td>
	</tr>
	
	<tr>
		<td>28</td>
		<td>.L6: cmp rdi, rbp</td>
	</tr>
	
	<tr>
		<td>29</td>
		<td>je .L5</td>
	</tr>
	
	<tr>
		<td>30</td>
		<td>mov r12, QWORD PTR [rdi]</td>
	</tr>
	
	<tr>
		<td>31</td>
		<td>call operator delete(void*)</td>
	</tr>
	
	<tr>
		<td>32</td>
		<td>mov rdi, r12</td>
	</tr>
	
	<tr>
		<td>33</td>
		<td>jmp .L6</td>
	</tr>
	
	<tr>
		<td>34</td>
		<td>.L5: mov rdi, rbx</td>
	</tr>
	
	<tr>
		<td>35</td>
		<td>call _Unwind_Resume</td>
	</tr>
</table>
		</td>
	</tr>
	<tr>
		<td class="title">Table 1</td>
	</tr>
</table>

<p>Note that for non-exceptional flow, only items in blocks A to F in column 3 are executed. Block F only gets called when exceptions are thrown via the stack unwinding mechanism.</p>

<p>On the other hand, for heap allocation that does not ensure proper clean-up, like this:</p>

<pre class="programlisting">
  void processHeap()
  {
    list&lt;Handle&gt;* handleList = new list&lt;Handle&gt;;
    populateList(handleList);
    delete handleList;
  }</pre>
  
<p>the machine code that gets generated is in Table 2.</p>

<table class="sidebartable">
	<tr>
		<td>
			<table class="journaltable">
					<tr>
		<th colspan="3">processHeap():</th>
	</tr>
	
	<tr>
		<td>1</td>
		<td>push rbp</td>
		<td rowspan="4">A:<br />
			Function Entry Prologue</td>
	</tr>
	
	<tr>
		<td>2</td>
		<td>push rbx</td>
	</tr>
	
	<tr>
		<td>3</td>
		<td>mov edi, 24</td>
	</tr>
	
	<tr>
		<td>4</td>
		<td>sub rsp, 8</td>
	</tr>
	
	<tr>
		<td>5</td>
		<td>call operator new(unsigned long)</td>
		<td rowspan="6">B:<br />
			Dynamic allocation and inlined std::list constructor</td>
	</tr>
	
	<tr>
		<td>6</td>
		<td>mov rbx, rax</td>
	</tr>
	
	<tr>
		<td>7</td>
		<td>mov rdi, rax</td>
	</tr>
	
	<tr>
		<td>8</td>
		<td>mov QWORD PTR [rax+16], 0</td>
	</tr>
	
	<tr>
		<td>9</td>
		<td>mov QWORD PTR [rax], rax</td>
	</tr>
	
	<tr>
		<td>10</td>
		<td>mov QWORD PTR [rax+8], rax</td>
	</tr>
	
	<tr>
		<td>11</td>
		<td>call populateList()</td>
		<td>C: Call function</td>
	</tr>
	
	<tr>
		<td>12</td>
		<td>mov rdi, QWORD PTR [rbx]</td>
		<td rowspan="14">D:<br />
			Inlined std::list destructor, function exit epilogue and tail optimized delete</td>
	</tr>
	
	<tr>
		<td>13</td>
		<td>cmp rbx, rdi</td>
	</tr>
	
	<tr>
		<td>14</td>
		<td>je .L12</td>
	</tr>
	
	<tr>
		<td>15</td>
		<td>.L13: mov rbp, QWORD PTR [rdi]</td>
	</tr>
	
	<tr>
		<td>16</td>
		<td>call operator delete(void*)</td>
	</tr>
	
	<tr>
		<td>17</td>
		<td>cmp rbx, rbp</td>
	</tr>
	
	<tr>
		<td>18</td>
		<td>mov rdi, rbp</td>
	</tr>
	
	<tr>
		<td>19</td>
		<td>jne .L13</td>
	</tr>
	
	<tr>
		<td>20</td>
		<td>.L12: add rsp, 8</td>
	</tr>
	
	<tr>
		<td>21</td>
		<td>mov rdi, rbx</td>
	</tr>
	
	<tr>
		<td>22</td>
		<td>mov esi, 24</td>
	</tr>
	
	<tr>
		<td>23</td>
		<td>pop rbx</td>
	</tr>
	
	<tr>
		<td>24</td>
		<td>pop rbp</td>
	</tr>
	
	<tr>
		<td>25</td>
		<td>jmp operator delete(void*, unsigned long)</td>
	</tr>
			</table>
		</td>
	</tr>
	<tr>
		<td class="title">Table 2</td>
	</tr>
</table>

<p>Thus, it has lost exception safety but made the generated code slightly shorter.</p>

<p>To regain exception safety with heap allocation, still using raw pointers, we would need to write something like Listing 1.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
void processHeapNoleak()
{
    list&lt;Handle&gt;* handleList = nullptr;
    try
    {
        handleList = new list&lt;Handle&gt;;
        populateList(handleList);
        delete handleList;    
    }
    catch (...)
    {
        delete handleList;
        throw;
    }
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<p>That doesnâ€™t look too pretty. Please donâ€™t do this at home. The generated machine code for this is in Table 3.</p>

<table class="sidebartable">
	<tr>
		<td>
			<table class="journaltable">
<tr>
	<th colspan="3">processHeapNoleak():</th>
</tr>

<tr>
	<td>1</td>
	<td>push rbp</td>
	<td rowspan="4">A:<br />
		Function Entry Prologue</td>
</tr>

<tr>
	<td>2</td>
	<td>push rbx</td>
</tr>

<tr>
	<td>3</td>
	<td>mov edi, 24</td>
</tr>

<tr>
	<td>4</td>
	<td>sub rsp, 8</td>
</tr>

<tr>
	<td>5</td>
	<td>call operator new(unsigned long)</td>
	<td rowspan="6">B:<br />
		Dynamic allocation and inlined std::list constructor</td>
</tr>

<tr>
	<td>6</td>
	<td>mov rbx, rax</td>
</tr>

<tr>
	<td>7</td>
	<td>mov QWORD PTR [rax+16], 0</td>
</tr>

<tr>
	<td>8</td>
	<td>mov rdi, rax</td>
</tr>

<tr>
	<td>9</td>
	<td>mov QWORD PTR [rbx], rax</td>
</tr>

<tr>
	<td>10</td>
	<td>mov QWORD PTR [rbx+8], rax</td>
</tr>

<tr>
	<td>11</td>
	<td>call populateList()</td>
	<td>C: Call function</td>
</tr>

<tr>
	<td>12</td>
	<td>mov rdi, QWORD PTR [rbx]</td>
	<td rowspan="14">D:<br />
		Inlined std::list destructor and tail optimized delete</td>
</tr>

<tr>
	<td>13</td>
	<td>cmp rbx, rdi</td>
</tr>

<tr>
	<td>14</td>
	<td>je .L20</td>
</tr>

<tr>
	<td>15</td>
	<td>.L21: mov rbp, QWORD PTR [rdi]</td>
</tr>

<tr>
	<td>16</td>
	<td>call operator delete(void*)</td>
</tr>

<tr>
	<td>17</td>
	<td>cmp rbx, rbp</td>
</tr>

<tr>
	<td>18</td>
	<td>mov rdi, rbp</td>
</tr>

<tr>
	<td>19</td>
	<td>jne .L21</td>
</tr>

<tr>
	<td>20</td>
	<td>.L20: add rsp, 8</td>
</tr>

<tr>
	<td>21</td>
	<td>mov rdi, rbx</td>
</tr>

<tr>
	<td>22</td>
	<td>mov esi, 24</td>
</tr>

<tr>
	<td>23</td>
	<td>pop rbx</td>
</tr>

<tr>
	<td>24</td>
	<td>pop rbp</td>
</tr>

<tr>
	<td>25</td>
	<td>jmp operator delete(void*, unsigned long)</td>
</tr>

<tr>
	<td>26</td>
	<td>mov rdi, rax</td>
	<td rowspan="20">E:<br />
		Exception handling, inlined std::list destructor delete and rethrow</td>
</tr>

<tr>
	<td>27</td>
	<td>call __cxa_begin_catch</td>
</tr>

<tr>
	<td>28</td>
	<td>.L23: call __cxa_rethrow</td>
</tr>

<tr>
	<td>29</td>
	<td>mov rdi, rax</td>
</tr>

<tr>
	<td>30</td>
	<td>call __cxa_begin_catch</td>
</tr>

<tr>
	<td>31</td>
	<td>mov rdi, QWORD PTR [rbx]</td>
</tr>

<tr>
	<td>32</td>
	<td>.L19: cmp rbx, rdi</td>
</tr>

<tr>
	<td>33</td>
	<td>je .L24</td>
</tr>

<tr>
	<td>34</td>
	<td>mov rbp, QWORD PTR [rdi]</td>
</tr>

<tr>
	<td>35</td>
	<td>call operator delete(void*)</td>
</tr>

<tr>
	<td>36</td>
	<td>mov rdi, rbp</td>
</tr>

<tr>
	<td>37</td>
	<td>jmp .L19</td>
</tr>

<tr>
	<td>38</td>
	<td>mov rbx, rax</td>
</tr>

<tr>
	<td>39</td>
	<td>call __cxa_end_catch</td>
</tr>

<tr>
	<td>40</td>
	<td>mov rdi, rbx</td>
</tr>

<tr>
	<td>41</td>
	<td>call _Unwind_Resume</td>
</tr>

<tr>
	<td>42</td>
	<td>.L24: mov esi, 24</td>
</tr>

<tr>
	<td>43</td>
	<td>mov rdi, rbx</td>
</tr>

<tr>
	<td>44</td>
	<td>call operator delete(void*, unsigned long)</td>
</tr>

<tr>
	<td>45</td>
	<td>jmp .L23</td>
</tr>
			</table>
		</td>
	</tr>
	<tr>
		<td class="title">Table 3</td>
	</tr>
</table>

<p>Moving on quickly, letâ€™s consider a fourth alternative, using a smart pointer.</p>

<pre class="programlisting">
  void processHeapSmartPtr()
  {
    auto handleList = make_unique&lt;list&lt;Handle&gt;&gt;();
    populateList(handleList.get());
  }</pre>
  
<p>OK, thatâ€™s code that I could live with.</p>

<p>Note that if you want to use only the smart pointer and not the underlying raw pointer you would have to rewrite or overload <code>populateList</code>. When I tried this, I noticed that passing a reference to the <code>unique_ptr</code> prevented the compiler from inlining and optimizing the pointer use resulting in more register use and larger code. Furthermore, the CppCoreGuidelines discourage passing references to smart pointers in cases like this [<a href="#[github-b]">github-b</a>]. When <code>get()</code> is used, there is no interface issue and the code size is barely any larger than the stack version.</p>

<p>The code flow in this case is quite similar to <code>processStack</code>. </p>

<p>The machine code for this function is in Table 4.</p>

<table class="sidebartable">
	<tr>
		<td>
			<table class="journaltable">
	<tr>
		<th colspan="3">processHeapSmartPtr():</th>
	</tr>
	
	<tr>
		<td>1</td>
		<td>push r12</td>
		<td rowspan="4">A:<br />
			Function Entry Prologue</td>
	</tr>
	
	<tr>
		<td>2</td>
		<td>push rbp</td>
	</tr>
	
	<tr>
		<td>3</td>
		<td>mov edi, 24</td>
	</tr>
	
	<tr>
		<td>4</td>
		<td>push rbx</td>
	</tr>
	
	<tr>
		<td>5</td>
		<td>call operator new(unsigned long)</td>
		<td rowspan="6">B:<br />
			Dynamic allocation and inlined std::list constructor</td>
	</tr>
	
	<tr>
		<td>6</td>
		<td>mov rbx, rax</td>
	</tr>
	
	<tr>
		<td>7</td>
		<td>mov QWORD PTR [rax+16], 0</td>
	</tr>
	
	<tr>
		<td>8</td>
		<td>mov rdi, rax</td>
	</tr>
	
	<tr>
		<td>9</td>
		<td>mov QWORD PTR [rbx], rax</td>
	</tr>
	
	<tr>
		<td>10</td>
		<td>mov QWORD PTR [rbx+8], rax</td>
	</tr>
	
	<tr>
		<td>11</td>
		<td>call populateList()</td>
		<td>C: Call function</td>
	</tr>
	
	<tr>
		<td>12</td>
		<td>mov rdi, QWORD PTR [rbx]</td>
		<td rowspan="14">D:<br />
			Inlined std::list destructor and tail optimized delete</td>
	</tr>
	
	<tr>
		<td>13</td>
		<td>cmp rbx, rdi</td>
	</tr>
	
	<tr>
		<td>14</td>
		<td>je .L33</td>
	</tr>
	
	<tr>
		<td>15</td>
		<td>.L34: mov rbp, QWORD PTR [rdi]</td>
	</tr>
	
	<tr>
		<td>16</td>
		<td>call operator delete(void*)</td>
	</tr>
	
	<tr>
		<td>17</td>
		<td>cmp rbx, rbp</td>
	</tr>
	
	<tr>
		<td>18</td>
		<td>mov rdi, rbp</td>
	</tr>
	
	<tr>
		<td>19</td>
		<td>jne .L34</td>
	</tr>
	
	<tr>
		<td>20</td>
		<td>.L33: mov rdi, rbx</td>
	</tr>
	
	<tr>
		<td>21</td>
		<td>mov esi, 24</td>
	</tr>
	
	<tr>
		<td>22</td>
		<td>pop rbx</td>
	</tr>
	
	<tr>
		<td>23</td>
		<td>pop rbp</td>
	</tr>
	
	<tr>
		<td>24</td>
		<td>pop r12</td>
	</tr>
	
	<tr>
		<td>25</td>
		<td>jmp operator delete(void*, unsigned long)</td>
	</tr>
	
	<tr>
		<td>26</td>
		<td>mov rdi, QWORD PTR [rbx]</td>
		<td rowspan="13">E:<br />
			Stack unwind handling, inlined std::list destructor</td>
	</tr>
	
	<tr>
		<td>27</td>
		<td>mov rbp, rax</td>
	</tr>
	
	<tr>
		<td>28</td>
		<td>.L37: cmp rbx, rdi</td>
	</tr>
	
	<tr>
		<td>29</td>
		<td>je .L36</td>
	</tr>
	
	<tr>
		<td>30</td>
		<td>mov r12, QWORD PTR [rdi]</td>
	</tr>
	
	<tr>
		<td>31</td>
		<td>call operator delete(void*)</td>
	</tr>
	
	<tr>
		<td>32</td>
		<td>mov rdi, r12</td>
	</tr>
	
	<tr>
		<td>33</td>
		<td>jmp .L37</td>
	</tr>
	
	<tr>
		<td>34</td>
		<td>.L36: mov rdi, rbx</td>
	</tr>
	
	<tr>
		<td>35</td>
		<td>mov esi, 24</td>
	</tr>
	
	<tr>
		<td>36</td>
		<td>call operator delete(void*, unsigned long)</td>
	</tr>
	
	<tr>
		<td>37</td>
		<td>mov rdi, rbp</td>
	</tr>
	
	<tr>
		<td>38</td>
		<td>call _Unwind_Resume</td>
	</tr>
			</table>
		</td>
	</tr>
	<tr>
		<td class="title">Table 4</td>
	</tr>
</table>

<p>You may have noticed that the non-exception path for the three versions using the heap are the same (lines 1 to 25 and blocks A to D in the tables of assembler). The only thing that is different is how they handle exceptions.</p>

<p>Here is a summary of the size of the code generated. The byte sizes were obtained by <code>nm</code></p>

<h2>Performance</h2>

<p>I did some measurements of these 4 functions. I just wrote a <code>main()</code> with a loop running a million times calling the 4 functions with empty stub <code>populateList</code> functions.</p>

<p>With Valgrind callgrind, I got the following numbers of op-codes executed per call.</p>

<table>
	<tr>
		<th>Function</th>
		<th>Op-codes executed</th>
	</tr>
	
	<tr>
		<td>processStack</td>
		<td>23</td>
	</tr>
	
	<tr>
		<td>processHeap</td>
		<td>27</td>
	</tr>
	
	<tr>
		<td>processHeapNoLeak</td>
		<td>27</td>
	</tr>
	
	<tr>
		<td>processHeapSmartPointer</td>
		<td>29</td>
	</tr>
</table>

<p>These are the exclusive counts i.e., only for the functions themselves. Whilst callgrind counts every instruction, by default it doesnâ€™t record functions that take less than 1% of the total. So, adding a loop is an easy way to ensure that they are included in the output. I get a slightly higher count for <code>processHeapSmartPointer</code> as I was doing these tests with GCC trunk, I expect that if Iâ€™d used GCC 7.3 the count would have been the same as the other two Heap functions.</p>

<p>This is pretty much what I was expecting</p>

<ol>
	<li><code>processStack</code> is the fastest but not the smallest due to the exception handling</li>
	<li><code>processHeap</code> is the smallest because it does no exception handling</li>
	<li>All of the functions using the heap execute similar numbers of machine instructions.</li>
</ol>

<p>The picture is very different for the inclusive counts, that is the functions plus any callees.</p>

<table>
	<tr>
		<th>Function</th>
		<th>Op-codes executed</th>
	</tr>
	
	<tr>
		<td>processStack</td>
		<td>24</td>
	</tr>
	
	<tr>
		<td>processHeap</td>
		<td>360</td>
	</tr>
	
	<tr>
		<td>processHeapNoLeak</td>
		<td>360</td>
	</tr>
	
	<tr>
		<td>processHeapSmartPointer</td>
		<td>362</td>
	</tr>
</table>

<p>There are two things that stand out</p>

<ul>
	<li>The number of op-codes executed hasnâ€™t change for the stack version, other than the call to the empty stub.</li>
	
	<li>The 3 heap versions have essentially the same count.</li>
</ul>

<p>As usual, you may get different results on different platforms/compilers â€“ I tried a couple of others and the results were broadly similar. Furthermore, this test case just uses stub functions. With real functions that actually do something the cost of the heap allocation would become relatively smaller. That said, the stack allocation here is around 15 times faster.</p>

<h3>Conclusions</h3>

<p>I think that the case is more or less settled. Use stack allocation where you can. Itâ€™s safer, faster and requires writing the least code. Obviously, there are times when you need heap allocation:</p>

<ul>
	<li>when you have huge data</li>
	<li>when you need extended object lifetime</li>
	<li>when you have self-referential data structures like graphs</li>
	<li>when you want to decouple interfaces like with the pimpl idiom</li>
	<li>when you have deeply recursive functions.</li>
</ul>

<p>When you have to use heap allocation, use smart pointers. There is a small code size overhead, but if you use <code>make_unique</code> (or <code>make_shared</code>) then the difference in time performance is negligible compared to using smart pointers with the benefit of not having to worry about resource leaks.</p>

<h2>Acknowledgements</h2>

<p>Thanks to the reviewers for pointing out my omissions and inconsistencies.</p>

<p>Thanks also to Matt Godbolt for providing Compiler Explorer.</p>

<h2>References</h2>

<p class="bibliomixed"><a id="[C++Ref-a]"></a>[C++Ref-a] malloc: <a href="http://en.cppreference.com/w/cpp/memory/c">http://en.cppreference.com/w/cpp/memory/c</a></p>

<p class="bibliomixed"><a id="[C++Ref-b]"></a>[C++Ref-b] operator new: <a href="http://en.cppreference.com/w/cpp/memory/new">http://en.cppreference.com/w/cpp/memory/new</a></p>

<p class="bibliomixed"><a id="[github-a]"></a>[github-a] jemalloc:<a href="https://github.com/jemalloc/jemalloc/wiki/Background">https://github.com/jemalloc/jemalloc/wiki/Background</a></p>

<p class="bibliomixed"><a id="[github-b]"></a>[github-b] CppCoreGuidelines: <a href="http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#r30-take-smart-pointers-as-parameters-only-to-explicitly-express-lifetime-semantics">http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#r30-take-smart-pointers-as-parameters-only-to-explicitly-express-lifetime-semantics</a></p>

<p class="bibliomixed"><a id="[Wikipedia]"></a>[Wikipedia] Data segment: <a href="https://en.wikipedia.org/wiki/Data_segment">https://en.wikipedia.org/wiki/Data_segment</a></p>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
