    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: â€˜Speedy Gonzalesâ€™ Serializing (Re)Actors via Allocators</title>
        <link>https://members.accu.org/index.php/articles/2425</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + Overload Journal #141 - October 2017</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c378/">o141</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-378/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+378/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;â€˜Speedy Gonzalesâ€™ Serializing (Re)Actors via Allocators</h1>
<p><strong>Author:</strong>&nbsp;Bob Schmidt</p>
<p>
<strong>Date:</strong> 05 October 2017 19:28:12 +01:00 or Thu, 05 October 2017 19:28:12 +01:00</p>
<p><strong>Summary:</strong>&nbsp;More speed! Sergey Ignatchenko completes his (Re)Actor allocator series with Part III.</p>
<p><strong>Body:</strong>&nbsp;<p class="quote">Start the reactor. Free Marsâ€¦<br />~ Kuato from Total Recall</p>

<p class="EditorIntro">Disclaimer: as usual, the opinions within this article are those of â€˜No Bugsâ€™ Hare, and do not necessarily coincide with the opinions of the translators and <em>Overload</em> editors; also, please keep in mind that translation difficulties from Lapine (like those described in <a href="#[Loganberry04]">[Loganberry04]</a>) might have prevented an exact translation. In addition, the translator and <em>Overload</em> expressly disclaim all responsibility from any action or inaction resulting from reading this article.</p>

<p>As we briefly discussed in Part I of this mini-series <a href="#[NoBugs17a]">[NoBugs17a]</a>, message-passing technologies such as (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and event-driven programs) have numerous advantages - ranging from being debuggable (including post-factum production debugging), to providing better overall performance.</p>

<p>In <a href="#[NoBugs17a]">[NoBugs17a]</a> and <a href="#[NoBugs17b]">[NoBugs17b]</a>, we discussed an approach to handling allocations for (Re)Actors, and then were able to achieve a safe dialect of C++ (that is, as long as weâ€™re following a set of well-defined local rules). Now, letâ€™s take a look at another task which can be facilitated by per-(Re)Actor allocators - specifically, at the task of serializing (Re)Actors that are later going to be de-serialized by the same executable. While one solution for this task was provided in <a href="#[Ignatchenko-Ivanchykhin16]">[Ignatchenko-Ivanchykhin16]</a>, the proposed â€˜ultra-fastâ€™ serialization is rather cumbersome to maintain, and in most cases it can be beaten performance-wise by serializing at the allocator level.</p>

<h2>#define (Re)Actors</h2>

<p>To make this article self-contained and make sure that weâ€™re all on the same page with terminology, letâ€™s repeat the definition of our (Re)Actors from <a href="#[NoBugs17a]">[NoBugs17a]</a>.</p>

<p>Letâ€™s name a common denominator for all our (Re)Actors a <code>GenericReactor</code>. <code>GenericReactor</code> is just an abstract class â€“ and has a pure virtual function, <code>react()</code>:</p>

<pre class="programlisting">
  class GenericReactor {
    virtual void react(const Event&amp; ev) = 0;
    virtual ~GenericReactor() {}
  }</pre>
  
<p>Letâ€™s name the piece of code which calls <code>GenericReactor</code>â€™s <code>react()</code> <em>Infrastructure Code</em>. Quite often this call will be within a so-called â€˜event loopâ€™ (see Listing 1).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
std::unique_ptr&lt;GenericReactor&gt; r 
  = reactorFactory.createReactor(...);

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  r-&gt;react(ev);
}

			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<p>Letâ€™s note that the <code>get_event()</code> function can obtain events from wherever we want; from <code>select()</code> (which is quite common for servers) to libraries such as <strong>libuv</strong> (which is common for clients). </p>

<p>Also letâ€™s note that an event loop such as the one above is by far <em>not</em> the only way to call <code>react()</code>: Iâ€™ve seen implementations of Infrastructure Code ranging from the one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a database (DB), then called <code>react()</code>, and then serialized the (Re)Actor back to the DB. Whatâ€™s important, though, is that even if <code>react()</code> can be called from different threads, it <strong>must</strong> be called <em>as if</em> it is one single thread. In other words, if necessary, all thread sync should be done <strong>outside</strong> of our (Re)Actor, so <code>react()</code> doesnâ€™t need to bother about thread sync regardless of the Infrastructure Code in use. </p>

<p>Finally, letâ€™s name any specific derivative from Generic Reactor (which actually implements our <code>react()</code> function) a <code>SpecificReactor</code>:</p>

<pre class="programlisting">
  class SpecificReactor : public GenericReactor {
    void react(const Event&amp; ev) override;
  };</pre>

<p>In addition, letâ€™s observe that whenever the (Re)Actor needs to communicate with another (Re)Actor then â€“ adhering to the â€˜Do not communicate by sharing memory; instead, share memory by communicatingâ€™ principle â€“ it merely sends a message, and it is only this message which will be shared between (Re)Actors. In turn, this means that we can (and <em>should</em>) use single-threaded allocation for <em>all</em> (Re)Actor purposes â€“ except for allocation of those messages intended for inter-(Re)Actor communications. </p>

<h2>Task: Serialization for the same executable</h2>

<p>Now, letâ€™s define what weâ€™re going to do with our (Re)Actor in this article (and <em>why</em>). Basically, as discussed in <a href="#[Ignatchenko-Ivanchykhin16]">[Ignatchenko-Ivanchykhin16]</a>, when dealing with (Re)Actors, we often need to serialize the current state of our (Re)Actor so that we can deserialize it in <em>exactly the same executable</em> (though this executable can run on a completely different computer etc.). Applications of such â€˜serialization for exactly the same executableâ€™ are numerous; in particular, it is useful for (see, for example, <a href="#[NoBugs17c]">[NoBugs17c]</a>):</p>

<ul>
	<li>Migrating our (Re)Actors (for example, to load-balance them)</li>
	<li>Low-Latency Fault Tolerance for (Re)Actors</li>
	<li>Production post-mortem debugging (serializing the state, plus all inputs after that state, of the (Re)Actor, and then transferring them to developerâ€™s box in case of a crash)</li>
</ul>

<h2>â€˜Speedy Gonzalesâ€™ (Re)actor-level serialization</h2>

<p>Now, as we have both our (Re)Actors and our task at hand more-or-less defined, letâ€™s see how well we can do with our per-(Re)Actor allocator. </p>

<p>Actually, it is fairly simple: </p>

<ul>
	<li>We re-define global allocator (for example, <code>malloc()</code>/<code>free()</code>, though depending on compiler specifics and options YMMV) so that it acts as a bunch of per-(Re)Actor allocators (or at least per-thread allocators) â€“ more on it below
		<p>This means that within our (Re)Actor, we do <strong>not</strong> need to specify allocators for each call to â€˜newâ€™ and for each collection &lt;phew! /&gt;</p>
	</li>
	
	<li>Within our per-(Re)-Actor allocator, we allocate/deallocate OS pages ourselves (via calls such as <code>VirtualAllocEx()</code>or <code>mmap()</code>); of course, we also keep a list of all the OS pages weâ€™re using.</li>
	
	<li>Whenever we need to serialize our (Re)Actor, we simply dump all the pages used by the allocator of this (Re)Actor (with an address of each page serialized) â€“ thatâ€™s it!</li>
	
	<li>When we need to deserialize our (Re)Actor, we try to allocate OS pages at exactly the same (virtual) addresses as they were originally allocated
		
		<p>If such allocation <strong>succeeds</strong> for all our serialized pages (which is common â€“ though strictly speaking, not guaranteed â€“ when weâ€™re deserializing a (Re)Actor into a dedicated-for-this-(Re)Actor process, which in turn is common for debugging), we just need to copy pages back from the serialized form into allocator (thatâ€™s it)</p>
		
		<p>If allocation at the same address <strong>isnâ€™t possible</strong> for whatever reason, we have to use a process which is essentially similar to relocation discussed in <a href="#[NoBugs17a]">[NoBugs17a]</a>. Very briefly:</p>
		
		<ul>
			<li>We have a <em>relocation map</em>, which gives the mapping between â€˜oldâ€™ page addresses and â€˜newâ€™ page addresses.</li>
			
			<li>At OS level, we make sure the pages with â€˜oldâ€™ addresses are unreadable.</li>
			
			<li>We run <em>traversing</em> (as discussed in <a href="#[NoBugs17a]">[NoBugs17a]</a>) over the state of our (Re)Actor. During this traversing process, we merely try to access all the elements of the state of our (Re)Actor. Whenever any pointer within our (Re)Actor state happens to point to the â€˜oldâ€™ page, such access will fail (causing an <em>access violation</em> exception). We catch each such an exception, and update the pointer which caused the exception to the â€˜newâ€™ value within the corresponding â€˜newâ€™ page (calculating the â€˜newâ€™ value using the relocation map).
				<p>Note that while the traversing of our <em>own</em> collections can easily be handled along the same lines as above, traversing and fixing <em>standard</em> collections can be outright impossible without adding a few lines to them &#9785;. How to handle this in practice? It depends, but one way to do it is to take a <em>cross-platform</em> STL implementation (such as EASTL), and to add the few lines implementing traversing for each collection you require (it is NOT rocket science for any <em>specific</em> STL).</p>
			</li>
		</ul>
	</li>
</ul>

<p>Bingo! After such traversing of the <em>whole</em> state of our (Re)Actor is completed, we can be sure that <em>all</em> the pointers to the heap within our (Re)Actor are updated with the new values. In turn, it means that <em>all</em> the pointers to the state are already updated, so all the relocations due to serialization are already handled, and we can proceed with normal (Re)Actor processing.</p>

<p>One very important property of such serialization is that it is extremely difficult to beat this kind of serialization performance-wise. Deserialization is a slightly different story due to potential relocation, but it is still pretty fast. Also, for several of the use cases mentioned in the â€˜Task: Serialization for the same Executableâ€™ section above, it is only the performance of <em>serialization</em> which really matters. Indeed, all we need to do is <code>memcpy()</code> for large chunks of RAM, and with <code>memcpy()</code> speeds being of the order of 5-10 gigabytes/second at least for x64 (see, for example, <a href="#[B14]">[B14]</a>), this means that even to serialize a (Re)Actor which has 100MB state, weâ€™re talking about times of the order of 10-20ms. Serializing the same thing using conventional serialization methods (even really fast ones, such as the one discussed in <a href="#[Ignatchenko-Ivanchykhin16]">[Ignatchenko-Ivanchykhin16]</a>) is going to be significantly slower. The exact numbers depend on the specifics of the organization of the data, but if we have a randomly filled 100MB <code>std::map&lt;int&gt;</code>, just iterating over it without any serializing is going to take the order of 100ms, almost an order of magnitude longer (!).</p>

<h2>Parallel serialization aided by (kinda-)copy-on-write</h2>

<p>For those apps where even 10-20ms of additional latency per 100MB of state is too much, it might be possible to reduce it further. </p>

<p>One of the implementations would work along the following lines (which are ideologically similar, though not identical to, classic Copy-on-Write):</p>

<p>When we want to serialize, we (in our â€˜mainâ€™ (Re)Actor processing thread):</p>

<ul>
	<li>create a list of pages to be serialized</li>
	<li>pre-allocate space in some other area of RAM where we want to serialize</li>
	<li>for all the pages to be serialized, set an â€˜already serializedâ€™ parameter to <code>false</code></li>
	<li>mark all the pages to be serialized as â€˜no-writeâ€™ (using <code>VirtualAllocEx()</code> or <code>mprotect()</code>).</li>
	<li>start another â€˜serializationâ€™ thread (use an always-existing dedicated serialization thread, take it from thread pool, etc.)</li>
	<li>continue processing the (Re)Actor messages in the main thread</li>
</ul>

<p>The â€˜serializationâ€™ thread just takes pages from the list to be serialized one by one, and for each such page:</p>

<ul>
	<li>checks if it is already serialized: if yes, skips; and if not, marks it as â€˜already serializedâ€™ (which should be done using an atomic CAS-like operation to prevent potential races with the main thread)</li>
	
	<li>if it wasnâ€™t already serialized:
		
		<ul>
			<li>serializes the page into pre-allocated space</li>
			<li>removes the â€˜no-writeâ€™ protection from the page</li>
		</ul>
	</li>
</ul>

<p>Whenever we have write access to one of the â€˜no-writeâ€™ pages, we catch the appropriate CPU-level exception, and within the exception handler:</p>

<p>Check if the page being accessed is already being serialized (this can happen due to races with the â€˜serializationâ€™ thread); this should be done in an atomic manner similar to the â€˜serializationâ€™ thread as described above</p>

<p>If the page isnâ€™t serialized yet:</p>

<ul>
	<li>serialize it into pre-allocated space</li>
	<li>remove the â€˜no-writeâ€™ protection from the page, so future writes no longer cause any trouble.</li>
</ul>

<p>Thatâ€™s it. While with such a processing we <em>do</em> have to copy <em>some</em> of the pages within the main thread (causing <em>some</em> latency), for typical access patterns, this will happen relatively rarely, significantly reducing overall serialization latency observed within the â€˜mainâ€™ (Re)Actor thread. For example, if out of our 100MB (~=25K pages) (Re)Actor state, only 1000 pages are modified during our 20ms serialization â€“ then the latency cost of the serialization will drop approximately by a factor of 25x, bringing observable latency to around 1ms (which is acceptable for the <em>vast majority</em> of the systems out there, even for first-person-shooter games).</p>

<h2>Per-(re)actor allocators in a usable manner</h2>

<p>Up to now, we are merely assuming that our allocators can be made per-(Re)Actor; one obvious way of doing this is to have our (Re)Actor code specify our own allocator for each and every allocation within our (Re)Actor (weâ€™ll need to cover both explicit calls to new, and all implicit allocations such as collections).</p>

<p>While such a naÃ¯ve approach would work in theory, it is way too inconvenient to be used in practice. Fortunately, changing an allocator to a per-(Re)Actor one happens to be possible <em>without any changes to the (Re)Actor code</em>. In particular, it can be done along the following lines.</p>

<p>First, we replace <code>malloc()</code>/<code>free()</code> (<strong>Important</strong>: make sure that your global <code>::operator new</code>/<code>::operator delete</code>, and your default <code>std::allocator</code> also use the replaced functions (!). The latter might be rather tricky unless your <strong>std</strong> library already uses <code>::operator new()</code>/<code>::operator delete()</code>, but usually it can be take care of; in particular, for GCC, see <a href="#[GCC]">[GCC]</a>) and the <code>--enable-libstdcxx-allocator</code> option for <code>./configure</code> of <strong>libstdc++.</strong>)</p>

<p>To implement our own <code>malloc()</code>, weâ€™re going along the lines of Listing 2. (Of course, <code>free()</code> should go along the same lines.)</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
thread_local OurAllocator* current_allocator 
  = nullptr;

void* malloc(size_t sz) {
  if( current_allocator )
    return current_allocator-&gt;malloc(sz);
  else
    return non_reactor_malloc(sz);
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 2</td>
	</tr>
</table>

<p>The point here is that our Infrastructure Code (the one which calls our (Re)Actor) sets the <code>current_allocator</code> pointer before every call to <code>GenericReactor::react()</code> (see Listing 3).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
current_allocator = create_new_allocator();
std::unique_ptr&lt;GenericReactor&gt; r
  = reactorFactory.createReactor
  (current_allocator,...);
current_allocator = nullptr;

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  current_allocator = r-&gt;allocator;
  r-&gt;react(ev);
  current_allocator = nullptr;
}

current_allocator = r-&gt;allocator;
r.reset();
current_allocator = nullptr;
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 3</td>
	</tr>
</table>

<p>Of course, this is a kind of trick â€“ but it will work. Very briefly: first, we confine our <code>current_allocator</code> variable to the current thread by using <code>thread_local</code>, and then within this single thread, we can easily control which allocator is currently used by simple assignments within our Infrastructure Code. One thing to remember when using this way is to make sure that we set <code>current_allocator</code> before <em>each and every</em> method call of our (Re)Actor (including its constructor and destructor(!)).</p>

<p>Thatâ€™s it: weâ€™ve made our (Re)Actor use a per-(Re)Actor allocator â€“ and without changing a single line <em>within</em> our (Re)Actorâ€™s code too &#9786;.</p>

<h2>Summary</h2>

<p>To summarize this part III of the mini-series on â€˜Allocators for (Re)Actorsâ€™:</p>

<ul>
	<li>Allocator-based serialization for (Re)Actors is both
		<ul>
			<li>Easy to implement in a very generic manner, and</li>
			<li>Extremely fast (for x64 â€“ around tens of ms per 100MB of state)</li>
		</ul>
		<p>If necessary, parallel serialization may further reduce latencies (in some cases, down to â€“ very roughly â€“ a single digit ms of latency per 100MB of the state).</p>
	</li>

	<li>The Allocators can be replaced to make them per-(Re)Actor, without any changes to the (Re)Actor code(!)</li>
</ul>

<p>And as this part III <em>concludes</em> this mini-series, letâ€™s summarize all our findings (from all three parts).</p>

<h3>Part I</h3>

<ul>
	<li>(Re)Actor-based allocators allow for very efficient allocation, with three potential modes:
		
		<ul>
			<li>â€˜Fastâ€™ mode (no protection, but faster than regular <code>malloc()</code>)</li>
			<li>â€˜kinda-Safeâ€™ mode (with protection from <em>some</em> of the memory corruptions)
				<p>Here, we introduced a potentially novel method of implementing â€˜danglingâ€™ pointer detection in runtime â€“ the one based on ID comparison. Compared to traditional â€˜tombstonesâ€™ it has better locality, and will usually outperform it.</p></li>
			<li>â€˜kinda-Safe with Relocationâ€™ mode, with added ability to relocate heap objects (this, in turn, allows to avoid dreaded external fragmentation, which tends to eat <em>lots</em> of RAM in long-running programs).</li>
		</ul>
	</li>
	
	<li>Simple â€˜traversingâ€™ interface is sufficient to ensure that all the pointers in the (Re)Actor state are updated</li>
</ul>

<h3>Part II</h3>

<p>By adding a few more of easily understood guidelines, we can extend our â€˜kinda-Safeâ€™ mode from Part I into â€˜really safeâ€™ C++ dialect.</p>

<p>All the guidelines/rules we need to follow are <em>local</em>, which enables reasonable tool-aided enforcement and allows to keep code maintainable.</p>

<h3>Part III</h3>

<ul>
	<li>Custom (Re)Actor-based allocator can be used for the all-important for (Re)Actors serialization for the same executable. It is (a) very easy to maintain for (Re)Actor code, and (b) extremely fast.</li>
	<li>Per-(Re)Actor allocators can be implemented without <em>any</em> changes within (Re)Actor itself (i.e. all the necessary changes can be confined to Infrastructure Code). </li>
</ul>

<p>Phew. It was rather long mini-series, but I hope I have made my points about the significant advantages of allocators specialized for (Re)Actor purposes reasonably clear.</p>

<p><img src="/content/images/journals/ol141/Ignatchenko/Ignatchenko_01.png" /></p>

<h2>References</h2>

<p class="bibliomixed"><a id="[B14]"></a>[B14] Thomas B, â€˜Single-threaded memory performance for dual socket Xeon E5-* systemsâ€™, https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/509237</p>

<p class="bibliomixed"><a id="[GCC] "></a>[GCC] â€˜The GNU C++ Library Manualâ€™, Chapter 6, <a href="https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#allocator.ext">https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#allocator.ext</a></p>

<p class="bibliomixed"><a id="[Ignatchenko-Ivanchykhin16]"></a>[Ignatchenko-Ivanchykhin16] Sergey Ignatchenko and Dmytro Ivanchykhin, â€˜Ultra-fast Serialization of C++ Objectsâ€™, <em>Overload</em> #136, Dec 2016</p>

<p class="bibliomixed"><a id="[Loganberry04]"></a>[Loganberry04] David â€˜Loganberryâ€™, Frithaes! â€“ an Introduction to Colloquial Lapine!, <a href="http://bitsnbobstones.watershipdown.org/lapine/overview.html">http://bitsnbobstones.watershipdown.org/lapine/overview.html</a></p>

<p class="bibliomixed"><a id="[NoBugs17a]"></a>[NoBugs17a] â€˜No Bugsâ€™ Hare, â€˜Allocator for (Re)Actors with Optional Kinda-Safety and Relocationâ€™, <em>Overload</em> #139, Jun 2017</p>

<p class="bibliomixed"><a id="[NoBugs17b]"></a>[NoBugs17b] â€˜No Bugsâ€™ Hare, â€˜A Usable C++ Dialect that is Safe Against Memory Corruptionâ€™, <em>Overload</em> #140, Aug 2017</p>

<p class="bibliomixed"><a id="[NoBugs17c]"></a>[NoBugs17c] â€˜No Bugsâ€™ Hare, â€˜Deterministic Components for Interactive Distributed Systemsâ€™, ACCU2017, available at <a href="http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/">http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/</a></p>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
