Title: â€˜Speedy Gonzalesâ€™ Serializing (Re)Actors via Allocators

Author: Bob Schmidt

Date: 05 October 2017 19:28:12 +01:00 or Thu, 05 October 2017 19:28:12 +01:00

Summary: More speed! Sergey Ignatchenko completes his (Re)Actor allocator series with Part III.

Body:

Start the reactor. Free Marsâ€¦
~ Kuato from Total Recall

Disclaimer: as usual, the opinions within this article are those of â€˜No Bugsâ€™ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

As we briefly discussed in Part I of this mini-series [NoBugs17a], message-passing technologies such as (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and event-driven programs) have numerous advantages - ranging from being debuggable (including post-factum production debugging), to providing better overall performance.

In [NoBugs17a] and [NoBugs17b], we discussed an approach to handling allocations for (Re)Actors, and then were able to achieve a safe dialect of C++ (that is, as long as weâ€™re following a set of well-defined local rules). Now, letâ€™s take a look at another task which can be facilitated by per-(Re)Actor allocators - specifically, at the task of serializing (Re)Actors that are later going to be de-serialized by the same executable. While one solution for this task was provided in [Ignatchenko-Ivanchykhin16], the proposed â€˜ultra-fastâ€™ serialization is rather cumbersome to maintain, and in most cases it can be beaten performance-wise by serializing at the allocator level.

#define (Re)Actors

To make this article self-contained and make sure that weâ€™re all on the same page with terminology, letâ€™s repeat the definition of our (Re)Actors from [NoBugs17a].

Letâ€™s name a common denominator for all our (Re)Actors a GenericReactor. GenericReactor is just an abstract class â€“ and has a pure virtual function, react():

  class GenericReactor {
    virtual void react(const Event& ev) = 0;
    virtual ~GenericReactor() {}
  }

Letâ€™s name the piece of code which calls GenericReactorâ€™s react() Infrastructure Code. Quite often this call will be within a so-called â€˜event loopâ€™ (see Listing 1).

std::unique_ptr<GenericReactor> r 
  = reactorFactory.createReactor(...);

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  r->react(ev);
}

Listing 1

Letâ€™s note that the get_event() function can obtain events from wherever we want; from select() (which is quite common for servers) to libraries such as libuv (which is common for clients).

Also letâ€™s note that an event loop such as the one above is by far not the only way to call react(): Iâ€™ve seen implementations of Infrastructure Code ranging from the one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a database (DB), then called react(), and then serialized the (Re)Actor back to the DB. Whatâ€™s important, though, is that even if react() can be called from different threads, it must be called as if it is one single thread. In other words, if necessary, all thread sync should be done outside of our (Re)Actor, so react() doesnâ€™t need to bother about thread sync regardless of the Infrastructure Code in use.

Finally, letâ€™s name any specific derivative from Generic Reactor (which actually implements our react() function) a SpecificReactor:

  class SpecificReactor : public GenericReactor {
    void react(const Event& ev) override;
  };

In addition, letâ€™s observe that whenever the (Re)Actor needs to communicate with another (Re)Actor then â€“ adhering to the â€˜Do not communicate by sharing memory; instead, share memory by communicatingâ€™ principle â€“ it merely sends a message, and it is only this message which will be shared between (Re)Actors. In turn, this means that we can (and should) use single-threaded allocation for all (Re)Actor purposes â€“ except for allocation of those messages intended for inter-(Re)Actor communications.

Task: Serialization for the same executable

Now, letâ€™s define what weâ€™re going to do with our (Re)Actor in this article (and why). Basically, as discussed in [Ignatchenko-Ivanchykhin16], when dealing with (Re)Actors, we often need to serialize the current state of our (Re)Actor so that we can deserialize it in exactly the same executable (though this executable can run on a completely different computer etc.). Applications of such â€˜serialization for exactly the same executableâ€™ are numerous; in particular, it is useful for (see, for example, [NoBugs17c]):

Migrating our (Re)Actors (for example, to load-balance them)
Low-Latency Fault Tolerance for (Re)Actors
Production post-mortem debugging (serializing the state, plus all inputs after that state, of the (Re)Actor, and then transferring them to developerâ€™s box in case of a crash)

â€˜Speedy Gonzalesâ€™ (Re)actor-level serialization

Now, as we have both our (Re)Actors and our task at hand more-or-less defined, letâ€™s see how well we can do with our per-(Re)Actor allocator.

Actually, it is fairly simple:

We re-define global allocator (for example, malloc()/free(), though depending on compiler specifics and options YMMV) so that it acts as a bunch of per-(Re)Actor allocators (or at least per-thread allocators) â€“ more on it below
This means that within our (Re)Actor, we do not need to specify allocators for each call to â€˜newâ€™ and for each collection <phew! />
Within our per-(Re)-Actor allocator, we allocate/deallocate OS pages ourselves (via calls such as VirtualAllocEx()or mmap()); of course, we also keep a list of all the OS pages weâ€™re using.
Whenever we need to serialize our (Re)Actor, we simply dump all the pages used by the allocator of this (Re)Actor (with an address of each page serialized) â€“ thatâ€™s it!
When we need to deserialize our (Re)Actor, we try to allocate OS pages at exactly the same (virtual) addresses as they were originally allocated
If such allocation succeeds for all our serialized pages (which is common â€“ though strictly speaking, not guaranteed â€“ when weâ€™re deserializing a (Re)Actor into a dedicated-for-this-(Re)Actor process, which in turn is common for debugging), we just need to copy pages back from the serialized form into allocator (thatâ€™s it)

If allocation at the same address isnâ€™t possible for whatever reason, we have to use a process which is essentially similar to relocation discussed in [NoBugs17a]. Very briefly:
- We have a relocation map, which gives the mapping between â€˜oldâ€™ page addresses and â€˜newâ€™ page addresses.
- At OS level, we make sure the pages with â€˜oldâ€™ addresses are unreadable.
- We run traversing (as discussed in [NoBugs17a]) over the state of our (Re)Actor. During this traversing process, we merely try to access all the elements of the state of our (Re)Actor. Whenever any pointer within our (Re)Actor state happens to point to the â€˜oldâ€™ page, such access will fail (causing an access violation exception). We catch each such an exception, and update the pointer which caused the exception to the â€˜newâ€™ value within the corresponding â€˜newâ€™ page (calculating the â€˜newâ€™ value using the relocation map).
  Note that while the traversing of our own collections can easily be handled along the same lines as above, traversing and fixing standard collections can be outright impossible without adding a few lines to them ☹. How to handle this in practice? It depends, but one way to do it is to take a cross-platform STL implementation (such as EASTL), and to add the few lines implementing traversing for each collection you require (it is NOT rocket science for any specific STL).

Bingo! After such traversing of the whole state of our (Re)Actor is completed, we can be sure that all the pointers to the heap within our (Re)Actor are updated with the new values. In turn, it means that all the pointers to the state are already updated, so all the relocations due to serialization are already handled, and we can proceed with normal (Re)Actor processing.

One very important property of such serialization is that it is extremely difficult to beat this kind of serialization performance-wise. Deserialization is a slightly different story due to potential relocation, but it is still pretty fast. Also, for several of the use cases mentioned in the â€˜Task: Serialization for the same Executableâ€™ section above, it is only the performance of serialization which really matters. Indeed, all we need to do is memcpy() for large chunks of RAM, and with memcpy() speeds being of the order of 5-10 gigabytes/second at least for x64 (see, for example, [B14]), this means that even to serialize a (Re)Actor which has 100MB state, weâ€™re talking about times of the order of 10-20ms. Serializing the same thing using conventional serialization methods (even really fast ones, such as the one discussed in [Ignatchenko-Ivanchykhin16]) is going to be significantly slower. The exact numbers depend on the specifics of the organization of the data, but if we have a randomly filled 100MB std::map<int>, just iterating over it without any serializing is going to take the order of 100ms, almost an order of magnitude longer (!).

Parallel serialization aided by (kinda-)copy-on-write

For those apps where even 10-20ms of additional latency per 100MB of state is too much, it might be possible to reduce it further.

One of the implementations would work along the following lines (which are ideologically similar, though not identical to, classic Copy-on-Write):

When we want to serialize, we (in our â€˜mainâ€™ (Re)Actor processing thread):

create a list of pages to be serialized
pre-allocate space in some other area of RAM where we want to serialize
for all the pages to be serialized, set an â€˜already serializedâ€™ parameter to false
mark all the pages to be serialized as â€˜no-writeâ€™ (using VirtualAllocEx() or mprotect()).
start another â€˜serializationâ€™ thread (use an always-existing dedicated serialization thread, take it from thread pool, etc.)
continue processing the (Re)Actor messages in the main thread

The â€˜serializationâ€™ thread just takes pages from the list to be serialized one by one, and for each such page:

checks if it is already serialized: if yes, skips; and if not, marks it as â€˜already serializedâ€™ (which should be done using an atomic CAS-like operation to prevent potential races with the main thread)
if it wasnâ€™t already serialized:
- serializes the page into pre-allocated space
- removes the â€˜no-writeâ€™ protection from the page

Whenever we have write access to one of the â€˜no-writeâ€™ pages, we catch the appropriate CPU-level exception, and within the exception handler:

Check if the page being accessed is already being serialized (this can happen due to races with the â€˜serializationâ€™ thread); this should be done in an atomic manner similar to the â€˜serializationâ€™ thread as described above

If the page isnâ€™t serialized yet:

serialize it into pre-allocated space
remove the â€˜no-writeâ€™ protection from the page, so future writes no longer cause any trouble.

Thatâ€™s it. While with such a processing we do have to copy some of the pages within the main thread (causing some latency), for typical access patterns, this will happen relatively rarely, significantly reducing overall serialization latency observed within the â€˜mainâ€™ (Re)Actor thread. For example, if out of our 100MB (~=25K pages) (Re)Actor state, only 1000 pages are modified during our 20ms serialization â€“ then the latency cost of the serialization will drop approximately by a factor of 25x, bringing observable latency to around 1ms (which is acceptable for the vast majority of the systems out there, even for first-person-shooter games).

Per-(re)actor allocators in a usable manner

Up to now, we are merely assuming that our allocators can be made per-(Re)Actor; one obvious way of doing this is to have our (Re)Actor code specify our own allocator for each and every allocation within our (Re)Actor (weâ€™ll need to cover both explicit calls to new, and all implicit allocations such as collections).

While such a naÃ¯ve approach would work in theory, it is way too inconvenient to be used in practice. Fortunately, changing an allocator to a per-(Re)Actor one happens to be possible without any changes to the (Re)Actor code. In particular, it can be done along the following lines.

First, we replace malloc()/free() (Important: make sure that your global ::operator new/::operator delete, and your default std::allocator also use the replaced functions (!). The latter might be rather tricky unless your std library already uses ::operator new()/::operator delete(), but usually it can be take care of; in particular, for GCC, see [GCC]) and the --enable-libstdcxx-allocator option for ./configure of libstdc++.)

To implement our own malloc(), weâ€™re going along the lines of Listing 2. (Of course, free() should go along the same lines.)

thread_local OurAllocator* current_allocator 
  = nullptr;

void* malloc(size_t sz) {
  if( current_allocator )
    return current_allocator->malloc(sz);
  else
    return non_reactor_malloc(sz);
}

Listing 2

The point here is that our Infrastructure Code (the one which calls our (Re)Actor) sets the current_allocator pointer before every call to GenericReactor::react() (see Listing 3).

current_allocator = create_new_allocator();
std::unique_ptr<GenericReactor> r
  = reactorFactory.createReactor
  (current_allocator,...);
current_allocator = nullptr;

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  current_allocator = r->allocator;
  r->react(ev);
  current_allocator = nullptr;
}

current_allocator = r->allocator;
r.reset();
current_allocator = nullptr;

Listing 3

Of course, this is a kind of trick â€“ but it will work. Very briefly: first, we confine our current_allocator variable to the current thread by using thread_local, and then within this single thread, we can easily control which allocator is currently used by simple assignments within our Infrastructure Code. One thing to remember when using this way is to make sure that we set current_allocator before each and every method call of our (Re)Actor (including its constructor and destructor(!)).

Thatâ€™s it: weâ€™ve made our (Re)Actor use a per-(Re)Actor allocator â€“ and without changing a single line within our (Re)Actorâ€™s code too ☺.

Summary

To summarize this part III of the mini-series on â€˜Allocators for (Re)Actorsâ€™:

Allocator-based serialization for (Re)Actors is both
- Easy to implement in a very generic manner, and
- Extremely fast (for x64 â€“ around tens of ms per 100MB of state)
If necessary, parallel serialization may further reduce latencies (in some cases, down to â€“ very roughly â€“ a single digit ms of latency per 100MB of the state).
The Allocators can be replaced to make them per-(Re)Actor, without any changes to the (Re)Actor code(!)

And as this part III concludes this mini-series, letâ€™s summarize all our findings (from all three parts).

Part I

(Re)Actor-based allocators allow for very efficient allocation, with three potential modes:
- â€˜Fastâ€™ mode (no protection, but faster than regular malloc())
- â€˜kinda-Safeâ€™ mode (with protection from some of the memory corruptions)
  Here, we introduced a potentially novel method of implementing â€˜danglingâ€™ pointer detection in runtime â€“ the one based on ID comparison. Compared to traditional â€˜tombstonesâ€™ it has better locality, and will usually outperform it.
- â€˜kinda-Safe with Relocationâ€™ mode, with added ability to relocate heap objects (this, in turn, allows to avoid dreaded external fragmentation, which tends to eat lots of RAM in long-running programs).
Simple â€˜traversingâ€™ interface is sufficient to ensure that all the pointers in the (Re)Actor state are updated

Part II

By adding a few more of easily understood guidelines, we can extend our â€˜kinda-Safeâ€™ mode from Part I into â€˜really safeâ€™ C++ dialect.

All the guidelines/rules we need to follow are local, which enables reasonable tool-aided enforcement and allows to keep code maintainable.

Part III

Custom (Re)Actor-based allocator can be used for the all-important for (Re)Actors serialization for the same executable. It is (a) very easy to maintain for (Re)Actor code, and (b) extremely fast.
Per-(Re)Actor allocators can be implemented without any changes within (Re)Actor itself (i.e. all the necessary changes can be confined to Infrastructure Code).

Phew. It was rather long mini-series, but I hope I have made my points about the significant advantages of allocators specialized for (Re)Actor purposes reasonably clear.

References

[B14] Thomas B, â€˜Single-threaded memory performance for dual socket Xeon E5-* systemsâ€™, https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/509237

[GCC] â€˜The GNU C++ Library Manualâ€™, Chapter 6, https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#allocator.ext

[Ignatchenko-Ivanchykhin16] Sergey Ignatchenko and Dmytro Ivanchykhin, â€˜Ultra-fast Serialization of C++ Objectsâ€™, Overload #136, Dec 2016

[Loganberry04] David â€˜Loganberryâ€™, Frithaes! â€“ an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs17a] â€˜No Bugsâ€™ Hare, â€˜Allocator for (Re)Actors with Optional Kinda-Safety and Relocationâ€™, Overload #139, Jun 2017

[NoBugs17b] â€˜No Bugsâ€™ Hare, â€˜A Usable C++ Dialect that is Safe Against Memory Corruptionâ€™, Overload #140, Aug 2017

[NoBugs17c] â€˜No Bugsâ€™ Hare, â€˜Deterministic Components for Interactive Distributed Systemsâ€™, ACCU2017, available at http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/

Notes:

More fields may be available via dynamicdata ..