    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: An MWSR Queue with Minimalist Locking</title>
        <link>https://members.accu.org/index.php/articles/2467</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Design of applications and programs + Overload Journal #143 - February 2018</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c67/">Design</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c382/">o143</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c67-382/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c67+382/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;An MWSR Queue with Minimalist Locking</h1>
<p><strong>Author:</strong>&nbsp;Bob Schmidt</p>
<p>
<strong>Date:</strong> 07 February 2018 16:45:17 +00:00 or Wed, 07 February 2018 16:45:17 +00:00</p>
<p><strong>Summary:</strong>&nbsp;Multithreaded queues come in many flavours. Sergey Ignatchenko describes his implementation of a multiple writer single reader queue.</p>
<p><strong>Body:</strong>&nbsp;<p class="EditorIntro">Disclaimer: as usual, the opinions within this article are those of â€˜No Bugsâ€™ Hare, and do not necessarily coincide with the opinions of the translators and <em>Overload</em> editors; also, please keep in mind that translation difficulties from Lapine (like those described in [<a href="#[Loganberry04]">Loganberry04)</a> might have prevented an exact translation. In addition, the translator and <em>Overload</em> expressly disclaim all responsibility from any action or inaction resulting from reading this article.</p>

<p>In [<a href="#[NoBugs17]">NoBugs17</a>], we discussed the theory behind using CAS (Re)Actors to build multithreaded non-blocking primitives. A very brief recap:</p>

<ul>
	<li>Whenever we want non-blocking processing, we <strong>have</strong> to use CAS (Compare and Swap) operations</li>
	
	<li>The idea of CAS (Re)Actors is to treat a CAS block (up to 128 bits in size on modern CPUs) as a state of the (Re)Actor; in other words, all weâ€™re doing within one specific (Re)Actor always fits into the following pattern:
		<ul>
			<li>We read the state</li>
			<li>We modify it if necessary</li>
			<li>We write it back</li>
		</ul>
	</li>
	<li>In the context of CAS (Re)Actors, writing the state back is tricky: it requires CAS operations, which can fail due to some other thread interfering with us. If we fail, we simply drop the whole result and start anew; actually, this is a very typical pattern in CAS-based primitives. Another way to look at it is to consider it an incarnation of optimistic concurrency control.</li>
</ul>

<p>As was mentioned in [<a href="#[NoBugs17]">NoBugs17</a>], the benefit provided by CAS (Re)Actors is <strong>not</strong> about the sequence of CPU operations weâ€™re issuing; in theory, <em>exactly</em> the same thing can be produced without (Re)Actors at all. The key benefit is about <em>how weâ€™re thinking about our multithreaded primitive</em>, which tends to provide significant benefits in the complexity that we can handle. Today weâ€™ll demonstrate a practical use of CAS (Re)Actors using one very specific example.</p>

<table class="sidebartable">
	<tr>
		<td class="title">Optimistic concurrency control</td>
	</tr>
	<tr>
		<td>
			<table class="journaltable">
				<tr>
					<td>
						<p>OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. [<a href="#Wikipedia">Wikipedia</a>]</p>
					</td>
				</tr>
			</table>
		</td>
	</tr>
</table>

<h2>The task at hand</h2>

<p>In quite a few rather serious real-world interactive distributed systems (such as stock exchanges and games), I found myself in need of a really fast queue, which had to have to have the following characteristics:</p>

<ul>
	<li>It should be an MWSR queue (allowing Multiple Writers, but only a Single Reader).</li>
	
	<li>It should have flow control. In other words â€“ if queue writers are doing better than queue readers â€“ the queue <em>must not</em> grow infinitely. Instead, at <em>some</em> point (when the queue grows over a certain pre-defined limit), our queue <em>must</em> start blocking writers.</li>
	
	<li>This means that our queue <em>cannot possibly be</em> 100% non-blocking. However, it should <em>only</em> be blocking. In other words, unless the queue is full, write should be non-blocking, and unless the queue is empty, read should be non-blocking too. <em>NB: as this is a performance requirement, complying with it a mere 99.9% of the time is good enough.</em></li>
	
	<li>As a side â€“ but occasionally rather important â€“ property, the queue should be at least almost-fair. 100% fair queues (the ones which <em>guarantee</em> first come, first served behaviour) are rather difficult to achieve, but an <em>almost</em>-fair property (~=â€˜there wonâ€™t be <em>too much</em> reordering in most of the scenariosâ€™) is much easier to obtain. In particular, if weâ€™re speaking about re-orderings caused by CAS failures (followed by an immediate retry along the lines of CAS (Re)Actors), then in practice, in 99.(9)% of the cases, unfairness will be limited to single-digit microseconds, which is OK for the vast majority of the use cases out there. In fact, most of the time, usually there are <em>much</em> more severe and unpredictable delays than those single-digit microseconds caused by CAS reorderings, so overall system behaviour will be pretty much indistinguishable from the behaviour of a perfectly fair system.</li>
</ul>

<p>As much as I was in need of such a queue, it turned out to be an <em>extremely</em> difficult task, so that I wasnâ€™t able to devise a system which avoids <em>all</em> the races. I did try to do it three or four times, but each time I found myself going into a vicious cycle of solving one race merely to create another one, and going into this â€˜robbing Peter to pay Paulâ€™ mode until I realized the futility of my efforts in that direction :-(. It was even worse as I was <em>sure</em> that a solution did exist â€“ it is just that I wasnâ€™t able to find it.</p>

<h2>Enter CAS (Re)Actors</h2>

<p>After several attempts at it, writing this queue-with-flow-control became a kind of personal obsession of mine, so there is no surprise that last year I took another shot at it. And as by that time I was spending quite a bit of time formalizing and generalizing my experiences with (Re)Actors, the idea of CAS-size (Re)Actors fortunately came to my mind. Letâ€™s see how the CAS (Re)Actors did allow me to write that elusive MWSR queue with flow-control (we have to be sketchy here, but the whole supposedly-working implementation is available on Github [<a href="#[NoBugs18]">NoBugs18</a>]).</p>

<h3>(Re)Actors</h3>

<p>Letâ€™s say that our queue is based on two CAS (Re)Actors, <code>EntranceReactor</code> and <code>ExitReactor</code>:</p>

<ul>
	<li>The primary goal of <code>EntranceReactor</code> is to handle writers, providing slots in the queue and instructing them to block when no such slots are available.
	
		<p><code>EntranceReactor</code>â€™s state consists of:</p>
	
		<ul>
			<li><code>firstIDToWrite</code> â€“ the first ID in the queue which is available for writing. <em>NB: we consider all IDs as non-wrappable because (as was discussed in [</em><a href="#[NoBugs17]">NoBugs17</a><em>]) it would take hundreds of years to wrap-around a 64-bit counter). The position of the first ID in the queue buffer can be calculated as a simple</em> <code>ID%QueueSize</code>.</li>
		
			<li><code>lastIDToWrite</code> â€“ the last ID which is available for writing (of course, we <em>do</em> want to allow more than one concurrent write to our fast-performing queue).</li>
		
			<li><code>lockedThreadCount</code> â€“ the number of writers which are currently locked (because the queue is full).</li>
		</ul>

		<p>In accordance with CAS (Re)Actor doctrine, all operations over (Re)Actor are inherently atomic. For <code>EntranceReactor</code>, we define the following atomic operations:</p>
	
		<ul>
			<li><code>allocateNextID()</code> â€“ allocates the next ID to the caller, <em>and</em> indicates whether the caller should lock for a while.</li>
		
			<li><code>unlock()</code> â€“ indicates that the thread is unlocked.</li>
		
			<li><code>moveLastToWrite(lastW)</code> â€“ here weâ€™re telling our <code>EntranceReactor</code> that reader has already read everything up to <code>lastW</code>, so that it can allow more writes. <code>moveLastToWrite()</code> returns whether we should unlock one or more writers (which is an expensive operation so we want to avoid it as long as possible).</li>
		</ul>
	</li>
	
	<li>Our second (Re)Actor is <code>ExitReactor</code>, which handles our only reader; in particular, it maintains information about completed writes, so it can tell when information is available for reading.
	
		<p>The state of <code>ExitReactor</code> consists of:</p>
	
		<ul>
			<li><code>firstIDToRead</code> â€“ the first ID which is not read yet.</li>
		
			<li><code>completedWritesMask</code> â€“ as there can be several concurrent writes, they can finish in an arbitrary order, so we have to account for them with a mask.</li>
		
			<li><code>readerIsLocked</code> â€“ a simple flag, with semantics similar to <code>lockedThreadCount</code> (but as we have only one reader, a simple boolean flag is sufficient here).</li>
		</ul>
	
		<p>As for <code>ExitReactor</code>â€™s atomic operations, we define them as follows:</p>
	
		<ul>
			<li><code>writeCompleted(ID)</code> â€“ indicates that the write of specific ID is completed.</li>
		
			<li><code>startRead()</code> â€“ called by reader to start read, and either returns an ID to read, or indicates that we should lock instead.</li>
		
			<li><code>readCompleted(ID)</code> â€“ indicates that the reader is done with reading; as it usually frees some space in the buffer, it returns a new ID where the write can be done.</li>
		</ul>
	</li>
</ul>

<p>For sizes of the fields, please refer to [<a href="#[NoBugs18]">NoBugs18</a>]; however, it should be noted that, under the CAS (Re)Actors paradigm, we can easily use bit fields, so the only thing we care about is the <em>combined bit size</em> of the fields, which shouldnâ€™t exceed the magic number of 128 (that is, for modern x64 CPUs which support the CMPXCHG16B instruction â€“ and that is pretty much any Intel/AMD CPU produced over last 10 years or so).</p>

<h3>On the ABA problem</h3>

<p>As is well-known in MT programming, and was briefly discussed in [<a href="#[NoBugs17]">NoBugs17</a>], the so-called ABA problem is one of those things which can easily kill the correctness of a multithreaded primitive. However, it <em>seems</em> that our (Re)Actors are free from ABA-related issues; very briefly:</p>

<ul>
	<li>As our IDs are monotonically increased and wraparound-free, all the writes which update <em>at least one</em> of the IDs are free from ABA problems (see also discussion on it in [<a href="#[NoBugs17]">NoBugs17</a>]).</li>
	
	<li>The fields <code>lockedThreadCount</code> and <code>readerIsLocked</code> have semantics with the property â€˜it is <strong>only</strong> the current value which matters, and no history is relevantâ€™, which also means that their updates are ABA-problem free.</li>
	
	<li>This leaves <code>completedWritesMask</code> as the only potentially ABA-dangerous field. However, we can observe that if we consider the <em>tuple</em> (<code>firstIDToRead</code>,<code>completedWritesMask</code>) and take into account the logic behind these fields, <em>this whole tuple</em> is monotonically increased and wraparound-free; this, in turn, means that there is no potential for ABA problems here either &lt;phew /&gt;.</li>
</ul>

<p>Overall, from what I can see, our (Re)Actors are ABA-problem free; still, as an ABA problem is one of those bugs which can sit there for ages before manifesting itself, I would certainly appreciate somebody more skilled than me having another look at it. </p>

<h2>Locking primitives</h2>

<p>In addition to (Re)Actors, we have two locking primitives, both built more or less along the lines of Listing 1.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
class LockedSingleThread {
private:
  int lockCount = 0;//MAY be both &gt;0 and &lt;0
  std::mutex mx;
  std::condition_variable cv;
public:
  void lockAndWait() {
    std::unique_lock&lt;std::mutex&gt; lock(mx);
    assert(lockCount == -1 || lockCount == 0);
    lockCount++;
    while (lockCount &gt; 0) {
      cv.wait(lock);
    }
  }
  void unlock() {
    std::unique_lock&lt;std::mutex&gt; lock(mx);
    lockCount--;
    lock.unlock();
    cv.notify_one();
  }
};
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<p>It is a rather simple (but quite interesting) primitive, with the idea being that whenever some of our (Re)Actors return, we should lock. The corresponding thread calls <code>lockAndWait()</code> and waits on a conditional variable until some other thread calls <code>unlock()</code>. It is important to note that our locking primitives <strong>must</strong> unlock properly regardless of potential races between a thread being locked and <code>unlock()</code>. In other words, it should work regardless of whether <code>unlock()</code> comes <em>before</em> or <em>after</em> the thread scheduled to be locked reaches <code>lockAndWait()</code>.</p>

<h2>Putting it all together</h2>

<p>Having all four building blocks (two (Re)Actors and two locking primitives), we can write our <code>MWSRQueue</code> (see Listing 2). </p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
template&lt;class QueueItem&gt;
class MWSRQueue {
  static constexpr size_t QueueSize = 64;

private:
  QueueItem items[QueueSize];
  MT_CAS entrance;
  MWSRQueueFC_helpers::LockedThreadsList
    lockedWriters;
  MT_CAS exit;
  MWSRQueueFC_helpers::LockedSingleThread
    lockedReader;

public:
  MWSRQueue();
  void push(QueueItem&amp;&amp; item) {
    EntranceReactorHandle ent(entrance);
    std::pair&lt;bool, uint64_t&gt; ok_id =
      ent.allocateNextID();
    if (ok_id.first) {
      lockedWriters.lockAndWait(ok_id.second);
      ent.unlock();
    }
    size_t idx = index(ok_id.second);
    items[idx] = std::move(item);
    ExitReactorHandle ex(exit);
    bool unlock =
      ex.writeCompleted(ok_id.second);
    if (unlock)
      lockedReader.unlock();
  }
  QueueItem pop() {
    while (true) {
      ExitReactorHandle ex(exit);
      std::pair&lt;size_t, uint64_t&gt; sz_id =
        ex.startRead();
      size_t sz = sz_id.first;
      assert(sz &lt;= QueueSize);
      if (!sz) {
        lockedReader.lockAndWait();
        // unlocking ex is done by
        // ex.writeCompleted()
        continue;//while(true)
      }
      uint64_t id = sz_id.second;
      size_t idx = index(id);
      QueueItem ret = std::move(items[idx]);
      uint64_t newLastW =
        ex.readCompleted(sz,id);
      EntranceReactorHandle ent(entrance);
      bool shouldUnlock =
        ent.moveLastToWrite(newLastW);
      if (shouldUnlock)
        lockedWriters.unlockAllUpTo(id +
          sz - 1 +QueueSize);
      return ret;
    } //while(true)
  }
private:
  size_t index(uint64_t i) {
    return i % QueueSize; //should be fast as
             // long as QueueSize is power of 2
  }
};
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 2</td>
	</tr>
</table>

<p>As we can see, after we have defined our (Re)Actors (including their operations), the whole thing is fairly simple. Within our <code>push()</code> function, we merely:</p>

<ul>
	<li>request an ID from <code>EntranceReactor</code> (and <code>lockAndWait()</code> if weâ€™re told to do so)</li>
	
	<li>write to the slot which corresponds to the ID. Note that while weâ€™re actually writing, weâ€™re <em>not holding any</em> locks, which is certainly a Good Thingâ„¢ concurrency-wise.</li>
	
	<li>Inform <code>ExitReactor</code> that the write is completed (so reader can start reading the ID we just wrote)</li>
</ul>

<p>As for our <code>pop()</code> function, it is only marginally more complicated:</p>

<ul>
	<li>We ask <code>ExitReactor</code> whether it is OK to read; if not, weâ€™re locking (and re-trying from scratch later)</li>
	
	<li>We read the data (again, at this point weâ€™re <em>not</em> holding any kind of locks(!)).</li>
	
	<li>Weâ€™re telling our <code>ExitReactor</code> that weâ€™re done reading â€“ and in response it may want to inform <code>EntranceReactor</code> that there is some room available. (This is implemented via a <code>newLastW</code> variable, but actually corresponds to sending a message â€“ containing this one variable â€“ from <code>ExitReactor</code> to <code>EntranceReactor</code>.)</li>
</ul>

<p>Thatâ€™s it! Weâ€™ve got our <code>MWSRQueue</code>, and with all the desired properties too. In particular, it <em>is</em> an MWSR queue, it <em>does</em> provide flow control, it locks <em>only</em> when it is necessary (on the queue being empty or full), and it <em>is</em> almost-fair (as IDs are assigned in the very first call to the <code>allocateNextID()</code>, the most unfairness which can possibly happen is limited to CAS retries, which are never long in practice).</p>

<p>However, IMNSHO the most important property of the queue is that it was observed to be easily debuggable. After I finished writing the code (which is around 700 LoC of heavily-multithreaded code, and is next-to-impossible to test until the whole thing is completed) and ran the simplistic tests found in the <span class="filename">/test/</span> folder within [<a href="#[NoBugs18]">NoBugs18</a>], there were, of course, bugs (like a dozen of them). And for multithreaded programs in general, debugging is a well-known nightmare (in particular, because (a) a bug manifests itself in a different place on different runs, and (b) adding tracing can <em>easily</em> change things too much so the bug wonâ€™t manifest itself anymore (!)). However, this specific queue happened to be debuggable very easily:</p>

<p class="blockquote"><em>I was able to debug it within half a day.</em></p>

<p>I contend that anybody who has tried to debug multithreading programs of comparable complexity will realize <em>how fast half-a-day is for this kind of not-so-trivial multithreading</em>.</p>

<h2>Maintainability</h2>

<p>After it started to work, I ran some experiments, and found that with one single writer, it performed great: <em>with real 128-bit CAS, I measured an upper bound performance of this queue at about 130 nanoseconds per </em><code>push()</code><em>+</em><code>pop()</code><em> pair</em>. However, with more than one writer, performance was observed to degrade very severely (around 50 Ã—(!)).</p>

<p>After thinking about it for a few hours, I realized that, actually, the code used in the examples above can be improved <em>a lot</em> â€“ in particular, we can (and <strong>should</strong>) avoid going into thread-sync stuff <em>on each and every call to </em><code>pop()</code>. Indeed, as our queue can handle up to 64 slots at a time, we can read all of them into some kind of a â€˜read cacheâ€™ (with proper synchronization), but then in subsequent calls to <code>pop()</code> we can easily read all the cached values without any thread sync involved. This optimization allowed me to improve performance in tests with two writers by over 50 Ã— (so that performance with two writers became about the same as performance with one single writer). BTW, if you want to see the code with this â€˜read cacheâ€™ optimization, it is a part of current implementation in [<a href="#[NoBugs18]">NoBugs18</a>].</p>

<p>However, my main point here is <em>not</em> about the performance of this particular queue. What I want to emphasize is that:</p>

<ul>
	<li>In spite of the queue being rather complicated, it was <em>easy</em> to reason about it</li>
	<li>After I realized what I want to do, implementing the whole thing (writing + debugging) took less than two hours(!). Once again, this is <em>extremely</em> fast for writing/debugging a significant change for <em>reliably-working multithreaded programs</em>.</li>
</ul>

<p>Of course, a lot of further optimizations are possible for this queue (in particular, I am thinking of introducing â€˜write cachesâ€™ along the lines of the â€˜read cacheâ€™ above); still, even the current (not perfectly optimized) version in [<a href="#[NoBugs18]">NoBugs18</a>] <em>seems</em> to perform pretty well under close-to-real-world usage patterns. On the other hand, <em>please treat the code in [</em><a href="#[NoBugs18]">NoBugs18</a><em>] as highly experimental, and be sure to test it very thoroughly before using it in any kind of production; multithreading bugs are sneaky, and there is always a chance that one of them did manage to hide within, in spite of all the reliability improvements provided by CAS (Re)Actors.</em></p>

<h2>Conclusions</h2>

<p>We have demonstrated how the real-world task of â€˜creating an MWSR queue with flow control and minimal lockingâ€™ can be implemented using the concept of CAS (Re)Actors (which was discussed in detail in [<a href="#[NoBugs17]">NoBugs17</a>]).</p>

<p>In the process, it was also observed that</p>

<p class="blockquote"><em>Not only do CAS (Re)Actors allow us to write multithreaded programs very easily (by the standards of multithreaded programs, that is), but also CAS (Re)Actor-based programs are easily maintainable and easily optimizable.</em></p>

<p>As a nice side effect ;-), we also wrote a practically-usable MWSR queue with flow control and minimalistic locking, which can take as little as 120 nanoseconds per <code>push()</code>+<code>pop()</code> pair :-).</p>

<p><img src="/content/images/journals/ol143/Ignatchenko/Ignatchenko_01.png" /></p>

<h2>References</h2>

<p class="bibliomixed"><a id="[Loganberry04]"></a>[Loganberry04] David â€˜Loganberryâ€™, Frithaes! â€“ an Introduction to Colloquial Lapine!, <a href="http://bitsnbobstones.watershipdown.org/lapine/overview.html">http://bitsnbobstones.watershipdown.org/lapine/overview.html</a></p>

<p class="bibliomixed"><a id="[NoBugs17]"></a>[NoBugs17] â€˜No Bugsâ€™ Hare, CAS (Re)Actor for Non-Blocking Multithreaded Primitives, <em>Overload</em> #142, December 2017</p>

<p class="bibliomixed"><a id="[NoBugs18]"></a>[NoBugs18] â€˜No Bugsâ€™ Hare, mtprimitives, <a href="https://github.com/ITHare/mtprimitives/tree/master/src">https://github.com/ITHare/mtprimitives/tree/master/src</a></p>

<p class="bibliomixed"><a id="[Wikipedia]"></a>[Wikipedia] Optimistic concurrency control (OCC)<a href="https://en.wikipedia.org/wiki/Optimistic_concurrency_control">https://en.wikipedia.org/wiki/Optimistic_concurrency_control</a></p>

<p class="bibliomixed"></p>

<h2>Acknowledgement</h2>
<p>Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague</p>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
