Title: From Mechanism to Method

Author:

Date: 02 June 2003 22:57:04 +01:00 or Mon, 02 June 2003 22:57:04 +01:00

Summary:

Body:

Introduction

Distinctly Qualified

The standard library string type is a product of elegance and sufficiency.

OK, OK, just kidding. If it looks like the product of design by committee it's because it was. If you appreciate minimalist design its baroqueness is certain to disappoint.

The road to Hell is paved with good intentions, and the standard library string is full of good intentions. The standard basic_string class template started life as a simple class and reached its middle age as an over-parameterized class template with a bad name and a bloated interface. The process of standardization added somewhat more than two cents worth: What about generalization for internationalization? What about copy optimization through reference counting? What about customizing its memory allocation? What about safe indexing? What about reverse search operations that mirror each forward search operation? What about support for STL? And support for similar index-based operations? You see, good intentions every one of them. But too much compromise in design leads to a compromised design.

In spite of this criticism, I use the standard string type. For one thing, it's standard, and for another it satisfies more of my string needs than a vanilla char *. More positively, inside this behemoth is a small class (or two) struggling to get out. There is a sense of obligation to try to release and realize it. In this article I don't want to address each and every last detail of such a redesign, but I would like to outline an approach. In particular, I want to declutter the interface a little to clear the path to a better understanding of something that has haunted string classes for the last decade or so: the specter of copy optimization through reference counting and copy-on-write.

The issues thrown up by reference counting can be tackled head on, with limited success, or tackled laterally. The resolution lies in a combination of restraint and substitutability principles [Henney2000], and in particular treating const qualification as a form of type separation [Henney2001a, Henney2001b].

Minimalism

The open secrets of good design practice include the importance of knowing what to keep whole, what to combine, what to separate, and what to throw away.

We can start with the name. Names are important. Hiding the templatedness of strings does not actually help the reader in any way, except perhaps to discourage them to think of or use strings as templated abstractions. The reason we are left with the common typedefed names of string and wstring, and the cumbersome underlying names of basic_string<char> and basic_string<wchar_t>, is historical. Originally, in the pre-STL era, the proposed standard library was light on templates - in fact it was quite light, period - and string and wstring were classes. In modern C++ programming we are more familiar with template usage, and would not be quite so reticent about the names or accepting of the supposed benefit of hiding templated usage. If we consider strings to be templated with respect to their character type, then so be it: string<char> and string<wchar_t> are clearer, more direct, and proud to be templated.

Orthogonality

Designing a string class is more of a challenge than many people appreciate. If you have not already done so (several times) it's worth a try: It's a good C++ work out - memory management, operator overloading, optimization, etc. The problem is not so much in language features or implementation, but in interface. What problem is a string designed to solve? Without a clear focus you discover that everyone has a slightly different view of what a string should be: a self-managed array of characters; a wrapper for char * and <cstring>; a small piece of text that requires simple access and concatenation; a potentially large stretch of text that requires efficient slicing and rearrangement operations; regular-expression searchable text; and so on. The problem is one of choice: All of these suggestions are reasonable, but satisfying them all simultaneously is not. Any attempt to create a single string class that does so cannot succeed. We have enough existence proofs to back this up.

There is already an excellent example of how to solve this design problem in the standard library: the STL. The STL stands head and shoulders above other container libraries because it is not a container library: It is a specification that defines independent, open-ended families of algorithms, function objects, iterators, and containers along with the requirements that allow them to be combined. Independent ideas are expressed independently, with algorithms separated from underlying container representation through iterators and from functional specifics through function objects. This orthogonal design separates concerns quite clearly. The added bonus is that you get some predefined algorithms, function objects, iterators, and containers thrown into the deal.

There is no single container class that satisfies all of our needs, so we have requirements and exemplars in the library. Given that we now know that asking how to write a single string class is the wrong design question, we can see how the cleaner STL-style solution can be applied to strings. In the first instance, we can consider strings to be sequences. Their bit-copyable elements, common use of null termination, conversions, concatenation, and I/O further characterize them. If we start with this, we end up with a minimal interface that can be satisfied by lightweight char * wrappers and scalable string implementations - such as SGI's rope template [SGI] - and even std::basic_string. The more complex functionality associated with different string types is then expressed orthogonally through algorithms, so that new algorithms can act on all strings and new strings can take advantage of old algorithms. Now, should we want an all-singing, all-dancing, killerapp, last-one-you'll-ever-need string, we can write our own - so long as it satisfies the base requirements of what is means to be a string.

Choice

An interface should represent reasonable goals and present its user with reasonable choices - overachieving interfaces are weaker and more complex, not stronger and simpler.

Consider, for instance, the issue of subscripting. operator[] is not required to perform bounds checking whereas at is. Indexing out of bounds causes undefined behavior for operator[] and an out_of_range exception for at. On the surface, this looks like a reasonable choice: You get to choose the quality of failure for yourself. The problem is that such an option is utterly useless and cannot be reasonably exercised. When would you consciously choose to write code that needed at rather than operator[]? If you make the choice, you have already anticipated the bug, and can therefore prevent it.

If you have a choice, it should be reasonable to exercise it. It should also be possible to exercise it. Take allocators. Please. The world of containers would be far simpler without them and very, very few people would miss them. People that actually need to customize their memory allocation - for shared memory, for persistence, for the sake of it - find themselves working against rather than with the allocator model. Trying effective memory management of a container whose representation and management is not fully open to you is like eating with a knife and fork... held with chopsticks... through mittens. If you are serious about managing the allocation of a container, then get serious and manage it: Write your own container type. It is simpler, more likely to work, and is very much in the extensible spirit of the STL - more so than limiting yourself to the handful of default container implementations in std.

Consistency

Another property of a well-designed interface is consistency. Some functions in basic_string throw exceptions on failure, whereas others do not. This is already inconsistent, but is made more so by the presence of both operator[] and at. If operator[] is shadowed by the exception-throwing at, then where are the exception-throwing doubles for iteration? If an exception-throwing access operator is considered reasonable, you should expect - indeed, demand - safe and unsafe variants for other forms of access. After all, what is good for the goose is good for the gander. However, we have established that at is not reasonable, so there is no need to clutter up the string interface any further.

Mad COW Disease

Strings get copied. Fact of life. Copy assignment and construction afford strings their value-based behavior. But strings are not lightweight classes. They encapsulate a heap allocated representation, and copying could be expensive, especially if the copied string is never modified:

template<typename char_type>
class string 
{
  ...
private:
  ...

  size_t used, reserved; // current
                         // length and allocated space
  char_type *text; // allocated and
                   // deallocated representation
};

The compiler is entitled to a number of optimizations. For instance, the following:

std::string cow = "Woof!";

Is equivalent to:

std::string cow = std::string("Woof!");

But can be - and normally is - optimized to:

std::string cow("Woof!");

For assignment, overloading operator= to take a const char * prevents a conversion to a temporary that is then used with the ordinary copy assignment operator.

The result of string concatenation is a temporary string object:

std::string loud_cow = cow + "!!";

Here operator+ returns a temporary std::string object that is used to initialize loud_cow. Depending on how the called function is written, the named return value optimization (NRV) allows a compiler to construct directly into loud_cow [Ellis-1990, Lippman1996] rather than create an additional temporary object. This optimization applies only to copy construction, not copy assignment: If loud_cow is assigned the result of the concatenation, a temporary is created and then discarded. Similarly, in the following initialization two temporaries are created, only one of which can be optimized away by the NRV:

std::string loud_cow = cow + " " + cow;

Because value objects of class type are commonly passed around by const reference, copying typically happens through assignment, data member initialization, and return values. We can see that the compiler already has considerable license to optimize, and that techniques such as overloading to prevent conversions and preferring initialization to construction help reduce the temporary burden, so to speak. But in complex expressions and initialization of data members we can also see that there may still be the need to amortize the cost of copying.

What is required of an optimization? Transparency - so it is substitutable for the unoptimized version - and optimization - many optimizations aren't. In particular, the requirement of transparency means that users should not be entertained by new and interesting bugs.

Counting the Bodies

The most common copy optimization is to share the representation of a string when a copy is made rather than make a deep copy that results in heap allocation. This means that copying is simple and cheap. Only when the string is going to be modified does the 'real' copy occur to avoid aliasing surprises. This lazy, just-in-time model - commonly referred to as copy on write - defers the cost of allocation until the point it is absolutely needed. If it is never needed, the cost is not paid. However, few things in life are for free: The sharing is not without overhead. For a start, it must be managed, which increases the complexity of the code. The referencing must also be tracked so that when - as a result of assignment or destruction - a string's text body is no longer referenced it is properly deallocated, and when only a single string handle refers to a text body, redundant deep copies are not made.

There are five ways in which references held by string handles to text bodies can be sensibly tracked, each with its own particular tradeoffs:

Hold separate pointers to the reference count and the actual text. This means that the footprint of the string object is a little larger and that we are paying for the allocation of two heap objects. The allocation means that it is unlikely that we recoup our investment unless a text body is shared by more than two string handles. For a single reference, this is a not an optimization. Holding a static reference count of 1, and only allocating a dynamic count when the figure rises above that can reduce the overhead in this case. This will complicate the implementation, but if the majority of strings are never copied this will be a saving. If, on the other hand, the string handle's footprint is a concern, the information duplicated between sharing handles can also be associated with the count, reducing the footprint to two pointers:
```
template<typename char_type>
class string 
{
  ...
private:
  ...
  struct shared 
  {
    size_t used, reserved, count;
  };
  shared *info;
  char_type *text;
};
```
Hold a single pointer to an object that contains the reference count, the pointer to the shared text, and the text size information. This always results in the allocation of two objects on the heap, and there is an extra level of indirection to reach the actual text. For some designs this could provide an additional benefit of allowing the actual text to be reallocated or virtualized in some way, e.g. to disk, without affecting the handle objects. In the common case, the main benefits of this approach are a little more restricted. The string handle's footprint has now been reduced to a single pointer and, if you want to add a constructor and destructor to the shared body, the management of the text memory can be hidden from the string handle. In its simplest form we can see the basic rearrangement is a proper handle-body configuration [Coplien1992, Gamma-1995]:
```
template<typename char_type>
class string 
{
  ...
private:
  ...
  struct shared 
  {
    size_t used, reserved, count;
    char_type *text;
  };
  shared *body;
};
```
Hold a single pointer to memory that contains both the information about the string text - including the reference count - and the string text itself. The information is held as a prefix to the char_type array. Only a single pointer is held in the handle, only a single allocation is performed, and treating the space before the text as a different type allows access to the string information. Although this solution is at a slightly lower level, it can be very effective [Henney1998], especially when encapsulated within the string handle. The drawbacks to this approach are that any resize must also involve reallocating and copying the information prefix, and also the intent of the code and connections between the data structures is less obvious:
```
template<typename char_type>
class string 
{
  ...
private:
  ...
  struct shared 
  {
    size_t used, reserved, count;
  };
  char_type *text; // reinterpret_cast
                   // <shared *>(text) - 1
};
```
Link copied objects together in a doubly-linked list and hold a pointer to the string text. The information about the string text can be held duplicated in each string handle or as a prefix of the text body's memory. When the links going to the previous and next string handle are both null (or, in a circular configuration, pointing to this) the text body is uniquely owned. This style of reference accounting (it is not really reference counting because there is no explicit count) is perhaps least appropriate for strings because there are no operations that require traversal of all handles. Each string handle will have a larger footprint than the other solutions considered so far, although only a single allocation is required per text body:
```
template<typename char_type>
class string 
{
  ...
private:
  ...
  size_t used, reserved;
  char_type *text;
  string *previous, *next;
};
```
Hold the string text in a managed lookup table and retain some kind of reference into the table. The information about the string can be held alongside the actual text in the table. This approach is suitable when the aim is not simply to reduce copy cost, but also to eliminate any duplicate strings. It is effectively a symbol table. The cost of initialization from a raw string is increased because of an initial search and a possible initial insertion, and there is increased space overhead per text body that exists. Strings can be held uniquely so that some string features, such as reserved capacity, are no longer appropriate. For strings, the typical implementation is to hold a static repository, which introduces its own issues as far as initialization and finalization ordering. This is typically not a suitable design for general purpose strings:
```
template<typename char_type>
class string 
{
  ...
private:
  ...
  struct less {...}; // function object
                     // type for comparison
  struct info 
  {
    size_t used, count;
  };

  typedef map<const char_type *, info, less> string_map;
  static string_map strings;
  string_map::iterator entry;
};
```

Clearly, there are many ways to skin a cow. For general-purpose, copy-on-write strings, the first three techniques are the most appropriate and most common.

Trying to be Smart

It seems clear that non-const operations such as operator+= and resize require a string handle to operate on its own copy of the text body. It also seems clear that const operations, such as size and compare, can operate without ill effect on a shared representation. This seems to divide operations in the string world neatly into two type types. However, there is a grey territory in between. What about non-const operator[]? This operator may be used for both reading from and writing to a string:

string<char> cow = "Woof!", ghost = cow;
ghost[3] = cow[1];

Both of these calls result in a call to the non-const operator[], but for assignment we want to assure that a deep copy happens, but for reading a deep copy would be wasteful. There is no way to distinguish between these uses within operator[]. What we need is a smarter reference to do the work for us:

template<typename char_type>
class string 
{
public:
  class reference 
  {
  public:
    ...
    char &operator=(char); // perform
                           // deep copy before write
    operator char() const; // use shared
                           // representation
  private:
    string *target;
    size_t index;
  };
  reference operator[](size_t);
  ...
};

This smart reference approach works in most cases. However, a smart reference is not totally substitutable for a real reference. The following fails to compile because std::swap expects real references:

swap(cow[3], ghost[1]);

There are other problems with the smart reference approach for strings [Meyers1996, Sutter1998a], some of which are related to dubious practice - holding the address of a returned reference - and others to do with constraints in the standard - the reference type is required to be a real reference, no smart references allowed.

And don't think that the problem is just confined to operator[]: It also applies to the iterator type, which may be used for both reading and writing. Therefore, for reference-counted strings, iterator must be a smart pointer rather than raw pointer type for the reference-counting optimization to be fully effective.

Pessimism

The outlook is pessimistic. As a copy optimization the effectiveness of copy-on-write reference counting has been reduced to a few cases. In other cases it may be quite the opposite of an optimization, regardless of the investment and increase in code complexity.

The only workable evaluation model for these problem functions is a pessimistic one: You don't know whether the user is going to read or write through the returned reference, and you have to just accept that and assume the worst. You may also consider catching some of the corner cases for undefined behavior, such as holding onto the address of a returned reference. In these cases you have to prevent any future sharing, so that if the current string is used as the source for a copy it causes a deep copy rather than sharing:

template<typename char_type>
class string 
{
public:
  typedef char_type *iterator;
  iterator begin() 
  {
    reserve();
    return text;
  }
  void reserve(); // reserve
                  // representation exclusively
  ...
private:
  ...
  char_type *text;
};

All in all, this further reduces the effectiveness of copy optimization to a few corner cases. For non-const cases there appears little to be gained from considering this a general-purpose optimization.

Threadbare

The final body blow comes with the introduction of multithreading. Sharing a reference-counted text body becomes unnecessarily interesting when the sharing is between threads. The gut instinct of programmers new to threaded programming is that a mutex or equivalent synchronization primitive will solve the problem. For instance:

template<typename char_type>
class string 
{
  ...
private:
  ...
  struct shared 
  {
    size_t used, reserved, count;
    mutex guard;
    char_type *text;
  };
  shared *body;
};

Synchronization primitives are operating system resources, and as such may be potentially scarce and costly to obtain. The temptation is then to share a common mutex for all string objects:

template<typename char_type>
class string 
{
  ...
private:
  ...
  struct shared 
  {
    size_t used, reserved, count;
    char_type *text;
  };
  static mutex guard;
  shared *body;
};

In addition to the initialization and finalization issues, you now have another problem: performance. First of all, locking and unlocking mutexes for all data accesses comes with a measurable overhead. And second, all string objects are now serialized through the same mutex, creating a potential bottleneck. Given that the aim of copy-on-write reference counting is to optimize - and taken with all the other issues raised previously - a mutex-based approach is not even on the radar.

If you look carefully at what you need to lock, you will see that the locking revolves around the reference count. Many operating systems provide you with lock-free synchronization primitives for incrementing and decrementing integers, e.g. InterlockedIncrement and InterlockedDecrement on Win32. With careful coding it is now possible to ensure that no shared text body is ever compromised by race conditions. But note that these primitives still incur a performance penalty - few things in life are free.

Separation of Concerns

There is a question we have to ask ourselves: Is it all worth it? The assumption has always been there that this is a good general-purpose optimization, from the early days of standardization [Teale1991] to the current standard [ISO1998]. At every stage, accommodating this style of implementation has caused headaches, even without the threading issues. The concern is not a recent one [Murray1993]: "A use-counted class is more complicated than a non-usecounted equivalent, and all of this horsing around with use counts takes a significant amount of processing time. If the time spent copying values is small enough (either because the values are small and cheap to copy or they are not copied very often), changing the class to do use counting may make programs slower. Always do some performance measurements when making this kind of change to convince yourself that this optimization is not really a pessimization!"

With multithreading the issues become even more involved [Sutter1998b] and the horsing around becomes a full-blown stampede (but hopefully not a race condition…). This simply reinforces an earlier conclusion: It is not possible to design a single string implementation that satisfies all uses. Thus the default implementation that causes the fewest surprises (bugs) - either in use or in implementation - is to avoid copy-on-write reference counting. Avoiding it, or providing explicit information on how to disable it, is the approach now adopted by many libraries [Dinkum, SGI].

So deeply rooted is the idea that copy-on-write reference counting is mandatory for strings that many developers are shocked - and sometimes go into denial - when they discover that the return on investment in this technique is often negligible and sometimes negative. Also have a look at Laguiole steak knives. The long-standing belief in this old practice is, however, younger than faith in another more fundamental software engineering principle: separation of concerns. And hey, do we have concerns.

A Qualified Difference

Listen to the code, it is trying to tell you something: Mixing reference counting with mutability causes problems. Period. However, if you listen closely, you can hear a leading question, the whisper of a solution: What if you don't mix reference counting with mutability? What if we are dealing with two related but distinct types?

From an interface perspective, we can see that we can use a string either as something that is read-mostly information or as a read-and-write space. From an implementation perspective, problems with reference counting arise only with mutability. Previously we explored the idea of const qualification being a form of subtype relationship [Henney2001a] and one that can be reflected in inheritance [Henney2001b]. For value types we can define separate classes, not related through inheritance, and provide substitutability through conversions [Henney2000].

Consider a design where string covers the general case and something like const_string covers the immutable case. const_string has a subset of the operations of string: the const ones plus some that effect a rebinding of handle to text body, such as operator=. const_string is different to const string, which prevents all modification but still comes with any baggage not relevant to const, e.g. reserved capacity. It is more like the relationship between iterator and const_iterator.

Not only do string and const_string differ in interface, but they can also differ in implementation: string should not be reference counted but const_string may be. const_string has none of the concerns that plagued copyon-write for a mutable string, and thread safety can be catered for by atomic increment and decrement operations.

Before you get too attached to the names string and const_string - and assuming that your compiler fully supports partial template specialization - consider one last refinement that uses template specialization and lets us keep a single name:

template<typename char_type>
class string 
{
  ...
private:
  ...
  size_t used, reserved;
  char_type *text; // unshared
};

template<typename char_type>
class string<const char_type> 
{
  ...
private:
  ...
  struct shared 
  {
    const size_t used;
    size_t count;
  };
  char_type *text; // reinterpret_cast
                   // <shared *>(text) - 1
};

With this approach string<char> is a common, writeable string and string<const char> is the idiom used to work with the read-only variant.

Conclusion

What do you get when cross a string class with copy-on-write reference counting? A problem. What do you get when you cross that with separation according to qualification? A solution.

The road to optimization is full of potholes. Trying to shoe horn many interface and implementation possibilities into a single type leads to twisted back roads. Separating core representation from algorithmic abstraction can clarify and clean up a string interface. Separation according to qualification is also a simplifying decision, both for interface and implementation. A cure - in strings at least - for Mad COW Disease^[1].

References

[Coplien1992] James Coplien, Advanced C++: Programming Styles and Idioms, Addison-Wesley, 1992.

[Dinkum] The Dinkum C++ Library, Dinkumware Ltd, http://www.dinkumware.com/.

[Ellis-1990] Margaret Ellis and Bjarne Stroustrup, The Annotated C++ Reference Manual, Addison-Wesley, 1990.

[Gamma-1995] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995.

[Henney1998] Kevlin Henney, "Counted Body Techniques", Overload 25, April 1998, also available from http://www.curbralan.com.

[Henney2000] Kevlin Henney, "From Mechanism to Method: Substitutability", C++ Report 12(5), May 2000, also available from http://www.curbralan.com.

[Henney2001a] Kevlin Henney, "From Mechanism to Method: Good Qualifications", C/C++ Users Journal C++ Experts Forum, January 2001, http://www.cuj.com/experts/1901/henney.html.

[Henney2001b] Kevlin Henney, "From Mechanism to Method: Total Ellipse", C/C++ Users Journal C++ Experts Forum, March 2001, http://www.cuj.com/experts/1903/henney.html.

[ISO1998] International Standard: Programming Language - C++, ISO/IEC 14882:1998(E), 1998.

[Lippman1996] Stanley Lippman, Inside the C++ Object Model, Addison-Wesley, 1996.

[Meyers1996] Scott Meyers, More Effective C++: 35 New Ways to Improve Your Programs and Designs, Addison-Wesley, 1996.

[Murray1993] Robert Murray, C++ Strategies and Tactics, Addison-Wesley, 1993.

[SGI] SGI Standard Template Library Programmer's Guide, http://www.sgi.com/tech/stl/.

[Sutter1998a] Herb Sutter, "GotW #44: Reference Counting - Part II", September 1998, http://www.gotw.ca/gotw/044.htm.

[Sutter1998b] Herb Sutter, "GotW #45: Reference Counting - Part III", October 1998, http://www.gotw.ca/gotw/045.htm.

[Teale1991] Steve Teale, "Proposing a C++ String Class Standard", Dr. Dobb's Journal, October 1991.

^[1] Thanks to Andrei Alexandrescu's original use of the Mad COW term on comp.std.c++

Notes:

More fields may be available via dynamicdata ..