ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pincopy_on_write_ptr<type>

Overload Journal #31 - Apr 1999 + Programming Topics   Author: Jon Jagger

Recap

In my first counted_ptr<> article (way back now) I created a flawed string class that reference counted shared state. The flaw in this deliberately naïve attempt was that it failed to consider an important question: what happens when you change shared state? For my fledgling string class it was an internal implementation detail. This meant the string modifier-methods had to create a personal, unshared copy of the state before doing their modification. My second counted_ptr<> article focused on just one of the string modifier-methods; the subscript operator. At the end of the second article the string class looked like this

namespace accu {
 class string {
 public: // types
  class char_reference;
  typedef char_reference reference;
  ... ... ...
 public: // access, idiomatic and primitive
  char_reference operator[](size_t index);
  char operator[](size_t index) const;

  void assign(size_t index, char new_value); 

 private: // plumbing
  void bounds_check(size_t index) const;
  void unshare_state();

 private: // representation
  char * text;
  size_t * count;
 };

 class string::char_reference {
 public:
  char_reference(string * target,
                 size_t index);
  ... ... ...
  char_reference operator=(char new_value)
  operator char() const;

 private:
  string * const target;
  size_t const index;
 };
}

The essence of the second article was that in a code fragment like this

string theory("hello");
theory[0] = 'J';   // statement 1
cout << theory[0]; // statement 2

statement 1 is syntactic shorthand for

string::char_reference __temp = theory[0];
__temp.operator=('J');// delegates to theory.assign(0, 'J')

and statement 2 is syntactic shorthand for

string::char_reference __temp1 = theory[0];
char __temp2 = __temp1.operator char();
cout << __temp2;

Caching in

Suppose we want to cache the length of the string. We could design the cache as, say, a size_t. A cache value of zero indicates a "stale" cache. The length query simply refreshes a stale cache (if it needs to) by calling strlen. The cache will need to be declared mutable. All pretty standard stuff. But hang on: the cache will need to be mutable? Will it? The text is shared, and the cache value is the length of text, so shouldn't the cache be shared too?

namespace accu {
 class string {
 public:
  string(const char * literal = "");
  ... ... ...
  char_reference operator[](size_t index);
  void assign(size_t index, char new_value);
  size_t length() const;
  ... ... ...
 private:
  char * text;
  size_t * cached_length; // == strlen(text)
  size_t * count;
 };
}

The need for mutable vanishes because const-ness does not traverse a pointer (I will return to this). The string::length() query cannot change cached_length (which is a pointer), but it can change what cached_length points to. Hence we arrive at

namespace accu {

 size_t string::length() const {
  if (*cached_length == 0)
  {
    *cached_length = strlen(text);
  }
  return *cached_length;
 }
 ... ... ...
 void string::assign(size_t index,
                     char new_value) {
  bounds_check(index);
  unshare_state();
  text[index] = new_value;
  *cached_length = 0;
 }
}

This is fine as far as it goes, but it goes too far. It's doing more work than it needs to. Consider the two cases. First, what if new_value (of assign) is not the null-character? Answer: *cached_length is needlessly being reset. That's easy to fix though.

namespace accu {
 void string::assign(size_t index,
                     char new_value)
 {
  bounds_check(index);
  unshare_state();
  text[index] = new_value;
  if (new_value == '\0')
  {
    *cached_length = 0;
  }
 }
}

Second, what if new_value (of assign) is the null-character? Answer: *cached_length does (surely?) need to be reset. Yes, but does it need to reset it to zero? Follow it through; what will string::length recalculate the cache value as? Well, to strlen(text). Yes, but what's that? Answer: the value of index in the previous assign where new_value was the null-character! Hence

namespace accu {
 size_t string::length() const
 {
  return *cached_length;
 }
 ... ... ...
 void string::assign(size_t index,
                     char new_value)
 {
  bounds_check(index);
  unshare_state();
  text[index] = new_value;
  if (new_value == '\0')
  {
    *cached_length = index;
  }
 }
}

Neat eh?

Putting the counted_ptr<> back in again

The string class is getting messy; at least two private methods (not great on a user class) and three data members (if we leave the cache in) that are all pointers. Time to simplify. Time to put the counted_ptr<> back.

#include "counted_ptr.hpp"

namespace accu {

 class string {
 public:
  size_t length() const;
  ... ... ...
  void assign(size_t index, char new_value);
 private:
  void unshare_state();

 private:
  class body;
  counted_ptr<body> ptr;
 };
}
namespace accu {
 struct string::body {
  char * text; 
  size_t cached_length; // strlen(text)
  ... ... ...
  body(const char * literal);
  void assign(size_t index, char new_value);
  ... ... ...
 };
}

Rethinking public vs. private. Again

You may recall in one of my articles I um'ed-and-ah'ed about the pros and cons of making string::assign public or private. Making it private was attractive because it meant there was a single syntax for assignment. It also shrank the public interface a little. And a small public interface is a Good Thing. Small things tend to be cohesive. Also, a mistake on a private signature is going to affect user code a lot less than a mistake on a public signature. In this case, there's real bite to that advice. I was looking through the spec for std::string in my copy of the standard. I happened to notice a method called, you've guessed it, assign taking two parameters: a size_t and a char. And of course, the semantics are completely different. Oops [1]. However, as is so often the case with mistakes, this proved quite lucky. It forced me to consider another option. One I hadn't even considered before: remove the method altogether! The insight is that methods of string::char_reference don't have to delegate to methods of string (which in turn delegate to methods of string::body). They could delegate directly to methods of string::body. I can move assign out of string and into string::body. And while I'm at it, I'll give it a better name; replace[1].

namespace accu {

 class string {
 public:
  class body;
  class char_reference;
  ... ... ...
  char_reference operator[](size_t index);
  char operator[](size_t index) const;
  ... ... ...
 private: // representation
  counted_ptr<body> ptr;
 };

 class string::char_reference {
 public:
  char_reference(
   counted_ptr<string::body> & ptr, 
   size_t index);
  char_reference & operator=(char new_value);
  operator char() const;

 private:
  counted_ptr<string::body> & ptr;
  size_t const index;
 };
}

namespace accu {
 struct string::body {
  size_t cached_length; // strlen(text)
  char * text;
  ... ... ...
  void replace(size_t index, char new_value);
  ... ... ...
 };
}
namespace accu {
 ... ... ...
 string::char_reference
 string::operator[](size_t index)
 {
  return char_reference(ptr, index);
 }

 char string::operator[](size_t index) const
 {
  return ptr->text[index];
 }
 ... ... ...
}

namespace accu {
 ... ... ...
 string::char_reference
 string::char_reference::operator=(
  char new_value)
 {
  ptr->replace(index, new_value);
  return *this;
 }
 string::char_reference::operator char() const
 {
  return ptr->text[index];
 }
 ... ... ...
}

namespace accu {
 size_t string::length() const
 {
  return ptr->cached_length;
 }
 ... ... ...
}

I like this. But something is not quite right. A code fragment with a mistake somehow niggles at my subconscious till I spot it. There is a mistake. It's so easy to make -- we've lost the call to string::unshare_state(). Before delegating across operator->() the modifier must make its own personal copy of the state. But this is not too hard to fix. One way is to make string::unshare_state() private, and grant string::char_reference friendship to string.

namespace accu {

 class string {
 public:
  class char_reference;
  friend class char_reference;
  ... ... ...
 private:
  void unshare_state();
  ... ... ...
 };
}


namespace accu {
 string::char_reference
 string::char_reference::operator=(
  char new_value)
 {
  ptr->unshare_state();
  ptr->replace(index, new_value);
  return *this;
 }
}

This friendship feels okay; string and string::char_reference are very closely related; by nesting. But just because it's a reasonable solution doesn't mean it's the best solution. There's almost never only one solution. Here's another.

namespace accu {
 ... ... ...
 string::char_reference
 string::char_reference::operator=(]
  char new_value)
 {
  counted_ptr<string::body>
   unshared(new body(ptr->text)); 
  ptr = unshared;
  ptr->replace(index, new_value);
  return *this;
 }
}

This is bit more complicated, but it has one thing going for it. It enables us to remove unshare_state() from the string class. That in turn means we can remove the friendship. The string class is thinning out nicely. The only thing to remember is to do this indirect unsharing in every modifier. That's not much to ask. But it is duplication. And duplication is boring. And when your mind gets bored it makes mistakes. So again, let's not rest on this solution, let's see if there is a better one. But this time, let's try a different approach. Ask yourself: "what's an ideal solution?" In a sense we already know an answer to this. It's ...

namespace accu {
 ... ... ...
 string::char_reference
 string::char_reference::operator=(
  char new_value)
 {
  ptr->replace(index, new_value);
  return *this;
 }
}

At this point you're probably thinking "Eh? - Jagger's lost it this time. Isn't this the version with the error again?" Yes. It is! But I maintain this version has a lot going for it. Number one, it was the first version so you could argue it's the most obvious "solution". Number two, it's the simplest "solution" since it's the shortest[2]. Nevertheless it isn't a solution since it doesn't work. Let's take a step back and take a look at counted_ptr<>

namespace accu {

 template<typename type>
 class counted_ptr {
 public:
  ... ... ...
  type * operator->() const;
  type & operator*() const;
 private:
  type * ptr;
  size_t * count;
 };
}

Both operators are const queries, but both give non-const access to the target object. This is perfectly reasonable; the counted pointer is behaving exactly like a raw pointer. What we have here is the hidden handle/body idiom. The user uses the handle (string) directly as a value, and the body (string::body) is effectively invisible. Hidden handle/body means the body is hidden from the user of the handle. In this case, the counted_ptr<string::body> is the link between the string-handle and the string::body-body. More specifically, counted_ptr<string::body>::operator->() is the usual way the string handle traverses to it's body. For example

namespace accu {
 size_t string::length() const
 {
  return ptr->cached_length;
 }
 ... ... ...
 string::char_reference
 string::char_reference::operator=(
  char new_value)
 {
  ptr->replace(index, new_value);
  return *this;
 }
}

What we'd really like is for the expression ptr-> to do different things in different cases. For the query, to just return a read-only raw pointer. And for a modifier to first make a deep copy and then return a read-write raw pointer. And, as it turns out we can do this since we can, once again, overload on const.

cow_ptr<type> - Copy On Write pointer

namespace accu {
 template<typename type>
 class cow_ptr { // moo
 public:
  ... ... ...
        type * operator->();
  const type * operator->() const;
 private:
  type * ptr;
  size_t * count;
 };
}

Let's look at our two examples again, this time using cow_ptr<string::body> instead of counted_ptr<string::body>. Here they are again. Remember ptr is a cow_ptr<string::body>

namespace accu {
 size_t string::length() const
 {
  return ptr->cached_length;
 }
}

This is a query. The compiler is forced to resolve ptr-> as the const version of cow_ptr<string::body>::operator->(), which can simply return a read-only raw pointer. Recall that counted_ptr<string::body>::operator->() returns a read-write raw pointer. In other words this

namespace accu {
 size_t string::length() const
 {
  ptr->cached_length = 42;
  ...
 }
}

will compile quite happily if ptr is a counted_ptr<string::body>, which is not a Good Thing. But it will generate a compiler error if ptr is a cow_ptr<string::body>, which is a Good Thing. The cow_ptr<> version is type-safer. When you split a class into a handle and a body it's easy to forget this subtlety: that the const-ness of a raw pointer does not traverse the pointer.

namespace accu {
 string::char_reference
 string::char_reference::operator=(
  char new_value)
 {
  ptr->replace(index, new_value);
  return *this;
 }
}

This is a modifier. The compiler is forced to resolve ptr-> as the non-const version of cow_ptr<string::body>::operator->() which can make a deep copy and then return a read-write raw pointer. The apparently missing call to make a deep copy is no longer an error; it has been refactored inside operator->() itself. There is a single point of definition, and no duplication. And the code in the handle-methods has a common look and feel too. Here's a first cut at cow_ptr<>.

namespace accu {
 template<typename type>
 class cow_ptr {
 public: // create/copy/destroy
  explicit cow_ptr(type * ptr);
  cow_ptr(const cow_ptr & rhs);
  cow_ptr & operator=(const cow_ptr & rhs);
  ~cow_ptr();

 public: // access
        type * operator->();
  const type * operator->() const;
        type & operator*();
  const type & operator*() const;
  ... ... ...
 private: // plumbing
  void copy();
  type * checked_ptr() const;

 private: // state
  type * ptr;
  size_t * count;
 };
}
namespace accu {
 // cow_ptr<> - create/copy/destroy
 template<typename type>
 cow_ptr<type>::cow_ptr(type * p)
  : ptr(p), count(new size_t(1)) 
 {
  // all done
 }

 template<typename type>
 cow_ptr<type>::cow_ptr(const cow_ptr& rhs)
  : ptr(rhs.ptr), count(rhs.count)
 {
  ++*count;
 }

 template<typename type>
 cow_ptr<type> & 
 cow_ptr<type>::operator=(const cow_ptr& rhs) 
 {
  ++*rhs.count;
  if (--*count == 0)
  {
   delete count;
   delete ptr;
  }
  ptr = rhs.ptr;
  count = rhs.count;
  return *this;
 }

 template<typename type>
 cow_ptr<type>::~cow_ptr()
 {
  if (--*count == 0)
  {
   delete count;
   delete ptr;
  }
 }
}

namespace accu { // cow_ptr<> - access
 template<typename type>
 type * 
 cow_ptr<type>::operator->()
 {
  copy();
  return checked_ptr();
 }

 template<typename type>
 const type * 
 cow_ptr<type>::operator->() const
 {
  return checked_ptr();
 }

 template<typename type>
 type & 
 cow_ptr<type>::operator*()
 {
  copy();
  return *checked_ptr();
 }

 template<typename type>
 const type & 
 cow_ptr<type>::operator*() const
 {
  return *checked_ptr();
 }
}

namespace accu { // cow_ptr<> - plumbing
 template<typename type>
 void
 cow_ptr<type>::copy() 
 {
  if (*count > 1)
  {
   cow_ptr<type> unshared(new type(*ptr)));
   *this = unshared;
  }
 }

 template<typename type>
 type *
 cow_ptr<type>::checked_ptr() const
 {
  if (!ptr)
  {
   throw logic_error("cow_ptr<>: null");
  }
  return ptr;
 }
}

And we need only ensure string::body has a copy constructor which makes a deep copy. Remember, thanks to the string::char_reference proxy class, the string::body copy constructor will only be called if we are actually making a change.

namespace accu {
struct string::body {
 char * text;
 size_t cached_length; // strlen(text)
 ... ... ...
 body(const char * literal);
 body(const body & rhs);
 ... ... ...
};

string::body::body(const body & rhs)
 : text(new char[rhs.cached_length + 1])
 , cached_length(rhs.cached_length)
{
 strcpy(text, rhs.text);
}
... ... ...
}

My thanks, as ever, to Kevlin for his comments and guidance. That's all for now.



[1] It turns out that around the time I wrote the article Kevlin was working on some slides for QA Training that covered this very topic. Kevlin had seen and avoided this exact problem; his method was called replace from the start.

[2] I'm the first to admit an element of sophistry in this.

Overload Journal #31 - Apr 1999 + Programming Topics