Title: A Simple Model for Object Persistence Using the Standard Library

Author:

Date: 26 June 1999 17:50:53 +01:00 or Sat, 26 June 1999 17:50:53 +01:00

Summary:

Body:

Abstract

Serialising systems of objects poses some challenges in terms of the preservation of inter-object associations, and object id allocation and management. For a large-scale project, heavyweight databases are the accepted solution, but for smaller projects simpler homegrown solutions are possible. I present one such system here.

Introduction

There are many possible requirements for a persistence framework. Persistence could involve object load on demand, cross platform support, multithreaded or multiprocess access, robust transactional storage, etc., etc. This article covers none of these issues. Instead, I shall focus on the bottom line: preserving a system of objects, along with their interconnections, to permanent storage, with a view to restoring the entire system to memory at a later date.

Single Objects

Serialising simple attribute types is easy. We can pretty much take the memory image of such data and copy it verbatim to disk. This reversible process allows the data to be restored just as easily (see figure 1).

Figure 1. Simple attributes are easy to serialise.

Although more complex objects typically contain simple attribute types, they are more difficult to serialise because they may contain links to other objects. These links are often represented, in one way or another, as pointers or disguised pointers (smart pointers, look-up tables, etc.). Pointers provide extremely efficient and low-level access to the pointee, but they suffer from the problem that their values are not preserved across successive incarnations of an application.

Thus, association information needs special handling in any persistence framework. We need to be able to identify the object pointed to in an invariant manner in order to store it to disk and unambiguously restore it again (see figure 2).

Figure 2. Associations needs to be preserved when serialising interconnected objects.

Object Ids (OIDs)

One way to manage object references is to allocate each object a unique object ID (OID). If each object (or each object of a given type) can be uniquely referenced using its OID, and if these OIDs are stored to file when the system is persisted, then the pointer-based object associations can be calculated when restored by using a look-up table mapping IDs to objects.

The drawback is that allocating and managing these OIDs can be tricky and expensive.

ID Management

There are many different techniques for allocating IDs to objects. If the total number of objects ever created is small, you can just give objects successive integer values. All you need to do is keep track of the last ID allocated, and allocate the following one next time. You need to be careful that frequently allocated temporary objects don't exhaust the range of IDs and cause the allocated IDs to 'wrap around.'

Alternatively, you can create and manage a pool of available IDs, and allocate them from the pool, returning IDs to the pool as their owners expire. This is less time-efficient, and can also impose a larger memory overhead, but is a more robust solution for a long-lived application.

Object factories can simply create objects with IDs when requested, or they can additionally keep track of the objects themselves, allowing an object to be located given only its ID. In this case it is useful to make the factory responsible for destroying objects as well, in order to preserve the integrity of the ID pool.

OIDs are just a resource, and depending upon the scheme employed, typical resource management dangers can arise. There may be no way to determine if a given OID is valid or not. You must plan to handle exhausting your pool of IDs. Unless resources are managed carefully (or automatically, perhaps with some form of 'smart' references) you could leak OIDs (a handle leak) by not releasing expired IDs, or you could get 'dangling IDs' instead of dangling pointers, if references are not correctly updated.

Commercial Databases

One way to manage ID allocation as well as persistence is to use a commercial database of some sort. If you create a mapping between your objects and a RDBMS, the database can manage the allocation of unique ids that serve as primary keys for each table of objects, as well as the physical storage and retrieval of the objects' data, their integrity, backup, sharing, replication, etc. These days an OODBMS can be used, which will additionally save you mapping your objects to flat tables.

But where's the challenge in that?! ☺ And besides, for small projects, or where a database system or its associated distributable runtime are unavailable or unsuitable, this may not be an option (or at least not the first choice). So let's see if we can get some way towards a simple solution ourselves.

The Ideal Object ID

Let's consider OIDs again, because referencing objects is the crux of the persistence problem. An ideal OID should be unique between objects. If it were additionally unique between classes of objects then that might be a 'nice-to-have.' We need quick and efficient access to an object via its OID. We also require a large pool of OIDs to ensure that we never run out of them. Efficient reuse of OIDs is necessary so that OIDs that are no longer needed can be recycled effectively. Finally, they need to be invariant over serialisation.

Sounds like a tall order. But what about memory pointers? They fulfil all these requirements apart from the fact that they are not preserved during serialisation - i.e. an object's 'this' pointer is not the same when you reload it from disk.

Pointers?

In order to serialise a system to disk and then restore it again at a later date, all we need to do is to store the original pointer values to disk. These uniquely identify the original objects pointed to. The trick is that we then need to re-map the object associations when we load them back, in order to match pointers with the correct objects, and all will be well!

We can't, in general, specify the this pointer of an object when it is created, so we must re-map the pointers to each object - the source, not the target, end of each association.

John Merrells suggested mangling the pointer values as we serialise them (and unmangling them as we restore them) so they look less like memory pointers during debugging. It is easy to incorporate this.

Persistence Using Pointers - the Details

To save a system to file, we must save each object, complete with the mangled pointer values for all the other objects to which it points. In addition to this, every object that can be pointed to must save its current this pointer (similarly mangled).

As our objects come back in from file, we note the original mangled this pointer for each object from the last time it was in memory, and the new this pointer for the current incarnation of the object, and store this mapping in a look-up table. Then, for every pointer we read in from file, we simply read the map to find the new location of the referenced object. This also unmangles the pointers' values automatically.

Pointers to the Future

That might sound easy so far, but there's one more problem awaiting us. At the point that we read in an object, it is quite possible (quite likely in fact) that some of the other objects that it references have not yet been restored. We therefore do not yet know what their new memory locations are. An obvious example of this is a circular reference, in which one pointer is guaranteed to point to a future object. In general, however, you can get this problem without any explicit circular references. In these situations, careful scheduling of the streaming of the objects to disk can avoid all cases of references to future objects, but for complex systems it is often not trivial, or even possible, to determine such an order.

Figure 3. A complex system of interconnected objects, including cyclical navigable associations.

To solve this we need a two-stage procedure to mirror the two operations we are performing.

Load in the objects, and
Patch up the pointers.

There are a number of ways to achieve this.

We could have two explicit calls to each object in order to restore it from disk - one to load all its non-pointer data and map old to new this pointers, and the other (called after all objects have been constructed) to patch up pointer references. At the point of the second pass, all objects have been created, and so all pointers can be patched up.

Another way to do this would be to simply create all the objects in pass one, and in pass two serialise all the objects' attributes - pointer and non-pointer alike.

Queuing Association Requests

A less obtrusive procedure is to automatically patch up associations from outside of the objects. This allows a single serialisation pass for each object, which is more common in typical serialisation schemes. We can reload all the objects, complete with their original attributes and pointer values (mangled), and then have an external serialisation manager correct all the inter-object associations at the end,1 without the objects' knowledge. This is quite a neat solution, but does rely on none of the objects exercising any of their associations before the end of serialisation (which can usually be arranged).

Will it Fly?

So, does this work in practice? Well, I got it to work quite simply, and this is what I did.

The Object Map - Mapping '`this`' Pointers

First, we need some form of map for mapping the old to the new this pointer values. For this I created a class called object_map:

class object_map
{
public:
   void register_object(void *old_this,
                        void *new_this); 
   bool lookup(void *old_ptr,
               void *&new_ptr) const;
   void clear();

private:
   // ... (disable copying, etc.)
   typedef std::map<void *, void *> ptr_map;
   ptr_map  m_pointer_map;
};

This is a concrete class that just manages a pointer-to-pointer map. I'm mapping void * pointers here because all pointers can safely be cast to this type and back on any platform. The pointer values also have to be stored to disk anyway, and so if they have just come back from disk, a void * is as good a type as any in which to hold them.

As each object is loaded, its previous this pointer is read in and mapped to its current this pointer value using the register_object() method. At a later date, if another object wants to point to this one, it calls lookup() with its previous pointer value, and the latest pointer value is returned, if present.

The Request Queue - Handling Future Objects

The next thing we need to do is handle all those future object references. For this I created a class called request_queue to 'queue-up' all the pending map requests that could not be fulfilled in the first instance.

class request_queue
{
public:
   template <typename T>
   push_request(T *&pointer);

   void map_unmapped(object_map &om);
   void clear();

private:
   // ...
   std::stack<map_request *> m_requests;
};

This class manages a bunch^[1] of pending requests. Pointers that need mapping are added to the queue with the push_request() template method. At the end of serialisation, the map_unmapped() method is called, which processes all the queued requests, patching up all the pointers.

Map Requests for Different Pointer Types

So, how are all these pointers, all of different types, held in the request queue? I wanted to hold their types as well as their values (rather than casting them all to void *s) so that they can be patched up correctly. Remember that objects may hold base class pointers to objects of derived types, and we need to ensure that any necessary pointer conversions are handled correctly. To do this, I utilised the external polymorphism pattern. This pattern allows types that have no polymorphic relationship (like our pointers) to be treated polymorphically.

I created an interface that I called map_request. I needed to store a whole bunch of outstanding requests in the request_queue container, yet I wanted to preserve the type information for each one. In other words, I wanted to store a request to map a pointer to a wotsit in the same container that I store a request to map a pointer to a thingamy. If I imposed a common base class on all objects, then I would be home and dry, but in general, wotsits won't be related to thingamies by any normal inheritance relationship. The way the external polymorphism pattern works is by defining a base class or interface, and then deriving classes from this base class for each of the types you want to handle.

The map_request interface looks like this:

class map_request
{
public:
   virtual ~map_request() {}

   virtual void *old_this() const = 0;
   virtual void map_to(void *new_this) = 0;
};

Request objects (those derived from map_request) store the previous location of the pointer, along with its type, and allow the pointer value at this location to be queried and set. A template class handles pointers to all types we could ever want in one fell swoop:

template <typename T>
class map_request_type : public map_request
{
public:
   explicit map_request_type(T *&ptr)
      : m_ptr(ptr)   {}

   virtual void *old_this() const
      {return m_ptr;}
   virtual void map_to(void *new_this)
      {m_ptr = static_cast<T *>(new_this);}

private:
   T   *&m_ptr;
};

If you have an object of type map_request_type<T>, the old_this() method allows the current pointer value to be accessed for look-up purposes, and the map_to() method casts the void * pointer passed in to the correct type, in this case a pointer to type T. This is a safe cast because we know the static type of the pointer (i.e. at compile time).

The procedure for saving a pointer to file is thus

Cast the T * pointer to a void * pointer.
Store this void * pointer to disk.

To reload the pointer from file we do this:

Read the void * pointer from file.
Look up this pointer value in the map.
1. If present, cast the pointer value to a T * pointer and use that pointer value.
2. If the pointer is not present in the map, create a map_request_type object of the correct type, and add it to the queue.
At the end of serialisation, repeat step (ii) alone on all the pending pointers in the queue.

This process avoids a lot of unnecessary and dangerous casts in client code. It also ensures that if a base pointer is used to reference a derived object, the correct pointer conversions will be applied, and the pointer values adjusted accordingly.

Mangling Pointer Values

In order to mangle the pointer values as they are saved to disk, and unmangle them if required when they are restored, I created the pointer_mangler class:

class pointer_mangler
{
public:
   explicit pointer_mangler(
      unsigned long mangle_style = ~0UL)
     : m_style(mangle_style)   {}

   void *mangle(void *ptr) const;
   void *unmangle(void *ptr) const;

private:
   unsigned long m_style;
};

This class simply mangles void * pointers. The ctor has a parameter that allows you to specify how pointers are to be mangled. In principle this would allow different pointer types to be mangled in different ways, if required for debugging purposes, by using a range of different pointer_mangler objects. My implementation simply casts the pointers to unsigned longs, and xors the result with the style value. It is therefore symmetric, with mangle() doing the same thing as unmangle()^[2].

Note that this class is platform dependent, and assumes a void * pointer can be cast to an unsigned long and back without loss of information (which I imagine should be true in most cases).

The Persistence Manager

I brought the main classes together in the persistence_manager class:

class persistence_manager
{
public:
   void initialise();
   bool terminate();

public:
   void register_me(void *old_this,
                    void *new_this);

   template <typename T>
   void unpersist_pointer(T *&pointer);

private:
   // no copying, etc...
   object_map     m_object_map;
   request_queue  m_requests;
};

This class contains an object_map and a request_queue. The idea is that, at the start of serialisation you create an instance of this class. For each object that you read in, you register it using the register_me() method, passing its previous and current this pointers. Every pointer that you come across is deserialised using the unpersist_pointer() method. This template method looks up the pointer in the map, and adjusts its value as necessary. If it is not found in the map, it pushes a request onto the request_queue and then continues. At the end of serialisation, you call the terminate() method, which then processes the pending requests in the queue.

A Serialisation Example using `fstream`s

Finally, to combine this class with a particular serialisation method, as an example, I created the following persistence class. This class is responsible for containing the persistence_manager object, the persistence stream and a pointer_mangler object:

class persistence
{
public:
   explicit persistence(std::ofstream &os);
   explicit persistence(std::ifstream &is);
   ~persistence();

   // Example serialisation for ints
   // Could use insertion/extraction
   //  operator overloading if preferred
   void persist_int(int &val);
   // Add other common types too…

   // persist this pointer
   void persist_me(void *my_this);

   // persist inter-object association
   template <typename T>
   void persist_pointer(T *&pointer);

private:
   // Low-level void * serialisation
   // Handle pointer mangling, etc.
   void write_pointer(void *ptr);
   void *read_pointer();

private:
   persistence_manager  m_persist;
   pointer_mangler      m_mangler;
   std::ios             *m_pStream;
   bool                 m_saving;
};

I killed two birds with one stone here and made the class handle both saving and loading. This allows the serialisation methods of persistent objects to be simpler. A class diagram for this arrangement is shown in figure 4. The basic idea is as follows. When storing an object:

If the object can be pointed to, use the persist_me() method to store the current this pointer to stream as a mangled void *.
Store attributes to the stream as usual.
Store pointers to other objects using the persist_pointer() method, which stores pointers as mangled void *s.

When reloading the object, we use the following procedure:

If the object can be pointed to, call persist_me(). This now reads the previous this pointer value from file as a void *, and registers the object's current location in the object map.
Read attributes from the stream as usual.
Re-establish object associations using the persist_pointer() method. This looks up the object pointers in the object map, and lodges a request for that object if it has not yet been restored.
The dtor of the persistence object calls persistence_manager::terminate(), which processes all pending requests.

Figure 4. The example persistence system.

This will probably be clearer from an example. If we have an object to be serialised that looks like this:

class B
{
public:
   // ...
   void persist(persistence &p);

private:
   int  m_b;
   A    *m_pA;
};

We can write its persist() method (which handles both saving and loading) as follows:

void B::persist(persistence &p)
{
   p.persist_me(this);      (1)
   p.persist_int(m_b);      (2)
   p.persist_pointer(m_pA); (3)
}

Line 1 is necessary if this object can ever be pointed to by another persistent object. When saving, it stores the this pointer to file, and when loading it registers the object in the map. Line 2 saves and loads a simple attribute type. Line 3 stores a pointer to file when saving, and reloads and re-maps that pointer when loading. An example system of objects of types A and B can now be stored and restored as follows:

void data::save(ostream &os)
{
   persistence p(os);
   for (int i = 0; i < 5; ++i) {
      a[i].persist(p);
      b[i].persist(p);
   }
}

void data::load(istream &is)
{
   persistence p(is);
   for (int i = 0; i < 5; ++i) {
      a[i].persist(p);
      b[i].persist(p);
   }
}

where a[0..4] and b[0..4] are objects of type A and B respectively. It's as simple as that. Persistent classes A and B neither have to share a common base class nor contain any special data members in order to support serialisation. All they need is a single symmetrical persistence method. They simply hand temporary responsibility for their object associations over to the persistence object and concentrate on what they do best - being A's and B's!

Points to Note

If a particular class fails to register itself in the map, then pointers to those objects cannot be corrected at the end of the loading stage. If this happens, your persistence class may want to throw an exception, and to avoid this happening, you can safely register all objects even if they are never pointed to. It just produces a little more storage overhead if you do this.

Note that in order for a pointer to be found, you must ensure that the pointer is in the map. This may sound silly, but if you have complex inheritance hierarchies, you must ensure that different values for an object's this pointer are all stored, depending on the static type of references to the object. For simple inheritance they will usually be the same, but for multiple inheritance and especially when you have virtual base classes, you can find that the effective value of an object's this pointer depends on the type of the pointer used to access it. In general, and if in doubt, you can always re-register a derived object even if a parent object has already registered. If it is already in the map, it will just get replaced, and if the object address is genuinely different from the previous entry, then all your bases are covered.

Finally note that if you already have object IDs allocated to all the objects you can point to, then you can use the above framework in a slightly different arrangement. Instead of storing pointers to file you can store the OIDs. When you register an object, you can add the current this pointer to the object_map keyed on the OID instead of the previous this pointer. Then, as each OID comes in, you can look this up in the map instead, in order to determine the location of the referenced object. This allows fast pointer-based associations between objects, whilst utilising the extant OIDs to persist these associations.

Conclusion

The persistence framework outlined above is limited to systems that are loaded from and stored to file in one go. It should be useful for simple systems, however, and should cope with arbitrary inheritance hierarchies and arbitrary object association patterns in an unobtrusive manner. With a little care, it can also be used in systems using smart pointers for their object reference mechanism of choice.

^[1] If they could not be done at the time.

^[2] Although in the current examples unmangle() is not used, as a convenient side effect of the remapping process is to make unmangling unnecessary.

Notes:

More fields may be available via dynamicdata ..