Title: File Format Conversion Using Templates and Type Collections

Author:

Date: 02 December 2002 21:57:50 +00:00 or Mon, 02 December 2002 21:57:50 +00:00

Summary:

Body:

A recent project involved upgrading some files from an old format to a new one (and possibly back again), as a result of changes to the data types being stored. Several possible implementations were considered. The final solution made use of template methods, and type-collection classes, and supported forward and backward file format conversion with no code duplication and minimal overhead.

Requirements

Many years ago, one of our projects was converted from a 16-bit application running on Windows 3.11 to a 32-bit one running on Win32. Most of the code was ported at the time, but some changes were not made because they would have required a change to the file formats. 16-bit identifier values were being stored in a file. Changing the file format was seen as too much of an upheaval (especially at a time when so many other changes were being made). And besides, 16-bits should be enough for anyone...

Time passed. Suddenly, 16-bits were no longer enough everyone. The file format needed to be upgraded. Discussions were had, and the following requirements emerged:

The old version of the software would not be required to read the new file format (i.e. no forwards compatibility - see [Blundell00]).
The new version of the software was required to use the new format (obviously) but only had to recognise the old format, and prompt the user to upgrade (i.e. limited backwards compatibility - see [Blundell00]).
An upgrade utility would convert from the old format to the new format. A 'downgrade' facility would be a 'nice-to-have' (just in case users were running both software versions on site and upgraded the wrong site by mistake) but was not a necessity.
The interfaces of the data classes should be changed as little as possible.
Any solution should support future changes (we don't want to have to re-implement everything when it comes to 64-bits).

Initial suggestions

Support for the new format in the software, and for both formats in the upgrade utility, required old and new versions of the persistence code for the data types involved, as well as some form of user-interface for the upgrade utility, and logic for converting the files as a whole. Suggestions were put forward for tackling the serialisation issues:

Copy the old serialisation source code to the upgrade tool project, modify the original code to use the new format so the application can read and write the new files, and include this modified code in the upgrade utility as well. The upgrade utility would therefore have code for both the old and new formats, and the application would have only the new code.
Append methods supporting the new formats to all the affected data classes. The application would use the new format only, and the upgrade utility would use both.
Modify the serialisation methods to handle both formats, determining which one to use with some form of flag or version number.

Drawbacks

The first suggestion set warning bells ringing left right and centre. Every time I have ever copied code around it has come back to haunt me. When the same, or similar, code is in two places you have twice as much code to manage. Changes need to be made in two places instead of one, which is highly error-prone. Furthermore, people inevitably forget about one or other of the copies, and so it gets out of date, it doesn't get built properly, documentation stagnates, and it causes endless confusion to new team members when they stumble across it. Re-use good; copy-and-paste bad!

However, criticisms were levelled at the second suggestion too. The application would need to cart around both old and new serialisation code, despite only ever using the new code. Small classes would find the majority of their source code comprising multiple persistence methods. Changes and fixes would still need to be made to both versions. Even if they sit right next to each other in the source file it is easy to miss one when editing the code through a tiny keyhole source code window^[1].

Finally, the third suggestion leads to spaghetti serialisation code, with huge conditional blocks based on ever-more complicated version dependencies. In later versions you have a mess of if blocks checking for file formats that have not been supported for years [Blundell00]. As with the previous suggestion lean classes become fat with persistence methods.

Types, typedefs and templates

In our project we were making no changes other than the types and sizes of various data values. Instead of a version flag, why not parameterise the persistence methods on the relevant types? This way we can support a whole raft of file formats using different types all with the same code. Simple wrapper methods can then be written to forward to the parameterised method with the correct types.

As a rather trivial example, consider the code for a class that stores an array of id values (see [Blundell99]).

  // id_array.h
  class id_array {
    ...
    short m_size; // should be plenty...
    short *m_ids; // should be wide enough
    };

  // id_array.cpp
  void id_array::extract(out_file &f) const
  {
    f << m_size; // raw write, 16-bits
    for (short i = 0; i != m_size; ++i)
    f << m_ids[i];
  }

  void id_array::build(in_file &f) {
    short size;
    f >> size; // raw read of 16-bits
    resize(size);
    for (short i = 0; i != size; ++i)
    f >> m_ids[i];
  }

As you can see, there is very little change to the code. The two methods are prefixed with a template declaration containing the type required. This type is then used inside the methods. One point worth noting here is that the type must be used in any overloaded function calls rather than the data members from the class itself. Writing f << m_size; will output m_size as the type defined in the class itself, rather than the required type T. Hence you must write T size = m_size; f << size; instead. Easy to overlook, that one (he says from experience :-)^[2].

Explosion of types

It soon becomes clear that, strictly, we should have parameterised the class both on the capacity and the contained type, because these are not necessarily the same. Thus, our class is now parameterised on two types:

  template <typename Count, typename T>
  void id_array::extractT(out_file &f) const{
    Count size = m_size;
    f << size;
    for (Count i = 0; i != m_size; ++i) {
      T value = m_ids[i];
      f << value;
    }
  }
  template <typename Count, typename T>
  void id_array::buildT(in_file &f) {
    Count size;
    f >> size;
    resize(size);
    for (Count i = 0; i != size; ++i) {
      T value;
      f >> value;
      m_ids[i] = value;
    }
  }

More complicated data structures may have even more types, and when you have many such low-level data types you can end up with a huge number of types and a huge number of different parameters to each method. It gets nasty very quickly.

Classes of types

What we really want is to be able to say, "My old file format used types t1, t2, …, tn, whereas in my new format I use types T1, T2, ..., Tn." It would be nice to be able to group these relevant types together so you can just say "new format" or "old format" rather than "short, unsigned short, int and short" to one method and something else to another. Enter the class as a method of naming things as a group:

  // format_types.h
  class old_types {
  public:
    typedef short count_t;
    typedef short my_id_t;
    ... // lots more follow, if nec.
  };
  class new_types {
  public:
    typedef size_t count_t;
    typedef int my_id_t;
    ... // lots more...
  };

Now, rather than passing in as many parameters as each class requires, persistence methods can be parameterised solely on a single format type. These methods then pull out whatever named types they require from the file format 'types class':

  template <typename Format>
  void id_array::extractT(out_file &f) const{
    Format::count_t size = m_size;
    f << size;
    for (Format::count_t i = 0;
         i != size; ++i) {
      Format::my_id_t value = m_ids[i];
      f << value;
    }
  }

  template <typename Format>
  void id_array::buildT(in_file &f) {
    Format::count_t size;
    f >> size;
    resize(size);
    for (Format::count_t i = 0;
         i != size; ++i) {
      Format::my_id_t value;
      f >> value;
      m_ids[i] = value;
    }
  }

Forwarding functions

We did not want to alter the interfaces of the data classes more than necessary. In particular, we wanted persistence from our main application to work exactly as before. To achieve this we created one more typedef for the types currently in use:

  // format_types.h
  // current_types points to new_types
  // now (not old_types)
  typedef new_types current_types;
  ...

and wrote forwarding functions to call the buildT() and extractT() template methods with the correct types:

  // id_array.h
  class id_array {
  public:
    // these are the original method names
    void extract(out_file &f) const; 
    void build(in_file &f);
    // these are new forwarding methods
    void extract_old(out_file &f) const;
    void extract_new(out_file &f) const;
    void build_old(in_file &f);
    void build_new(in_file &f);

  private:
    // These are the implementations
    template<typename Format>
    void extractT(out_file &f) const;
    template<typename Format>
    void buildT(in_file &f);
  };

We then implemented these forwarding methods:

  void extract(out_file &f) const {
    extractT<current_types>(s);
  }

  void build(in_file &f) const {
    buildT<current_types>(s);
  }

  void extract_old(out_file &f) const {
    extractT<old_types>(s);
  }
  ... // etc.

These are all just one-liners, making it trivial to implement and maintain.

New formats

If a new format is required in the future (64-bits, etc.) supporting it is simple:

Add code to the unit test class to check that the new format works OK.
Add a new types class, really_new_types, containing the relevant typedefs.
Add one-line forwarding methods to each class to pass this types class in.
Update current_types to point to the new types class, really_new_types.
Build and check that your unit tests pass, to ensure the single persistence methods are sufficiently generalised to support the new types.

If you want you can omit step 3 and expose public templated serialisation methods. That way, clients can use any file format they choose by calling the method with the correct types class. We did not do this, (a) to control access to the different formats more closely, and (b) because our compiler, Visual C++ 7 (the latest .NET version) requires template methods to be implemented inline, which we did not want to do. Some of our persistence methods were quite involved. Implementing them in the header files could have introduced additional compilation dependencies from extra #include directives being required.

Our workaround involved declaring a private friend helper class at the top of each data class:

  // id_array.h
  class id_array {
    class persister;
    friend class id_array::persister;
  public:
  ...
  };

Class persister then simply had two methods: the two template persistence methods moved from the main class:

  // id_array.cpp
  class id_array::persister {
  public:
    template<typename Format>
    static void extractT(const id_array &a,
                         out_file &f) {
      ... // inline because of VC++7
    }
    template<typename Format>
    static void buildT(id_array &a,
                       out_file &f) {
      ... // inline because of VC++7
    }
  };

The use of this private helper class allowed us to move the inline implementations of these template methods out of the header file. Making it a nested class avoided name clashes because we were not polluting the scope of our data classes with additional names (and therefore each class could use the same nested class name, persister). The forwarding methods within each data class could now simply forward to the static methods of class persister, passing in a reference to themselves:

  // id_array.cpp
  void id_array::extract(out_file &f) const {
    persister::extractT<current_types>(*this, f);
  }
  ... // etc.

Alas we were not quite in the clear yet. Another weakness of VC++7 is that it does not support explicit specification of template parameters for template methods/functions. We had to work around this one as well by passing in a dummy object to each method and letting the compiler sort out which function to call:

  // id_array.cpp
  class id_array::persister {
  public:
    template<typename Format>
    static void extractT(const Format &,
                         const id_array &a,
                         out_file &f) {
      ...
    }
    ...
  };
  ...

  void id_array::extract(out_file &f) const {
    id_array::persister::extractT(current_types(), *this, f);
  }
  ...

Conclusion

Classes were used as a scope to package up the whole set of types, used when serialising to a given file format, into a 'types class'. A typedef was provided to allow current_types always to refer to the primary types class, and hence the current file format. Template serialisation methods were used to localise a single serialisation algorithm for each class in a single place to aid implementation and maintenance. One-line (non-template) forwarding methods were used to provide an easy interface to the current, old, and new file formats. And finally the use of a private nested friend class and dummy template function parameters allowed us to work around various weaknesses in the Microsoft C++ compiler and to move our templated persistence methods out of the header files.

None of these choices were rocket science, but the end result was a seamless implementation of multi-format persistence with very little overhead, either overall (just the format classes were needed) or in each of the persisted classes.

References

[Blundell99] Blundell, R.P., "A Simple Model for Object Persistence Using the Standard Library," Overload 32, June 1999

[Blundell00] Blundell, R.P., "Automatic Object Versioning for Forward and Backward File Format Compatibility," Overload 35, January 2000

^[1] which is all the space you seem to be left with, these days, in between the project windows, watch windows, output windows, toolbars, palette windows, etc., of the modern IDE.

^[2] But fortunately one that is easy to spot when the automated unit tests, which of course you wrote first, fall over.

Notes:

More fields may be available via dynamicdata ..

Journal Articles