Recently, a colleague approached me with an interesting problem; he had two containers with corresponding elements, so the n-th entry of container A was related to the n-th entry of container B, and he needed to sort these containers so the elements of A were "in order", without losing the correspondence property, as in the figure below:
There are several solutions to this, each of which has its merits and disadvantages, such as:
-
Create a third container holding either the numeric index, or some form of iterator or pointer into the containers. This container can then be sorted using a special comparison function that references the original containers. Code that processes the data can then either use the new container to index into the originals, or the original containers can be re-shuffled to match the order specified in the new container.
-
Copy the values into a container of pairs and sort that, possibly using the self-sorting property of the Standard Associative Containers, if appropriate
It was a third solution that really interested me - what my colleague conceptually had was a container of pairs of values, even if it was physically stored as a pair of containers of values; why then couldn't we treat the data as a container of pairs? This would allow sorting in place, and intuitive access to the data. The answer is: we can - just write an iterator adaptor that iterates through both containers simultaneously, and returns a pair when dereferenced; the pair of containers can thus be viewed as a sequence of pairs of values. The rest of this article covers the complexities hidden in that "just".
To cover the original problem (sorting), we need only worry about random access iterators, since std::sort requires random access. However, the problem had me hooked, and I wanted a general solution, with maximum flexibility, and all the complexities that involved.
For maximum flexibility, we want our adapted iterator to be as capable as possible, but no more - we cannot efficiently provide more facilities than the underlying iterators. Therefore, we must choose the most basic iterator category of the underlying iterators. Since the iterator category tags other than output_iterator_tag form an inheritance hierarchy, we can use the implicit conversions applied by the ternary conditional operator ?: to determine the most basic category - the return type of a conditional expression where the two result expressions are pointers is a pointer to the common base class of the pointed-to classes, so the type of the expression
false ? (std::forward_iterator_tag*)0 : (std::random_access_iterator_tag*)0
is std::forward_iterator_tag*, for example. We can then code to handle output iterators separately - if either of the iterators is an output iterator, the result is an output iterator, unless the other is an input iterator, in which case it is an error. This is all handled by the CommonCategory class template, shown in listing 1 - the categoryCheck functions and CategoryMap class templates are shown in listing 2.
Having chosen our iterator category, we need to implement the appropriate operations to fulfil the Standard Iterator Requirements from Section 24.1 of the C++ Standard [ISOCpp]. Most operations can easily be implemented by forwarding to the corresponding operations on the underlying iterators - the pre-increment operator can be implemented by incrementing the underlying iterators, for example. It is the dereference operator (operator*) and the choice of value_type which is complicated, as it depends on the iterator category. Input Iterators may return by value, so if either of the underlying iterators is an Input Iterator, we need to copy the result of dereferencing the underlying iterators. On the other hand, the only operation permitted on the result of dereferencing an Output Iterator is to assign to it, so we cannot store a copy - the dereference operator must return something which, when assigned to, assigns to the result of dereferencing the underlying iterators. Finally, for all other iterators, we need to return a reference that can be used to access and update the elements of the underlying sequences.
The ValueForCategory class template assists us with our choice - the PairIt iterator template just delegates to ValueForCategory, once the appropriate iterator_category has been determined, and this is specialized for Input Iterators and Output Iterators, leaving the primary template to handle the other cases.
Implementing the dereference operator for Input Iterators and Output Iterators is actually quite straightforward. For Input Iterators, the value_type can be a plain pair of values, the elements of which are the value_types of the underlying iterators, and the dereference operator can just copy the values from the underlying iterators into a pair held within the iterator, and return a reference to that pair.[1] For Output Iterators, the value_type is void, but the result of the dereference operator is something quite different - an OutputPair that contains references to the underlying iterators, and which dereferences and writes to the iterators when assigned to. OutputPairs cannot be copied, so our iterator's dereference operator should return a reference to an internal instance of OutputPair. The definition of OutputPair is shown in listing 3.
Supporting Forward Iterators, Bidirectional Iterators and Random Access Iterators is more complicated - the dereference operator must return a reference to the value_type, which must hold real references to the elements in the sequences covered by the original iterators. Just to add complexity, we really want the value_type to be Copy-Constructible and Assignable, and to copy the values of the elements, not the references, as users wouldn't expect modifying copies of the values to affect the originals; this implies that objects of the same class sometimes contain references to data held elsewhere, and sometimes hold the data directly. For this purpose, we define the OwningRefPair class, which has references for its public data members, and contains an internal buffer for the values - the references can either point to external data, in which case the buffer is empty, or they can point to the buffer, in which case the buffer contains instances of the appropriate objects. The objects are stored in an internal buffer, rather than on the heap, to avoid the overhead of dynamic memory allocation; however, this does require care to ensure that the objects are properly destructed, and to ensure that the buffer is correctly aligned.
For alignment, we use a union of an appropriately-sized array of char, and an instance of align_t. align_t is itself a union of all the fundamental types, and structs containing them. On most platforms, this will have the most rigorous alignment of any type, so (on most platforms)[2] the union of align_t and the array of char is guaranteed to be correctly aligned for any type that has a sizeof less than or equal to the size of the array.[3] The RawMem template union uses the sizeof the template parameter as the size of the char array.
We can then cast the address of the char array in our RawMem union to a pointer of the required type, and use it with placement new to construct an instance of the specified type. At the appropriate point, we can also manually invoke the destructor to clean up the object - i.e. in the destructor of our object, we check to see if the buffer contains an object or not, and invoke the destructor if it does, since this is a fixed property of the OwningRefPair object - either it contains the referred-to objects in its buffer, or it doesn't, this property doesn't change during its lifetime. In this case, the owned object is an OwnedPair; the details are shown in listing 5. Since our value_types are distinct, and have different construction syntax, we delegate the actual task of construction and destruction to the ValueForCategory template, to give a uniform interface to PairIt.
Having pinned down the value_type for our iterator, and what we get when we dereference it, we can put together a basic version of our PairIt, as in listing 6. This highlights a couple of issues. Firstly, we delegate all the type selection to the PairItHelper template, so we can inherit from an appropriate instance of std::iterator<> without having to specify all the types explicitly. Secondly, even though std::iterator<> defines all the required typedefs, we have to repeat them here, so we can use them within the class definition; the base class is a dependent name, so it isn't searched during resolution of unqualified names within the class. This begs the question of whether or not we need to inherit from std::iterator<> at all; some existing code expects all iterators that aren't raw pointers to inherit from std::iterator<>, and doing so causes no harm. We also can reuse the memory management from OwningRefPair, so we don't have to rely on any particular properties of the value_types of the underlying iterators, except this time we delegate the construction and destruction to PairItHelper as well. Note also that all the members are mutable - this is because we don't want to pass on any requirements that the encapsulated iterators be non-const for specific operations, and the cache needs to be updated in response to dereferencing the iterator, which is a const operation.
Now that our iterator supports the basic operation of dereferencing, we need to cover the remaining iterator requirements from the C++ Standard. For Input Iterators, the relevant section is 24.1.1, and table 72. This requires that in addition to dereferencing, we also require:
-
Copy-construction,
-
Assignment,
-
Equality and Inequality operators, and
-
Pre- and post-increment operators.
These operations are also sufficient for Output Iterators, as they cover all the requirements from section 24.1.2 and table 73.
In all cases, we can just defer to the underlying iterators, and perform the operations on them. However, there is a consequence for exception safety - since we know nothing about the effects of the operations on the underlying iterators, including whether or not they through exceptions, and whether or not the iterator types support a non-throwing swap operation, we have to add a disclaimer to the usage of our iterator - if an operation on a PairIt throws an exception, then the iterator is to be considered to have become invalid. If we don't add this disclaimer, then it is possible that the state of the iterator may become confused, as (for example) one of the underlying iterators may have advanced, and the other one not.
Another point to make is that copy-construction and assignment should only copy the iterators, not the cache. This is to avoid unnecessary copying of the cached data, which would only provide an additional source of exceptions for no gain - the cache must be regenerated every time the iterators are dereferenced anyway to support Input Iterators that automatically advance when read (which may not actually be allowed anyway). When dealing with non-Input Iterators, the cached value is only a couple of references, so this should add little performance penalty. The alternative is to have every function that modifies the underlying iterators call emptyCache(), and only call initCache() if the cache is not initialised.
For the cases where both the underlying iterators are at least Forward Iterators, we need to meet additional requirements, for PairIt to also work as a Forward Iterator. These are given by section 24.1.3 and table 74 of the Standard, and are actually mostly semantic constraints, rather than operational constraints, so these are implemented automatically if the underlying iterators are themselves Forward Iterators. The only additional operation required is that PairIt must be Default-constructible, which is trivially implemented.
If our underlying iterators are at least Bidirectional Iterators, we should also implement the pre- and post-decrement operators to maintain that level, as detailed in section 24.1.4 and table 75 of the Standard. As for pre- and post-increment, these can be implemented merely by forwarding to the underlying iterators:
PairIt& operator-() { -it1; -it2; return *this; }
If our underlying iterators are both Random Access Iterators, then we have a whole swathe of additional requirements to support, as detailed in section 24.1.5 and table 76 of the Standard. These are:
-
The arithmetic operators + and -,
-
The arithmetic assignment operators += and -=,
-
The comparison operators <, >, <= and >=, and
-
The subscripting operator [].
The arithmetic operators are trivial - just forward to the underlying iterators. The comparison operators require a bit more thought - what do we do if the first iterator is less than its partner, but the second isn't? - but the problems can be defined out of existence; iterators can only be compared if they are in the same range. This implies that one is reachable from the other. If the first and second underlying iterators don't give the same results when compared against their partners in the PairIt we are comparing against, then this can't be the case, and the issue can be avoided - in fact, we could get away with just comparing the first iterator in each pair, so the comparison operators are also trivial. The subscripting operator is easy, too - it[n] is just syntactic shorthand for *(it+n), so we can implement it that way.
However, a bit more thought reveals that doing things the "simple" way requires that all these operations are either member functions or friends of PairIt, yet some could be implemented in terms of others. For example, the idiomatic way of implementing + is to use += on a copy, as follows:
template<typename I1,typename I2> PairIt<I1,I2> operator+(PairIt<I1,I2> temp, std::ptrdiff_t n) { temp+=n; return temp; }
The same applies to the comparison operators - all the others can be implemented in terms of operator<. In fact, operator< itself can then be implemented in terms of subtraction, since these are Random Access Iterators, so the list of friends and member functions is now down to:
-
operator+=,
-
operator-=,
-
operator- where both operands are iterators, and
-
operator[], which is required to be a member function.
Sometimes we don't want to have to specify the precise type of our iterators explicitly, because we're creating a temporary object to pass to an algorithm, and doing so requires excessive typing, if it is possible at all. For this reason, we also provide a helper template function makePairIterator that takes two iterators as parameters, and returns a PairIt containing them. This simple function and its use is shown in listing 7.
Of course, as written, this will copy the elements of the containers to do the comparison, as the value_type of the pair iterators is a custom pair type, as described above, so a temporary std::pair has to be constructed. It is therefore more efficient to write functor class with a template function call operator:
struct MyPairComparisonFunctor { template<typename PairType> bool operator()(const PairType&, const PairType&) const; };
You could, of course, write a comparison function to take the precise custom pair type in question, but this varies depending on the iterators, so is not as straightforward as it may seem.
Implementing an iterator adapter to treat a pair of sequences as a sequence of pairs is not a trivial task, though some of the individual parts are; the biggest headache is deciding the value_type and the return type for the dereference operator.
However, it provides a genuinely useful service, and in combination with other iterator adapters and function objects, can be used to access data in intuitive ways, however it is stored.
[Alex] Andrei Alexandrescu. Generic<Programming>: Discriminated unions (II). C/C++ User's Journal, 20(6), June 2002. Available online at http://www.cuj.com/experts/2006/alexandr.htm.
Overload Journal #51 - Oct 2002 + Programming Topics
Browse in : |
All
> Journals
> Overload
> 51
(8)
All > Topics > Programming (877) Any of these categories - All of these categories |