Title: Mixing Strings in C++

Author:

Date: 03 August 2003 13:15:59 +01:00 or Sun, 03 August 2003 13:15:59 +01:00

Summary:

This article is not going into the realm of how characters are represented and conversions between character representations, instead the interaction between different string implementations are investigated.

Body:

Every (sane) programming language has some mechanism to handle text strings. Text strings are basically a sequence of characters. But this sequence can be implemented in different ways, which are not always interchangeable in a simple and efficient way.

Strings in C

There is only one string type in C. This makes life easy as there is no confusion in the choice of string type to use. C uses a linear sequence of characters terminated by a null character. This sequence is either kept in an array or some other memory pointed to by a pointer. An array of characters is easily converted to a pointer to characters and this makes it easy to pass around pointers to characters in function parameters and data structures. The pointers are so often used to represent strings that they often are referred to as strings and not pointers to strings.

Working with these C strings is not always easy, as the memory for the character sequence must be managed by the programmer. To append a string to another you must reserve a piece of memory to store the new string, then copy the original string and append the second string to this. Then you can use a pointer to this new memory as the result string. Just don't forget to free the memory of the original string to avoid memory leakage, but not if anyone else is still using a pointer to it! This handling gets even worse when you need to merge several strings, for example when adding a file name to a directory name:

char * dirname = ...;
char * filename = ...;
char * tmpdirname;

// Length of new string includes directory
// separator and null character
int newlen = strlen(dirname) +
             strlen(filename) + 2;
tmpdirname = (char *) malloc(newlen);
if (tmpdirname == NULL) {
  // Handle memory error.
}
strcpy(tmpdirname, dirname);
strcat(tmpdirname, "/");
strcat(tmpdirname, filename);
free(dirname);
dirname = tmpdirname;

Of course, most developers wrap such functionality in utility libraries. However, the amount of variations of semantics makes such libraries difficult to use. In this example, we append a filename to a directory name that involves a third string temporarily. Yes, a simple function to append a string to another could be used within this function but then there would be two calls to that function and thus two memory allocations. So two functions are needed, one for simple string append and one for file name append. For every special semantic variation needed, the number of functions increases.

What if the directory name referred to by dirname cannot be freed? It might be a pointer to a string literal or a character sequence stored in an array on the stack and cannot be freed. Or the string is still used somewhere else in the program. We need to have two variations each of the functions in the library, one that frees the original string and one that leaves it intact. Sometimes the variants that free memory are left out and the user of the functions has to remember to free the memory.

Strings in C++

C++ provides the capability to create a string class that manages the underlying memory and makes strings easier to use. Standard C++ [ ISO-IEC] introduced a string class in the library to encapsulate the semantics of strings. Appending a file name to a directory name using the C++ string type is much simpler:

std::string dirname = ...;
std::string filename = ...;

dirname = dirname + "/" + filename;

Note that the error handling is done with exceptions here. This may look like cheating. But exceptions are intended to let code avoid error handling in the cases where nothing can be done other than pass the error on to a higher level.

So using a string class makes life easier for the programmer. Sadly, this is not the only way strings are used in C++. C++ has inherited much from C, including its string concept. Many libraries use C strings (pointers) for function parameters and data structures so both C and C++ programs can use them. One such library many C++ programmers use is the C library. C strings are also used in some parts of the C++ library where exceptions cannot be thrown.

Conversions between String Types

C++ strings are very convenient to use. C strings are relatively easy to use if ownership of the memory is managed properly. However, mixing them can cause a few headaches.

C strings from C++ strings

When you have a C++ string and want to pass it to a function that takes a C string, you can easily get a const char* using the c_str() member of std::string.

// A function that promises not to modify
// the string.
extern void f(const char *);
std::string myString = ...;

f(myString.c_str());

However, the pointer you get points to some data that is managed by the C++ string object. This data will only be valid as long as the string object is not modified in any way. If f() only uses the pointer in its body to do some processing or copying then everything will be fine. But, if f() stores the pointer in some data structure that lives on after f() has returned there is no guarantee that that pointer points to valid data when it is looked at again. So keep your tongue nicely in your mouth and all will be fine. Right?

Well, there are some more situations you have to watch out for. What if myString is modified by another thread while f() is still executing? The behaviour will differ depending on your library implementation and how the string is modified. In some situations, you may get away with this because the memory for where the C string is stored is not overwritten immediately. You may end up with a debugging nightmare instead where your string looks OK for a while and then suddenly changes contents or gets corrupt.

You are not safe in a single-threaded system either. What if f() calls some other function that modifies the same string object that was used in the call to f()?

C++ strings from C strings

Going from C strings to C++ strings is not dangerous but there are still a few things to keep in mind for efficiency.

The std::string class has a convenient conversion from C strings that can be used for implicit conversions from pointer to characters.

// A function that promises not to modify
// the string.
void g(std::string const &);
char* myString = ...;

g(myString);

The call is clean here, no trickery is required to call g(). However, you must be aware that a temporary std::string object is created implicitly. This object keeps a copy of the character sequence allocated on the heap and there are invisible calls to the constructor and the destructor of std::string. In some projects, this cost cannot be afforded.

A simple way to avoid this temporary object is to provide an overloaded g(const char*) at the cost of a second function body.

Custom String Classes

To make things worse, there are several custom string class implementations around. Some predate the C++ standard and some add various features needed in their projects. These non-standard string classes usually work in a similar way to std::string but have some minor additions or omissions that make them difficult to replace with std::string.

My most recent project uses two custom string types. The first is a home-made string dating from the days before the C++ standard that also adds a hash value for the string (called xstring). This hash value was used in various tables to find the string quickly.

The other custom string type is a convenience class, a wrapper for std::string (called U_String). It is derived from std::string and thus behaves exactly like it, but can be forward declared without any #include. This is very useful for header files that only declare string parameters passed by const-reference. (Where the type is only used by name.) The header file <string> includes <iostream> in some libraries, which can be very heavy for the compiler, especially when precompiled headers cannot be used. Compile times were reduced significantly when the U_String class was introduced because the project consisted of a large number of small translation units.

Older parts of the project use the xstring class while newer parts use U_String. Converting between them is trivial to code but requires creation of a temporary object. Usage of the xstring class is slowly being replaced by U_String where the hash value isn't used. The intent is to remove the xstring class completely. The need for the hash value will be replaced by a flyweight string (see below).

String Semantics

The C++ string is a powerful tool. It does a very good job to represent a string. However, there are some problems with it for power-users.

In most library implementations, each string object keeps a copy of the character sequence in a piece of memory it owns. This may be expensive if the same string is copied many times.

Some implementations have solved this by sharing exact copies between copied string objects by keeping a reference count for the memory that contains the character sequence. The memory is shared until a string object is modified. It then gets a bit of memory for itself. This is called COW-strings. (Copy-On-Write) This technique reduces the memory consumption drastically in some situations. It also cuts down the time for allocating new memory on the heap and copying the character sequence. However, this technique may cause problems in threaded systems and it is not used in most modern library implementations.

My recent project is a C++ parser that tokenises its input. Each token contains a file name, line and column numbers. Of course, there are many tokens from the same file. The project uses two different compilers with different library implementations. The memory consumption differed dramatically between the two implementations. Naturally, we wanted the best performance with both implementations.

Read-only strings

The first approach was to create a read-only string class. File names used in tokens are not likely to change. So a reference counting technique could be used. We had seen the benefit of this in one of the library implementations. If we removed all modification methods of the string class, we would eliminate the logic required to keep the read-only strings consistent. To make the implementation simple we just created a class that contained a reference counted pointer to std::string and added the methods we required.

One benefit of this read-only class we saw immediately was that when two filenames were compared, we could compare the pointers first and if they were equal, there was no need to compare each character in the strings.

Flyweight (read-only) strings

The thought of using pointer comparisons for string comparisons was very appealing. However the read-only strings could be created from different sources and thus there would exist equivalent strings that did not have equal pointers.

The next step was to use the Flyweight Pattern, see [Gamma-et-al]. Each file name was now represented by a handle. File name strings were created by a factory to guarantee that equivalent strings used the same handle.

The Flyweight string class used policy classes that specify how the strings are stored and a usage domain. The domain was introduced so that different kinds of strings couldn't be compared. Separating strings in domains also keeps the numbers of strings in each domain down, therefore the time for inserting the string is lower. We had one domain for filenames, another for identifiers etc. The compiler would complain if we tried to compare filenames and identifiers.

The storage parameter specifies the internal data structure for the strings and what type the handle is. The initial implementation used a set of strings for storage and the handles were iterators into the set.

Conversions Again

Now we suddenly have a project with six different types of strings. (More if each specialisation of the specialised flyweight strings are counted.) Care had to be taken when there was conversion between them. However, it turned out that having separate types for each kind of string was more helpful than confusing. Having separate types gave the compiler a chance to help us in finding inconsistent usages of string types. There wasn't that much conversion between the new string types and ordinary strings so the small cost was acceptable.

Conclusion

Strings are used for many purposes. Although a string is a simple concept, there are many ways to use strings, and many types of strings tailored for these usages. You may have many different string types in your project depending on how the strings are used and on the history of your project. If your project uses strings heavily, you may optimise your code by looking at how different string types are used and interact with each other and decide on which type of string is best suited for each usage.

References

[ISO-IEC] International Standard: Programming Languages - C++ ISO/IEC 14882:1998(E), 1998.

[Gamma-et-al] [2] Gamma et al. Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995.

Notes:

More fields may be available via dynamicdata ..