    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Mixing Strings in C++</title>
        <link>https://members.accu.org/index.php/articles/1233</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + CVu Journal Vol 15, #4 - Aug 2003</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c107/">154</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-107/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+107/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Mixing Strings in C++</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 August 2003 13:15:59 +01:00 or Sun, 03 August 2003 13:15:59 +01:00</p>
<p><strong>Summary:</strong>&nbsp;<p>This article is not going into the realm of how characters are represented and conversions between character representations, instead the interaction between different string implementations are investigated.</p></p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e20" id="d0e20"></a></h2>
</div>
<p>Every (sane) programming language has some mechanism to handle
text strings. Text strings are basically a sequence of characters.
But this sequence can be implemented in different ways, which are
not always interchangeable in a simple and efficient way.</p>
<p>This article is not going into the realm of how characters are
represented and conversions between character representations,
instead the interaction between different string implementations
are investigated.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e26" id="d0e26"></a>Strings in
C</h2>
</div>
<p>There is only one string type in C. This makes life easy as
there is no confusion in the choice of string type to use. C uses a
linear sequence of characters terminated by a null character. This
sequence is either kept in an array or some other memory pointed to
by a pointer. An array of characters is easily converted to a
pointer to characters and this makes it easy to pass around
pointers to characters in function parameters and data structures.
The pointers are so often used to represent strings that they often
are referred to as strings and not pointers to strings.</p>
<p>Working with these C strings is not always easy, as the memory
for the character sequence must be managed by the programmer. To
append a string to another you must reserve a piece of memory to
store the new string, then copy the original string and append the
second string to this. Then you can use a pointer to this new
memory as the result string. Just don't forget to free the memory
of the original string to avoid memory leakage, but not if anyone
else is still using a pointer to it! This handling gets even worse
when you need to merge several strings, for example when adding a
file name to a directory name:</p>
<pre class="programlisting">
char * dirname = ...;
char * filename = ...;
char * tmpdirname;

// Length of new string includes directory
// separator and null character
int newlen = strlen(dirname) +
             strlen(filename) + 2;
tmpdirname = (char *) malloc(newlen);
if (tmpdirname == NULL) {
  // Handle memory error.
}
strcpy(tmpdirname, dirname);
strcat(tmpdirname, &quot;/&quot;);
strcat(tmpdirname, filename);
free(dirname);
dirname = tmpdirname;
</pre>
<p>Of course, most developers wrap such functionality in utility
libraries. However, the amount of variations of semantics makes
such libraries difficult to use. In this example, we append a
filename to a directory name that involves a third string
temporarily. Yes, a simple function to append a string to another
could be used within this function but then there would be two
calls to that function and thus two memory allocations. So two
functions are needed, one for simple string append and one for file
name append. For every special semantic variation needed, the
number of functions increases.</p>
<p>What if the directory name referred to by <tt class=
"varname">dirname</tt> cannot be freed? It might be a pointer to a
string literal or a character sequence stored in an array on the
stack and cannot be freed. Or the string is still used somewhere
else in the program. We need to have two variations each of the
functions in the library, one that frees the original string and
one that leaves it intact. Sometimes the variants that free memory
are left out and the user of the functions has to remember to free
the memory.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e42" id="d0e42"></a>Strings in
C++</h2>
</div>
<p>C++ provides the capability to create a string class that
manages the underlying memory and makes strings easier to use.
Standard C++ [ <a href="#ISO-IEC">ISO-IEC</a>] introduced a string
class in the library to encapsulate the semantics of strings.
Appending a file name to a directory name using the C++ string type
is much simpler:</p>
<pre class="programlisting">
std::string dirname = ...;
std::string filename = ...;

dirname = dirname + &quot;/&quot; + filename;
</pre>
<p>Note that the error handling is done with exceptions here. This
may look like cheating. But exceptions are intended to let code
avoid error handling in the cases where nothing can be done other
than pass the error on to a higher level.</p>
<p>So using a string class makes life easier for the programmer.
Sadly, this is not the only way strings are used in C++. C++ has
inherited much from C, including its string concept. Many libraries
use C strings (pointers) for function parameters and data
structures so both C and C++ programs can use them. One such
library many C++ programmers use is the C library. C strings are
also used in some parts of the C++ library where exceptions cannot
be thrown.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e56" id="d0e56"></a>Conversions
between String Types</h2>
</div>
<p>C++ strings are very convenient to use. C strings are relatively
easy to use if ownership of the memory is managed properly.
However, mixing them can cause a few headaches.</p>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e61" id="d0e61"></a>C strings from C++
strings</h3>
</div>
<p>When you have a C++ string and want to pass it to a function
that takes a C string, you can easily get a <span class=
"type">const char*</span> using the <tt class=
"methodname">c_str()</tt> member of <tt class=
"classname">std::string</tt>.</p>
<pre class="programlisting">
// A function that promises not to modify
// the string.
extern void f(const char *);
std::string myString = ...;

f(myString.c_str());
</pre>
<p>However, the pointer you get points to some data that is managed
by the C++ string object. This data will only be valid as long as
the string object is not modified in any way. If <tt class=
"function">f()</tt> only uses the pointer in its body to do some
processing or copying then everything will be fine. But, if
<tt class="function">f()</tt> stores the pointer in some data
structure that lives on after <tt class="function">f()</tt> has
returned there is no guarantee that that pointer points to valid
data when it is looked at again. So keep your tongue nicely in your
mouth and all will be fine. Right?</p>
<p>Well, there are some more situations you have to watch out for.
What if <tt class="varname">myString</tt> is modified by another
thread while <tt class="function">f()</tt> is still executing? The
behaviour will differ depending on your library implementation and
how the string is modified. In some situations, you may get away
with this because the memory for where the C string is stored is
not overwritten immediately. You may end up with a debugging
nightmare instead where your string looks OK for a while and then
suddenly changes contents or gets corrupt.</p>
<p>You are not safe in a single-threaded system either. What if
<tt class="function">f()</tt> calls some other function that
modifies the same string object that was used in the call to
<tt class="function">f()</tt>?</p>
</div>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e104" id="d0e104"></a>C++ strings from
C strings</h3>
</div>
<p>Going from C strings to C++ strings is not dangerous but there
are still a few things to keep in mind for efficiency.</p>
<p>The <tt class="classname">std::string</tt> class has a
convenient conversion from C strings that can be used for implicit
conversions from pointer to characters.</p>
<pre class="programlisting">
// A function that promises not to modify
// the string.
void g(std::string const &amp;);
char* myString = ...;

g(myString);
</pre>
<p>The call is clean here, no trickery is required to call
<tt class="function">g()</tt>. However, you must be aware that a
temporary <tt class="classname">std::string</tt> object is created
implicitly. This object keeps a copy of the character sequence
allocated on the heap and there are invisible calls to the
constructor and the destructor of <tt class=
"classname">std::string</tt>. In some projects, this cost cannot be
afforded.</p>
<p>A simple way to avoid this temporary object is to provide an
overloaded <tt class="methodname">g(const char*)</tt> at the cost
of a second function body.</p>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e132" id="d0e132"></a>Custom String
Classes</h2>
</div>
<p>To make things worse, there are several custom string class
implementations around. Some predate the C++ standard and some add
various features needed in their projects. These non-standard
string classes usually work in a similar way to std::string but
have some minor additions or omissions that make them difficult to
replace with <tt class="classname">std::string</tt>.</p>
<p>My most recent project uses two custom string types. The first
is a home-made string dating from the days before the C++ standard
that also adds a hash value for the string (called <tt class=
"classname">xstring</tt>). This hash value was used in various
tables to find the string quickly.</p>
<p>The other custom string type is a convenience class, a wrapper
for <tt class="classname">std::string</tt> (called <tt class=
"classname">U_String</tt>). It is derived from <tt class=
"classname">std::string</tt> and thus behaves exactly like it, but
can be forward declared without any <tt class=
"literal">#include</tt>. This is very useful for header files that
only declare string parameters passed by const-reference. (Where
the type is only used by name.) The header file <tt class=
"filename">&lt;string&gt;</tt> includes <tt class=
"filename">&lt;iostream&gt;</tt> in some libraries, which can be
very heavy for the compiler, especially when precompiled headers
cannot be used. Compile times were reduced significantly when the
<tt class="classname">U_String</tt> class was introduced because
the project consisted of a large number of small translation
units.</p>
<p>Older parts of the project use the <tt class=
"classname">xstring</tt> class while newer parts use <tt class=
"classname">U_String</tt>. Converting between them is trivial to
code but requires creation of a temporary object. Usage of the
<tt class="classname">xstring</tt> class is slowly being replaced
by <tt class="classname">U_String</tt> where the hash value isn't
used. The intent is to remove the <tt class=
"classname">xstring</tt> class completely. The need for the hash
value will be replaced by a flyweight string (see below).</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e185" id="d0e185"></a>String
Semantics</h2>
</div>
<p>The C++ string is a powerful tool. It does a very good job to
represent a string. However, there are some problems with it for
power-users.</p>
<p>In most library implementations, each string object keeps a copy
of the character sequence in a piece of memory it owns. This may be
expensive if the same string is copied many times.</p>
<p>Some implementations have solved this by sharing exact copies
between copied string objects by keeping a reference count for the
memory that contains the character sequence. The memory is shared
until a string object is modified. It then gets a bit of memory for
itself. This is called COW-strings. (Copy-On-Write) This technique
reduces the memory consumption drastically in some situations. It
also cuts down the time for allocating new memory on the heap and
copying the character sequence. However, this technique may cause
problems in threaded systems and it is not used in most modern
library implementations.</p>
<p>My recent project is a C++ parser that tokenises its input. Each
token contains a file name, line and column numbers. Of course,
there are many tokens from the same file. The project uses two
different compilers with different library implementations. The
memory consumption differed dramatically between the two
implementations. Naturally, we wanted the best performance with
both implementations.</p>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e196" id="d0e196"></a>Read-only
strings</h3>
</div>
<p>The first approach was to create a read-only string class. File
names used in tokens are not likely to change. So a reference
counting technique could be used. We had seen the benefit of this
in one of the library implementations. If we removed all
modification methods of the string class, we would eliminate the
logic required to keep the read-only strings consistent. To make
the implementation simple we just created a class that contained a
reference counted pointer to std::string and added the methods we
required.</p>
<p>One benefit of this read-only class we saw immediately was that
when two filenames were compared, we could compare the pointers
first and if they were equal, there was no need to compare each
character in the strings.</p>
</div>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e203" id="d0e203"></a>Flyweight
(read-only) strings</h3>
</div>
<p>The thought of using pointer comparisons for string comparisons
was very appealing. However the read-only strings could be created
from different sources and thus there would exist equivalent
strings that did not have equal pointers.</p>
<p>The next step was to use the Flyweight Pattern, see [<a href=
"#Gamma-et-al">Gamma-et-al</a>]. Each file name was now represented
by a handle. File name strings were created by a factory to
guarantee that equivalent strings used the same handle.</p>
<p>The Flyweight string class used policy classes that specify how
the strings are stored and a usage domain. The domain was
introduced so that different kinds of strings couldn't be compared.
Separating strings in domains also keeps the numbers of strings in
each domain down, therefore the time for inserting the string is
lower. We had one domain for filenames, another for identifiers
etc. The compiler would complain if we tried to compare filenames
and identifiers.</p>
<p>The storage parameter specifies the internal data structure for
the strings and what type the handle is. The initial implementation
used a set of strings for storage and the handles were iterators
into the set.</p>
</div>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e217" id="d0e217"></a>Conversions
Again</h3>
</div>
<p>Now we suddenly have a project with six different types of
strings. (More if each specialisation of the specialised flyweight
strings are counted.) Care had to be taken when there was
conversion between them. However, it turned out that having
separate types for each kind of string was more helpful than
confusing. Having separate types gave the compiler a chance to help
us in finding inconsistent usages of string types. There wasn't
that much conversion between the new string types and ordinary
strings so the small cost was acceptable.</p>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e222" id=
"d0e222"></a>Conclusion</h2>
</div>
<p>Strings are used for many purposes. Although a string is a
simple concept, there are many ways to use strings, and many types
of strings tailored for these usages. You may have many different
string types in your project depending on how the strings are used
and on the history of your project. If your project uses strings
heavily, you may optimise your code by looking at how different
string types are used and interact with each other and decide on
which type of string is best suited for each usage.</p>
</div>
<div class="bibliography">
<div class="titlepage">
<h2><a name="d0e227" id="d0e227"></a>References</h2>
</div>
<div class="bibliomixed"><a name="ISO-IEC" id="ISO-IEC"></a>
<p class="bibliomixed">[ISO-IEC] International Standard:
Programming Languages - C++ ISO/IEC 14882:1998(E), 1998.</p>
</div>
<div class="bibliomixed"><a name="Gamma-et-al" id=
"Gamma-et-al"></a>
<p class="bibliomixed">[Gamma-et-al] [2] Gamma et al. Design
Patterns: Elements of Reusable Object-Oriented Software,
Addison-Wesley, 1995.</p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
