    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: String Tokenization - A Programmer's Odyssey</title>
        <link>https://members.accu.org/index.php/articles/438</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + Overload Journal #44 - Aug 2001</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c160/">44</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-160/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+160/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;String Tokenization - A Programmer's Odyssey</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 26 August 2001 17:46:07 +01:00 or Sun, 26 August 2001 17:46:07 +01:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e18" id="d0e18"></a></h2>
</div>
<p>This is an article that I have been writing and rewriting over a
considerable period of time. While debating how to best present it,
I realised that it was as much an article about my development as a
C++ programmer, as about the tokenization of strings.</p>
<p>One of the common idioms required in programming is the
extraction of tokens from textual information. Over time I have
used various methods of tokenizing strings from use of strtok in C
(and C++) through to the tokenizer class presented in this article.
This article discusses the evolution of this class and how it
tracks my development and understanding of the C++ language and the
associated STL.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e24" id="d0e24"></a>Early C++
Years</h2>
</div>
<p>After first introductions to OO concepts in Object Pascal, my
first contact in developing a major application in C++ used an
early version of Visual C++. This used the familiar and comfortable
strtok function to tokenize strings. The first developments along
the path to current form of the tokenizer class was the desire to
move away from using these C hangovers and use something more
suitable for the brave new world of OO development, the known
issues presented in using C <tt class="function">strtok</tt>
functions such as re-entrancy had an obvious impact on this desire.
This first incarnation used a simple iterator style interface
offering the following tokenization loop style.</p>
<pre class="programlisting">
for(Tokenizer iter(my_string); !iter.IsDone(); iter.Next()){
// do something with token
  cout &lt;&lt; iter.Token();
}
</pre>
<p>The class provided the following constructors:</p>
<pre class="programlisting">
Tokenizer(const CString&amp; string, CString separator = _T(&quot;,&quot;),BOOL removeSpaces = TRUE);
Tokenizer(const CString&amp; string, CString separator, TCHAR delimiter, 
                                  BOOL removeSpaces = TRUE);
</pre>
<p>One of the defects in this first attempt was the lack of a
pointer style access interface, but this was easily rectified by
providing the appropriate <tt class="methodname">operator*</tt> and
<tt class="methodname">operator-&gt;</tt> methods. Methods for
<tt class="methodname">operator++</tt> (both prefix and postfix)
followed in quick succession to further refine the interface.
Further refinements to the class included the ability to handle set
tokenization (where each token set was separated by a different
separator to that which separated the tokens within a set, e.g 1,
red, car: 2, yellow, lorry).</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e49" id="d0e49"></a>STL
Intervention</h2>
</div>
<p>Striving to keep up with C++ developments it was with some
relief when I was finally able to start using a version of Visual
C++ which supported the STL and provided the STL as part of its
repertoire. A major aim of mine is generating code that is as
generic as possible; as a result I started using <tt class=
"classname">basic_string</tt> more and more in preference to the
MFC CString class. To support this I simply migrated the original
tokenizer class from using the MFC <tt class=
"classname">CString</tt> to <tt class=
"classname">basic_string</tt>.</p>
<p>As my understanding of the logic behind the STL and C++
templates improved (I had previously used Generics in ADA so
already understood the basic concepts), I was convinced of the
benefits in ensuring STL extension classes used a syntax similar to
the STL format. This required two steps, firstly to convert the
class to a template so as to match the underlying basic_string
definition, and secondly to match the STL iterator style interface.
After these changes a loop looked like this:</p>
<pre class="programlisting">
basic_tokenizer&lt;basic_string&lt;char&gt; &gt; tokenizer(my_string);
typedef basic_tokenizer&lt;basic_string&lt;char&gt; &gt;::iterator iterator;
for (iterator iter(tokenizer.begin()); iter != tokenizer.end(); ++iter){
// do something with token
  cout &lt;&lt; *iter;
}
</pre>
<p>With the following class constructors definition:</p>
<pre class="programlisting">
template &lt;class T&gt;
class basic_tokenizer{
  basic_tokenizer(const T&amp; string, T separator = _T(&quot;,&quot;),  bool removeSpaces = true);
  basic_tokenizer(const T&amp; string, T separator, TCHAR delimiter, 
                        bool removeSpaces = true);
    :
    :
</pre>
<p>Using templates had the immediate advantage that the class now
supported the use of any string type class that conformed the STL
<tt class="classname">basic_string</tt> interface, such as the SGI
rope class.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e76" id="d0e76"></a>STL
Conformance</h2>
</div>
<p>A remaining issue in this interface that gave me cause for
concern was that as the tokenizer was effectively an iterator it
should be possible to use it in the standard STL algorithm
functions such as <tt class="function">copy</tt>.</p>
<p>While considering the issues I happened across two articles
describing alternative implementations of tokenizer
classes<sup>[<a name="d0e86" href="#ftn.d0e86" id=
"d0e86">1</a>]</sup> and at this point I considered abandoning my
class and using one of these in preference. On reflection I
believed that these both had problems of their own with their STL
iterator syntax use, so I examined ways to incorporate the desired
improvements in the next iteration of my own class.</p>
<p>In summary the design goals for the next version of the class
were:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>Support any basic_string conformant type</p>
</li>
<li>
<p>Use only standard methods for implementation</p>
</li>
<li>
<p>Support STL iterator syntax</p>
</li>
<li>
<p>Support STL algorithms (only as input iterator)</p>
</li>
<li>
<p>Use functors for the tokenization function to allow for
replacement tokenization methods</p>
</li>
</ul>
</div>
<p>The first goal is simply achieved by ensuring that the template
definition does not use any prior knowledge of the string type, but
simply requiring that the template parameter conform to the
standard STL <tt class="classname">basic_string</tt> interface. The
class can then use the standard informational types internally from
the provided string type class. Additionally the class requires the
provision of a token finder 'functor' class which will be used to
extract the tokens from the string being tokenized (details of this
are beyond the scope of this article and details can be found in
the source code)</p>
<pre class="programlisting">
template &lt;class T, class F = basic_finder&lt;T::value_type&gt; &gt;
class basic_tokenizer : public std::iterator&lt;std::forward_iterator_tag, T&gt;...
</pre>
<p>The class is derived from the base STL iterator class so that it
can be used much more in keeping with STL iterators and so a
tokenization loop will look as follows (<span class=
"emphasis"><em>note the tokenizer is designed to allow the test for
loop end to be made against the string end function
directly</em></span>).</p>
<pre class="programlisting">
typedef sfx::util::basic_tokeniser&lt;std::string&gt; tokeniser;
for (tokeniser tok = test1.begin(); tok != test1.end(); ++tok){  cout &lt;&lt; *tok &lt;&lt; endl;}
</pre>
<p>In supporting use of the tokenizer in STL algorithms it needs to
be understood that the tokenizer class provides an iterator
adapter. The tokenizer is then used to adapt both the begin and end
iterators from the underlying string class for use in an STL
algorithm, like this:</p>
<pre class="programlisting">
copy(tokenizer(my_str.begin()), tokenizer(my_str.end()), output_iterator);
</pre>
<p>Note that the tokenizer is specifically derived from a
<tt class="classname">const_iterator</tt> to ensure that it can
only be used in algorithms where input iterators are allowed as
this is the only form that make sense for a tokenization
sequence.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e133" id="d0e133"></a>Final
Words</h2>
</div>
<p>After completing this article I came across the updates to the
one of the tokenizer classes I had previously investigated within
the C++ Boost community. I have not had chance to study this in
detail yet but decided that the article was worthwhile even in
respect of this development.</p>
<p>During the long gestation of this class I have learnt enormous
amounts about template concepts and their effective use, and how
the STL fits together and can be extended naturally. It is critical
when designing STL compatible extension classes to ensure that both
class syntax and use are familiar to STL users. If you have
comments on the design or other aspects of the class then please
drop me an email at the address below. Source code for the class
can (shortly) be found on my website <a href=
"http://www.wilsonsonline.org" target=
"_top">http://www.wilsonsonline.org</a>.</p>
</div>
<div class="footnotes"><br>
<hr class="c2" width="100">
<div class="footnote">
<p><sup>[<a name="ftn.d0e86" href="#d0e86" id=
"ftn.d0e86">1</a>]</sup> A Generic Iterator for Strings, David
Lorde, C/C++ Users Journal April 1999. The Token Iterator, John R.
Bandela, <a href="http://www.codeproject.com/cpp/tokeniterator.asp"
target=
"_top">http://www.codeproject.com/cpp/tokeniterator.asp</a></p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
