    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Mini-project to Decode A Mini-language - Part Two</title>
        <link>https://members.accu.org/index.php/articles/251</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + Overload Journal #64 - Dec 2004</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c148/">64</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-148/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+148/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Mini-project to Decode A Mini-language - Part Two</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 12 December 2004 15:57:10 +00:00 or Sun, 12 December 2004 15:57:10 +00:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e18" id="d0e18"></a></h2>
</div>
<p>Part 1 of this two-part article [<a href="#Guest">Guest</a>]
described the preliminary stages of a mini-project to write a codec
for a mini-language, delivering:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>a rough specification of the codec,</p>
</li>
<li>
<p>a suite of test data,</p>
</li>
<li>
<p>some prototype code,</p>
</li>
<li>
<p>three implementation strategies.</p>
</li>
</ul>
</div>
<p>Part 2 continues the project and presents the final
implementation.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e40" id="d0e40"></a>Motivation</h2>
</div>
<p>Part 1 of this article drew inspiration from &quot;The Art of UNIX
Programming&quot;, by Eric Raymond [<a href="#Raymond2">Raymond2</a>].
Part 2 continues to draw from this same source, which applies as
readily to implementation as it did to design.</p>
<p>At this point, I can reveal a second motivating source, &quot;The
Tale of a Struggling Template Programmer&quot;, Stefan Heinzmann
[<a href="#Heinzmann">Heinzmann</a>], which served to remind me how
frustrating software development can be: sometimes the tools are to
blame, sometimes the languages appear faulty, and sometimes the
poor programmer takes a wrong turn. More personally, it reminded me
that I ought to experiment with modern C++<sup>[<a name="d0e53"
href="#ftn.d0e53" id="d0e53">1</a>]</sup>.</p>
<p>Anyone familiar with both sources will appreciate there's a
degree of tension between them. In what follows, I document my
attempts to resolve this tension. Along the way, we shall revisit
the world of MPEG video encoding and get started with the Boost
Spirit library.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e59" id="d0e59"></a>Project
Recap</h2>
</div>
<p>To briefly recap, then, our goal is to write a tool to convert
the binary format used in MPEG-2 digital video broadcasting into a
textual form and back again - to write a <tt class=
"literal">dvbcodec</tt>. For example, we would like to convert a
section of the Program Association Table (PAT), whose syntax is as
follows:</p>
<pre class="programlisting">
program_association_section() {
  table_id                   8
  section_syntax_indicator   1
  '0'                        1
  reserved                   2
  section_length            12
  transport_stream_id       16
  reserved                   2
  version_number             5
  current_next_indicator     1
  section_number             8
  last_section_number        8
  for(i=0; i&lt;N; i++) {
    program_number          16
    reserved                 3
    if(program_number == '0') {
      network_PID           13
    }
    else {
      program_map_PID       13
    }
  }
  CRC_32                    32
}
</pre>
<p><span class="bold"><b>[ISO/IEC 13818-1] Table 2-26 - Program
association section</b></span></p>
<p>The numerical values here represent field widths in bits: the
first byte of the section encodes the <tt class=
"varname">table_id</tt>, the next bit the <tt class=
"varname">section_syntax_indicator</tt>, and so on until the final
four bytes which encode the cyclic redundancy check.</p>
<p>The PAT is just one of the tables we would like to decode. There
are many others, the next three most important being the the
Conditional Access Table, the Program Map Table and the Network
Information Table (CAT, PMT and NIT).</p>
<p>The textual output format we decided on should reflect the
syntax description as follows:</p>
<pre class="programlisting">
program_association_section() {
  table_id                   8 = 0x0
  section_syntax_indicator   1 = 0x1
  '0'                        1 = 0x0
  ...
  CRC_32             32 = 0xcae52d9f
}
</pre>
<p>We came up with three possible implementation strategies for our
dvbcodec:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>Implement a pat-codec. Then implement a cat-codec, then a
pmt-codec, etc.</p>
</li>
<li>
<p>Implement a general codec which understands the full bitstream
syntax and can use it to parse an arbitrary section format. All
that then remains is to prime this codec with the required section
formats.</p>
</li>
<li>
<p>Devise a code generator which, given a section format, will
generate a program to encode/decode that particular format.</p>
</li>
</ul>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e98" id="d0e98"></a>Towards a
Solution</h2>
</div>
<p>The first strategy holds little appeal: it risks being a recipe
for cut-and-paste code and boring repetition. We reject it.</p>
<p>The second and third strategies look to have more going from
them, particularly since we have restricted our scope to a subset
of the full bitstream syntax. Although these strategies appear
rather different, they both require us to parse syntax descriptions
of the general form exemplified by the <tt class=
"varname">program_association_section</tt>.</p>
<p>So, we need a parser. We need one capable of handling
conditionals and loops: one capable, that is, of handling a
Turing-complete mini-language. Raymond [<a href=
"#Raymond2">Raymond2</a>] can advise. In general terms, he
suggests:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>Where possible, reuse. Look for a proven, documented, supported,
portable, parser. (He argues these criteria pretty much imply an
open source solution.)</p>
</li>
<li>
<p>Prefer scripting languages such as Python and Perl. These
facilitate rapid development and are less prone to memory
management bugs. You may not need the raw performance offered by
C/C++, and the library support offered by these languages is often
superior.</p>
</li>
</ul>
</div>
<p>On the more specific subject of parsers, Raymond recommends
<tt class="literal">lex</tt> and <tt class="literal">yacc</tt>
though, in keeping with the Unix philosophy of documenting
weaknesses, he admits these tools are not perfect. He also
suggests:</p>
<div class="blockquote">
<blockquote class="blockquote">
<p>&quot;If you can implement your parser in a higher-level language
than C (which we recommend you do ...) then look for equivalent
facilities like Python's PLY ...&quot;</p>
</blockquote>
</div>
<p>I tend to agree with Raymond, but I'm not convinced PLY is the
way to go here. Of course, it won't get me very far with my aim of
finding out about modern C++, but it's also not part of the
standard Python distribution. In fact, a web search reveals several
other Python parser frameworks - it's unclear which will
prevail.</p>
<p>The C++ standard library doesn't provide a parser either. We
might make some progress tokenising our data with <tt class=
"function">std::strtok</tt> or even <tt class=
"function">std::sscanf</tt>, but this won't suffice. lex and yacc
are of course a tried and tested solution, but I'd rather not have
to learn two more mini-languages.</p>
<p>The next place to look is in the next best thing to the C++
standard library, namely Boost [<a href="#Boost">Boost</a>]. Three
clicks from the front page takes us to the Spirit parser, which
claims to be a scalable parser framework written in C++. We trust
the source, the documentation is good, the examples compile: let's
try some code.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e146" id="d0e146"></a>Getting
Started With Spirit</h2>
</div>
<p>The code below is a complete program to recognise lines of the
form:</p>
<pre class="programlisting">
reserved 2 = 0x3
</pre>
<p>this being the format we arrived at for fields of our text
sections.</p>
<pre class="programlisting">
#include &lt;boost/spirit/core.hpp&gt;
#include &lt;iostream&gt;
#include &lt;string&gt;
using namespace boost;
/**
 * @brief Parse a string representing a field 
 * @returns True if the field matches the
 * format: &lt;field_name&gt; &lt;bitwidth&gt; = &lt;value&gt;,
 * false otherwise.
 */
bool parseField(std::string const &amp; str) {
  return spirit::parse(
      str.begin(), 
      str.end(),
      spirit::lexeme_d[+spirit::graph_p]
      &gt;&gt; spirit::uint_p
      &gt;&gt; '='
      &gt;&gt; spirit::hex_p,
      spirit::space_p).full;
}
int main() {
  std::cout &lt;&lt; &quot;Enter text.\n&quot;
        &lt;&lt; &quot;Lines will be matched against: \n&quot;
        &lt;&lt; &quot;&lt;field_name&gt; &lt;bitwidth&gt; = &quot;
        &lt;&lt; &lt;hexvalue&gt;\n&quot;
        &lt;&lt; &quot;Type 'q' to quit\n&quot;;
  std::string str;
  std::string const quit(&quot;q&quot;);
  while(std::getline(std::cin, str) &amp;&amp;
        str != quit) {
    std::cout &lt;&lt; (parseField(str)
                          ? &quot;hit&quot; : &quot;miss&quot;)
              &lt;&lt; std::endl;
  }
  return 0;
}
</pre>
<p>Here, the action is concentrated in the function <tt class=
"function">parseField()</tt>, which wraps a call to <tt class=
"methodname">spirit::parse()</tt>. <tt class=
"methodname">spirit::parse()</tt> accepts as arguments:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>two iterators marking the start and end of the data to be
parsed,</p>
</li>
<li>
<p>a parser,</p>
</li>
<li>
<p>a skip parser.</p>
</li>
</ul>
</div>
<p>We have used <tt class="constant">spirit::space_p</tt> directly
as our skip parser: this primitive parser recognizes whitespace and
tells <tt class="methodname">spirit::parse()</tt> which characters
it should skip past in the input. A more sophisticated skip parser
might be used to skip comments.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e186" id="d0e186"></a>Operator
Overloading</h2>
</div>
<p>The parser itself is a sequence of sub-parsers which can be
read: recognise input consisting of a block one or more printable
characters, followed by an unsigned integer, followed by the equals
sign, followed by a hexadecimal integer.</p>
<p>Operator overloading is used by Spirit to make such expressions
into readable approximations of EBNF syntax descriptions (see also
[<a href="#Alexandrescu">Alexandrescu</a>] for more on this
technique). Here, we see that <tt class=
"methodname">operator&lt;&lt;()</tt> has been overloaded as a
sequencing operator, prefix <tt class="methodname">operator+()</tt>
has been overloaded to mean &quot;one or more of&quot;, and <tt class=
"methodname">operator[]()</tt> is overloaded to adapt the behaviour
of a sub-parser - in this case using <tt class=
"constant">spirit::lexeme_d</tt> to turn off whitespace
skipping.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e208" id="d0e208"></a>Parser
Generators</h2>
</div>
<p>I should also mention that the '=' sub-parser is a shorthand for
<tt class="literal">spirit::ch_p('=')</tt>, which in turn is a
parser generator returning the character literal parser <tt class=
"literal">spirit::chlit&lt;char&gt;('=')</tt>.</p>
<p>Similarly, <tt class="methodname">spirit::hex_p</tt> and
<tt class="methodname">spirit::uint_p</tt> are parser generator
functions which return suitable specialisations of the <tt class=
"methodname">spirit::uint_parser</tt> template struct. The full
template parameters of this struct are as follows:</p>
<pre class="programlisting">
template&lt;typename T = unsigned,
         int Radix = 10,
         unsigned MinDigits = 1,
         int MaxDigits = -1&gt;
struct uint_parser { /* */ };
</pre>
<p>The helper functions <tt class="methodname">hex_p</tt> and
<tt class="methodname">uint_p</tt> are often good enough, but it's
also useful to have the full flexibility of the base parser. For
example, if we need to match larger hex values, and <tt class=
"type">long long</tt> is available, we could create an alternative
hex parser:</p>
<pre class="programlisting">
uint_parser&lt;unsigned long long, 16&gt; const
  long_long_hex_p
    = uint_parser&lt;unsigned long long, 16&gt;();
</pre>
<p>In fact, the <tt class="classname">uint_parser</tt> should work
with any user defined scalar type.</p>
<p>(You've probably noticed I'm now working in the <tt class=
"literal">boost::spirit</tt> namespace. I continue to do so for the
remainder of this article.)</p>
<p>One thing I cannot do with the hex parser, unfortunately, is get
it to accept the <tt class="literal">0x</tt> we've used to prefix
hex digits. We can fix the bug in our program by introducing a new
parser rule.</p>
<pre class="programlisting">
with_base_hex_p
  =  lexeme_d
     [
       as_lower_d[&quot;0x&quot;]
       &gt;&gt; hex_p
     ];
</pre>
<p>Note here:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>the <tt class="literal">as_lower_d</tt> directive, which
converts all characters from the input to lowercase, and therefore
recognising both <tt class="literal">0x</tt> and <tt class=
"literal">0X</tt>.</p>
</li>
<li>
<p>the rather unusual code layout. I have tried to follow the
Spirit style guide [<a href="#Spirit3">Spirit3</a>] when writing
parser grammars. This will become increasingly more important when
we develop a more substantial grammar.</p>
</li>
<li>
<p>the string literal &quot;<tt class="literal">0x</tt>&quot;, which in this
context becomes yet another parser.</p>
</li>
</ul>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e289" id="d0e289"></a>Semantic
Actions</h2>
</div>
<p>Simply recognising fields is not enough: we need to act on their
contents. That is, we must associate semantic actions with the
sub-parsers. This can be done using another overload of <tt class=
"methodname">operator[]()</tt>, which enables us to link an action
to a parser.</p>
<p>Here, then is an encoder which will convert text versions of
sections to binary. I have omitted <tt class=
"literal">#include</tt> directives etc. for brevity. The full
implementation is available with the source distribution for this
article [<a href="#Homepage">Homepage</a>].</p>
<pre class="programlisting">
typedef std::string::const_iterator iter;
/**
 * @brief Put the input value to the output
 *  stream using the specified bitwidth
 */
void putBits(std::ostream &amp;, unsigned w,
             unsigned v) { /* */ }
/**
 * @brief Parse a field of the form:
 *  &lt;field_name&gt; &lt;bitwidth&gt; = &lt;value&gt;
 */
bool parseField(iter const &amp; begin,
                iter const &amp; end,
                unsigned &amp; bitwidth,
                unsigned &amp; value) {
  return parse(
      begin, 
      end,
      lexeme_d[+graph_p]
      &gt;&gt; uint_p[assign_a(bitwidth)]
      &gt;&gt; '='
      &gt;&gt; lexeme_d
         [
            ! as_lower_d[&quot;0x&quot;]
            &gt;&gt;  hex_p[assign_a(value)]
         ]
      ,
      space_p).full;
int main() {
  std::string str;
  int line = 0;
  try {
    while(std::getline(std::cin, str)) {
      ++line;
      unsigned bitwidth, value;
      if(parseField(str.begin(), str.end(),
                    bitwidth, value)) {
        putBits(std::cout, bitwidth, value);
      }
    }
  }
  catch(std::exception &amp; exc) {
    std::cerr &lt;&lt; &quot;Error parsing line &quot; &lt;&lt; line 
              &lt;&lt; &quot;\n&quot; &lt;&lt; str &lt;&lt; &quot;\n&quot;
              &lt;&lt; exc.what() &lt;&lt; std::endl;
    return -1;
  }
  return 0;
}
</pre>
<p>Note here:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>I have used typedefs for the iterators passed into the parser.
This will ease switching to another forward iterator type, if
required.</p>
</li>
<li>
<p>I decided to make the 0x preceding hexadecimal values optional,
using Spirit's overload of <tt class=
"methodname">operator!()</tt></p>
</li>
<li>
<p>The use of the <tt class="literal">assign_a</tt> actor for our
semantic action. We could have used any function accepting an
<tt class="type">unsigned integer</tt> or any functor providing
<tt class="methodname">operator()(unsigned int)</tt>. Again, it's
simpler to use one of Sprit's off-the-peg actors.</p>
</li>
<li>
<p>The program implements a classic Unix filter. This makes it
suitable for use in a Unix pipeline. See [<a href=
"#Raymond2">Raymond2</a>] for more on this. Unfortunately, I'm not
sure this is a great idea for binary output: I haven't found a
portable way to reset <tt class="classname">std::cout</tt> for
binary.</p>
</li>
</ul>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e339" id="d0e339"></a>Exceptions in
Parsers</h2>
</div>
<p>Another important point to note about our simple encoder is the
way it handles failure conditions using C++ exceptions rather than
C-style error codes. There are plenty of failure conditions to
handle: a value might not fit in the available bitwidth, the output
stream might not be in a suitable state, and so on.</p>
<p>In this simple parser we might equally well have passed error
codes around, but a more complex parser is likely to involve
recursion and/or nested function calls. Exceptions perform well in
both the simple and the complex case, offering a scalable
solution.</p>
<p>The Spirit parser framework itself uses exceptions internally
for similar reasons. To quote the documentation:</p>
<div class="blockquote">
<blockquote class="blockquote">
<p>&quot;C++'s exception handling mechanism is a perfect match for
Spirit due to its highly recursive functional nature. C++
Exceptions are used extensively by this module for handling
errors.&quot;</p>
</blockquote>
</div>
<p>Like our program, Spirit should not leak any such exceptions to
its users.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e353" id=
"d0e353"></a>Weaknesses</h2>
</div>
<p>The simple encoder presented above follows Postel's
prescription, to a degree [<a href="#Postel">Postel</a>]. It
doesn't mind too much about whitespace; it allows any sequence of
printable characters as a field name; and it isn't fussy about the
presentation of hexadecimal numbers.</p>
<p>Its main flaw is that it does not look at the text format of our
sections as a whole: it simply skips the lines which close blocks
or start loops, for example. This means the encoding will quietly
do the wrong thing given input where a new-line has gone missing,
or where the data has been truncated. This is dangerous. It also
means the encoder cannot check the integrity of our text data - for
example, to confirm the <tt class="varname">section_length</tt>
field contains the actual section length, or to validate a CRC.</p>
<p>When we start thinking along these lines, we realise that
perhaps the encoder should calculate the values of these fields for
us. We'll need a CRC generator anyway - why not embed it in the
encoder?</p>
<p>These are important points. However, we never considered data
validation when we planned our codec and I'm not going to worry
about it just yet - I need to get started on the decoder. Data
validation, though valuable, would need to be optional since an
encoder must let us generate broken data for test purposes.</p>
<p>Also, Raymond [<a href="#Raymond2">Raymond2</a>] encourages us
to limit options whenever possible: if we can release code earlier
then our users can tell us which options they really want. Ideally,
he suggests we make the release open-source, and allow users to
(submit patches which) implement these options.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e375" id="d0e375"></a>Progress
Review</h2>
</div>
<p>We've used Spirit to write a micro-parser to drive the encoder.
We're ready to start on the decoder. Spirit's scalability will be
tested.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e380" id="d0e380"></a>The
Decoder</h2>
</div>
<p>I decided to attempt the second implementation strategy: to
develop a codec which understands the bitstream syntax and can use
it to parse an arbitrary section format. I had no good reason for
preferring this to the code-generator strategy.</p>
<p>As already noted, this is a parsing task. We will use Spirit to
define the grammar used by the bitstream syntax. We can then parse
our static program data - the section formats we're interested in -
which gives us the basis we need to parse the run-time program
inputs, that is, actual instances of binary encoded sections.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e387" id="d0e387"></a>Grammar
Definitions and Parse Trees</h2>
</div>
<p>I do not propose to dwell on the practical use of Spirit for
much longer: we've already seen enough of what it can do, so for
full implementation details please refer to [<a href=
"#Spirit2">Spirit2</a>] and the codec source distribution [<a href=
"#Homepage">Homepage</a>].</p>
<p>For the decoder, note that simply parsing the data once and
associating semantic actions to the various lexical elements is not
enough. For instance, to process descriptor loops we need to
revisit the same parser node several times. Spirit provides
abstract syntax trees for exactly this purpose.</p>
<p>I do think it is worth presenting here a portion of the section
grammar. To me, this is a quite remarkable application of C++. For
even more remarkable transcriptions of EBNF syntax definitions into
Spirit grammars - including a C++ tokenizer and a C parser - I
would recommend a visit to the Spirit Applications Repository
[<a href="#Spirit">Spirit</a>].</p>
<pre class="programlisting">
/**
 * @brief MPEG-2 Section grammar defined
 * using Boost Spirit.
 * Reference: - ISO/IEC 13818-1, MPEG-2
 *              Transport Stream
 */
struct Section :
  public boost::spirit::grammar&lt;Section&gt; {
  template &lt;typename ScannerT&gt;
  struct definition {
    definition(Section const &amp; /*self*/) {
      section_
        =  section_ref_
           &gt;&gt; section_body_
        ;
      section_ref_
        =  text_id_
           &gt;&gt;  '('
           &gt;&gt;  ')'
        ;
      text_id_
        =  leaf_node_d[
             lexeme_d[
               alpha_p
               &gt;&gt; * (alnum_p | '_')
             ]
           ]
        ;
      quoted_binary_
        =  leaf_node_d[
             lexeme_d[
               '\''
               &gt;&gt;  bin_p
               &gt;&gt; '\''
             ]
           ]
        ;
      section_body_
        =  ch_p('{')
           &gt;&gt; *(    field_
                 |  loop_
                 |  conditional_
                 |  section_ref_
               )
           &gt;&gt; '}'
        ;
      field_
        =  identifier_
           &gt;&gt;   uint_p
        ;
      identifier_
        =  text_id_
           |  quoted_binary_
        ;
      conditional_
        =  str_p(&quot;if&quot;)
           &gt;&gt; condition_
           &gt;&gt; section_body_
           &gt;&gt;  ! (str_p(&quot;else&quot;)
                  &gt;&gt;  section_body_)
        ;
      condition_
        =  inner_node_d['('
             &gt;&gt; text_id_
             &gt;&gt; &quot;==&quot;
             &gt;&gt; quoted_binary_
             &gt;&gt; ')'
           ]
        ;
      loop_
        =  loop_control_
           &gt;&gt; section_body_
        ;
      loop_control_
        =  leaf_node_d[str_p(&quot;for&quot;)
             &gt;&gt;  '('
             &gt;&gt;  * (anychar_p - ')')
             &gt;&gt;  ')'
           ]
        ;

    }
    ...
    boost::spirit::rule&lt;ScannerT&gt; const &amp;
    start() const {
       return section_;
    }
  };
};
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e407" id="d0e407"></a>Decisions
Taken</h2>
</div>
<p>Many of the decisions taken when writing our naive encoder scale
up to the decoder: limited use options; exceptions used in
preference to error codes; Spirit style guide for grammar
definitions; typedefs for iterators.</p>
<p>Some decisions were ones we haven't yet faced. The main one was:
where should we put section format definitions (for the PAT, CAT,
PMT and NIT)? There are two obvious alternatives:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>create a C++ source file containing these definitions - perhaps
as an array of string literals,</p>
</li>
<li>
<p>place them in a text file in a known location, and have this
file read when the decoder starts up.</p>
</li>
</ul>
</div>
<p>The second alternative is perhaps most faithful to our original
aims. Program logic and program data are nicely separated, and
extending the decoder to handle other sections is a simple matter
of editing the text file. No recompilation necessary.</p>
<p>Despite these attractions, I went for the first option - partly
because it's easier to implement and partly because I didn't want
to work out where to put the text file.</p>
<p>The other corner I cut concerns determining how and when to exit
loops. The issue is fully described in the first part of this
article (see the subsection headed &quot;Complications&quot;). My resolution
was to notice that loops always exit when we've used up the number
of bytes encoded in a length field - with the single exception of
the outermost loop, which may end four bytes early in order to
leave space for a CRC. So, the decoder maintains a stack of length
fields, testing the top value after each loop iteration, popping it
on loop exit. The first item to be stacked may need adjusting to
allow for the four byte CRC. Again, the source distribution should
make this clear.</p>
<p>I could find no official statement regarding what could be used
as a field name in the section format definitions: inspection
suggested that these names were rather similar to C identifiers,
with the important addition of quoted binary values for fixed
fields (e.g. the '0' which is the third named field in the
<tt class="varname">program_association_section</tt> format
definition).</p>
<p>A few more parse tree node directives might have resulted in a
leaner decoder, but I wanted the syntax grammar to be as simple as
possible. I am inexperienced in writing grammars and preferred a
small amount of extra code in my application. The application is
quite compact anyway.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e434" id="d0e434"></a>Ship Happens
[<a href="#Alexandrescu">Alexandrescu</a>]</h2>
</div>
<p>Raymond [<a href="#Raymond2">Raymond2</a>] has lots of practical
advice on how to ship a source code distribution, going down to
details of file naming conventions. Some of the suggestions I have
followed are:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>choose a suitable license</p>
</li>
<li>
<p>include a README</p>
</li>
<li>
<p>set up a project homepage [<a href="#Homepage">Homepage</a>]</p>
</li>
<li>
<p>include a BUGS file, listing known defects and limitations</p>
</li>
<li>
<p>include platform/compiler details</p>
</li>
<li>
<p>include self-test code</p>
</li>
</ul>
</div>
<p>The BUGS file is a strangely satisfying thing to write,
particularly if you've ever delivered software which doesn't admit
to defects, let alone limitations (or indeed if you've ever used
such software). In this case it is essential to document both.</p>
<p>Version 0.1 of the dvbcodec features the naive encoder described
in the preceding - really only of use for system testing (we can
check that decoding then encoding a file recreates the original
file). The decoder is rather better - in fact, I've extended it
beyond the original specification to handle a few more section
formats: <tt class="literal">dvbcodec -l</tt> gives details.</p>
<p>Having done the hard work and shipped our beta release, the rest
of this article is dedicated to some closing thoughts.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e476" id="d0e476"></a>Is C++ the
Right Tool for the Job?</h2>
</div>
<p>My criteria for language selection were somewhat artificial. If
I had been allowed a free hand I almost certainly would have been
biased towards Python [<a href="#Python">Python</a>]. However,
having gone the C++ route - the modern C++ route, even - it would
seem a good point to step back and review my selection.</p>
<p>Raymond [<a href="#Raymond2">Raymond2</a>] has reservations
about C++, which I summarise here:</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>By not automating memory management it fails to address C's
biggest shortcoming; and backwards compatibility with C has
compromised the language's design.</p>
</li>
<li>
<p>Object oriented software design isn't all it's cracked up to be.
All too often it leads to shaky object hierarchies, unnecessary
abstractions and code which is hard to maintain,</p>
</li>
<li>
<p>C++ is so complex that no one programmer can be expected to know
it all,</p>
</li>
<li>
<p>If C++ really were superior to C, it would now dominate it.</p>
</li>
</ol>
</div>
<p>(Incidentally, as already mentioned, Raymond is not knocking C++
to promote C. His recommendation is to adopt scripted and mixed
language solutions.)</p>
<p>To fully address all these points is outside the scope of this
article. Instead I shall look at each in the context of the
development of our codec:</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>By using standard library containers - <tt class=
"classname">map</tt>, <tt class="classname">vector</tt>, <tt class=
"classname">stack</tt>, <tt class="classname">string</tt> etc - I
have avoided a single direct call to operator new. If I've got my
exception handling and my use of Spirit right, there should be no
leaks. Regarding C compatibility, I have benefited from the C
standard library in a few places. Converting from C string literals
to C++ strings is convenient.</p>
</li>
<li>
<p>The application code (as opposed to the Spirit framework code)
uses only two concrete classes. I have resisted the temptation to
make these generic, or to make them derive from an abstract class.
The RAII class idiom is usefully employed. The Spirit framework
itself has moved with the times: what was &quot;implemented with
run-time polymorphic classes&quot; is now &quot;a complete rewrite ... using
expression templates and static polymorphism&quot;.</p>
</li>
<li>
<p>Agreed! Spirit's fine documentation includes examples which have
served as a basis for my own application. When I deviated from
these examples too far the result was a barely comprehensible
torrent of compiler errors. Typeless programming in a strongly
typed language can be tough<sup>[<a name="d0e528" href=
"#ftn.d0e528" id="d0e528">2</a>]</sup>.</p>
</li>
<li>
<p>C is far more portable. I believe our codec is standards
compliant, so maybe in the long term it will be portable. However,
at the moment (September 2004) the list of compilers which cope
with Spirit is small. So our codec isn't really portable. Not one
of the compilers I use at work could cope with this program. This
reflects my experience with C++ over the years: to get the good
stuff, either you wait, or you work around compiler limitations.
Bear in mind too that of two types of compiler bugs - incorrect
error messages, no object code; no error messages, incorrect object
code - the second is far more insidious and dangerous: and the
presence of the first naturally leads a programmer to suspect the
existence of the second.</p>
</li>
</ol>
</div>
<p>Despite all this, Spirit has proven itself flexible, scalable,
capable of expressing grammars clearly and of writing efficient
parsers without the need to step outside C++. Indeed, it could
never have been done without C++. I am sure I will use the Spirit
parser framework again.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e550" id="d0e550"></a>Open
Source</h2>
</div>
<p>The future of Unix is Linux is open source. Raymond is
passionate about software quality and argues convincingly that open
source the best way to deliver the highest quality software. I do
not propose to rehearse these arguments here: Raymond's writings
are available both in print form and online. (See, for example
[<a href="#Raymond">Raymond</a>]).</p>
<p>What does interest me is that I cannot see how the full power of
Spirit could be realised using anything other than a full source
code distribution. It's all done with header files. Maybe with some
reworking the implementation could be delivered in pre-built
libraries, multiplied up by the various operating system, platform,
version permutations - but wouldn't this make it even harder for
compilers to build applications based on Spirit?</p>
<p>Of course, open source means more than just access to source:
but it's still notable that this style of C++ favours open source
distribution<sup>[<a name="d0e562" href="#ftn.d0e562" id=
"d0e562">3</a>]</sup>.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e566" id="d0e566"></a>And
Finally</h2>
</div>
<p>I'm still not sure if it would have been better to write a code
generator, to generate our codec from the section formats.</p>
<p>Maybe I'll try using Spirit and C++ to generate a C codec.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e573" id="d0e573"></a>Credits</h2>
</div>
<p>This article and the accompanying source code was developed
using the GNU emacs integrated development environment (GNU emacs,
GNU make, GCC, find, grep, etags, PCL-CVS etc), JASSPA Microemacs
(for its superb text mode and binary editor), all running on the -
sorry Eric, thanks Cygwin - Microsoft Windows operating system.</p>
</div>
<div class="bibliography">
<div class="titlepage">
<h2><a name="d0e578" id="d0e578"></a>References</h2>
</div>
<div class="bibliomixed"><a name="Alexandrescu" id=
"Alexandrescu"></a>
<p class="bibliomixed">[Alexandrescu] I borrowed this section
header from Andrei Alexandrescu's homepage. <span class=
"bibliomisc"><a href="http://www.moderncppdesign.com/main.html"
target="_top">http://www.moderncppdesign.com/main.html</a></span>.
It's funny because it's true.</p>
</div>
<div class="bibliomixed"><a name="Antonsen" id="Antonsen"></a>
<p class="bibliomixed">[Antonsen] Frank Antonsen, &quot;Stream-Based
Parsing in C++&quot;, <span class="citetitle"><i class=
"citetitle">Overload 56</i></span>, August 2003</p>
</div>
<div class="bibliomixed"><a name="Boost" id="Boost"></a>
<p class="bibliomixed">[Boost] Boost: <span class=
"bibliomisc"><a href="http://www.boost.org" target=
"_top">http://www.boost.org</a></span></p>
</div>
<div class="bibliomixed"><a name="Raymond" id="Raymond"></a>
<p class="bibliomixed">[Raymond] Eric Raymond, <span class=
"bibliomisc"><a href=
"http://www.catb.org/~esr/writings/cathedral-bazaar/" target=
"_top">http://www.catb.org/~esr/writings/cathedral-bazaar/</a></span></p>
</div>
<div class="bibliomixed"><a name="Colvin" id="Colvin"></a>
<p class="bibliomixed">[Colvin] Greg Colvin, <span class=
"citetitle"><i class="citetitle">In the Spirit of C</i></span>,
<span class="bibliomisc"><a href=
"http://www.artima.com%20/cppsource/spiritofc.html" target=
"_top">http://www.artima.com
/cppsource/spiritofc.html</a></span></p>
</div>
<div class="bibliomixed"><a name="DVB" id="DVB"></a>
<p class="bibliomixed">[DVB] Digital Video Broadcasting (DVB);
Specification for Service Information (SI) in DVB systems</p>
</div>
<div class="bibliomixed"><a name="Heinzmann" id="Heinzmann"></a>
<p class="bibliomixed">[Heinzmann] S. Heinzmann, &quot;The Tale of a
Struggling Template Programmer&quot;, <span class="citetitle"><i class=
"citetitle">Overload 61</i></span>, June 2004</p>
</div>
<div class="bibliomixed"><a name="Homepage" id="Homepage"></a>
<p class="bibliomixed">[Homepage] Homepage: <span class=
"bibliomisc"><a href="http://homepage.ntlworld.com/thomas.guest"
target=
"_top">http://homepage.ntlworld.com/thomas.guest</a></span></p>
</div>
<div class="bibliomixed"><a name="ISO" id="ISO"></a>
<p class="bibliomixed">[ISO] INFORMATION TECHNOLOGY - GENERIC
CODING OF MOVING PICTURES AND ASSOCIATED AUDIO: SYSTEMS
Recommendation H.222.0 ISO/IEC 13818-1</p>
</div>
<div class="bibliomixed"><a name="PLY" id="PLY"></a>
<p class="bibliomixed">[PLY] PLY: <span class="bibliomisc"><a href=
"http://systems.cs.uchicago.edu/ply/" target=
"_top">http://systems.cs.uchicago.edu/ply/</a></span></p>
</div>
<div class="bibliomixed"><a name="Postel" id="Postel"></a>
<p class="bibliomixed">[Postel] The canonical form of Postel's
prescription, according to the Jargon File (<span class=
"bibliomisc"><a href="http://www.catb.org/~esr/jargon/" target=
"_top">http://www.catb.org/~esr/jargon/</a></span>) is: &quot;Be liberal
in what you accept, and conservative in what you send.&quot;</p>
</div>
<div class="bibliomixed"><a name="Python" id="Python"></a>
<p class="bibliomixed">[Python] Python: <span class=
"bibliomisc"><a href="http://www.python.org" target=
"_top">http://www.python.org</a></span></p>
</div>
<div class="bibliomixed"><a name="Raymond2" id="Raymond2"></a>
<p class="bibliomixed">[Raymond2] Eric S. Raymond, <span class=
"citetitle"><i class="citetitle">The Art of UNIX
Programming</i></span>, Addison-Wesley 0-13-142901-9</p>
</div>
<div class="bibliomixed"><a name="Spirit" id="Spirit"></a>
<p class="bibliomixed">[Spirit] The Spirit Applications Repository
is available at <span class="bibliomisc"><a href=
"http://spirit.sourceforge.net" target=
"_top">http://spirit.sourceforge.net</a></span></p>
</div>
<div class="bibliomixed"><a name="Guest" id="Guest"></a>
<p class="bibliomixed">[Guest] Thomas Guest, &quot;A Mini-project to
Decode a Mini-language - Part One&quot;, <span class=
"citetitle"><i class="citetitle">Overload 63</i></span>, October
2004</p>
</div>
<div class="bibliomixed"><a name="Spirit2" id="Spirit2"></a>
<p class="bibliomixed">[Spirit2] Spirit: <span class=
"bibliomisc"><a href="http://www.boost.org/libs/spirit/index.html"
target=
"_top">http://www.boost.org/libs/spirit/index.html</a></span></p>
</div>
<div class="bibliomixed"><a name="Spirit3" id="Spirit3"></a>
<p class="bibliomixed">[Spirit3] Spirit Style Guide: <span class=
"bibliomisc"><a href=
"http://www.boost.org/libs/spirit/doc/style_guide.html" target=
"_top">http://www.boost.org/libs/spirit/doc/style_guide.html</a></span></p>
</div>
</div>
<div class="footnotes"><br>
<hr class="c2" width="100">
<div class="footnote">
<p><sup>[<a name="ftn.d0e53" href="#d0e53" id=
"ftn.d0e53">1</a>]</sup> My job involves writing portable C++ to
run on embedded platforms. The compilers supplied for these
platforms often do not support &quot;modern&quot; C++ features such as
templates.</p>
</div>
<div class="footnote">
<p><sup>[<a name="ftn.d0e528" href="#d0e528" id=
"ftn.d0e528">2</a>]</sup> The &quot;Techniques&quot; section of the Spirit
documentation [<a href="#Spirit2">Spirit2</a>] describes an
extraordinary method for obtaining an object's type: &quot;... Try to
compile. Then, the compiler will generate an obnoxious error
message ... THERE YOU GO! You got its type! I just copy and paste
the correct type.&quot;</p>
<p>Elsewhere, the Spirit documentation mentions Dave Abrahams'
proposal to reuse the <tt class="literal">auto</tt> keyword for
type deduced variables.</p>
<p>See also Colvin [<a href="#Colvin">Colvin</a>] for a radical
take on this issue.</p>
</div>
<div class="footnote">
<p><sup>[<a name="ftn.d0e562" href="#d0e562" id=
"ftn.d0e562">3</a>]</sup> The case for access to source code is
even stronger for the libraries built using popular scripting
languages such as Python and Perl.</p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
