Title: A Mini-project to Decode a Mini-language - Part One

Author:

Date: 01 October 2004 16:25:38 +01:00 or Fri, 01 October 2004 16:25:38 +01:00

Summary:

Body:

This article appears in two parts. Part One - this part - describes the first stages of a mini-project to write a codec for a mini-language. (If you don't know what codecs and mini-languages are, don't worry. Read on!) Part two will describe the later stages of this project and present an actual implementation of the codec.

Keen readers will, then, have the opportunity to try out an implementation of their own before part two appears, based on the specification arrived at during the course of the following paragraphs. As encouragement, I provide a (link to a) data suite which can be used for test purposes.

Motivation

The motivation for this article comes from "The Art of UNIX Programming" by Eric Raymond ([Raymond]). This is one of the most inspiring books I've read on how to write software: although firmly rooted in the traditions of the UNIX operating system, the culture and philosophy it describes applies far more widely. It has reinforced my belief that software development can indeed be an art.

Having read this book, I wanted to put some of its ideas into practice. So I set myself a mini-project.

An Idea For a Mini-project

As a starting point, I'd like to summarise two of [Raymond]'s most important lessons

Data structures, not algorithms, are central to programming.
Prefer text file formats: they're human-readable and extensible. If you must use a binary format then invest in a tool which converts from this format to a textual one and back again. This will facilitate working with your data.

I work in a domain where the need for compression requires the use of binary file formats: namely digital television (DTV). Digital video is typically MPEG-2 encoded. This is a highly compressed encoding designed to squeeze as much content as possible into a limited bandwidth. MPEG-2 encoding also allows video and audio content to be combined with metadata: for example, a television programme might be accompanied by a text description of itself and of what's showing next^[1].

To get to grips with this metadata a conversion tool is required. The specific task I set myself, then, was to write a tool to convert MPEG-2 metadata from binary to text, and, if required, to reverse the process. Such an encode/decode tool is commonly referred to as a 'codec'.

My project, then, was to implement a digital television codec.

The MPEG-2 Bit Stream Syntax: An Example of a Mini-language

The metadata format is specified using the MPEG-2 bit stream syntax which is defined in [ISO-IEC13818-1]. Data items are described by name and length in bits using a C-like procedural syntax. An example makes this clear:

TS_program_map_section() {
  table_id                 8
  section_syntax_indicator 1
  '0'                      1
  reserved                 2
  section_length          12
  program_number          16
  reserved                 2
  version_number           5
  current_next_indicator   1
  section_number           8
  last_section_number      8
  reserved                 3
  PCR_PID                 13
  reserved                 4
  program_info_length     12
  for(i=0; i<N; i++) {
    descriptor()
  }
  for(i=0;i<N1;i++) {
    stream_type            8
    reserved               3
    elementary_PID        13
    reserved               4
    ES_info_length        12
    for(i=0; i<N2; i++) {
      descriptor()
    }
  }
  CRC_32                  32
}

What we have here is the bit stream syntax for a section of the Program Map Table. The first 8 bits give the table id (which happens to be 2, for this particular table); the single bit which follows provides the section syntax indicator; the next bit is always set to zero; the next two bits are reserved (and should each be set to one); and so on, until we get to the 32 bit CRC^[2].

To provide a little context: the Program Map Table (PMT) supplies basic information about the digital television services present in an MPEG-2 transport stream. A section of this table - as shown above - defines the elementary streams which comprise a single television service. For example, the PMT for the BBC1 digital television service consists of a video stream, an audio stream, a subtitle stream and some data streams. Of course, [ISO-IEC13818-1] defines the format of many other tables and sections, and related specifications - such as [ETSI-EN300468] define many more.

This textual specification of a binary format is an example of what [Raymond] terms a mini-language. In fact we have a Turing-complete mini-language: that is, it allows for loops and conditionals. The particular example shown here does not include any conditionals, though we do have nested loops. Note also the referenced descriptor() items. To fully parse the TS_program_map_section() we'll need the descriptor() format specified too:

descriptor() {
  descriptor_tag       8
  descriptor_length    8
  for(i=0; i<N; i++) {
    private_data_byte  8
  }
}

Complications

The syntax is easy to read, particularly to anyone familiar with C. However, if we look more closely at the for-loop in Example 2, although it's apparent that i must be an integral loop counter, it's less clear where N is defined. Similarly, in Example 1, how do we find the values of N1 and N2?

[ISO-IEC13818-1] answers these questions. In Example 2, the descriptor_length data element encodes an unsigned integer which tells us how many bytes of data are to follow: so N is simply the value obtained by decoding descriptor_length. Example 1 is not quite so simple. N is easy enough - it's as many variable-length descriptors as it takes to fill the total length specified by program_info_length. N2 is similarly the number of descriptors to fill the length specified by the most recent occurrence of ES_info_length. For N1 however, we have to keep decoding elementary streams until the following is true:

Sum of elementary stream lengths (in bytes)
   == 'section_length' - 
      - (2 + 5 + 1 + 8 + 8 
         + 3 + 13 + 4 + 12) / 8
      - 'program_info_length'
      - 32 / 8

We have to divide by 8 since field widths of values within the PMT section are given in bits - but length fields give values in bytes.

Despite these complications, the encoding is well-designed: we can parse these binary data structures sequentially without needing to look ahead; and we can skip over any bits we're not interested in.

In fact, the more closely we inspect our examples, the more we notice. This is good. Recall that data structures, not algorithms, are central to programming. Already we're getting stuck into the data.

While we're in this positive frame of mind, let's review the full range of control structures required by the [ISO-IEC13818-1] bitstream syntax:

while(condition) {
  data_element
  ...
}
do {
  data_element
  ...
}
while(condition)
if(condition) {
  data_element
  ...
}
else {
  data_element
  ...
}
for(i=0;i<n;i++) {
  data_element
  ...
}

So, we've got pretty much C's control structures, excepting switch, break, return and continue.

This is starting to look alarming. How complex can a condition be? How shall we handle three different looping constructs? Our mini-project has become rather bigger than we imagined.

Back to the Data

The thing to do at this point is to shelve these concerns and get back to specifics. So, I got hold of some PMT section data and parsed it by hand. I used two types of data:

PMT sections pulled out of recorded digital TV broadcasts
a simple PMT section synthesised by hand.

I shall spare you the details. Note though that in parsing by hand we're already starting work on our text output format. For example, given the binary contents of a synthesised descriptor:

0a 04 0a 0b 0a 0b

and recalling the descriptor syntax:

descriptor() {
  descriptor_tag        8
  descriptor_length     8
  for (i=0; i<N; i++) {
    private_data_byte   8
  }
}

a suitable output might be:

descriptor() {
  descriptor_tag      8 = 0x0a
  descriptor_length   8 = 0x04
  for (i=0; i<N; i++) {
    private_data_byte 8 = 0x0a
    private_data_byte 8 = 0x0b
    private_data_byte 8 = 0x0a
    private_data_byte 8 = 0x0b
  }
}

I have deliberately chosen an output format which closely resembles the syntax definition. The loop has been unrolled, but I have retained the loop control to indicate the structure and origin of the data. I have chosen a hexadecimal representation for the data values - always a good choice for binary data - and explicitly indicate the numeric base used by prefixing these values with the string 0x. Finally, I have retained the bit widths for convenience: this will mean that when converting from text to binary, there will be no need to refer back to the descriptor syntax.

Referrring back to our motivating reference, we see we have instinctively followed one of [Raymond]'s recommendations:

"when filtering, never throw away information you don't need to".

(Here, the term "filter" is used in its Unix sense, and applies well to a codec; and the reasoning is that any discarded information can never be used in any program further down the Unix pipeline). In our example, we can see that the output includes all the information carried by both the descriptor syntax definition and by the example descriptor.

Handling Failures

Suppose our descriptor was too short:

0a 04 0a 0b 0a

What should our codec make of such data?

Maybe something like this:

descriptor() {
  descriptor_tag      8 = 0x0a
  descriptor_length   8 = 0x04
  for (i=0; i<N; i++) {
    private_data_byte 8 = 0x0a
    private_data_byte 8 = 0x0b
    private_data_byte 8 = 0x0a
  >>> ERROR: end of data reached. descriptor() incomplete

It's perhaps premature to tie down how errors should be handled, other than to say that they should draw attention to themselves, that they shouldn't crash the codec, and that they should provide useful diagnostics. But it certainly isn't premature to include some malformed data in our test set.

Another point [Raymond] makes about data conversion tools is that they should be generous in what they accept (as input) but rigorous in what they emit (as output). In our case, this means that a user might change the layout of a text version of a descriptor() to read like this:

descriptor() 
    {   descriptor_tag 8 = 0xA
        descriptor_length 8 = 0x4
        for (i = 0; i < N; i++) 
        {    private_data_byte 8 = 0xA
             private_data_byte 8 = 0xB
             private_data_byte 8 = 0xA
             private_data_byte 8 = 0xB
        }
    }

and still expect the codec to convert this to binary as:

0a 04 0a 0b 0a 0b

One of the great benefits of having a codec is that we can generate binary data from an easy-to-edit textual form: it would be a severe limitation if the encoding process was over-sensitive about whitespace, layout, or the capitalisation of hexadecimal numbers.

Reducing Project Scope

Whilst tinkering around with my test data set, I've also been paging through the MPEG-2 bitstream syntax. The bad news is that the expressions which appear in conditionals may be quite complex, making use of all the usual C arithmetic, bitwise, logical and relational operators as well as a few domain-specific additions.

The good news is that we can make good progress if we restrict our scope as follows:

Restriction 1: restrict the control structures to for() {...} and if (condition) {...} else {...}
Restriction 2: restrict conditions to the form field == value

These restrictions will not make a lot of sense to end users. In end user terms, we can aim for a first release of our codec which will only support sections from the following four tables:

Program Association Table (PAT)
Conditional Access Table (CAT)
Network Information Table (NIT)
Program Map Table (PMT)

This reduced scope may seem rather limiting.

Note however that these four tables - collectively termed the Program Specific Information (PSI) Tables - "contain the necessary and sufficient information to demultiplex and present programs" ([ISO-IEC13818-1]).

Note further that our syntactic restrictions will not stop us from extending our codec to handle the complete set of DVB Service Information (SI) tables ([ETSI-EN300468]), which contain just about all of the metadata used in European digital broadcasts. Note finally that we remain faithful to our aims: [Raymond] emphasises the need to release early and often, so that users can drive (and implement, even, in an open source world) future developments. By reducing scope, we allow for this early feedback. We must take care, though, to follow another Unix maxim, and keep our design extensible.

A Prototype Descriptor Decoder

/**
 * @brief Decode a descriptor.
 * @param begin The start of the descriptor data
 * @param end One past the end of the descriptor
 *            descriptor data
 * @param out Output stream for the decoded data
 */
bool decodeDescriptor(desc_iter begin,
         desc_iter end, std::ostream & out) {
  out << "descriptor() {\n";
  if(begin != end) {
    out << "    descriptor_tag 8 = "
        << *begin++ << "\n";
  }

  // We don't know yet how much data we need to
  // decode.  Use a special non-zero value to
  // indicate this.
  unsigned int to_decode = 0xff;

  if(begin != end) {
    to_decode = *begin++;
    out << "    descriptor_length 8 = "
        << to_decode << "\n";
    out << "    for(i=0; i<N; i++) {\n";
    while(begin != end && to_decode != 0) {
      out << "        "
          << "private_data_byte 8 = "
          << *begin++ << "\n";
      --to_decode;
    }
    out << "    }\n"; 
  }
  if(begin != end) {
    out << "ERROR: descriptor() too long.\n";
  }
  else if (to_decode != 0) {
    out << "ERROR: end of data reached. "
        << "descriptor() incomplete.\n";
  }
  else {
    out << "}\n";
  }
  return begin == end && to_decode == 0;
}

The function above is a direct first attempt at writing a descriptor decoder. Whilst there may be some mileage in this approach, some weaknesses are apparent:

The indentation is fixed. This won't do if we are decoding a descriptor in the broader context of a PMT section, when it can appear at two different levels.
The error handling is clumsy.
Data - in this case, the descriptor's syntax - has become muddled with control flow.

Now is not the time to deal with these weaknesses. We shall simply note that the first is simple to fix and the second can easily be improved on. It's the third weakness which, in the longer term, will lead to code which is harder to understand, maintain and extend. On the other hand, this function demonstrably works on our test data set, which is encouraging; and it's not hard to see how the approach taken could be extended to decode PAT, CAT, NIT and PMT sections - which is all we've decided to do.

Design Alternatives

We are now in a good position to consider the design of our dtv-codec. Three alternatives spring to mind:

Implement a descriptor-codec, a pmt-codec, and so on as required. Here, each mini-codec understands its own designated part of the syntax. Then the dtv-codec simply farms out work as appropriate. This extends the direct approach described above.
Implement a general dtv-codec which understands the full bitstream syntax and can use it to parse an arbitrary section format. All that then remains is to prime this codec with the required section formats.
Devise a code generator which, given a section format, will generate a program to code/decode that particular format.

All three alternatives are good, and all seem in line with our motivating aims. The third, in particular, exemplifies [Raymond]'s "Rule of Generation: Avoid hand-hacking; write programs to write programs when you can."

In choosing which route to take, we should remember the [XP] mantra: "Do the simplest thing possible"; which, in UNIX-speak, becomes the more prosaic: "Keep it Simple, Stupid!"

More Details

For those who want to implement their own codec I have posted a zip archive to my website [HomePage]. This archive contains:

binary PAT, CAT, NIT and PMT sections
synthesised text sections
alternative text versions
malformed binary sections
the relevant section syntax definitions
table_id values
a README

For those who'd prefer to see my attempt; you'll have to wait for part two of this article.

Conclusions

Progress has been made, and without the need to compromise our artistic aims. Even before we've completed the project, we've started to receive the main benefit: of understanding our data.

The second part of this article is where we compromise, get our hands get dirty, bite off more than we can chew - that is, we write some code.

References

[Raymond] Eric S. Raymond, The Art of UNIX Programming, Addison-Wesley 0-13-142901-9

[ISO-IEC13818-1] Information Technology - Generic Coding Of Moving Pictures And Associated Audio: Systems Recommendation H.222.0 ISO/IEC 13818-1

[ETSI-EN300468] Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems

[XP] Extreme Programming, http://www.extremeprogramming.org

[HomePage] http://homepage.ntlworld.com/thomas.guest

^[1] The proper term for these particular items of metadata is Event Information, present and following. Since this article is not primarily about digital television I shall avoid such jargon as far as possible.

^[2] I assume the 'reserved' parts of the section are included to provide room for a degree of future extensibility. But not much room. This is one of the reasons why [Raymond] advocates text file formats: "if you need a larger value in a text format, just write it."

Notes:

More fields may be available via dynamicdata ..