Journal Articles

CVu Journal Vol 27, #6 - January 2016 + Student Code Critiques from CVu journal.
Browse in : All > Journals > CVu > 276 (7)
All > Journal Columns > Code Critique (70)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Code Critique Competition 97

Author: Martin Moene

Date: 02 January 2016 21:20:07 +00:00 or Sat, 02 January 2016 21:20:07 +00:00

Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.

Body: 

Participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.

Note: If you would rather not have your critique visible online, please inform me. (We will remove email addresses!)

Last issue’s code

I have written some code to read in a CSV file and handle quoted strings but I seem to get an extra row read at the end, not sure why.

If I make a file consisting of one line:

--- Book1.csv ---

word,a simple phrase,"we can, if we want, embed commas"

--- ends ---

I get this output from processing the file:

    Rows: 2
    Cells: 3
      word
      a simple phrase
      "we can, if we want, embed commas"
    Cells: 1

What have I done wrong?

The code is in Listing 1.

// Reading CSV with quoted strings.
#include <iostream>
#include <string>
#include <vector>

typedef std::string cell;
typedef std::vector<cell> row;
typedef std::vector<row> table;

table readTable()
{
  char ch;
  table table; // the table
  row *row = 0; // the row
  cell *cell = 0; // the cell
  char quoting = '\0';
  while (!std::cin.eof())
  {
    char ch = std::cin.get();
    switch (ch)
    {
    case '\n':
    case ',':
      if (!quoting) {
        cell = 0;
        if (ch == '\n') {
          row = 0;
        }
        break;
    case '\'':
	case '"':
      if (quoting == ch) {
        quoting = '\0';
        }
      else if (!cell) {
        quoting = ch;
        }
    }
    default:
      if (!row) {
        table.push_back({});
        row = &table.back();
      }
      if (!cell) {
        row->push_back({});
        cell = &row->back();
      }
      cell->push_back(ch);
      break;
    }
  }
  return table;
}

int main()
{
  table t = readTable();

  std::cout << "Rows: " << t.size() << "\n";
  for (int r = 0; r != t.size(); ++r) {
    std::cout << "Cells: "
      << t[r].size() << "\n";
    for (int c = 0; c != t[r].size(); ++c)
    {
      std::cout << "  " << t[r][c] << "\n";
    }
  }
}
			
Listing 1

Critiques

James Holland

When I compiled the student’s code I received three warnings and no errors. It is always a good idea to produce code that compiles with no warnings and so I set about seeing what the compiler was complaining about. A brief investigation showed the cause of the problems. There is a redundant declaration of ch at the start of readTable(). The declaration should be deleted. (The ch that is used by the program is the one declared at the start of the while loop within readTable().)

The other warnings concern the nested for loops within main(). The problem is that size() returns an unsigned type that is being compared with a signed type. Mixing types in this way can produce unexpected results and so is best avoided. Changing the type of variables r and c to size_t will keep the compiler happy.

I see the student talks about a file named ‘Book1.csv’ and yet the supplied program receives input from the console. I assume the student was experimenting in an attempt to debug the code. [Ed: Apologies that I failed to make that clear – the student was using command-line redirection from the file.] I found it more convenient to input data from the file and so I modified the code accordingly but keeping the student’s use of the eof() function. Running the amended code did not result in an extra blank line reported by the student. I suspect the student’s Book1.csv file consisted of a carriage return on the end of the first line and it is this that caused of the extra blank row in table. In any case, the student’s use of eof() is incorrect.

The student used a while loop to read characters from the file. The body of the loop will repeatedly execute until the end of file marker is set. The marker is set only when an attempt has been made to read a character from beyond the end of the file. The student’s code reads characters at the start of the loop body and this is where the problem lies. Within the loop, characters are read up to and including the last character of the file. At this point the end of file marker has not been set and so the loop is executed one more time. An attempt is made to read another character and the end of file marker is set. Unfortunately it is set too late, the loop body has already started to execute. Instead of returning a character from the file, get() returns EOF. The program then processes this value (-1 on my machine) in the same way it would a character from the file. This is clearly not what the student intended. To cure this problem, the body of the loop must not be allowed to execute after an attempt has been made to read beyond the last character of the file. Fortunately this is easy to arrange.

What is required is for the while loop to attempt to read a character from the file and then check for the end of file marker being set all within the predicate of the loop. The code shown below achieves the required result.

  char ch;
  while (file.get(ch))
  {
    …
  }

In fact, the file.get() returns the file stream and so the predicate tests, with an implicit conversion to bool, whether the stream is in a good state. Attempting to read one character beyond the end of the file renders the stream not in a good state and so the body of the loop will not be executed, as required.

The program will now work as expected. However, one cannot help but notice that the function readTable() is quite involved. It consists of switch statement where one case falls through to the next, there are case labels within if statements. In fact, there are a total of six if statements within the switch statement. All of this wrapped up in a while loop. As far as I can determine, the code works correctly. It is, however, very difficult to reason about. It is just far too complicated. There has to be another way. Fortunately there is.

The function readTable() is mostly concerned with searching for patterns within text strings. Whenever searching for patterns is mentioned, regular expressions should come to mind. C++11 has a regular expression library that is based on the Boost library. Using regular expressions greatly simplifies readTable() as is shown below.

  table readTable()
  {
    boost::regex regular_expression(
      "('.*')|(\".*\")|[^, ]+( *[^, ]+)*");
    boost::sregex_iterator end;
	
    table table;
    std::ifstream file("Book1.csv");
    for (std::string record;
    getline(file, record);)
    {
      boost::sregex_iterator position(
        record.cbegin(), record.cend(),
        regular_expression);
      row row;
      for (; position != end; ++position)
      {
        row.push_back(position->str());
      }
      table.push_back(row);
    }
    return table;
  }

The first statements of the replacement readTable() contains the regular expression that defines what is being searched for and consists of three parts that are separated by the ‘or’ symbol |. The first part, ('.*'), defines a search for any number of characters starting with a single quote mark and ending with a single quote mark. The next part of the expression defines a search of any number of characters starting with a double quote mark and ending with a double quote mark. The third part of the expression is a little more complicated. It defines a search for a string starting with any character that is not a comma and is not a space. The string may then contain any number of spaces and any number of characters that are not a comma and not a space. This, in effect, defines a word of a phrase. The expression then goes on to define that a phrase can contain any number of words.

It has to be said that using the regular expression grammar effectively requires a little practice. Once mastered, however, regular expressions are a concise way of defining search patterns that do not require the programmer to construct complicated and hard to fathom state machines.

The rest of readTable() is fairly conventional. It reads the file a line at a time and then, using the regular expression, searches for the words and phrases within the line. readTable() then assembles the words and phrases ready to be pushed onto table.

Commentary

There were a number of problems with the supplied code including the ‘presenting problem’ of the extra empty row. As James correctly points out, this one is caused by the erroneous use of the eof() method in the program. In my experience, code that uses an explicit call to eof() is often incorrect. It is generally easier to use the implicit conversion to bool as this seems to naturally help to create better handling of end of input.

I do not recommend the naming convention used for table, row and cell in this program where the variables have the same name as their type as this can lead to hard-to-understand code. Additionally, since the meaning of the symbol depends on the scope, moving code around during maintenance or refactoring can lead to obscure breakages.

People are often surprised to come across the interactions between switch and if demonstrated in this program. Again, I would not recommend this style unless there are some overwhelming reasons why it needs to be used (such as a hard performance requirement if this technique can be shown to be measurably faster than a more structured alternative). The reason is the additional complexity that intermingling case statements and if statements leads to. It becomes hard to reason correctly about the code as most programmers in C or C++ have a natural tendency to unconsciously treat each case statement separately and so fail to fully take into consideration the ‘dangling’ portion of the earlier if statements.

While I was expecting to receive critiques of the code that attempted to make the structure of the state machine clearer, I think that James’ approach of using a pre-written standard solution has a lot to recommend it. There may be an initial conceptual barrier of understanding the syntax of the regular expression being used – but since this is using a standard regular expression grammar this will be a transferable skill! Once the regular expression is checked, the rest of the code is simple enough to be clearly correct.

The main downside of using regular expression matching is likely to be a performance one since it is likely to be slower than a tailored state machine solution. As always though, even when performance is relevant, it is important to measure rather than rely on a guess. I did a naive measurement comparing the original code and James’ solution; the regular expression code only took 4.4 seconds to read a quarter of a million rows which, while slower than the original code at 1.2 seconds, is likely to be fast enough in many cases.

The winner of CC 96

There was only one critique this time – perhaps the unstructured use of control flow put people off!

As I mentioned in my commentary, I liked James’ approach of raising the abstraction level of the problem and using the power of the regular expressions handling in the standard C++ library to solve the problem.

His solution is written in terms of boost, but you can simply replace boost with std for compilers supporting C++11 or above.

So, despite being the only entrant, I consider that James deserves the award of the prize for this issue’s critique.

Code critique 97

(Submissions to scc@accu.org by February 1st)

I have a simple template class that holds a collection of values but sometimes code using the class crashes. I’ve written a simple test for the class, which works, but I do get a warning about a signed/unsigned mismatch on the for loop. I thought using auto would stop that. Is that anything to do with the crash? I’ve commented out all the other methods apart from add and remove.

The code is in Listing 2 (values.h) and Listing 3 (test.cpp).

#include <utility>
#include <vector>

#pragma once

// An unordered collection of values
template <typename T>
class Values
{
public:
  void add(T t);
  bool remove(T t);
  std::vector<T> const & values() const
  { return v; }
private:
  std::vector<T> v;
};

// Add a new item
template <typename T>
void Values<T>::add(T t)
{
   v.push_back(t);
}

// Remove an item
// Returns true if removed, false if not present
template <typename T>
bool Values<T>::remove(T t)
{
  bool found(false);
  for (auto i = 0; i != v.size(); ++i)
  {
    if (v[i] == t)
    {
      v[i] = std::move(v.back());
      v.pop_back();
      found = true;
    }
  }
  return found;
}
			
Listing 2
#include <iostream>
#include "values.h"

void test()
{
  Values<int> vi;
  for (int i = 0; i != 10; ++i)
  {
    vi.add(i);
  }

  if (!vi.remove(1))
  {
     std::cout << "Can't remove 1\n";
  }

  if (vi.remove(1))
  {
     std::cout << "Can remove 1 twice\n";
  }

  if (!vi.remove(9))
  {
    std::cout << "Can't remove 9\n";
  }

  std::cout << "size: " << vi.values().size()
    << std::endl;
}

int main()
{
  test();
}
			
Listing 3

You can also get the current problem from the accu-general mail list (next entry is posted around the last issue’s deadline) or from the ACCU website (http://www.accu.org/journals/). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.

Notes: 

More fields may be available via dynamicdata ..