Title: A Little String Thing

Author:

Date: 03 August 2002 13:15:52 +01:00 or Sat, 03 August 2002 13:15:52 +01:00

Summary:

Body:

It is sometimes useful to be able to tidy up a string before further processing in a software system e.g. at the point where the user has entered the string via the user-interface. From this point on an assumption may be safely made that the string contains no formatting "surprises". Perhaps one of the hardest things to spot with strings is leading or trailing white space. As unlikely as it may seem this can cause systems to fail e.g. a table in a database cannot be found because leading white space in a table name is present. Trying to find such issues can really be like looking for a needle in a haystack.

By removing all non-significant white space (by that I mean leading and trailing white space) at the point where the string has just been entered by a user, such problems should not occur. Performing the operation at the point of entry of data into a system means that no further checks for leading/trailing white space need to be made in other parts of the code and provides for a single point of maintenance.

The project I have in mind uses the C++ standard library string class. Despite criticisms from some quarters that the string class has too many methods (a "Swiss army knife" of a string class, if you will) it certainly doesn't have any methods for removing leading and/or trailing white space. This means we have to code it ourselves. Well, the string class does provide "find" type methods so we could use those to locate specific portions of the string and extract the "real" part of the string. Below is some sample code:

#include <string>
#include <iostream>

namespace {
void rem_space(std::string& str) {
  typedef std::string::size_type pos_t;
  pos_t start = str.find_first_not_of(' ');
  pos_t end = str.find_first_of(' ', start);
  str = str.substr(start, end - start);
}

void show(const std::string& str) {
  std::cout << "Text is: *" << str << '*' << std::endl;
}

void test_it(std::string& str) {
  show(str);
  rem_space(str);
  show(str);
  std::cout << std::endl;
}}

int main() {
  using std::string;

  string test1("   abc   ");
  test_it(test1);

  string test2("   abc");
  test_it(test2);
    
  string test3("abc   ");
  test_it(test3);
    
  string test4("abc");
  test_it(test4);

  return 0;
}

On the face of it, the code seems to do the trick. The method rem_space in the anonymous namespace is really the heart of it and I guess the three lines of code speak for themselves (note that the clear names of the methods on the std::string class make the code almost self-documenting here). The start and end positions are not checked against std::string::npos here as such checks are unnecessary in this context (but they should be checked for start > end - more later).

Still, there's plenty wrong with the code above! For a start, it only checks for a ' ' as white space. What about tab (\t) and newline (\n) and all other such characters? A more subtle problem with the code is if you give it a string such as " abc abc ". What would you expect the result to be? What would you want the result to be? (Unfortunately, the two don't always coincide!)

This can easily be fixed by changing the line

pos_t end = str.find_first_of(' ', start);

pos_t end = str.find_last_not_of(' ');

The work to fix the check for white space becomes a little more involved. We could use a string of characters to look for (a character class for those of you into regular expressions) e.g.

str.find_first_not_of(" \t\n");

Sure we've got all our white space characters? A quick look at an ASCII table will provide several more so we're definitely lacking here. Alternatively, we could use an already-provided method called isspace() which - guess what - checks to see if the character passed to it is a white space character. It even does more than that - it will also use locale information to interpret what should be classed as white space. I won't go into locales here as it's beyond the scope of this article^[1]. We end up with a slight problem here: you can't pass a function or functor to the std::string "find" methods. Looks like we need to iterate over the string and provide some means of processing it on a character by character basis. std::find_if is one way of iterating over a container and applying an arbitrary "find" predicate to each of the values in the container. Fortunately, the std::string behaves like a container class in that it supports iterators - in fact, random access iterators - so we can use the general algorithm methods such as std::find_if. So here's a slightly modified version:

void rem_space(std::string& str) {
  typedef std::string::iterator str_it;

  str_it str_start =
    std::find_if(str.begin(),
                 str.end(),
                 std::not1(isspace));

  str_it str_end = 
    std::find_if(str.rbegin(),
                 str.rend(),
                 std::not1(isspace)).base();

  str = (str_start <= str_end) ?
         std::string(str_start, str_end) : "";
}

The additional includes are:

#include <locale> // for isspace()
#include <algorithm> // for std::find_if
#include <functional> // for std::not1

Before you try it: no, it won't compile, but first let's take a look at what we've got here.

Instead of using std::string::find_nnn() we are now using std::find_if. This is so that we can apply an arbitrary predicate - in our case, isspace(). Note that isspace() is one of a set of global convenience functions that use facets and locales under the covers.

We are now using std::string::iterator types instead of the std::string::size_type as find_if works with iterators and the std::string::find_nnn methods return you numeric positions rather than integers. In brief, the algorithm looks like this:

find the first location in the string where we do not have a space
find the last location in the string which is non white space
construct a temporary string from the start and end iterators and assign this to the input string

Again, no checks need to be made on whether the iterators returned from std::find_if are equivalent to str.end() as the code will work fine if the start and end iterators are equivalent to str.end(). However, there is a possibility that the start iterator will end up with a value greater than the end iterator: consider the case of a string containing nothing but white space. The forward iterator will go all the way to the end of string and return str.end(). The reverse iterator will start at the end and go all the way to the beginning, returning str.begin()! The ternary operator is used to cope with this case.

So, the str_start iterator is set to the first value not containing a white space character, starting from the beginning of the string. str_end is set to the last character not containing a white space value, starting from the end of the of string - but what is the .base() tacked on to the str_end line of code? Well, to start from the end of the string we used the std::string::reverse_iterators given by rbegin() and rend(). The type of iterator returned by std::find_if will also be a reverse_iterator in this case. We want a forward iterator here so we can easily construct our sub-string object from the start and end iterators using the range constructor form of std::string. Calling "base()" on a reverse_iterator performs the conversion of a reverse_iterator into a regular iterator for you^[2].

Even so, it still won't compile. The problem is to do with the fact that we are passing a "normal" function into the find_if as its predicate. Apart from an efficiency consideration^[3], the issue is that the free function isspace() is not adapatable. This means you cannot apply function adapters to them - such as bind1st, bind2nd etc. What we need to do is provide our own function object (aka "functor") that is adaptable. Here's the code for the functor:

struct is_space : public
  std::unary_function<std::string::value_type, bool> {
  result_type operator()(const argument_type& val) const {
    //The only global isspace() I have access to returns int
    // - which isn't conformant but is programming life! 
    return isspace(val) != 0;
  }
};

Inheriting from std::unary_function doesn't do much other than give the struct a few typedefs - yet, these are the typedefs that make the functor adapatable. I've used a couple of the typedefs provided by std::unary_function in my own operator() such as "result_type" and "argument_type". These just pick up the types I specified in the template parameters to the base unary_function class - if I change the types later on, the rest of the code will follow which means less typing for me! Perhaps more importantly, though, it also means once again a single point of maintenance: I don't need to change the argument types in two - albeit closely physically related - places if I use the typedefs provided for me by unary_function.

Modifying the code yet again to use my is_space functor we end up with this:

void rem_space(std::string& str) {
  str_it str_start =
    std::find_if(str.begin(),
                 str.end(),
                 std::not1(is_space()));

  str_it str_end = 
    std::find_if(str.rbegin(),
                str.rend(),
                std::not1(is_space())).base();

  str = (str_start <= str_end) ?
         std::string(str_start, str_end) : "";
}

A full version of the final source code is attached as a zip file which the ACCU are free to publish on their website for download if they so wish.

Of course, if you can do it better or have suggestions to improve this version then write in! I'm sure James would welcome your input.

(Absolutely! - ed)

^[1] If you're interested in locales, Josuttis's "C++ Standard Library" has a good section on the subject, whilst the piece de resistance, at least in my opinion, is Klaus Kreft and Angelika Langer's "C++ I/O Streams and Locales"

^[2] More on this in, amongst other works, Scott Meyer's "Effective STL"

^[3] A function pointer passed as a predicate is called by de-referencing the function pointer whereas a function object (a.k.a. "functor") can usually be made inline

Notes:

More fields may be available via dynamicdata ..

Journal Articles

Title: A Little String Thing