Journal Articles

CVu Journal Vol 14, #1 - Feb 2002 + Programming Topics

Browse in :

All > Journals > CVu > 141 (8)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: String Comparison with Wild Characters.

Author: Administrator

Date: 03 February 2002 13:15:49 +00:00 or Sun, 03 February 2002 13:15:49 +00:00

Summary:

I sometimes find that algorithms are like buses. I spend ages looking for a good algorithm to solve a problem, and then two come along at once.

Body:

I sometimes find that algorithms are like buses. I spend ages looking for a good algorithm to solve a problem, and then two come along at once.

At least, that is what happened when I wanted an algorithm to compare a string against a pattern having wild card characters. Then as the two solutions used different approaches: one recursive and one iterative, the question arose as to which was better, so I did some timings. And thought I would write about it for you.

Now this is a fair amount of ground to cover in the space available to me, so I hope you will excuse me if I go at a fair pace. I can always expand on an aspect later on if someone asks.

What are 'Iteration' and 'Recursion'?

Most functions solve a problem by doing a simple step many times. How that step is repeated is the difference between 'iteration' and 'recursion'.

If one uses some form of loop (e.g., a for a while statement, though it may be a crude if and goto statement pair), to repeat the step, then one is iterating. But if the function calls itself to apply the step to the next bit of the problem before it has finished the current step, then one is using 'recursion'.

Many, possibly all, problems can be solved by both iterative and recursive algorithms. But you will usually find that a particular algorithm is more easily expressed one way round than another. For instance, a compiler is normally written using recursion, but could be written as an iterative program. And finding the factorial of a number is sometimes expressed as a recursive routine, but is best evaluated iteratively.

Which if the two approaches should be adopted will also depend on the computer language used. FORTRAN does not traditionally support recursion, principally because the early computers did not provide the sort of instructions that would make its implementation easy. So FORTRAN programmers usually restrict themselves to iterative algorithms. C (and most other computer languages) have no such restriction: their programmers can use either recursion or iteration as appropriate.

To illustrate iteration and recursion, let me give the factorial example. The factorial operator, '!' is encountered in probability theory and n! is the product of all the numbers from n down to 1. So 1! is 1 and 6! is 6 * 5 * 4 * 3 * 2 * 1 which equals 720. As a special case, 0! is defined as 1.

An iterative function to evaluate this might be:

unsigned int factorial_iterative(unsigned int x){
  unsigned int  i, r;
  if(x <= 1) return(1);
  r = 1;
  for (i = 2; i <= x; ++i) r *= i;
  return (r);
}

and the equivalent recursive function:

unsigned int factorial_recursive(unsigned int x){
  if(x <= 1) return (1);
  return (x * factorial_recursive(x - 1));
}

A common mistake when writing recursive functions is forgetting to make sure that the recursion actually stops. This is akin to getting the loop test wrong in the for-statement of an iterative function. So as a good practise I make sure that my recursive functions begin with test that stops the recursion. It is then clear what the terminal condition on the recursion is and it is not hidden in the middle of the detail of the function.

What are Wild Characters?

Let me start by saying what my problem was. It was a string comparison problem, but not the simple problem already solved by strcmp. That library function directly compares two strings and says whether or not they are the same. That is, whether each character in the first string, or pattern, is matched exactly by the corresponding character in the second string, or test. (As an aside, strcmp also indicates whether the test string is greater or less than the pattern string, but the standard does not say if "a" is greater or less than "b".)

In my problem, I wanted my pattern to contain wild cards. These are characters that are interpreted as matching not just themselves, but a whole group of characters in the test string.

I wanted two specific wild characters, with two slightly different purposes. The first was what I call the 'SINGLE' character whose appearance in the pattern string causes exactly one test character to be matched, irrespective of what it is. I usually use the '%' character for this. So whilst a pattern character of 'a' can match only a test character of 'a', the '%' character can match not only '%' and 'a', but also 'Z' and any other character in the character set.

I also wanted a 'MANY' character, for which I use '*'. This is like 'SINGLE', save that whereas 'SINGLE' matches exactly one test character, whatever it is, 'MANY' can match any number of test characters, whatever they are, including none.

And so that I can force a '%' or '*' to match only themselves, I also specified an 'ESC' character, which changes, or escapes, the meaning of the following character so that it is matched only by itself. For this, because it is the common choice in C, I used '\\'.

I decided to follow the logic of strcmp in my choice of return value: a zero indicates a match and non-zero a mismatch. Even if you think that this logic is backwards, it is always worth using an established interface. It makes life a lot easier to expand a program's capabilities by employing one's own function instead of a library function, or even revert to the standard library function if everything goes wrong and the additional functionality must be removed. And if the two interfaces are the same, you will not use one when you meant the other.

I wanted this function because I was writing a program that picked up all the disk files whose names had a simple form when expressed with wild cards. For instance, on many systems the pattern "*.c" will select all the C source code files.

Another use would be in a text editor's 'find' command. Indeed, many text editors support something called 'regular expressions' which are a greatly enhanced version of what I am trying to do here.

The Recursive Solution.

I find it easier to describe the recursive, than the iterative, solution, so that is where I will start.

First, the name of the function. C has reserved all names starting with 'str' plus another alphabetic character for its own future use, so whilst one might like to use something like 'strwcmp', the function has to be named wstrcmp1, and the '1' distinguishes this from the iterative solution which I call wstrcmp2.

I am sure that you could write the simple string comparison used by strcmp. (Go on, give it a go. Do not be scared because it is a library function, the solution takes only six simple lines.)

So if you look at wstrcmp1 and take my switch statement as a complicated if statement in which only the '\0' and 'default' cases are of interest, you should recognise your solution for strcmp. From there, adding the 'SINGLE' wild card character of '%' is easy. It matches everything other than the end of string character, '\0'.

The 'ESC' is also easy: one does have to advance to find the escaped pattern character, but this is then tested as though it were an ordinary pattern character handled by the '\0' or default cases.

The recursion in the function comes with the 'MANY' character of '*'. Remember that it has to match an arbitrary number, including none, of test characters. The problem is that the only way the function can determine how many characters to match is to let it match all possible character counts in turn. The fun starts if, as is permitted, the pattern contains more than one '*'.

The solution is to consider the two cases of the 'MANY' character matching zero and one test characters. If 'MANY' matches zero test characters, then the result of the overall comparison is the same as the comparison between the remainder of the pattern after the '*' and the test string starting with its current character. If 'MANY' matches one test string character, the result is the same as the comparison between the pattern starting with the '*' and the test string starting just after the current character.

Both these comparisons must also be wild card comparisons in case there are further 'SINGLE' or 'MANY' characters in the pattern. But in either case one of the arguments is shorter, though only by one character. So I can use my own function to do that comparison. That is, I call wstrcmp1 from the middle of wstrcmp1, even though the original call has not yet completed. When the second call to wstrcmp1 completes, I can complete my original call. And that is what makes this function recursive.

So wstrcmp1 has to try both possibilities and returns a match if either possibility also returns a match. A '&&' operator is used rather than a '||' operator at this point because of the inverted logic that wstrcmp1 (and strcmp) use for their result.

If you want to see this happening, insert some code at the start of wstrcmp1 to print out its arguments every time it is entered, and a line before every return statement to flag the function's exit and say what the returned value is. For every '*' in your pattern string, you will see a whole series of lines saying that the function has been re-entered before any matching exits appear.

A difficulty some have with understanding recursion is how a function can be used to solve a problem before writing that function has been completed. The way to look at this is to see that when the function is re-entered it is always with a smaller problem than the original one. Here, at worst, either the pattern or the test string is one character shorter. (At best, both will be substantially shorter.) So the new problem must be easier to solve, and even though two of them must be solved this is still easier than solving the original problem.

Also the authors must have the confidence that when completed their function will solve the problem they set out to solve. This is just the same as using a library function: one must have the confidence that it does what it says, even if one does not know how it does it. The difference here is that one must write the function oneself. One must have the confidence that one can write a function that does what one wants it to do.

The Iterative Solution.

Turn now to the iterative version of the string comparison in wstrcmp2. The first difference is that it is twice the length of wstrcmp1. It is often found that of the iterative and recursive approaches, the recursive approach is shorter (but not necessarily better).

How does wstrcmp2 work? Well whereas wstrcmp1 stepped through each character in the pattern string seeing if it matched the test string, wstrcmp2 steps through each character in the test string and sees which of the pattern characters it might possibly match based on what has been matched so far.

The array pointed to by cur_state says which of the pattern characters the current test character must match if the test string is to match the pattern string. And next_state is set as a result of each comparison to say which pattern characters the next test character must be tested against in the next iteration. The for statement steps through all the pattern string characters for each test character in turn.

Because the length of the pattern string is not known, the storage actually used by cur_state and next_state must be dynamically allocated by malloc (and tested to ensure that malloc was able to allocate the requested storage). The length of the storage requested is one more than the pattern length because this function will mark the pattern's terminal '\0' as a potential candidate for a match.

Because the function dynamically allocates storage, it must also be careful to return that storage before it completes. I found it easiest here to do this with the static function free_mem that I have to take great care to call before every return statement. Otherwise I would cause a memory leak, which is when the system thinks I am using memory, so will not reallocate it, whilst my program has lost all references to it and so cannot release it.

From the above, I expect you can see what the switch statement is doing. In conjunction with the for statement, it compares the current test character against all the pattern characters it might match. If successful, as indicated by matched, it marks the next pattern character as a potential match for the next test character.

Again, the trick is knowing how to handle the 'MANY' character. The function ensures that the current 'MANY' character is a potential match for the next test character by marking the next_state array and that the next pattern character is a potential match for the current test character by marking the cur_state array. This corresponds to the 'MANY' character matching one and no characters of the test string respectively. And repeated application of this rule results in the function eventually determining the exact number of characters a given 'MANY' character should match.

The variable running is used to ensure that there is always at least one potential pattern match tested for the current test character. Because if not, all possibilities have been exhausted and the test string does not match the pattern.

If you want to see wstrcmp2 operating, print out the pattern and test strings on entry, together with the current test character and the array pointed to by next_state at the beginning of the while loop. You will see the potential matches marching along the pattern string and proliferating as they pass through any 'MANY' characters it contains, only to die out as they fail to match the current test character.

Which Function is Best?

Having got two solutions to my problem, I have to ask "which is best?" And it goes without saying that if one solution works and the other does not, the answer is easy. So if you find something wrong, please let me know. Or write an article for C Vu saying how you fixed it.

There are other indications as to which is best. If you had to extend the functions, which would you find easier to modify? Perhaps that is the best one to use.

But the usual two deciding factors are size and speed. Let me say immediately that the following figures are based on my own venerable machine. So if you get code sizes half of those I get, or run times one thousand times quicker, I would not be at all surprised.

Generally, speed is more important than size due to the low price of memory. But for those writing embedded code for micro-controllers with no memory expansion capability (or a very low target price), the reverse can be true. When a company has orders for a million units, production engineers will go to great lengths to remove a single 1p resistor from the circuit board. So the removal of a 75p memory chip will attract much of the company's development resources.

I have already noted that the source code size of the recursive version is half that of its iterative counterpart. The same ratio applies to the size of the generated code: 326 versus 694 bytes decimal respectively. But be warned that greater code size often hides a better or faster algorithm, though this will prove not to be the case here.

Another size aspect is the amount of data memory necessary to run the function. This comprises space for static variables, stack space for local variables and heap space for dynamically allocated variables.

For iterative functions, the data space required is fairly accurately known. For wstrcmp2, a maximum string length could be defined, so twice that is needed in the heap plus space for a few local variables on the stack.

The situation is less clear for recursive functions. Certainly every call will create a new set of local variables on the stack, but the compiler may also copy the entire register set to the stack as well.

For wstrcmp1 one could again specify a maximum pattern length, but perhaps every character in the pattern will turn out to be the 'MANY' character. I know several successive 'MANY' characters mean the same as a single 'MANY' character, but users do that sort of thing. And each 'MANY' character will result in another level of recursion. This could prove expensive on stack space.

Which gets me to the topic of function speed and timing algorithms.

The `clock` Function.

Algorithm timings is another topic whose pitfalls probably demand an entire article in their own right.

Let me first talk about the clock function in the standard C library and the associated header file, clock.h. clock returns the total processor time your job has used so far as an arithmetic type. And that immediately opens up a number of problems.

Firstly, when did that clock start running? The standard talks of 'program invocation', but quite a lot goes on between you pressing 'carriage return' or clicking on an icon and the first line of your program being executed. So assume that clock will never return a value of zero, and always look at the difference between the values returned by two calls to clock which surround the code section you want to time.

clock returns a value of the arithmetic type clock_t. 'arithmetic type' means that you can do arithmetic (add, divide, etc) on the values returned by clock. So you can subtract the values returned by two calls to clock and store that difference in a variable of type clock_t.

But whether that is an integral or floating point type you cannot tell. Looking in time.h is not allowed because when you have to move to the next version of your compiler you may find its author has changed what they did. So if you want a result of type double, as I will, an explicit cast is required.

Next, what are the units of clock_t? Is the time used returned as milliseconds, minutes or what? Here, the macro CLOCKS_PER_SEC is supplied, division by which will convert the value returned by clock to seconds. I suspect that here the standard is trying to cater for those systems which return the processor time used as an integer number of internal clock ticks of unknown duration as well as those which return a double precision value which has already been converted to seconds.

Look at the value of CLOCKS_PER_SEC if you want, but do not expect it to be a simple value. It could turn out to be a variable whose value is set by the compiler install sequence depending on the computer's speed and real time clock chip type. So do not expect the value to be unchanged if you move your program to another machine.

The type returned by CLOCKS_PER_SEC is not specified, so another cast to double may be required if truncation errors are not to dominate the result of its division.

Finally, for the paranoid, there is no indication of how long clock will run for before its return value wraps around. But I think you could reasonably expect this to be several days, if not weeks.

How Does One Time the Functions?

Let me move on now to time_wstrcmp.c which finds the speed of wstrcmp1 and wstrcmp2.

And the first problem is that I expect that both functions will run much faster than one tick of clock: that is, faster than 1/CLOCKS_PER_SEC. So to get a reasonable timing estimate, I use a for loop to call the function under test many times so that several clock ticks occur.

You will see the calls to wstrcmp1 and wstrcmp2 in the middle of the listing, together with their enclosing for loops and the surrounding calls to clock to find out how long each loop ran for. I have different loop limit macros because I quickly found that the two functions ran at very different speeds so warranted different loop counts for sensible timing estimates. And because I want to see how the two functions perform with different strings, I take the two strings needed from the argument list given to 'main' when it starts up.

The next problem is that I have measured the time for not only multiple runs of my chosen function, but also the loop to run it multiple times, and then also the time it takes to run clock. So I also time a for loop repeating a simple assignment.

Subtraction of successive calls to clock gives me the time to run the intervening loops and their contents. Division by the loop count gives the time for one pass through the loop together with the loop overhead. The loop overhead is estimated by f and subtracted to give the time to execute the function of interest.

Those who have paid close attention will have noticed that I have not quite cancelled out the time take to run clock because of my use of different loop counts. So I should strictly also time clock and bring that into the calculations. I leave that to you.

There are a few system related points on timings which you just have to hope have little effect. The first is that the system does not decide to swap out your process to virtual memory, that is, a disk file. Because if it does, you can be quite sure that the machine will charge your process for the machine time used even if the cause of that swap is a job somewhere else on the machine. So for accurate, repeatable, timings, run your timing job when the machine is lightly loaded, and close all your unnecessary windows as well.

Another problem is that many systems charge the cost of running an interrupt to the process they interrupted rather than the process they serve. This is because interrupts should be fast so sorting out the accounting could take more time than was used in the first place. Again, make your timings when few others are using the machine.

How Fast are the Functions?

The first thing my timings showed was that the overhead of a for loop and assuagement of a constant is 38ìs (microseconds). Not as bad as I had feared, but not as good as I had hoped either.

I looked at the speed of the two solutions when given simple matching strings of increasing length containing no wild characters. I got these figures which you might like to plot on a graph.

Pattern	Test	wstrcmp1 Recursive (ms)	wstrcmp2 Iterative (ms)
a	a	0.30	2.6
ab	ab	0.45	3.8
abc	abc	0.61	5.4
abcd	abcd	0.77	7.3
abcde	abcde	0.92	9.7
abcdef	abcdef	1.07	12.4

The speed of the library function strcmp would be of interest for these cases as well. Perhaps you could add them.

The results for the recursive solution in wstrcmp1 look quite reasonable. It takes 0.30ms to set up and compare the first character and then a 0.15ms for each further character.

The iterative solution in wstrcmp2 immediately looks poor at 2.6ms to set up and check the first character (9 times slower than wstrcmp1) which increases to 12 times slower when comparing a 6 character string. The reason is that for each extra character, wstrcmp2 not only has to compare that extra character, but it has an extra pattern character to check for all the other characters in the test string. So you should have got a parabola for the wstrcmp2 timings if you did plot the numbers.

Although I have not tried it, I would not expect these timings to vary much had the pattern contained the 'SINGLE' character, '%'.

So the iterative solution of wstrcmp2 is poor on patterns that do not contain the 'MANY' character, but then it was not designed for that case.

I then tried the functions on a simple pattern containing the 'MANY' character, '*', again on patterns of increasing length. In all cases, the pattern matches the test string.

Pattern	Test	wstrcmp1 Recursive (ms)	wstrcmp2 Iterative (ms)
a*b	ab	0.58	4.7
a*b	abb	1.00	5.8
a*b	abbb	1.42	6.9
a*b	abbbb	1.85	8.0
a*b	abbbbb	2.27	9.1
a*b	acb	0.86	5.7
a*b	accb	1.14	6.8
a*b	acccb	1.42	7.8
a*b	accccb	1.70	8.8

The recursive solution of wstrcmp1 is still always faster than the iterative solution of wstrcmp2, but there are some other interesting trends.

wstrcmp2 takes 1.1ms to match each extra character, irrespective of the test string length. This is expected as the pattern length is constant at three characters.

The recursive solution, however, is sensitive to whether the characters the 'MANY' must accept are, or are not, the same as the following character in the pattern. If they are not, it performs a lot better (0.28ms per character matched) than if they are (0.42ms), which is not an effect seen by the iterative solution.

I guess that we are seeing the effect of the definition of the C '&&' operator which says that the right hand operand should not be evaluated when the operation result is obvious from evaluating just the left-hand operand. Were the return expression for the 'MANY' case in wstrcmp1

return(wstrcmp1 (pattern, test + 1) &&
            wstrcmp1(pattern + 1, test));

I think the test results might also be reversed.

Finally, I looked at the speed of matching a pattern containing two 'MANY' characters. Again, all the test strings match the pattern.

Pattern	Test	wstrcmp1 Recursive (ms)	wstrcmp2 Iterative (ms)
abz	abz	0.86	7.7
abz	abbz	1.14	9.2
abz	abbbz	1.42	10.8
abz	abbbbz	1.71	12.3
abz	abbbbbz	1.98	14.0
abz	abcz	1.14	9.2
abz	abccz	1.42	10.8
abz	abcccz	1.70	12.4
abz	abccccz	1.98	13.9

Probably because of the choice of test strings, the recursive solution still runs at 0.28ms per test character whereas the iterative solution has now slowed to 1.6ms per character.

So Which Version Should I Use?

In this case, is there any doubt? The recursive version has the shorter source file, the smaller generated code size and the faster execution time. It wins on all counts, so use that.

So where have I got to? I started with a brief description of iteration and recursion. I described a non-trivial problem and went on to give two solutions: one recursive and one iterative. From there, I wondered which was the best solution to the original problem and measured the performance of the two functions. And I concluded with a recommendation as to which of my two functions you should use. As I said at the beginning, I have covered a lot of ground.

But what I have covered are the sort of things a good software developer will think of when writing his latest creation. Having described a problem, I found more than one potential solution. So I wrote them both, and asked which was better. And then tried them both to find out. When I started, I did not expect such a clear-cut result as I got; I actually thought the iterative solution might prove faster. But that is why one performs an experiment.

Next time you have a function to write, ask "Is there another way?" Find one, and try it. You may be surprised at the result, but you will certainly be a better, more experienced, person as a result.

// code listing: wstrcmp1.c
#include  <stdio.h>
#define    SINGLE  '%'
#define    MANY  '*'
#define    ESC  '\\'
int wstrcmp1(char *pattern, char *test){
  char    p;
  char    t;
  while (1) {
    p = *pattern;
    t = *test;
    switch(p) {
    case '\0': return t;
    case SINGLE:
      if(t == '\0') return 1;
      else break;
    case MANY:
      if(t == '\0')return(*(pattern + 1) != '\0');
      else return(wstrcmp1(pattern + 1,test) 
          && wstrcmp1(pattern, test + 1));
case ESC:
      p = * ++pattern;
      if(p == '\0') return 1;
      if(p != t) return 1;
      else break;
    default:
      if(p != t) return 1;
    }
    ++pattern;
    ++test;
  }
}

// code listing: wstrcmp2.c
#include  <stdio.h>
#include  <string.h>
#include  <stdlib.h>
#define    SINGLE  '%'
#define    MANY  '*'
#define    ESC  '\\'
static char  *cur_state;
static char  *next_state;
static void free_mem (void) {
  free (cur_state);
  free (next_state);
}
int wstrcmp2 (char *pattern, char *test) {
  char    *csp;
  char    *nsp;
  int    pat_len;
  char    *pp;
  int    t;
  int    matched;
  int    running;
  int    i;
  pat_len = strlen (pattern) + 1;
  cur_state = (int *) malloc (pat_len * sizeof (char));
  next_state = (int *) malloc ((pat_len + 1) * sizeof (char));
  if((cur_state == NULL) || (next_state == NULL)) {
    fputs("Out of memory in wstrcmp2.\n", stderr);
    free_mem();
    exit(EXIT_FAILURE);
  }
  memset(next_state, 0, (size_t) pat_len);
  *next_state = 1;
  while(1) {
    memcpy(cur_state, next_state, 
                  (size_t) pat_len);
    *next_state = 0;
    running = 0;
    t = *test++;
    nsp = next_state;
    csp = cur_state;
    pp = pattern;
    for (i = 0; i < pat_len; i += 1) {
      matched = 0;
      if(*csp) {
        switch(*pp) {
        case SINGLE:
          matched = 1; 
          break;
        case '\0':
          if(t == '\0') {
            free_mem ();
            return 0;
          }
          break;
        case MANY:
          *(csp + 1) = 1;
          *nsp = 1;
          break;
        case ESC:
          matched = * ++pp == t;
          csp++;
          * ++nsp = 0;
          break;
        default:
          matched = *pp == t;
          break;
        }
        running = 1;
      }
      pp++;
      csp++;
      * ++nsp = matched;
    }
    }
    if(!running || (t == '\0')) {
      free_mem ();
      return 1;
    }
  }
}

Editorial Note

The text books state that tail recursion (i.e. the return statement calls the function recursively) can always be replaced by iteration. In addition they normally add the advice that the latter will be more efficient.

A second consideration is the potential for stack overflow. Some systems have virtually infinite stack sizes (via paging all the way down to tape drives if necessary) but most do not. Recursive functions are very stack hungry.

Perhaps readers would like to look carefully at the above code and consider how deep the recursion could be. Some might like to attempt to improve the iterative solution, if possible by removing those expensive requests for heap memory.

Notes:

More fields may be available via dynamicdata ..