Title: Taking A Chance

Author:

Date: 06 April 1996 13:15:27 +01:00 or Sat, 06 April 1996 13:15:27 +01:00

Summary:

Body:

Bottom Slice by Francis

In the five years that I have edited CVu no previous item has ever caused such a large response. Clearly ACCU members have been infected with lottery fever along with the rest of the population. What am I referring to? That little item in the last issue of CVu about a broken program to calculate lottery entries.

I am going to provide the bread to make up a sandwich while letting The Harpist provide the filling. I hope you will find the result helpful. I hope the experts among you will take the time to read this item through to the end. This is for two reasons. The first is that I would like your feedback on the material. The second is that I suspect that you may learn something - I certainly did.

I would very much like the rest of you to hunt out your own questions and problems, because if they trigger even a small fraction of what this one has then all of us will become better programmers.

Before I deal with the actual bugs in the program let me introduce you to your English dictionary. Go and find it and look up the word meretricious and you will know why all your meritorious efforts have gone unrewarded. OK, it was a nasty trick to play and as a result it has produced a single indisputable winner, the only competitor to identify what metric was being used even if there was rather too much merit in his response. Was there a point to this? Well I wanted to remind everyone that you should look up what you do not know in the appropriate reference book, not just guess at a suitable meaning from context. Anyone who knew the meaning of 'meretricious' but assumed that I had confused it with 'meritorious' should remember that I was once a teacher and words are the tools of my trade.

No, I am not telling you what it means - look it up.

The Bugs

Everyone found the major bug - the under dimensioned array of five ints to store six results. Several speculated as to why it bit in the way it did. A single member not only explained it but also proposed a change to the program to demonstrate the problem consistently on all compilers. I would have felt very guilty about not giving him the prize had it not been that his own program would have won a prize for the most obfuscated algorithm, and I am still not sure what it does in all circumstances.

To save you hunting for the last issue, here is the problem code again.

#include <stdio.h> #include <time.h> int main() { int loop=0, count=0; int num[5]; int test = 0; srand((unsigned)time(NULL)); num[loop] = (1+rand()%49); printf("Random number no %d is %3d\n",loop+1,num[loop]); loop++; while (loop <= 5 ) { count = 0; test = 0; num[loop] = (1+rand()%49); while (count < loop) { if(num[loop] == num[count]){ test = 1; count = loop; } else count++; } if(test==0) { NULL printf("Rand number no %d is %3d\n", loop+1, num[loop]); loop++; } } return 0; }

The significant feature is the optimiser for Turbo C will almost certainly put count into a register (as well as test and loop). Unoptimised code, and presumably this was the situation with GNU C, may have locate count contiguous to the array num. In such circumstances count may coincide with the notional location of num[5]. This will not matter until you generate the sixth value at which point one of two things happens, the sixth value is less than six and gets incremented by each pass of the inner while loop. Or it is greater than five and the while loop terminates immediately. Sometimes this value will be the same as one of the first five.

If you want to demonstrate this with your current compiler of choice, correctly declare num as an array of six ints and then use num[5] instead of count throughout the rest of the program.

So that is the obvious bug, and the explanation for its working. There is another one lurking that most of you missed (and that includes some very experienced professional programmers - no names, you know who you are). The writer clearly intends to include the correct header files, so where is stdlib.h? Without it the compiler makes a best guess at the parameters and return values for rand() and srand(). It will get those for srand() wrong as it will set up code to pass an int and return an int. The code the linker will find in the library will expect an unsigned int and return a void. Did I hear a mutter of 'Well, what does it matter? The code will still work.' That kind of attitude won't cut it. Of course a C++ compiler will spot the problem, and so will the kind of type-safe linker that comes with Clarion's compilers.

A couple of people wanted to grumble about the 'return 0' as they wanted it replaced with 'return EXIT_SUCCESS'. This is an example of a magic number (the '0') that most experienced programmers are quite happy with. What about the other magic numbers? The '49' and the '5'. May be the programmer would have got it right if he had used manifest constants, probably not but ... There are also some nasty problems lurking for the C++ programmer who uses NULL. The consensus of opinion these days is to simply use 0 when you want a null pointer and avoid the use of NULL.

I think all the points in the last paragraph are relatively small and are the kind of thing that some programmers will waste endless hours arguing about.

The final point I have on the code itself is that I am less than happy with it being written as one single monolithic function, a feeling shared by most of those writing about it. Remember the Harpist's guiding rule, 'A function should (conceptually) only do one thing.' The other 'rule of thumb' that has been broken is 'If you find yourself writing something twice, it should have been done once as a function.'

Deeper Problems

I am going to leave The Harpist to write about many of the problems associated with using rand() but there is one that needs some explanation now (The Harpist missed the problem when we were brainstorming on the subject). This is the problem of ensuring that your algorithm will terminate. Let me use a simpler problem. Suppose that you are using a (pseudo-)random number generator to simulate tossing a pair of coins. The core code might look something like this:

enum Pair {TT, TH, HT, HH}; enum Pair result toss( void ){ int coin1 = rand()% 2; int coin2 = rand()% 2; return (coin1 + 2 * coin2) }

What are the possible return values? Well you might reasonably expect a fairly equal distribution of 0 (TT), 1 (TH), 2 (HT) and 3 (HH). Most current implementations of rand() will generate that kind of result but there is at least one infamous implementation that generates alternate odd and even values with the result that the above code can only return 1 or 2. This particular problem is so blatant that the poor quality of the implementation is quickly noticed. However despite all expectations to the contrary even worse implementations are (in my opinion - if you can quote text to the contrary please send it in) conforming. As the ISO C Standard does not define a pseudo-random number (a pretty difficult thing to do anyway) the following seems to be conforming:

static unsigned int next =1; void srand(unsigned int seed) { next = seed;} int rand () { return abs(next); }

Yes, I know that this is plain crazy but it is only an extreme. How good is the implementation of rand() that your Standard C Library provides? Under what circumstances will it break your program because it fails to meet your implicit expectations?

This is a quality of implementation issue, but most programmers just assume that the implementation they are using is good enough. One odd aspect of rand() is that it is explicitly prohibited from generating genuine random numbers because setting the seed requires that the subsequent generated sequence is fully determined. Even not calling srand() does not help because that is defined as equivalent to calling srand(1). I'll leave The Harpist to explore the consequences of this in the current context.

Alternative Algorithms

The first criterion for acceptance is that whatever algorithm you use should demonstrably terminate in a predictable time. My instinctive choice was to apply a standard shuffle routine. The heart of this is a small function that exchanges the last item in the list for a randomly chosen item (possibly itself).

void choose_last(int array[], int entries){ int item = rand()%entries; int temp = array[item]; array[item]=array[entries-1]; array[entries-1]=temp; return; }

Now we can use this function to do a complete shuffle of an array:

void shuffle(int array[], int size) { int last; for(last=size; last>2; last--) choose_last(array, last); return; }

Set up an array of 49 ints and initialise it to contain each of the integer values 1-49 (how you do this is irrelevant, the obvious iteration of numbers[i] = i+1; will do fine), call shuffle() with the array name and 49, now use the entries in the first six (last six or whatever rule takes your fancy) as your selection.

This was such an obvious solution for me that I was amazed to be offered an entirely different one, which I had never consciously seen before but is perfectly reasonable. The basis of this solution was to apply a 'weight' to each entry. First we need a struct to provide pairs of values and weights:

typedef struct { int value; int wt; } weighted_value;

And next an ordering function:

int compare(weighted_value i, weighted_value j){return (i.wt-j.wt);}

And finally we can write our shuffle function:

void shuffle(weighted_value array[], int size){ int count; for(count=0; count<size; count++) array[count].wt = rand(); qsort(array, size, sizeof(weighted_value), compare); return; }

You may have to play around with some nasty casts to pass that function address to qsort() as many compilers are very picky. The correct cast is:

(int (*)(const void *, const void *))

Many books never bother with detailing this because they are still thinking in terms of writing code without prototypes. In such cases, what the compiler usually does works. Once you have prototypes around the compiler can see that the type of the function address being passed does not match the type of the function pointer declared as the fourth parameter of qsort(). I've known quite a few programmers abandon qsort() because they could not work out the correct cast - function types are not the easiest thing to declare. This shuffle is used very like the previous one.

Any other shuffling algorithms? Perhaps I should warn you that many of the ones people invent for themselves are badly biased. Don't let that deter you because we can all learn from broken solutions as well as working ones.

Time to hand over to The Harpist. I'll be back when he has finished to give you the prize winner as well as my solution to writing a lottery selection generator.

The Filling - A Fair Chance by The Harpist

We all have a sense of what we mean by random. We have some very firm expectations. The trouble is there are a surprising number of tarpits for even the experts (I'll not be surprised if I manage to fall into a couple).

Let me give you a small example. You toss a coin 20 times and it lands heads every time. You are offered a bet on the next toss. Do you back heads or tails? Those who know nothing about coins and probability usually want to back heads again. Those with superficial knowledge back tails because they reckon it is about time that tails came up to start balancing the results. The statisticians go to great pains to explain that the past history of results is irrelevant (the coin lacks memory and self-awareness) so the chance is 50-50 either way. So what is your answer?

Notice that all you know is that you tossed the coin, you know nothing about the 'fairness' of it. The only evidence you have for this coin is that it has shown a very strong tendency to land heads. If you look at the coin you might even find that it was two-headed. We know that biased coins exist, even after checking that it is not double-headed we still do not know if this coin is unbiased. Indeed all the evidence we have suggests that it is biased towards landing heads. We should always use the best evidence available. Before we toss the coin at all, that is our knowledge of the general behaviour of coins. Once we have tossed it a few times we have should take into account the specific behaviour of this coin. So back it to land heads again. If it isn't biased you still have your 50-50 chance, but if it is biased it is almost certainly biased towards heads.

What Is Random?

Well what is your answer? I think we generally mean that all potential selections have an equal chance. You can use a single six-sided die to select one of six items. But notice that using two standard dice to generate totals between 2 and 12 does not fill that definition of randomness - there is a strong bias towards the middle numbers. In other words you can use a die to generate a random number but you cannot use one to generate a random total. I know that this is simplistic, but I am not writing an in depth article on Statistics (something that Francis maintains is one of the worst understood items in common use - he used to have endless battles with other maths teachers and examiners who got their answers wrong.)

Now what are we intending to choose at random when we select lottery numbers? Surely not the individual numbers. Think that through. Any single number is not much use to us. To win the big one we need to select a specific six numbers. The selection mechanism used by the lottery is designed to ensure that all the possible selections are equally likely. If you want to work out how many different selections are possible evaluate:

(49 x 48 x 47 x 46 x 45 x 44) / (6 x 5 x 4 x 3 x 2)

This is 13983816, or almost 14 million. Keep that number in mind.

Pseudo-Random Number Generators

It is very difficult to define one of these. What we want is a deterministic mechanism that will select numbers from a particular range without any apparent relationship between one number and the next choice. Of course there is a connection or else the mechanism would not be deterministic. In a sense, the one thing that a pseudo-random number generator cannot do is generate random numbers. What we make do with is some form of randomising the starting point for a function that generates a sequence of numbers that does not appear to follow any well defined pattern.

Note very carefully that the only function that can possibly use a random number (which must be supplied by some outside device) is srand(). Unless you call srand() again, the sequence generated by a particular implementation of rand() is fully determined, and each time you run your program, supplying the same seed value to srand() you will get the same results. This means that you cannot get more results from your program than there are starting seeds. (We will see in a moment that this isn't quite true)

How many starting points does C allow you? Look in limits.h. Don't make Francis' mistake and look at RAND_MAX (found in stdlib.h). The critical value is UINT_MAX because srand() takes an unsigned int argument. That gives you the maximum number of results that your program can generate if srand() is only called once (calling it several times only helps if you can provide some external mechanism to provide a random seed. The system clock won't cut it)

Now on the 16-bit systems most of you are using UINT_MAX is usually 65535. That is a long way from 14 million. In other words the kind of program that most write to generate lottery selections are biased. We could actually generate all the possible solutions by iterating our program for all the 65536 (including 0) seeds available. Assuming that you have a good quality pseudo-random number generator and a 32-bit unsigned int you have a reasonable chance that your program can generate all selections with roughly equal probability.

This twin-problem of good quality pseudo-random number generators with a sufficient number of seeds to ensure that all selections are possible is not just an academic quibble. There are already several commercial programs on the market. If one of these suffers from the limited seed problem (more than likely) then you may find that you are sharing your 'big one' with a large number of other people.

Cycles

As well as the number of different starting points (seeds) for a pseudo-random number there is the issue of how long the cycle of values is before it repeats. This depends on the algorithm you are using. Some algorithms start with lead ins of a hundred or so values that are dependant on the initial seed before rapidly entering a short cycle of a dozen or fewer values. If you are testing a pseudo-random number generator it is well worth generating a few thousand values which are discarded, before generating the actual test values. This will help weed out these poor quality algorithms.

Many algorithms only cycle through a subset of the possible values, do not confuse this with the length of the cycle itself. While it is true that many of the simpler pseudo-random number generators do not have values repeat in a cycle this does not have to be true. The reason why some people think this is an essential feature of cycles is that they are thinking of pseudo-random number generators that reseed themselves from the last value generated.

Let me give you a simple example (not very good as a pseudo-random number generator, but it does generate a cycle with repeated values.

static unsigned int next =1; void srand(unsigned int seed) { next = seed;} int rand(){ next++; return (next * next) % 10; }

This generates the cycle 0-1-4-9-6-5-6-4-9-1-0-1-4...

While the common problem is short cycles it is worth remembering that the best pseudo-random number generators can have very long cycles, many times longer than RAND_MAX. In other words there may be a much richer texture of patterns available if only we were not limited by srand() taking an unsigned int argument. Is there anyway to capitalise on these longer cycles?

Persistence

This term refers to data that survives the execution of a program so that it can be used again. Sadly the C Standard makes no provision for a persistent pseudo-random number generator. Any reasonable pseudo-random number generator will have some internal state variables that control the next value generated. There is no way of accessing these (well a kind implementor could provide you the information, but I do not know of one that does). The best you can do is to store the starting seed and the running total of the number of calls to rand(). At the restart of your program you would have to seed the pseudo-random number generator and then iterate through all the previous calls to rand() (discarding the return values) to get back to the place in the sequence that you last used. The obvious problem with this is that you have an ever increasing start-up overhead (try benchmarking a million calls to rand())

Considering our current problem, we can use a different bit of persistent data to widen the range of end selections. Assuming that we are using a shuffle technique as the core of our selection generator, we can store the final order in a data file. Subsequent uses of the program could use the most recent stored values to initialise the array. That is, instead of:

int i; int num[49]; for(i=0; i<49; i++) num[i]=i+1; /* program */

Use something like:

int i; int num[49]; FILE *numbers = fopen("last", "r"); for(i=0; i<49; i++) fscanf(numbers, "%d", num+i); fclose(numbers) /* rest of program */ numbers = fopen("last", "w"); for(i=0; i<49; i++) fprintf(numbers, "%d", num[i]);

(How much better I find the C++ io! Look at that horrible lack of symmetry between fscanf and fprintf.)

I think that this technique will provide a reasonable random selection of lottery numbers even on a system with 16-bit unsigned ints. Anyone got anything better? Come to that, are there any approaches that do not rely on some form of shuffle?

One last thing before I hand you back to Francis. Anyone who wants to learn more about pseudo-random number generators could do much worse than read chapter 3 of 'The Art of Computer Programming'. It's the first chapter of volume 2 (Seminumerical Algorithms). Originally written in 1969 with a second edition in 1981 it is still one of the most authoritative works available. The maths gets fairly heavy at times, but don't let that put you off, there is much there that you will understand even without following all the math.

Top Slice - Getting Objective by Francis Glassborow

One contributor thought about using some object orientation. I'll leave his name out because I think his actual solution was far below his normal exemplary standards. But he actually set my thinking onto a new path.

Often we use pseudo-random number generators because we want repeatability, and the seed mechanism guarantees that. But this is exactly what we do not want for this problem. We want to generate a selection of 6 numbers from 49 with only about one chance in 14 million that the same selection will be made by any other use of the same program.

Let me use a bit of O-O. I want to emulate two things, a continuous shuffling of the available numbers. The choice of the one that is at the selection point at a random moment. Now human time runs at a very different rate from our modern machine time. We can capitalise on this.

Consider the following little function:

int new_top(int array[], int entries){ int item = rand()%entries; int temp = array[item]; array[item]=array[entries-1]; array[entries-1]=temp; return kbhit(); }

And using it something like:

int main(){ int size=49; int num[49]; int count; for(count=0; count<size; count++) num[count]=count+1; srand((unsigned)time(NULL)); printf("Press a key to get each number \n"); for(count=0; count<6; count++){ while(!new_top(num, size)); /* shuffle till a key is hit */ printf("%d ", num[--size]); /* then output current end entry */ } printf("\nWell thats your six. Best of luck.\n"); return 0; }

You may want to do some inlining. You may also want to suppress all output till the end. The purpose is to get the machine shuffling as fast as possible so that even those trying to repeat cannot possibly duplicate the precise timing of keypresses that would require.

If you are willing to put up with a little tedium you could try this variation of a shuffle function that could replace the one that I started with (in the other slice).

void shuffle(int array[], int size) { int last; printf("Make %d keypresses", size-1); for(last=size; last>2; last--) while(!new_top(array, last); return; }

I think that this is about the most random shuffle you are likely to get from current hardware. Of course this is not a Windows program because it relies on kbhit(). Which leads to two variations for those that want to try their hands at this idea. First convert this shuffling idea to run effectively on an event driven GUI system (MS Windows for most of you). Second redesign it and implement it for a multi-threading system with one thread shuffling and the other thread watching the keyboard, mouse or whatever. This is actually quite hard, because you will have to handle the reducing array for each selection event. You also want the time slices to be kept small else the granularity will make repeatability possible. I feel that this could be quite a nice introduction to MT programming.

The Winner

Tony Sumner was the only contributor to notice my malicious qualification on entries (thank goodness there was one, and only one) so he wins a copy of the Anniversary edition of The Mythical Man-Month.

I would like to thank all the following whose contributions added much to an enjoyable brainstorming between me and The Harpist. I promise you that none of us got it all right.

Contributors:

Nigel Armstrong (who, I think noticed that the prize was for a meretricious submission, but his wasn't),

Raymond Butler (who kept things simple),

U P Cheah (who suggests that such misunderstandings of array size is particularly prevalent among those converting from BASIC to C),

Geoff (trouble with email is that surnames often get sliced off, perhaps it is just as well, because he thought that the algorithm in the original was OK)

Alan Griffiths (The editor: The ISDF Newsletter) provided a rather comprehensive list of minor points re the published code as follows:

There are a number of other problems with the code, the first couple of the following list are appropriate even to novices (as I presume the author to be), the latter ones are more related to the ideal of "quality code".

If the random number generator were to repeat the same five (or fewer) numbers then the program would not terminate. (Although this unlikely to be a problem it makes getting a result a matter of luck.)
srand() and rand() are accessed without prototypes, the header <stdlib.h> should be included.
The choice of variable names is poor - count is not used as a count and test does not indicate what the test is for.
The use of "magic numbers" throughout code is a frequent source of errors - especially when maintaining code. For example, if the range of numbers were to change to 1-99 then it is necessary to check the whole program to locate the code that needs changing.
Because the algorithm is poorly structured it is necessary to repeat (almost) the code to output the numbers. Again, this increases the effort needed to understand or change the program.
The variables count and test are declared with a scope that extends far beyond their use. This can lead to "dual purpose" variables with misleading names.

In a lengthy commentary Derek Johnson writes:

Assuming that rand() produces something approximating to an even distribution, then there is zero probability that the algorithm fails to terminate. Even if the same algorithm were used to find an order for all 49 numbers, then the expected time is still finite - it might be rather large, but it will still finish some time. Maybe what you were thinking of is that the algorithm has no guaranteed time at which it terminates, which is quite another thing.

And later

Under Sun Unix (version 4, at least) rand produces odd & even numbers alternately!

In other words he knows that there are very bad generators out there, but does not believe it will effect this program. Somewhere I have a book on Murphy's Law. Programmers have to be less trusting.

Alec Ross included a good clear analysis of the reason why the problem manifested the way it did.

Kevin Howe produced an alternative selection mechanism that effectively used an array of flags. He then used the generated value to count through the so far unselected choices. As he mentions, this means that no sorting is necessary if you want to produce your list in numerical order, his version does not disturb the implicit order.

Peter Wippell came up with a pure C++ based solution relying on using a set container to keep out duplicates. Sadly this is vulnerable to the quality of implement-ation of the pseudo-random number generator.

There are a couple of other anonymous contributions sitting in my directory tree. Try to remember to include your name in the body of material that you send in electronic form. I know I should be efficient and tag all incoming files but we all know I'm not.

Thanks for all the input. I think that it has been very worthwhile (at least I have learnt something) now who has the next little bit to trigger another tutorial?

Notes:

More fields may be available via dynamicdata ..