Journal Articles

CVu Journal Vol 13, #5 - Oct 2001 + Professionalism in Programming, from CVu journal

Browse in :

All > Journals > CVu > 135 (7)
All > Journal Columns > Professionalism (40)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Professionalism in Programming #10

Author: Administrator

Date: 05 October 2001 13:15:47 +01:00 or Fri, 05 October 2001 13:15:47 +01:00

Summary:

Software testing

Body:

You can write as much code as you like, there's one thing you can be sure of: it won't work perfectly first time, no matter how long you sit down and carefully specify it beforehand. Software faults have this creepy ability to work their way into any software. The more code you write, the more faults you'll introduce. The faster you write, the more you'll introduce. I've yet to meet a really prolific coder who created anything near bug-free code.

So what do we do about this? We test the code. We do this to find the problems that exist, and once we've fixed them we use the tests to give us confidence in the quality of the code we've created.

If you were a car manufacturer, you'd be obliged to physically crash your new products in various manners to measure and ensure applicable safety standards are met. We do something like that with our software (although on the whole, we try not to crash it!) Perhaps a great cause of the problems with contemporary software development and deployment is that there are no relevant "safety standards" which we should adhere to.

The what and why of testing

Such simple questions as 'What is testing?' and 'Why do we have to test?' seem almost too obvious to answer. Yet all too often adequate software testing is not performed - or it is not performed at the appropriate stage of production. Getting your testing procedures correct is a skill. Actually having testing procedures is even more significant. The mere mention of the term testing procedure is enough to bring some engineers out in hives.

During the lifetime of a software development process various articles get tested. A large number of documents will go through a testing stage (more commonly known as a review process). Doing this ensures, for example, that the requirements specification correctly models the customer's needs, the functional specification implements the requirements specification, the various sub system specifications are complete enough to fulfil the functional specification, and so on.

Naturally then, the implementation code gets tested. It gets tested at several levels too, ranging from on the developer's machine (possibly outside of the target environment in which it will run), at an integration level where the code is glued together with the other parts of the system, to a product level where the entire end-product is validated. The product as a whole is clearly another testing target. Whilst this latter level of testing will (or 'should') indirectly test all the code components developed that is not the focus of the tests at this level.

In the rest of this article we will focus entirely on how as software developers we test the code we write.

Effective testing

For our software testing to be effective we need to be clear why we're testing and what we're trying to do. As software developers our testing procedure exists for two reasons (this is the why):

To help us to find faults and fix them, then
To ensure the faults don't reappear in later versions.

To understand what we're doing we need to be clear exactly what software we're testing and the method we're using to test (there are several). Finally, one of the hardest and most important questions to answer is: when do we know when we've finished testing?

Once we're satisfied we've tested thoroughly and have removed all the faults we can consider our software bullet proof^[1]. Of course, your confidence in the software is only as good as your tests. If you haven't tested your software at all, any confidence you have in it running correctly is clearly unfounded (unless, perhaps, it is the simplest of one-line "hello world" programs). The more separate tests you run, the more you can have confidence that your code is correct. The higher the quality of the tests (and just how do we measure this?), again, the higher your level confidence.

In order to do this, then, what form do our tests take? When we write software we create individual functions at the lowest level, then classes, then whole systems (for example, a reusable component, program, device driver or shared library). Each level should be tested thoroughly. And each level requires a different sort of test.

Tests can generally only ensure that the correct output is generated for the right set of inputs. They also need to ensure that the correct failure cases are generated for a small set of the possible invalid inputs^[2]. This is hard, and we'll see why below.

Sadly we have to accept that we still can't produce fault free software - some errors will get through even the most 'exhaustive' testing. With this in mind, for effective testing we need to focus on the key tests that will likely capture the majority of software defects.

Testing mustn't be put off for too long - experience shows us that the maintenance and testing phases of software development are at least as big as the development phase. Some studies show that typically more than 50% percent of the development time is spent in testing. Real testing is really hard. We need to start as soon as possible - even during (or perhaps before) serious software development.

Why is testing hard?

If a particular piece of code needs to be tested we said above that we need to ensure it generates correct output for the set of all good input values and produces the correct failure cases for the set of all invalid inputs. That sounds innocuous enough but for all but the simplest function (with no reliance on any external part of the system) it is just plain impossible to test exhaustively.

For example, this is a fairly easy C++ function to test:

bool logical_not(bool b)
{
  if (b)
    return false;
  else
    return true;
}

The set of inputs is of size two. However, what about the following function (let's not consider its elegance at the moment) - it has a very large set of inputs and multiple paths of execution. How do you exhaustively test it?

bool is_prime(int val)
{
  for (int i = 2; i*i = val; ++i)
  {
    int remainder = val - ((val/i)*i);
    if(remainder == 0)
      return false;
  }
  return true;
}

And that's just for a single small function. When software components get glued together, and start relying on each other the complexity of the software expands exponentially. It becomes not just difficult, but technically infeasible to test the software exhaustively. The time and resources do not exist to generate the test data and run the software over all sets of inputs and stimuli. Software is inherently complex and can fail in a multitude of exciting and bizarre ways. You don't just have invalid output to contend with - there's also crashes, GPFs, and more to guard against.

Now this isn't the only reason software testing is hard. Software evolves. This evolution tends to break tests. If the requirements are not pinned down early on, your tests early on in the project will probably be invalid because by the time you come to deliver, the APIs have changed, the functionality is completely different and the full set of tests have not been created because development never stood still long enough. If external interfaces are not stable we can't test.

Bugs can be caused by both errors in software and also in hardware. When working in embedded environments you are generally more likely to run into hardware error. Hardware faults can sometimes be an order of magnitude more difficult to diagnose and fix.

So good testing is just plain hard. That's why it's often not done at all properly - leading to the release of code that is simply not suitable or saleable. The more complicated a code base becomes the harder it is to test properly, and the easier it is to ignore this. Programmers tend to leave their developmental testing and rely on the company's QA^[3] department to find their bugs for them. This is not just wrong, it is downright dangerous. "The single most important rule of testing is to do it." [Kernighan-]

How do you test?

So if we can't test exhaustively what do we do? We need to test as effectively and as fully as we can^[4]. This kind of testing is more than just debugging. We write a suite of tests to run - the tests are extra software^[5]. The question is what goes into this test suite and how do we determine what tests are necessary? Presume that the 'simple stuff' - running memory checkers, static code validators and the like - has already been done during development (almost as a part of debugging). Having some test cases in for memory leaks is not by any means unnecessary - however this may not be the main focus.

What tests are required is naturally dependant on the software you are writing. The focus may be on timely execution, correct memory usage, certainly on correct outputs for correct inputs - and often all of the above. The complexity of the test suite grows incredibly quickly as the software under test gains a more complex and convoluted API.

Now we're not going to take a textbook approach to describing testing as much as a practical approach. There is a lot of theory behind various types of testing. This article will not replace a good study of the real background material. There are some links at the end that provide good jumping points for further investigation.

We do, though, need to get the nomenclature straight - people differ in their definitions of the terms bug, error, fault and defect. They are all precise and important. Generally a fault exists in our software that leads to it exhibiting a defect or error. The term bug is a colloquialism: according to folklore, the first computer bug was an actual bug. It was discovered by Admiral Grace Hopper in 1945 at Harvard. A moth trapped between two electrical relays of the Mark II Aiken Relay Calculator caused the whole machine to shut down.

Tests are easier when the code is written thoughtfully. If a single section of code is self-contained and doesn't have endless undocumented and tenuous dependencies on the outside world then it will be much easier to test. For example, if a function relies heavily on global variables before running a test you have to ensure that all these variables are set to a particular state. However, if the variables accessed are gathered in a state structure that is passed as a parameter, then the environment of the test is much better defined. The test can be written with a much higher degree of confidence. You don't want a test to succeed one day and then fail the next depending on which way the wind is blowing.

Good code should be designed from the outset to be testable. It is a general rule that when you structure your code for testability you will generally be structuring it in a sensible, understandable and maintainable way. That can't be a bad thing.

Different sorts of tests

There are a number of different kinds of testing, each cover different considerations, each can catch different sets of faults. All are needed. This is a whistle stop tour of the world of testing.

There are two main approaches to devising tests, and several testing scenarios. First we'll look at the two approaches, black box and white box testing.

Black box testing

This is also known as functional testing. It is testing from point of view of the user, not designer. Black box testing compares actual functionality against intended functionality. It deals with problems in a fault-based approach.

Black box testing is concerned only with meeting the software's specification - not that all parts of implementation are tested. Therefore without clear specs it is very hard to implement the tests. The internal workings of the code are not known by the tester, it is seen as a black box. This means that the test is unbiased - the designer and tester are (or should be) independent of one another. How often they are is another matter.

Black box test cases can be designed as soon as the software specification is complete. Since testing every possible input stream is fairly unrealistic many program paths may go untested. The tester will have no idea what kind of code coverage they have achieved when the tests are complete.

Black box testing does rely on the specification being correct in the first place, and on it not being radically altered after the tests have been devised.
White box testing

This is also known as structural testing. It is a code-coverage based approach. If black box testing is concerned with faults of omission, white box testing discovers faults of commission. Each line of code is scrutinised systematically to ensure correctness. Where you couldn't see into the 'black box' beforehand, you now can and do. For this reason white box testing is sometimes called glass box testing. It is really only concerned with testing the software produced and doesn't guarantee it meets the specification in the same way as black box.

There are static and dynamic methods of white box testing. In the static method the code is not run, it is inspected and walked through to ensure that it does in fact represent a valid solution. The dynamic method is largely concerned with path and branch testing - running the code in such a way as to execute every line, and every possible decision.

White box testing is much more expensive that black box, and is correspondingly done a lot less. It needs the source code to have been written before the white box tests can be planned. Black box testing will typically have been done beforehand. It is laborious. The consequence of a failure at this stage can be much more expensive. You'd have to code the fix, black box test again, then redesign and run white box tests.

White box testing aims at some degree of exhaustion; in statement coverage (we should have executed every line at least once), and branch coverage (we should have traversed every branch option, but not necessarily every permutation of the branch options). It can require modification to the code to force it to go down certain paths, rather than engineer test cases when it will. Without tool support white box testing can make your head explode.

If you do modify the source code, then you're not actually testing the final executable, which is not always a good thing. It can also be hard to make work because in large code bases with third party libraries, sometimes not all the source is available.

There are a number of source-patching test tools available. For example, Rational Purify (www.rational.com), and BugTrapper from Mutek (www.mutek.com).

Along with these approaches to testing, there is a set of different test scenarios. They correspond to the different stages of software development. The relevant tests can be developed in a black or white box method.

Unit testing

The term unit test is commonly used to mean testing a "module" of code (a library, driver, stack layer), but really describes testing the atomic structures, each class or function. Each untrusted unit that the code interfaces with is replaced with a stub or simulator - this ensures we will only trap bugs in this unit, not bugs caused by outside influences. Unit testing is performed in strict isolation.
Component testing

This validates the combination of one or more units into a full component - often this is what people mean by 'unit test'. They are tested together, all stubs and simulators are replaced with the real things.
Integration testing

This is testing the combination of components - ensuring that they interconnect properly and function as a whole according to specification.
Regression testing

This is re-testing after fixes or modifications are made to the software or its environment. You regression test to ensure the software is still working as well as it did before, and that in making a modification you haven't broken something else. When you work with brittle software a change in one place can cause a strange fault to appear elsewhere. Regression testing guards against this.

It can be difficult to determine how much re-testing is needed, especially near the end of the development cycle. Automated testing tools can be especially useful for this type of testing.

This is still not the end of the story. There are other forms of testing which we don't have the space to delve into fully. You may need to run performance tests - if it is a requirement of your product. Maybe volume test to ensure that your code can handle a huge amount of data. This can unearth problems related to efficiency of a system (e.g. incorrect buffer sizes, too much memory required). Stress testing throws a huge amount of data at the code within a short space of time and is similar to volume testing, it is often used for high availability systems. Recovery testing ensures that data can be recovered after a system breakdown, and that it is still correct. This can be a priority for high reliability systems. There are various forms of end-user tests, often performed in 'usability labs' under very controlled and scripted conditions, as well as field trials.

What about 'alpha-tests' and 'beta tests'? In reality these terms mean nothing, and are largely employed as milestones on a development schedule. Each company will have their own definition of what they mean by software in an alpha or beta state. 'Alpha' usually implies most of the functionality is present but may be completely unreliable. In a 'beta' state the software should be mostly usable with very few remaining problems - beta testing is used in the run up to final release candidates.

What is required for good testing?

We're going to be spending a long time testing. We need to make sure we're not wasting our time. How do we ensure it's effective? If we want our testing to have maximum impact (and minimum pain): test early, test often. This is embodied in the Extreme Programming approach by writing the test cases first of all, before any production code. I've never done this but I can see that it gives a real confidence in the code being written^[6].

For good testing we need to ensure test cases are exhaustive (or more realistically 'as complete as possible'). How do we ensure test cases are exhaustive? It's even fairly hard to define what exhaustive means. Perhaps it means that you've covered the execution of every line of code (in a large code base that's quite an undertaking - and what if some lines have been written that can never be reached). Perhaps it means that every API has been exercised in the most common environments. Perhaps it's feeding the software every set of possible input, good and bad (quite a brute force method, and often far too resource intensive to be a realistic approach).

Good test cases often go hand-in-hand with the generation of good test data. If you can automatically generate the test data then this can be very handy. Any form of automation will increase the speed at which you can test, therefore allowing you to test harder in the same period of time - increasing the likelihood of improving your software quality.

How do we determine what test data to generate? If we can't cover every single possible input we need to select a handful of pertinent inputs. More often than not this will be:

A certain number of well chosen good inputs to ensure that the software works properly in the 'normal' case.
A certain number of well chosen bad inputs - checks just the reverse. It ensures the software is robust.
All the boundary cases - identify the highest and lowest inputs that are valid (or whatever the natural boundaries are). For each of these test the boundary on the 'good' side and the 'bad' side to ensure the software works at the extremes and fails exactly when expected.
Test randomly generated sets of data, to eliminate guesswork - this can be a surprisingly effective test strategy.
If numeric data is the input, always test for zero cases. Programmers always seem to fail to think properly about zero for some reason.
If appropriate, then perform stress testing. Overload the program with inputs to determine its capacity. This is especially pertinent in threaded or real time systems.

In order to foster good testing in a development organisation there must be a culture of concrete specifications being written up front. This allows test specifications to be created and worked with. It is also vital to develop good working relationships twixt the QA department and the software developers. All too often there is a rivalry where the testing department is seen as a bunch of people who aim to get in the way of developers and hinder the path to release, rather than a team who help to build a stable product.

How do you know when to stop testing? The task is potentially endless. The end point can be difficult to determine. Many modern software applications are so complex, and run in such an interdependent environment, that complete testing can never be done. Common factors in deciding when to stop are:

Deadlines (release deadlines, testing deadlines, etc.),
Test cases completed with certain percentage passed (and no major show-stoppers remaining),
Test budget depleted (a very sad criteria for stopping),
Coverage of code/functionality/requirements reaches a specified point,
Exhibited bug rate falls below a certain level,
Beta or alpha testing period ends.

Test harnesses

We said that automating the software testing process is a Good Thing. The common and effective way of doing this is to use a test harness. This is a scaffold in which to place the tests we write in order to marshal the test execution and gather the results of the testing in a single place. The harness monitors which tests have been done (and the more complex ones maintain a history of test results over time).

Testing is costly and automation cuts down time and cost. Clearly though, not all tests can be automated. End-user testing can be hard to automate, for example how do you emulate mouse clicks or ensure that the correct sound clip is playing?

A high level of automation comes into its own when regression testing. If you make a later modification to the code and want to ensure that you haven't broken it in any way you can just run the set of tests over it automatically; out at the end pops a yes or no. Of course the regression test result is only as good as the tests put into the harness.

Many software companies have developed their own test harnesses. There are also a number of commercial and free harnesses available which merit a look. Some of these are listed in the 'Resources' section at the end. There is a huge benefit in using a company standard (or other standard) test harness. Rather than spell it out, I'll leave that as an exercise for the reader.

Conclusion

We have seen the necessity, problems with, approaches to and practical ways to test. Are we performing the full level of testing for which we are responsible? Do we even know for sure what level of testing we are responsible for? Do we have full confidence in the quality of our code, and if not what steps can we take to rectify this?

Resources

Here are a selection of links to material on the internet that are useful.

Software Testing (from CMU): www-2.cs.cmu.edu/~koopman/des_s99/sw_testing/
comp.software.testing FAQ: www.faqs.org/faqs/software-eng/testing-faq/preamble.html
Software QA and Testing: www.softwareqatest.com/

Test harnesses

CppUnit: cppunit.sourceforge.net
JUnit: www.junit.org
Jtest and C++Test from Parasoft: www.parasoft.com
liveCODE from Applied Microsystems: www.applied-microsystems.com
Aprobe from OC Systems: www.ocsystems.com

References

[Kernighan-] Brian Kernighan and Rob Pike. The Practice of Programming. 1999, Addison Wesley, Inc. ISBN: 0-201-61586-X.

^[1] dly, in the real world testing will rarely ensure that software is bullet proof - merely that it is adequate. As we'll see later testing is hard and we can only do as much as we can afford.

^[2] The set of invalid inputs is almost always much larger than the set of valid inputs.

^[3] 'Quality Assurance' - the name often given to the group of people who perform software testing and have the final sign-off on whether or not it is of a sufficiently high quality for release.

^[4] We're more likely to if we are made accountable for our testing. This is a major plus point for testing specifications. They ensure the visibility of the test procedure, but are often scorned because they are seen to 'add work'.

^[5] You'll get a headache if you start thinking about testing the testing code.

^[6] early you can only be writing black box tests here.

Notes:

More fields may be available via dynamicdata ..