ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinThe Eternal Battle Against Redundancies, Part I

Overload Journal #106 - December 2011 + Programming Topics   Author: Christoph Knabe
The drive to remove redundancies is widely seen as a good thing. Christoph Knabe sees how it has influenced programming languages.

Since the beginning of programming, redundancies in source code have prevented maintenance and reuse. By ‘redundancy’ we mean that the same concept is expressed in several locations in the source code. Over the last 50 years the efforts to avoid redundancies [Wikipedia] have inspired a large number of programming constructs. This relationship is often not obvious to the normal programmer. Examples include relative addressing, symbolic addressing, formula translation, parameterizable subroutines, control structures, middle-testing loops, symbolic constants, preprocessor features, array initialization, user defined data types, information hiding, genericity, exception handling, inheritance, dynamic dispatch, aspect oriented programming, functional programming, and even program generators and relational databases. These constructs are discussed using examples from 14 widely used programming languages. Whosoever understands the common concept is well equipped for the future.

What can the Zuse computer say today?

Architecture of the Zuse 22

The design of the Z22 was finished by about 1955, and 55 machines of this type were produced. It formed a whole generation of computer specialists in central Europe. The Z22 was characterized by:

  • hardware logic implemented by 600 tubes
  • working storage: a magnetic drum of 8192 words @ 38-bit
  • registers: a core memory of 14 words @ 38-bit
  • peripheral storage: 5 hole punched paper tape
  • console I/O: push buttons, glow-lamps, teletype with paper tape reader
  • operating frequency: 3 kHz

An instruction consisted of 38 bits: 2 with the value 10, then 5 for conditions, 13 for operations, 5 for a register address, and 13 for a working storage address. Each of the condition and operation bits was programmed by a specific letter and switched a specific gate.

The registers could be accessed by their address, but some were used for special purposes by some operations. Access time for registers was always one CPU cycle, for drum words only if they were accessed in sequence.

Some Registers of the Z22:

Number Special Usage
2 Testable by P or Q if positive or negative
3 Overflow area for accumulator, last bit testable by Y
4 Accumulator, filled by B, added by A, etc., testable by PP etc.
5 Stores return address for subroutines, filled by F

The Zuse 23 of 1961 was logically equivalent, but was implemented using transistors. It had 40-bit words and up to 255 registers.

In 1971 my (high) school inherited a 12-year-old Zuse 22 and I learned programming on it. In the rapidly moving computer domain we would usually consider using such obsolete technology to be a waste of time. But this has provided me with the background for a survey of the programming techniques developed over the last 50 years. The Zuse 22 of the German computer pioneer Konrad Zuse was one of the first mass produced computers in the world (55 machines was a lot in those days!). It had a highly economical construction and was programmed in a machine level language: the Freiburgian Code. The example program in table 1 adds the natural numbers from n decrementing to 1, and prints the result (tested by the Z22 simulator of Wolfgang Pavel [Pavel]). In practice only the contents of the Instruction column were punched onto paper tape and read by the computer as a program. The Address column indicates into which word the instruction was stored, and the Comment column corresponds to comments in contemporary languages.

Address Instruction Comment
T2048T Transport the following to words 2048 ff.
2048 10' i: Initial value for i is n, here the natural number 10.
2049 0' sum: Initial value is the natural number 0.
2050 B2049 Bring the sum into the accu(mulator).
2051 A2048 Add i to the accu.
2052 U2049 Store (Umspeichern) accu to sum.
2053 B2048 Bring i into the accu.
2054 SC1 Subtract the Constant value 1 from the accu.
2055 U2048 Store (Umspeichern) accu to i.
2056 PPE2050 If accu Positive Execute from (go to) 2050
2057 B2049 Bring sum into the accu.
2058 D Print (Drucke) accu.
2059 Z0 Stopp
E2050E Execute now from 2050

The instructions can have symbolic, combinable operation letters : B=Bring, A=Add, S=Subtract, T=Transport, U=Umspeichern (store to), C=Const-Value, PP=if Positive, E=Execute from (go to), D=Drucken (print), Z=Stop. But the addressing was purely numeric with absolute storage addresses. Here the variables i and sum are stored at the addresses 2048 and 2049 respectively. The algorithm itself is stored from address 2050, where we jump back using the instruction PPE2050 , if the value of i is still positive.

Redundancies appear here in the addresses: 2048 for i appears 4 times, 2049 for sum 3 times, 2050 for the beginning of the program and the loop twice explicitly and once implicitly (two cells after where the tape content is stored). As a consequence the program is neither relocatable in the working storage nor simply extendable. So there are big difficulties in its maintenance.

Relative and symbolic addressing

Progress came later for the transistorized Zuse 23 by the development of ‘Relative Addressing’. This enabled a programmer to write a subroutine as if it was located at address 0. A certain prefix instruction told the loading program to store the actual start address in a load-time base register, which was usually register 26. Appending A26 to an address caused the loading program to add the content of register 26 to the value to form an absolute address before storing the instruction to be executed later. So when using relative addressing the conditional jump instruction to the beginning of the program in table 1 would be PPE2A26 instead of PPE2050. By this means the program has become relocatable. Relative addressing was still very economic: it did not need more resources than the register 26 at load time.

True, relative addressing facilitates relocating a subroutine in working storage, but inside the subroutine it is as inflexible as absolute addressing. If we wanted to extend the example by inserting a prefix action before the calculation loop, we would have to shift the relative jump goal 2A26, too. Thus ‘symbolic addressing’ was introduced with the Zuse 23 (sold from 1961). See the German programming manual for the Z23 [Zuse23] p. 65ff. The Z23 loading program substituted each bracketed identifier of up to 5 characters by its address. The program from table 1 could be rewritten with the symbolic addresses (I), (SUM), and (BEGIN) as in Listing 1.

T2048T
(I) 10'
(SUM) 0'
(BEGIN) B(SUM)
A(I)
U(SUM)
B(I)
SC1
U(I)
PPE(BEGIN)
B(SUM)
D
Z0
E(BEGIN)E
			
Listing 1

Now it is possible to insert further instructions at any place without destroying the program. This improvement was such a big one that the assembler for the Siemens 2002 was named after this technique, PROSA (Programming with Symbolic Addresses). Necessary resources for symbolic addressing were an addressing program and a symbol table.

We are now acquainted with the most common procedure for redundancy elimination: The redundant code part gets a name and we use that name instead of the former, redundant code parts.

Formula translation

In the beginning, technical and scientific calculations dominated computer applications. But as we can see in the words 2050…2053 of the Z22 example, a simple addition needed three instructions (bring, add, store). For the formula (a+b)*(a-b) we would need about 7 instructions. So the need quickly arose for simplifying formula calculations. This was enabled by formulae with variable identifiers, literals, operator signs, operator priorities, and parentheses. The programming language FORTRAN got its name by this feature (Formula Translator). Fortran I was defined in 1956. At that time there was no possibility of defining your own operators.

Subroutines

If you needed an instruction sequence several times, on the Zuse 22 you could jump there by using the call instruction F from different program locations. Besides doing the actual jump, this instruction loaded a ‘jump back’ instruction into register 5. That is why you had to begin each subroutine by copying the contents of register 5 to the end of the subroutine using a U-instruction. This assured a jump back to the calling location when reaching the end of the subroutine.

But often you don’t need identical, but only similar processing. In this case Freiburgian Code had the convention of reserving cells before the subroutine for arguments and results. These had to be filled by the caller before the call instruction and retrieved afterwards, respectively. Then it had to jump to the first instruction of the subroutine, which always had to be B5 followed by a U with the address of the last instruction cell of the subroutine. So the program from listing 1, converted into a Zuse23 subroutine with entry address SUMUP for summing up the integer numbers from 1 to n, is shown in listing 2.

T2048T
(N) 10'
(SUM) 0'
(SUMUP) B5
U(BACK)
B(SUM)
A(N)
U(SUM)
B(N)
SC1
U(N)
PPE(SUMUP)
(BACK)Z0
			
Listing 2

In order to print the sum of the numbers from 1 to 20, you could call SUMUP as follows:

  BC20   U(N)   F(SUMUP)   B(SUM)   D

While this had to be obeyed as a convention on the Zuse 23, nowadays it is automated in all higher programming languages by the concept of a subroutine call with an argument list and return value. FORTRAN II and Algol introduced the concept of named, parameterized subroutines around 1958. The possibilites for redundancy elimination were enormous and gave rise to the style of procedural programming. A subroutine in FORTRAN II for summing up the numbers from 1 to n by a counting loop is shown in listing 3. Identifiers starting with I to N are considered integers.

C     COMPUTES THE SUM OF THE NUMBERS FROM 1 TO N
      FUNCTION ISUMUP(N)
      ISUM = 0
      DO 99 I = 1, N
  99  ISUM = ISUM + I
      ISUMUP = ISUM
      RETURN
      END
			
Listing 3

Control structures

The building blocks by which programs got their high flexibility and reusability were conditional jumps such as the PPE instruction of the Zuse 22. Conditional forward jumps could be used to implement conditional branches. A conditonal backward jump could be used to implement repetition, which would terminate when the condition no longer held. By combining these facilities you could construct arbitrarily complex algorithms. As conditional branches and limited repetitions with or without a control variable were needed frequently, the popular programming languages introduced such constructs. In FORTRAN I (1957) you could sum up the integer numbers from 1 to n by the following counting loop:

      ISUM = 0
      DO 99 I = 1, N
   99 ISUM = ISUM + I
C  HERE ISUM CONTAINS THE SUM OF THE NUMBERS
C  FROM 1 TO N

FORTRAN had half-symbolic addressing. A statement could be identified by a freely electable number, a so-called ‘label’. The statement DO 99 I = 1, N incremented the control variable I from 1 to N, and each time all statements up to and including the statement labeled by 99 are repeatedly executed. For branching FORTRAN I offered the arithmetic IF:

IF (expression) negativeLabel,zeroLabel,positiveLabel

This statement jumps to one of the three enumerated labels depending on the sign of the expression result.

Algol 60 had already introduced nowadays common control structures:

  1. A multi-branches cascadable IF: if cond then expr else expr
  2. A universal loop with control variable, for example:
    • for i := 1 step 1 until 100 do print(i) prints the numbers from 1 to 100
    • for i := 1, i*2 while i<2000 do print(i) prints the powers of 2 from 1 to 1024.

Modern control structures with the purpose of being able to completely avoid jump instructions were introduced by Pascal (1970). Pascal distinguished the pre-testing WHILE-DO-loop, the post-testing REPEAT-UNTIL-loop, and the counting FOR-DO-loop. Nevertheless Pascal still contained the GOTO statement, as non-local termination could not be done otherwise

Unfortunately the WHILE-loop, planned for repetitions where the number of iterations is not known in advance, implies redundancies in the following use case, which occurs extremely often in practice: We want to process an unknown number of elements, maybe even none. The code in listing 4 reads positive numbers and prints their sum. The procedure readNumber is understood to deliver -1 if it can’t find any further number. The keywords are typed in upper case, although this is not important in Pascal.

VAR
    x: integer;
    sum: integer := 0;
BEGIN
    readNumber(x);
    WHILE x>0 DO BEGIN
        sum := sum + x;
        readNumber(x);
    END;
    writeln;
    writeln('The sum is ', sum, '.');
END
			
Listing 4

We see, that the procedure readNumber has to be called redundantly: Once before the WHILE-loop, the second time at the end of the loop body, in order to prepare the variable x for the next test of the WHILE-condition.

That is why C (1973), Modula-2 (1978), and Ada (1980) introduced the possibility of leaving an arbitrary loop by a special jump instruction, in particular an endless loop. Using this we could solve the above task without redundancies (C syntax, listing 5).

int sum=0, x;
for(;;){
    readNumber(&x);
    if (x<=0) break;
    sum += x;
}
			
Listing 5

This corresponds to the general pattern for processing an unknown-in-advance number of records in a middle-testing loop (listing 6).

initialize
for(;;){
  retrieve
  if (notSuccessful) break;
  process
}
			
Listing 6

Recursion: Recursion is when a function calls itself either directly or indirectly. The ability to do so was introduced by LISP and Algol at around the same time, but the then predominant languages FORTRAN and COBOL did not allow recursion. Some tasks, e.g. traversing trees, are most elegantly expressed by recursion. Recursion does not directly contribute to the elimination of redundancies. If the recursive call is the last statement of a function, the compiler can replace it by a simple, storage-efficient loop. Compilers of functional programming languages regularly do this optimization.

User-defined control structures

The ability for programmers to define their own control structures was born in LISP (1958), as in this language instructions and data were notated in the same form, the so-called S-expressions. They also had the same internal representation. For example, (TIMES N 7) on the one hand means the multiplication of N by 7. On the other hand it is simply a list with the elements TIMES, N, and 7. A function in LISP I, which was defined as FEXPR instead of the usual EXPR, had a special property. It evaluated its passed arguments not before executing its function body, but left this until the explicit usage of the function EVAL in the function body. So a function body could arbitrarily control the frequency and order of evaluation of its arguments.

Later on when the power of this concept was found out, LISP dialects introduced comfortable notations to define FEXPRs. For example in MacLISP you could define an if-then-else construct, which was not contained in early LISP, by the traditional COND as follows:

  (defun IF fexpr (args)
    (let ((predicate (car args))
          (then (cadr args))
          (else (caddr args)))
     (cond 
      ((eval predicate) (eval then))
      (t (eval else)))))

The IF function gets all arguments unevaluated as a list with the name ARGS. LET assigns the individual arguments to the values PREDICATE, THEN, and ELSE. COND evaluates THEN or ELSE depending on the evaluation of PREDICATE. Example from Steele and Gabriel [LISP].

As a lover of redundancy-free solutions I missed middle-testing loops in the modern object-functional language Scala (2001). But I could define one myself (listing 7).

def loopGetTestProcess[Item]
    (getItem: => Item)
    (shouldContinue: Item=>Boolean)
    (processItem: Item=>Unit)
{
    var item = getItem
    while(shouldContinue(item)){
        processItem(item)
        item = getItem
    }
}
			
Listing 7

The function loopGetTestProcess has 3 parameter lists. In the first it expects an expression getItem for obtaining a next item, in the second a boolean function shouldContinue for judging the success of the obtainment, and in the third a function processItem for processing the obtained item. By using the generic parameter [Item] it is assured that the types fit together. The redundant call of getItem is well encapsulated in the function body and is not visible in the using source code.

As syntactic sugar in Scala you can write an argument list of length 1 with curly brackets thus enabling the following usage:

  def printLines(in: java.io.BufferedReader){
    loopGetTestProcess(in.readLine())(_!=null){
      println
    }
  }

In Java this could be done by break; and would look like in listing 8.

void printLines(final java.io.BufferedReader in){
    for(;;){
      final String line = in.readLine();
      if(line==null)break;
      println(line);
    }
}
			
Listing 8

The big achievement in Scala is that the programmer has defined this control structure himself and can define others too.

Constants

In working storage every cell is modifiable. That is why there were no constants in Freiburgian code. By around 1960, the predominant higher programming languages FORTRAN, Algol, and COBOL had only variables and literals, usable in expressions. e.g. the literal 2 in the FORTRAN expression A**2 + 2*A*B + B**2 with ** meaning ‘to the power of’. A redundancy problem arises, if you must use the same literal several times. In COBOL 66 you could also use literals to dimension a variable, for example a west-German ZIP code, which consisted of four digits, as ZIP PICTURE 9(4). But if you had done it this way at several locations in your program code, the reorganisation of the German postal systems in 1993 obliged you to change all these locations to five digits: ZIP PICTURE 9(5). In the early higher programming languages there was no possibility to declare variable or array sizes free of redundancy.

Pascal (1970) solved this problem very cleanly by declarable, symbolic constants, which could be used for dimensioning arrays, as well. E.g.:

  CONST zipLength: integer := 4;
  TYPE ZipCode = PACKED ARRAY[zipLength] OF   character;
  VAR clientZip: ZipCode;
  BEGIN
    FOR i := 1 TO zipLength DO write(clientZip[i]);

Pascal also introduced the concept of user-defined data types (here the type ZipCode). Along with the symbolic dimensioning constants this enabled you to define sizes for a whole software system free of redundancies.

C (1973) solved the dimensioning problem less elegantly, as you had to use preprocessor macros instead of symbolic constants. But on the other hand you could even do it without explicit size constants, if you used the sizeof operator. So the same thing expressed in C:

  typedef char ZipCode[4];
  ZipCode clientZip;
  int i=0;
  for(; i<sizeof(clientZip); i++){
    putchar(clientZip[i]);
  }

C++ (1983) introduced the ability to declare frozen variables at any location in the program code by the keyword const and by that to build calculated ‘constants’. Beginning with Java (1995) it became normal to dimension arrays only at runtime, or to work with the growable collections.

Preprocessor features

Often a preprocessor was used in order to avoid redundancies. So it is possible to include declarations, which are needed in the same wording in several compilation units, from one single file. This technique was used in COBOL (COPY), FORTRAN (INCLUDE), and C (#include). In this way you could build a data abstraction module in C. You had to write the function declarations of the module in its header file, and the function definitions along with the managed module variables in its implementation file. As an example see the header file stack.h for a stack of characters. The lines starting with # contain preprocessor directives.

  #ifndef stack_h
  #define stack_h
  
  #define MAX_STACK_SIZE 10
  void stack_push(char item);
  char stack_pop(void);
  
  #endif

Every client of this stack has to include the header file stack.h and then you can call the functions declared therein:

  #include "stack.h"
  ...
  stack_push('X');

About 1985–1995 this was the predominant modularization technique in industry, before it was superseded by object orientation.

Stringize Operator: The C/C++ preprocessor contains the # operator, which delivers the name of a given macro argument as a string. With this you can program without redundancies simple dialogs, e.g. for test purposes. So you can use in C++ the following macro PROMPT_READ in order to print the name of the passed variable, and to read a value into it afterwards:

  #define PROMPT_READ(var) \
  { cout << #var << "? ";    cin >> var;}

Using this macro you can program a console dialog as in listing 9.

  string name;
  int age;
  PROMPT_READ(name);
  PROMPT_READ(age);
  cout << name << " is " << age <<
     " years old." << endl;
			
Listing 9

A typical dialog usage could look as follows:

  name? Ulf
  age? 22
  Ulf is 22 years old.

Generally preprocessors are considered an inelegant solution. That is why the majority of cases for which you used the preprocessor in C or C++, are solvable in Java by its own language constructs without a preprocessor. However in Java the following is not possible:

  • determination of the name of a variable as by the stringize operator in C
  • delegation of the abortion of a method execution and some accompanying action, e.g. errno=FAILURE; return; to another method
  • delegation of a certain pattern of exception handling to another method.

    If we had C-like macros in Java we could write

    TRANSACTION(myAction1(); myAction2())

    which would be useful to transform a statement sequence into a database transaction by

    		   #define TRANSACTION(stmts) \
               {try{stmts; commit();}catch(Exception e) \
    		   {rollback(e);}}

Array initialization

Pascal introduced redundancy-free array dimensioning, but initialization still had a problem. As an example we want to declare and initialize an array in Pascal, which contains codes for three allowed actions. Even in modern GNU Pascal you have to indicate the array size 3:

  VAR actions: ARRAY[1..3] OF char = ('G','P','D');

But the size is redundant in relation to the number of initialization elements. The more actions you need, the more difficult becomes the manual numbering.

C introduced to derive an array’s size from the length of its initialization list: static char actions[] = {'G','P','D'};. Java inherited this useful feature.

Summary and prospects

The early higher programming languages introduced the possibility of avoiding redundancies by extraction, parameterization, and naming of a repeated code snippet. Such code snippets could be: addresses, values, size indications, declarations, and statement sequences. Some code patterns were invoked by syntactical units of the programming language, e.g. loops. In the better case the programmer could give a freely electable name to a code pattern. If a programming language helped to eliminate redundancies better than a competing language, this was a relevant advantage in the battle for dissemination.

In the 2nd part of this article we will deal with the more modern techniques ‘information hiding’, genericity, exception handling, object-oriented, aspect-oriented, and functional programming as well as domain specific languages and relational data bases, and how they all contribute to redundancy elimination.

References

[LISP] Steele/Gabriel: The Evolution of LISP, p. 9:http://www.dreamsongs.com/Files/HOPL2-Uncut.pdf

[Pavel] http://www.wpavel.de/zuse/simu/

[Wikipedia] http://en.wikipedia.org/wiki/Don%27t_repeat_yourself

[Zuse23] http://www.weblearn.hs-bremen.de/risse/RST/WS04/Zuse23/Z23Programmierungsanleitung.pdf

Overload Journal #106 - December 2011 + Programming Topics