Title: Incremental Design: A Case Study of a Compiler

Author:

Date: 02 October 2005 05:00:00 +01:00 or Sun, 02 October 2005 05:00:00 +01:00

Summary:

Body:

Introduction

Agile development is great for keeping control of details, but how well does it deal with development involving heavy design? This article is about using a very agile approach to write a dynamic compiler. Compilers are probably the best area to do design in because of the large body of compiler literature and theory.

I'll discuss my experience of writing a compiler using an agile approach. The key challenge is working a lot of design, often from papers not personal experience, into an agile development process. Agile incremental development works very well for writing a compiler because it's a very effective way to learn from other people's written experience. It's a very effective way to learn while developing and get the learning into the code quickly.

The compiler, Exupery, is a hobby project. It compiles Squeak Smalltalk bytecodes into x86 machine code which is then executed inside the same process that is running the compiler. One of the original goals was to learn something general about software engineering by personally writing a technically difficult piece of software. The other original goal was to produce a useful open source project. Hopefully, you'll gain some insight from this article.

Compilers

A minimally useful compiler is a large project and a fully optimising compiler doing everything imaginable is impossibly large (a trip to the library or CiteSeer^[1] makes it even bigger).

There are several diferent compiler designs that could make sense for an object oriented dynamically typed language. The possible compiler designs run from compiling as quickly as an interpreter interprets [May] [Deutsch-] to very slow compilers that produce very high quality code. [Appel]

Most JIT (Just-in-Time) compilers compile fast but not as fast as interpreters, they stick to linear time worst case algorithms because compilation happens at run time where noticable pauses are not acceptable. Most mature batch compilers tend to be very slow, aiming at fast execution times (at least at higher optimisation levels) though a lot are simpler to make them buildable.

JITs are a great strategy for very fast compiling compilers because they can recoup on the second execution. But they are much less appealing when execution time is important because the compile time pauses will get longer (potentially a lot longer, as compilers often use O(N²) - or greater - algorithms).

Compilers offer some strong benefits for up front design because of the rich body of literature, but they also have some strong disadvantages. It's likely that there are no successful projects working in the same design space: language implementation; optimisation style (fast compilation or fast execution); and basic framework (SSA, dataflow, simple tree walkers, one pass). It is also very likely that no-one on the team has done anything similar.

Judging design literature without having personally implemented similar projects is hard, if not impossible, but many key decisions have to be made early. These are really project-level design, not technical, but they'll influence the choice of algorithm and intermediate forms.

Register Allocation

Register allocation is the process of assigning all the variables (including ones holding intermediate values) to a fixed number of physical registers. There are several different approaches to register allocation. The main two are colouring register allocators and linear scan allocators. Colouring register allocators are slower but produce better code. Linear scan allocators have seen a lot of work recently because they are more suitable for JIT compilers.

Colouring Register Allocators

while hasSpills
  build interference graph
  simplify and move removal
  hasSpills := select registers
  if hasSpills
    insert spill code
  else
    assign variables to registers

A colouring register allocator works by first building an inteference graph. Nodes represent the variables, and edges mean the variables can be live at the same time. If two variables have an edge connecting them then they cannot be assigned to the same register.

Simplification is the process of removing nodes where their degree (number of edges) is less than the number of registers available. While simplifying we also remove moves to and from registers where possible (coalescing). As variables are simplified they are pushed onto a stack. If no variable can be simplified then one is chosen and pushed onto the stack as a spill candidate.

Selecting registers involves popping variables off the stack then assigning them to a register that isn't used by any of their neighbours. It will be possible to find a register for it because when it was selected it had less neighbours than the number of registers . If a variable was pushed as a spill candidate then we try and find a register for it if some of its neighbours were allocated to the same register but if we can't it will be spilled (stored in memory rather than a register).

Colouring register allocation was first practically formulated in 1981 by Chaitin [Chaitin]. Briggs introduced a few key improvements in 1992 [Briggs] including making final spill decisions during selection, which is more efficient because some of a variable's neighbours might have been allocated the same register, so spilling would have been overly pessimistic. George and Appel in 1996 [George-] extended Briggs' work by coalescing moves while simplifying, rather than as a separate phase, which allows a lot more moves to be removed. There's a lot more work on colouring register allocators - Google scholar finds over 500 documents. It would be stupid not to do a serious literature search before implementing in a domain which has been studied so much.

Squeak and Why It's Challenging to Implement Efficiently

Squeak is the main open source implementation of Smalltalk. Smalltalk is a pure dynamically-typed object oriented programming language. Everything is an object including Contexts (stack frames), Blocks (closures), and integers. SmallIntegers fit into a machine word with a tag and will overflow into large integers.

The core language is very pure. Even conditionals ifTrue: and ifFalse: (the equivalent of if statements in C) are implemented in the library, but Squeak, and most other Smalltalks, cheat here for speed. SmallIntegers are only slightly different from other objects - it's possible to add methods to SmallIntegers but not to subclass them. Squeak, and most other implementations, provides special bytecodes that perform SmallInteger operations. If the arguments are not SmallIntegers or the operation fails then a normal message send will be performed.

Every expression needs type checking, which is the cost of being a pure dynamically typed language. This costs about 2 cycles to type check per operand. Theoretically type checking can be removed in most cases.

Tagging and detagging integers in Squeak is slower than necessary because the integer tag is 1 rather than 0. Small integer operations need to remove the 1 tag before operating then add it afterwards.

Booleans are real objects. The only restriction is there can only be one true object and one false object in the system. This means that the expression a < b ifTrue: [do something] naively needs to perform the comparison, convert the control flow into a boolean object (a little if then else sequence), then convert the boolean back into control flow (another if object is true then else sequence). That's a lot of instructions and also an extra conditional branch for the processor to (potentially) mispredict.

Sends (function calls) are very common. Theoretically all expressions are sends and most are in practice. While loops, if expressions, and small integer operations are theoretically sends but are expanded when generating bytecodes.

Smalltalk runs in an image which hosts both the IDE and the program being developed. The entire system is written in itself and modifiable live. Smalltalk systems host their own development tools. It's possible, and normal, to change how the development system works live during normal development. The debugger, profilers, and editors (called browsers) are normal code that can be inspected and changed.

Compilation is done incrementally. When developing, individual methods are compiled as they are edited and will be used for every call after they have compiled. It's also possible to modify the system programmatically. This is how the programming environment is implemented. Any method may be modified at any time including by code. Any object may be swapped with any other object at any time (become:), all references to the objects get changed by this operation. A lot can be changed during normal program execution.

Squeak is currently implemented by an interpreter. The interpreter executes a bytecode in about 10 clocks which is the branch misspredict penalty. That's a fairly tough target to beat with a naive compiler but a well designed simple compiler could do it.

Smalltalk Syntax

Smalltalk syntax is very simple. There are three types of messages, blocks, and literal expressions.

`Factorial printString`	Unary messages are a single word.
`* + ** <=`	Binary messages look like operators in other languages. They have an argument before and after them.
Receiver printOn: 'hello'. Receiver at: 'a key' ifAbsent: [self doSomething]	Key word messages are a sequence of words ending in :. Each argument always has a keyword before it ending in :
`[1+ 20] [:each\| each + 10]`	Blocks are bits of code that can be executed later. They look like `[anExpression]` or `[:each\| self aMessage: each]`. They can take arguments, each argument begins with `a :`, after the last argument there is `a \|`.
`\|a b c\|`	Defines three local variables called `a`, `b`, and `c`. Definitions can only occur at the start of methods or the start of blocks.

Precedence is unary messages then binary messages then keyword messages. So aStream nextPutAll: 10 * 200 factorial fully bracketed is: aStream nextPutAll: (10 * (200 factorial)). The result is to print a very large number on aStream.

Control flow is logically implemented using message sends and blocks. a < b ifTrue: [self doSomething] executes the block [self doSomething] if a < b. [a < b]. whileTrue: [a := a + 1] is a basic while loop. It first executes [a := a + 1] while executing [a < b] returns true. #at:ifAbsent: used as an example of a keyword message will execute its block argument if it couldn't find the key (the first argument). Reading [] brackets as if they were brackets from C is tolerable but Smalltalk's [] are semantically richer because they return a first class object.

Smalltalk is a natural environment for JITs (which it's had since '83). Writing a traditional JIT inside the language poses problems. A JIT normally stops execution until it has compiled a method then jumps into it. What happens if it's written and running in the environment it's compiling and needs to compile part of itself to finish compiling a method?

Initial Planning

When starting a large project, the first problem is deciding if the project is worthwhile, which involves enough design to estimate the costs of a minimally useful system. A minimally and generally useful compiler is probably around 10-20kloc which is a large system to write as a hobby. It's possible to write a compiler in less, but to be worth the cost of maintenance it must be noticably faster than the interpreter. Simple compilers can be slower than tuned interpreters. Exupery is a big project for a single person's hobby, so ideally it should also have big motivating goals. The end goal is to be as fast as C or Fortran for most programs.

Exupery is designed to become faster than any previous Smalltalk implementation by combining a solid classical optimiser with inlining driven by dynamic type feedback. That's too much to achieve in one step but it still helps to know that the end goals are possible.

To completely eliminate the cost of Smalltalk's dynamic power we need to:

Remove integer tagging and detagging.
Remove boxing and deboxing of Booleans and Floats
Remove all send overhead including calling blocks (higher order functions)
Optimise the low level intermediate to an equivalent level to C

This could be done by:

Having solid back end producing good code. In practice this just means a good register allocator and some care. This was the biggest source of remaining waste noticed in Self [Holzle]
Using dynamic type feedback to discover what types are used for each send site then inlining commonly called messages.
Using type analysis to remove unnecessary type checks and boxing/ unboxing of Booleans and Floats
Induction variable analysis will allow the removal of most type checks of loop counters and array range checks

Dynamic type feedback provides information on what types were actually used. The solid optimisation framework can then remove redundant tagging and detagging in the common case code^[2] and moving unnecessary code out of loops (e.g. the code that figures out how big an object is to range check it. The array is almost certainly the same object across the entire loop).

So long as compilation happens in a background thread and never causes execution to halt, it's possible to combine very fast generated code with no pauses. Background compilation also makes it possible to run the compiler in the same environment that's being compiled. The advantages are: there are no compilation pauses, slow optimisations can be used, and the compiler can be written in Smalltalk as a normal program. The key disadvantage is there must be another way of executing Smalltalk but we already have an interpreter.

Task Breakdown

The vision of a compiler above is too big to commit to building. So let's break it down into three different projects. First the full compiler meeting the initial ambitious goals. This is the goal that's really motivating. Second the simplest possible compiler that would be useful. This is the first version that should get widespread use. Third the simplest possible compiler that can be built and tested. This compiler is the first stepping stone to the second project.

The project's current high level breakdown (release plan) is:

Get it compiling a basic program. Just to compile an iterative factorial
Make that faster than the interpreter.
Compile the bytecode benchmark in the same process as the interpreted compiler
Make that faster than the interpreter
Add support for most of the language. (minus blocks)
Optimise sends with PICs
Optimise bytecodes. Make it faster than VisualWorks, the fastest commercial Smalltalk
Make it practical. This is the current task
Make sends fast. Full message and block inlining
Build a basic optimiser. Just to remove integer tagging and untagging
Extend the optimiser to handle classical optimisations such as common sub expression elimination and code hoisting. Moving #at: overhead and write barrier checks out of loops is a specialised case of code hoisting which may double the bytecode performance
Extend the optimiser down to replace the low level optimiser. This will allow the classical optimisations to remove redundancy that was hidden inside the tagging and boxing code.
Add floating point support. To be worthwhile, native compiled floating point support needs to remove most of the boxing and deboxing of floats. This may require the entire optimiser in this plan or it might be worthwhile just with tree traversal based optimisations and dynamic type feedback. Measurements on real code could answer this question.
Induction variables. Extend the optimiser to optimise induction variables (a fancy name for loop counters and friends). This should allow all range checks using looping variables to be fully optimised and to remove the overflow checks when incrementing loop counters. It should remove the last remaining overhead in common array loops.

The original release plan was for five major releases 1) a basic compiler, 2) decent bytecode performance (at least twice the interpreter's) and compiling inside the image, 3) add most of the bytecodes, 4) decent send performance (probably inlining), 5) make it practical. Each release is a few months' full time work.

Implementing the release plan up to "Make it practical" or "Make sends fast" would produce a roughly minimal compiler that was practically useful. This is where the compiler is up to now, it's four times faster for the bytecode benchmark and twice as fast for the send benchmark than the interpreter. The only algorithm that's more complex than a tree traversal is the colouring register allocator. That such a simple compiler could be so much faster than the interpreter was easy to predict initially by looking at the performance of traditional JITs, which cannot use any complex optimisations because they cannot afford long compile time pauses.

From the minimal practical compiler, I broke out a minimal testable compiler which was the first thing built. The minimal testable compiler just compiles an iterative factorial method into assembly which was then assembled and linked to a C driver routine. The next iteration added a register allocator and some performance tuning to instruction selection to use some addressing modes. Then I started compiling directly to machine code and running the generated code inside the same process that was running the compiler.

This little compiler could then be extended by adding language features and optimisations. It is the core of the minimal useful compiler. The overall architecture is there, all the phases exist in a nontrivial form. Building up towards the basic minimal compiler is now just work and keeping control of the details. It's a recursive process that's re-entered in the section "Later Design".

Breaking down large tasks is critical. There are too many details to keep straight. Stuff that isn't important at the strategic level because it definitely can be made to work can still cause crashes or wrong answers. Making a task breakdown is often enough design to safely begin implementation.

At the beginning a brief project plan is useful to plan just enough to see that the project is worth starting, however, justifying a large project by its end goals is a mistake. Ideally each iteration should pay for itself. [Beck-]

Testing and Small Steps

The main benefit of tests is the ability to work in very small chunks. It's possible to write (or modify) tests one evening then modify the code later. It's very easy to pick up coding on a broken test.^[3]

Exupery is written using two separate test suites; each has almost complete test coverage. The programmer tests are unit tests with some code integration tests thrown in, they will never crash the development environment. The customer tests compile and run example fragments^[4] which can crash the development environment if they generate faulty programs (programs that corrupt memory etc).

When writing code to extend the language coverage, first I write some customer tests while figuring out what all the special cases are for the new statement. Then I'll run them, they break. Then it's time to write programmer tests to drive the implementation. The programmer tests drive the implementation of the language feature for each component. For a new integer operation this will start with byte code reading, then intermediate generation, then adding the new instruction to the back end. One the tests have been written, it's possible to focus on the details without needing to keep the entire picture in mind all the time, if a critical detail is ignored the tests break. Tests specify behaviour precisely enough to allow it to be safely ignored, this reduces the amount that I need to keep in my head, which allows me to work with smaller chunks of time.

Just having a customer test suite that compiles and runs programs is not enough. There are too many ways to implement, the test that drives development should provide the direction to develop. The test will often be written one evening, and the code a few days later, therefore the test must carry the subtlties to be implemented because short term memory cannot.

Debugging and Reliability

Debugging can be a major problem because bugs usually show up when you run a compiled program and it crashes (or produces the wrong answer). The first stage is figuring out why the generated program is crashing. Then figuring out where the bug is in the compiler, which is non-trivial because one stage may be failing because a previous stage did something unanticipated.

Reliability is a big problem when writing a compiler, there is a lot of functionality, and every user visible action involves most of the compiler, including several complex algorithms . Thus tracking down a bug can be a lot of work, because first it must be found in the compiled code, then it must be traced back though the compiler. Even when compilation fails it's often not clear which stage is at fault, is the register allocator failing because it's complex and still buggy, or is the input invalid because of an upstream bug?

Reliability is a key issue for compilers, not just because it's better not to have real compiler bugs, but also because it's easy to lose control of quality. They are too complex to debug into shape if the quality ever drops.

Non-deterministic bugs are surprisingly easy to create. First because analysis algorithms may only exhibit a bug on one ordering (iterating over hash tables can make order nondeterministic) - the order isn't important for correctness, but it can expose or hide bugs. Having a large automated test suite and running it frequently makes intermittent bugs obvious as the suite will only sometimes fail or crash.

Large test suites help to ratchet up the quality. By keeping all the tests passing it's easy(ish) to avoid introducing new bugs. Adding more optimisations however adds more ways for subtle bugs to creep in as there are more ways for features to interact. Optimisations require more customer/integration tests to get the same level of coverage.

Originally I thought that I'd need to run my acceptance tests in a separate forked image to make it safe. The customer tests run generated machine code which can crash the development environment. Having your development enviroment which includes the editors and debuggers crash regularly during development is painful. Forking separate processes hasn't been necessary, manually identifying the crashing tests from the stack trace when it crashes and throwing an exception at the beginning of the crashing tests has proved enough. It's very unusual for a lot of tests to be broken for a long time. The programmer tests keep the quality high enough that unexpected customer test crashes are sufficiently rare.

It's important to be able to quickly fix broken tests. The code would get very brittle because it would take too long to debug if any test started failing. Knowing that all the tests passed recently reduces the amount of code that could have introduced a bug and thus the time required to fix it.

A Little Duplication

Often there isn't a clear right way to factor (erm, design) some code but it is painfully clear that the current structure is inadequate. Sometimes there are obvious small improvements that we can make but there are still cases where there isn't an obvious way to improve the structure. There it's worthwhile to introduce some duplication. A common case is test code, what should vary between the tests? What should be factored into helper methods and hidden? There's a cost, more context is needed to read the method, if it's reused this is a big saving. Copying and pasting the first test 10 times, changing the parts needed, then refactoring away the duplication has worked well. Intermediate creation and assembling are good examples in the production code, both have an involved detail driven design which evolved through refactoring. The initial duplication provided the environment to find the right structure, then refactoring consolidated the winning design features.

One big problem with intermediate creation was finding a design that was both clean and easy to test without duplication in the tests. The obvious way to test intermediate creation is to compile short methods into intermediate then check that the intermediate generated is correct. The problem is the basic intermediate building blocks such as integer tagging are repeated, making them very hard to change without needing to change a lot of tests.

My first approach was to break the intermediate generator into two objects, the first handling the higher level logic and the second handling the repeated low level operations. This helped, but there was often too much code in each test. The next insight was to make the intermediate emitter mock itself by having an instance variable that normally just impersonated self (this in C++). This provides a flexible recursive way to test intermediate emission. A message can be sent either to self (included in the test) or to the mockable instance variable (which is normally the same as self). This provides enough flexibility to test any part of intermediate generation in isolation. Testing components in isolation requires very decoupled design.

A little duplication is often a good thing, especially when there isn't an obviously right answer and a better design is obviously needed. The duplication allows the code to evolve in different ways without being overly constrained with the requirements from elsewhere. Duplication and copy and paste programming makes experiments with design cheap. Refactoring will remove the duplication later, once the kernel of a design has been found. Tests over the duplicated code makes it safe to refactor once a decent structure is found. Refactoring away the duplication is likely to refine the design as it needs to generalise to deal with more solutions.

Refactoring and Architectural Correctness

Large refactoring often involves moving temporarily away from architectural correctness.

For instance, there are two different, and both correct, ways of generating expressions in a simple intermediate language. Either expressions are written in trees representing the abstract syntax tree, or individual sub-trees are written out sequentially as a list. Eventually the code will end up as a sequential list, but the tree form provides a lot of cheap and simple manipulative power (the tree optimisers). Either has obviously correct semantics but mixing them arbitrarily will produce bugs when side effecting expressions are executed in a different order.

Who is responsible for unique execution of expressions is a similar problem. The expression array at: anExpression uses anExpression multiple times. First to type check it (is it an integer), then to lower range check it (is it greater or equal to 1), then to upper range check it, then finally to index into the array. Is anExpression responsible for returning a register in case it's used multiple times or is #at: responsible for placing the result in a register before using it multiple times?

In both cases I've switched from one architecturally correct style to the other. The simplest way to work is to generate code adding it directly to a basic block and make each expression responsible for returning a register which is given to any expression which receives its result. This is very easy to test - check that all results from expressions are either a literal or a register. However tree optimisations require expressions in trees so they can optimise operation sequences across bytecodes (high level tree optimisations). Also it's slightly more optimal to make the user responsible for uniqueness because then larger trees are formed to feed into instruction selector.

The problem is, when refactoring from one style to another, there will be a noticeable time when the entire system is broken. Either format in both cases is correct but half and half is always wrong although it may work sometimes. But to refactor, we should be moving in small steps keeping a working system. Breaking the system provides a lot of tactical guidance (fix what's broke until everything works again).

Here, there's a key difference between having strong guarantees and having all tests pass. The strong guarantee is the reason to believe the current tests are sufficient. If all tests are passing and they are good enough to provide a strong guarantee then the program should be bug free. Finding strong guarantees is definitely a heavy thought activity, closer to Dijkstra than TDD is normally portrayed. An architectural refactoring will often need the the test suite to be changed, not to make it pass again, but to make it cover the new design well enough.

Later Design

Design after the project has started is building the understanding needed to make decisions when required. It's best to make major decisions by deferring them to the last possible moment, but that's the wrong time to build the understanding required to make them.

Later design involves exploring how different infrastructure investments produce different performance improvements. The tactical decisions will be driven by specific benchmarks, but choosing which benchmarks (and tests) will drive is important. Also looking for powerful combinations that enable a lot of later features to be easily implemented is important.

When key decisions need to be made early it pays to make them by default, say by ignoring the cost of moves and registers knowing a colouring register allocator can clean this up. It's important to do the thought experiments to know that there are good ways to solve the problems when they come up.

Many design decisions are easy to make at the right time but some are not. The best that I can do is guess which decisions need thought then start thinking and reading about them preferably well before they really need to be made.

Often the key long term decisions are attempts to find optimal and minimal sets of optimisations - to find the different hills. Key questions worth thinking about are:

What is the minimal infrastructure needed to compete with C in speed for most programs?
What is the minimal infrastructure required for high speed floating point and what is required for a useful floating point speed improvement?
What is the minimal system that is practically useful and worth the risk of using for real production use for any major market segment?

There are some features that are worth investing in because they will take us to a better hill. Dynamic inlining is the ideal solution to common message sends as it makes common sends nearly free. Building a decent optimisation framework will make a lot of serious classical optimisations very cheap to build. Often key pieces of infrastructure can not be justified by their first use (register allocation could be) but are still worth building.

Register Allocation, the First Big Piece of Infrastructure

Why choose to use a colouring register allocator?

Removes complexity from the front end
Hides two address operations from the front end which both simplifies the front end and makes it more portable
Provides an efficient general solution
Not optimal for the x86 without using very sophisticated solutions because of register pressure due to only having 6-7 usable registers, however, move coalsecing is very useful to avoid the front end needing to know about 2 operand addressing.

I chose to implement a naive register allocator first then evaluate whether it was worthwhile to use a complex colouring allocator. I also made design decisions heavily favouring a colouring register allocator by ignoring the cost of moves and registers in the front end. The rest of the code was designed to work well with a colouring register allocator but debugged with a very simple allocator. This meant that when writing the allocator it was much easier to debug because the rest of the initial compiler was written and tested. Delaying the final decision to implement meant that when I did implement I was sure it was necessary rather than just relying on a (correct) guess.

The colouring register allocator was the first, and currently only, complex algorithm in Exupery. It was in the initial design but not implemented until the second iteration. The first iteration compiler was very slow, it assumed that there was a good register allocator that would remove unnecessary moves and efficiently allocate the remaining registers. The colouring register allocator was needed to make Exupery faster than the interpreter.

The register allocator serves to hide most of the oddness of the x86 from the compiler's front end. It cleans up the moves introduced to fake 3 operand instructions and removes most of the short lived temporaries used to by general code sequences in the front end.

At what stage is it justified to introduce a complex register allocator? The choice is either a single complex register allocator or adding a lot of complexity to the front end of the compiler to minimise the use of registers and move instructions generated.

Design Lock In, Assumptions Can Spread

Good design is keeping each decision in its own module so that each design decision can be changed independently. Unfortunately, a very usefully made decision can leak as the rest of the system starts to rely on it.

A separate version of the #at: primitive is now compiled for each receiver. This enables some code to be calcalated at compile time rather than run time. This means that the versions of the #at: primitive that are used for getting characters out of strings are compiled into different methods to the versions used to get objects out of Arrays. Squeak only uses one primitive for all common #at: calls.

Compiling each form of #at: efficiently is fairly easy but creating a single implementation that deals with all cases is hard. A case statement would be needed for the type of the receiver which adds to the overhead. As the compiler inlines calls to #at: it would mean that every call would have code for every kind of optimised #at: implementation. Compiling a different method for each receiver also allows the compiler to calculate the offset where indexed variables start. Objects in Squeak start with some named instance variables then may have indexed (array) instance variables afterwards. This way a simple array can be implemented as a single object, the named instance variables contain any bookkeeping the class needs, and the indexed instance variables contain the array data. It's a space optimisation from the '70s, changing it would be a lot of work and would also make it much harder for people to use Exupery.

The array #at: primitive in Squeak looks in the class to find out how many named instance variables the class has, looks at the object to find out how big it is, range checks, then does the lookup. By specialising for each receiver class we can avoid all overhead from having a variable number of named instance variables. The primitive code can also ignore all the other implementations of #at: in the system because they are sent to other classes, a bytecode implementation needs to deal with this type checking. However, having a separate compiled representation for each receiver type opens up the possibility of inlining self sends with no run time overhead, because the receiver type is known at compile time. As more code starts to rely on the receiver type being known at compile time then this decision will slowly get locked in.

The register allocator has been locked in by a similar process. The front end assumes that registers and move instructions are free. Switching to a simplier register allocator would involve a lot of rework for the rest of the compiler to avoid generating wildly inefficient code.

How Should Compiled Code Interact with Objects?

Compiled code often needs to refer to objects in garbage collected object memory (the image). This is tricky because objects may be moved on a full garbage collect as the garbage collector will compact memory.

The current solution is naive but it works. There is an array of objects which the VM has a pointer to. Compiled code accesses objects indirectly through this array. This way there is only one pointer that needs to be coordinated with the compacting garbage collector. It does add an extra level of indirection to every object access. These accesses happen during the method entry code and also during an inlined primitive.^[5]

Ideally, it would be nice to be able to hard code a pointer to an object in the compiled code. However if there is a garbage collect then the object may have been moved leaving a dangling pointer, unless something is done. There are two options, either update all the pointers to objects in compiled code, or discard all the code in the code cache (flushing it). Code can be regenerated dynamically, so throwing it away is a simple and viable solution to managing dependencies.

If there weren't pointers from objects into the code cache then discarding all the code would be a very simple operation. Unfortunately CompiledContexts (stack frames) contain pointers into the code cache which make it very quick to perform returns. The pointer is the machine code return address. There is also a dictionary (hash table) used to look up compiled methods. Another option to speed up sends and returns would be a send cache [Deutsch-] which is more complex but the contents could easily be discarded. Luckily, Squeak objects have a few common classes encoded directly into the header in a 5 bit field. This means that type checks will be fast for common operations such as arrays. Having three different encodings for classes (SmallIntegers, compact classes using the bit field, and classes requiring a pointer to the class) is unfortunate when implementing fast sends or general type lookups.

My first cut simple approach of using indirection through an array turned out to be good enough especially with a simple optimisation relying on the compact class encoding. This will make it a little harder to eliminate compact classes (it could be done in a few lines of Smalltalk). A better solution may be needed sometime in the future but there is no urgency.

Making Exupery Practical, the Current Design Problem

Making Exupery practical is the current major strategic problem. The basic requirements have been met, Exupery is four times faster for the bytecode benchmark and two times faster for the send benchmark. Most of the implementation is solid and the last few pieces of scaffolding (early implementations that were too naive to be left in but allowed other parts to be built properly) have been removed.

The goal is to provide a good useful speed improvement for normal code. Fast benchmark performance is nice, but it doesn't guarantee a useful speed improvement for general code especially when only a few benchmarks are used. That a lot of the inner loop methods have been rewritten in Slang^[6] or C doesn't help, because it removes a lot of the easy methods which Exupery could have optimised.

The issues stopping usefulness are bugs, driving the compiler, and missing features. When compiling code outside of the test suite it's still too common to run into bugs, to be generally useful it must be trusted, preferably to compile anything at any time. Driving dynamic inlining manually is not pleasant, you need to specify what sends will be inlined and with what receivers. There are also probably a few missing features, for example, blocks (closures) might be needed, or support for more of the super extended bytecodes (bytecodes that have an extended argument field).

There are a host of small issues, including bugs when compiling things that have never been compiled before, and small features that will really matter if the compiler is running in a background thread automatically compiling methods. For instance, compiled code does not listen for interrupts, so it can not be aborted and compiled code can not currently be single stepped.

Most of these issues are small, the key decision is figuring out what needs to be done for the compiler to become practical. That's best done incrementally by fixing the current most obviously limiting problem until it isn't a significant problem. These kinds of problems are best solved with plenty of feedback by fixing one then re-evaluating.

Inlining and Polymorphic Inling Caching

Send performance is critical to high performance Smalltalk. One current goal is improving the general send performance. This is building towards full dynamic inlining. Dynamic inlining is strategic, because it changes both how the compiler is used and also the overall cost structure. Dynamic inlining makes common sends free, this means a lot of optimisations to work around slow sends can be removed.

The phases are:

Polymorphic Inline Caches
Specialised Primitives
Inlined Specialised Primitives
Inlined General Methods

The breakdown was to get a useful speed improvement as early as possible and also to make the individual sections as short as possible.

Polymorphic inline caches provide two things, first they provide a fast form of sends, second they provide a way to get dynamic type information. Dynamic type information may be required to drive inlining^[7].

Specialised primitives are compiled versions of a method containing a primitive. They are compiled for each different receiver type, this allows the compiler to generate custom code for each implementor. They need PICs to make the sends fast enough to compete with local implementations.

While having fast primitives with fast sends to them is nice, it's still not as quick as a specialised version of the message (say #at: and #at:put:). One way to speed them up further would be to inline the primitives. Specialised inlined primitives are very nice because they provide a generalised framework to speed up many primitives to the same levels that the basic integer operations enjoy. The type feedback means that your method dynamically gets the right primitives inlined rather than a specific one chosen by the VM implementor. It's also a building block towards full inlined primitives.

Specialised inlined primitives was the last thing added to the breakdown. They change the shape of the breakdown because they remove the need for PICs to get inlined primitives working. There's still a practical issue that non-inlined primitives will be slower, compiling shouldn't make code slower, even temporarily (at least until compilation is automatic). The idea of specialised inlined primitives didn't occur to me until after I'd implemented PICs and specialised primitives. Insight is nice but it doesn't necessarily come at the right time.

It could be argued that it would have been better to go straight for profile driven inlining rather than using PICs. With hindsight, and original foresight, just using inlining might have been better than beginning with PICs, because inlined primitives were needed to regain enough #at: performance. This would have involved finding a way to inline #at: methods without using PICs to get type information, instrumenting a compiled non-PICed specialised primitive might have been possible. The problem is it would have been much slower after the first compile but before inlining. Whether this would have been tolerable is hard to say (without having dynamically driven inlining to perform real empirical experiments)

Design as Reading, Writing and Discussion

A lot of initial design work involved reading around the field, there's a lot to know - from modern CPU architecture to the two main compiler approaches that it's based on (Self and classical/SSA optimising compilers).

Moving into new areas also will bring on a lot of design by reading. Adding a solid optimiser will involve hitting the books. My current feeling is using SSA from the beginning makes sense, it's not that much more complex (and probably simpler including a basic optimiser) than building a traditional dataflow analyser.

Documentation is something that you shouldn't do too much of on an agile project because demanding documentation increases the cost of change. Agility is producing a low cost of change instead of guaranteeing a perfect result first time. But some is helpful.

One group of documents that's always worth having is those that pay for themselves in writing. Writing is a good way to regain the big picture. Writing can be a good way to have a conversation without anyone to talk to, it's a form of documentation that's more valuable for single person projects than larger projects where there are other people to talk to.

Open source projects also have interesting documentation requirements. The teams are dispersed with different goals and sources of funds (often people's hobby and education time). Being developer driven, documentation often suffers (though commercial development is often undocumented for very similar reasons). But being made up of geographically and organisationally diverse groups, having decent documentation is a serious asset. A key reason is to minimise the amount of time needed to come up to speed on a project. There are a lot more people who have a few weekends available than a few months. But there have to be sub-projects available that could be done by a new person in a few weekends.

Critical reading is essential, the only way to work effectively is to build from experience in literature. There is the trap of implementing something that only works on paper and not in code. Writing is very useful for a single person project because it replaces conversations with peers who are knowledgable about the projects specifics.

Conclusion

Both theory and incremental design are invaluable tools when writing a compiler. Incremental design provides a great way to understand how to apply theory, and also the means to escape from theory's traps where things work brilliantly, but only on paper.

Thinking and doing at the same time, or very closely interwoven, is a critical technique when working on software in an area with a lot of design literature, especially when lacking first hand experience.

Using a simple solution to a sub-problem at a minimum enables everything else to be debugged separately to the complex solution. Often the simple solution is enough. Breaking large problems down into smaller ones that can be solved individually is vital to keep the risks low [Henney] and also because, often, the experience of building the component is required to understand how to design it.

The major goals of the project were valuable for providing consistency and focus, but have yet to become useful when justifying the time spent so far because they are too far away. The short term rewards have been enough to make it worthwhile but they do change, at the beginning working on a hard problem really deepens understanding, as the project progresses community membership becomes more important.

References

[May] May, C. Mimic: a fast System/370 simulator., 1987 Symposium on Interpreters and Interpretive techniques.

[Deutsch-] Deutsch, L. Peter, Schiffman, Allan M. Efficient Implementation of the Smalltalk-80 System, Principles of Programming Languages ? 1983

[Appel] Appel, Andrew, Modern Compiler Implementation in C, 1998, Cambridge University Press

[Holzle] Holzle, Urs, Adaptive Optimization for Self: Reconciling high performance with Exploratory Programming, 1994, PhD Thesis

[Chaitin] Chaitin, M.A., Register allocation via coloring, 1981, Computer Languages 6, pp. 47-57

[Briggs] Briggs, P., Register Allocation via Graph Coloring, 1992, PhD from Rice University

[George-] George, L. and Appel, A. W., Iterated register coalescing, 1996, ACM Trans. on Programming Languages and Systems 18(3), pp. 300-324

[Henney] Henney, K., Stable Intermediate Forms: A foundation Pattern for Derisking the Process of Change, Overload issue 65 February 2005

[Beck-] Beck, K. and Fowler, M., Planning Extreme Programming, 2001, Addison-Wesley

^[1] http://citeseer.ist.psu.edu/

^[2] Common case might just mean any code path that's ever been called. There is a lot of code generated for cases that are not expected to happen but theoretically might. For eample, the integer addition code includes a full message send just in case the arguments are not integers or the addition overflows. Moving type checks and tagging out of the common case code and into the return from an uncommon send is a simple and effective optimisation, it just needs the level of abstraction.

^[3] This is vital for hobby projecs where 2 full days in a row is rare (there's always something else to do in a weekend) and very nice for professional work where distractions from coding are often unavoidable (live customer problems, meetings, whatever).

^[4] There are also performance tests, but their use is separate, they do not check correctness.

^[5] A send to some primitive operation that isn't implemented in Smalltalk, either because it is a basic operation or for performance.

^[6] Slang is the language that the Squeak interpreter is written in. It is a cross between Smalltalk's syntax and C's semantics. One advantage of Slang is it's possible to debug the interpreter in Squeak before compiling down to C. It also provides a clean way of writing primitives to speed up slow loops from code that was originally Smalltalk.

^[7] It can also be obtained by profiling if the profiler records a method and its receiver as well as its sender. this may not be enough for some primitives which is partially why I originally decided to implement PICs.

Notes:

More fields may be available via dynamicdata ..