ACCU :: Editorial: Many Hands Make Light Work

Some people say the parallel revolution is coming. Ric Parkin argues itâ€™s in full swing.

So what do you know about Mooreâ€™s Law [Moore]? Despite being remarkably famous for a technical â€˜lawâ€™, people tend to misquote it or confuse it with related corollaries. The underlying idea is interesting for its simplicity, and the fact itâ€™s held true for so long.

The original formulation is surprisingly technical [Intel] and deals with the change in the minima of the cost/density curve of packing transistors on a single wafer. Simply, there is a â€˜sweet-spotâ€™ for transistor density cost â€“ making fewer is more reliable but you get less, but adding more can reduce yields. This curve has a minimum where you get the best complexity value. Mooreâ€™s observation was that this ratio was doubling every year and looked to roughly continue at this rate for at least the immediate future; this was in 1965, so itâ€™s gone on for a while! Further refinements from data changed this to a doubling every two years, and a minor refinement by David House, taking into account the improvement in speed due to smaller transistors, tweaked this to 18 months. This is more subtle than the usual formulations: the most common one is that the number of transistors doubles. This isnâ€™t a bad approximation though, if you consider how much you get at a constant â€˜middleâ€™ price-point.

Increase in clock speed

Another formulation that has fallen out of favour was that processing speed doubled. This was roughly true for a long time, partly due to the performance increase that House noted, but depended a lot on the increase in clock speed, which juddered to a halt shortly after 2000, as noted by Herb Sutter in â€˜The Free Lunch is Overâ€™ [Sutter1]. Itâ€™s interesting to note that a major contribution to this was the problem of heat dissipation from the activity of transistors, a problem noted by Moore himself in his paper!

But the halt of the almost-free performance boosts due to clock speed didnâ€™t stop the underlying increase in transistor density. You had more to work with, but things no longer got faster as if by magic. Instead these extra transistors were used to implement ingenious shortcuts â€“ things like:

Pre-fetching instructions that might be needed in the near future.
Speculative execution where you work out what these instructions would do, without committing the results until you found theyâ€™d been run. (This is why having lots of branches can hurt performance â€“ it interferes with this optimisation. Hence tricks such as loop unrolling.)
Data caching in on-chip memory to get around the comparatively slow communication between the chip and the main memory, by keeping the data youâ€™re working on nearby on fast memory on the chip itself.
Memory write reordering, where itâ€™s faster to write to memory in a single pass so the hardware changes the order to be more efficient.
Parallel execution of independent instructions.

But a weâ€™ve run out of a lot of those clever tricks. Instead those extra transistors are being used to create extra cores that can truly run multiple threads of execution simultaneously rather than switching from one thread to another, as happens when multi-threaded programs get run on single cores. However, this had some odd effects on some already compiled programs that used multiple threads: they slowed down! Understanding why this could happen is instructive. Remember how on-chip caches were used to avoid the slow round-trip to main memory and back? Well a similar thing happens between cores: if you share data between two cores that are running communicating threads, then keeping their data in sync with each other takes time. Also the locking of that data can cause a core to stall completely while someone else is modifying it. This is not an efficient use of the hardware. So one thing to learn is to be very careful about how many threads you spawn, and how much they need to communicate or share data. Itâ€™s all too easy to assume that youâ€™re the only thing running, spawn more threads than cores and end up slowing everything down due to contention. In fact one efficient arrangement for many standard applications is to have your program single threaded! That way you play â€˜nicelyâ€™ with other applications that can be competing for resources. A slight variant on this would be to have all the user interfaces running on one core, and the real â€˜workâ€™ algorithms running on a second core, only occasionally interacting with the UI. Indeed, this is how some operating systems work, in order for their user interfaces to be slick and responsive. Another hard learnt lesson is to avoid sharing data â€“ itâ€™s often counter-intuitively faster to make a deep copy to pass to a worker thread than share data. (Functional languages have a big advantage here as they have a much better idea that things wonâ€™t change and can do an efficient copy if needed.)

Specialised chips

Another long-established trend is to use extra specialised chips to support some computing-intensive operations and relieve the more general purpose CPUs. My first experience of such chips were in early PCs where you could have a separate floating point unit, such as the Intel 80287 [80287], to boost performance of maths-intensive programs. Otherwise floating point had to be done by the main CPU using slow emulation libraries. (I remember stepping though the Borland Pascal libraries and spotting that the start of their maths library checked for the presence of a FPU, and then changed all the functionsâ€™ code to either use it and return immediately, or implement the software version. This neatly avoided a check every time or a level of indirection, but such self-modifying code is frowned upon nowadays.)

Another major use for separate specialised chips is for boosting graphics performance. Large displays need a lot of memory, and modern effects and applications need to modify it quickly in often simple but repetitive ways. Taking such a burden off the CPU and giving it to a dedicated chip with a large amount of on-chip memory, and many small cores that can work on different areas of the screen, can make a huge difference. In the last few years such chips have become enormously powerful, and tweaked so that their often idle computing power can be harnessed for more general purpose processing, helped by specially written libraries. The performance increase can be remarkable for suitable applications, such as databases [Alenka], or even parts of an OS [KGPU].

As with many such special purpose chips, increases in transistor density often means that such facilities get built into the main CPU chip, such as Intelâ€™s multimedia instruction set, or the Cell processor as used in the PlayStation 3 [Cell]. However, this can only go so far. While Mooreâ€™s law still holds, the effort needed to make every smaller transistors in the face of thermal and quantum effects is rising and ultimately there are hard physical limits â€“ although you can get pretty small [Physorg].

Remote processing

So how do you keep increasing computing power? Well, weâ€™ve already seen one way â€“ add more transistors. But not necessarily on the same chip â€“ GPUs and FPUs were originally on separate chips, and we can continue in this tradition by adding multiple chips on each board, or add more boards to the computer, or even add more computers. Herb Sutter has again mapped out this trend in an excellent overview [Sutter2]. Put simply, we no longer have the old traditional single CPU and single thread at our disposal â€“ we now have multiple-threads on multi-core chips, in computers with several chips of various types, connected to a network that can contain millions of other computers, many of which could be ours to use, or possibly rented from the cloud on a ongoing or ad hoc basis. Example of what this can do is shown by Appleâ€™s Siri voice recognition services and Amazonâ€™s Silk browser â€“ rather than relying on the rather puny computing resources of a mobile phone, these applications get serves in the cloud to do the hard work. In Siriâ€™s case it sends the voice stream over the network where a server will do the recognition and work out what to do. In Silk, the web request is actually done by their server, and the final page image is rendered and returned for display.

Looking to the future

So future applications will often be highly complex clusters of algorithms, split between a relatively low powered display and entry device, and an amorphous cloud of computing nodes, which can come and go as needed. Of course there is a problem with reliablity â€“ if the network connection goes, so does your computing power!

Does this sound complex? It sure is, as anyone whoâ€™s tried designing algorithms for multithreading. This is even more complex â€“ the problem is weâ€™ll have a huge amount of unpredictable hardware available and our software will have to adapt to what it finds. Ultimately this has to mean that we no longer deal with low level threading, locking etc, and instead build upon a higher level platform which works out how best to distribute the computing needs. Some languages and platforms such as Erlang have already made some progress in this direction. Other languages such as C++ have only just reached the low level thread and hardware stage â€“ the time is ripe for higher abstractions and language features to give us the vocabulary to describe out algorithms correctly for the future.