Sunday, October 7, 2007

Parallel Programming, Math, and the Curse of the Algorithm

The CPU, a Necessary Evil

The universe can be seen as the ultimate parallel computer. I say ‘ultimate’ because, instead of having a single central processor that processes everything, every fundamental particle is its own little processor that operates on a small set of properties. The universe is a reactive computer as well because every action performed by a particle is a reaction to an action by another particle. Ideally, our own computers should work the same way. Every computer program should be a collection of small reactive processors that perform elementary actions (operations) on their assigned data in response to actions by other processors. In other words, an elementary program is a tiny behaving machine that can sense and effect changes in its environment. It consists of at least two actors (a sensor and an effector) and a changeable environment (data variable). In addition, the sensor must be able to communicate with the effector. I call this elementary parallel processor the Universal Behaving Machine (UBM).

More complex programs can have an indefinite number of UBMs and a sensor can send signals to more than one effector. Unfortunately, even though computer technology is moving in the general direction of our ideal parallel computer (one processor per elementary operator), we are not there yet. And we won’t be there for a while, I’m afraid. The reason is that computer memory can be accessed by only one processor at a time. Until someone finds a solution to this bottleneck, we have no choice but to use a monster known as the CPU, a necessary evil that can do the work of a huge number of small processors. We get away with it because the CPU is very fast. Keep in mind that understanding the true purpose of the CPU is the key to solving the parallel programming problem.

Multicore and The Need for Speed

Although the CPU is fast, it is never fast enough. The reason is that the number of operations we want it to execute in a given interval keeps growing all the time. This has been the main driving force behind CPU research. Over the last few decades, technological advances insured a steady stream of ever faster CPUs but the technology has gotten to a point where we can no longer make them work much faster. The solution, of course, is a no-brainer: just add more processors into the mix and let them share the load, and the more the better. Multicore processors have thus become all the rage. Unsurprisingly, we are witnessing an inexorable march toward our ideal computer in which every elementary operator in a program is its own processor. It’s exciting.

Mathematicians and the Birth of the Algorithmic Computer

Adding more CPU cores to a processor should have been a relatively painless evolution of computer technology but it turned out to be a real pain in the ass, programming wise. Why? To understand the problem, we must go back to the very beginning of the computer age, close to a hundred and fifty years ago, when an Englishman named Charles Babbage designed the world’s first general purpose computer, the analytical engine. Babbage was a mathematician and like most mathematicians of his day, he longed for a time when he would be freed from the tedium of performing long calculation sequences. All he wanted was a reasonably fast calculator that could reliably execute mathematical sequences or algorithms. The idea of using a single fast central processor to emulate the behaviors of multiple small parallel processors was the furthest thing from his mind. Indeed, the very first program written for the analytical engine by Babbage’s friend and fellow mathematician, Lady Ada Lovelace, was a table of instructions meant to calculate the Bernoulli numbers, a sequence of rational numbers. Neither Babbage nor Lady Ada should be faulted for this but current modern computers are still based on Babbage’s sequential model. Is it any wonder that the computer industry is having such a hard time making the transition from sequential to parallel computing?

Square Peg vs. Round Hole

There is a big difference between our ideal parallel computer model in which every element is a parallel processor and the mathematicians’ model in which elements are steps in an algorithm to be executed sequentially. Even if we are forced to use a single fast CPU to emulate the parallel behavior of a huge number of parallel entities, the two models require different frames of mind. For example, in a true parallel programming model, parallelism is implicit but sequential order is explicit, that is to say, sequences must be explicitly specified by the programmer. In the algorithmic model, by contrast, sequential order is implicit and parallelism must be explicitly specified. But the difference is even more profound than this. Whereas an element in an algorithm can send a signal to only one other element (the successor in the sequence) at a time, an element in a parallel program can send a signal to as many successors as necessary. This is what is commonly referred to as fine-grain or instruction-level parallelism, which is highly desirable but impossible to obtain in an MIMD execution model using current multicore CPU technology.
The image above represents a small parallel program. A signal enters at the left and a ‘done’ signal is emitted at the right. We can observe various elementary parallel operators communicating with one another. Signals flow from the output of one element (small red circle) to the input of another (white or black circle). The splitting of signals into multiple parallel streams has no analog in an algorithmic sequence or thread. Notice that parallelism is implicit but sequential order is explicit. But that’s not all. A true parallel system that uses signals to communicate must be synchronous, i.e., every operation must execute in exactly one system cycle. This insures that the system is temporally deterministic. Otherwise signal timing quickly gets out of step. Temporal determinism is icing on the parallel cake because it solves a whole slew of problems related to reliability and security.

It should be obvious that using Babbage’s and Lady Ada’s 150-year old computing model to program a parallel computer is like trying to fit a square peg into a round hole. One would think that, by now, the computer industry would have figured out that there is something fundamentally wrong with the way it builds and programs computers but, unfortunately, the mathematicians are at it again. The latest trend is to use functional languages like Erlang for thread-based parallel programming. Thread-based, coarse-grain parallelism is a joke, in my opinion. There is a way to design a fine-grain, self-balancing multicore CPU for an MIMD execution environment that does not use threads. Threaded programs are error-prone, hard to program and difficult to understand. Decidedly, the notion of a computer as a calculating machine will die hard. It is frustrating, to say the least. When are we going to learn?

Lifting the Curse of the Algorithm

To solve the parallel programming problem, we must lift the curse of the algorithm. We must abandon the old model and switch to a true parallel model. To do so, we must reinvent the computer. What I mean is that we must change, not only our software model, but our hardware model as well. Current CPUs were designed and optimized for the algorithmic model. We need a new processor architecture (both single core and multicore) that is designed from the ground up to emulate non-algorithmic, synchronous parallelism. It’s not rocket science. We already know how to emulate parallelism in our neural networks and our cellular automata. However, using current CPUs to do so at the instruction level would be too slow. The market wants super fast, fine-grain, self-balancing and auto-scalable multicore processors that use an MIMD execution model. It wants parallel software systems that are easy to program and do not fail. Right now there is nothing out there that fits the bill.

The Next Computer Revolution

It remains to be seen who, among the various processor manufacturers, will be the first to see the light. Which nation will be the standard bearer of the new computing paradigm? When will the big switch happen? Who knows? But when it does, it will be the dawning of the next computer revolution, one which will make the first one pale in comparison. We will be able to build super fast computers and programs of arbitrary complexity that do not fail. It will be the true golden age of automation. I can’t wait.

[This article is part of my downloadable e-book on the parallel programming crisis.]

See Also:

Nightmare on Core Street
Why Parallel Programming Is So Hard
The Age of Crappy Concurrency: Erlang, Tilera, Intel, AMD, IBM, Freescale, etc…
Half a Century of Crappy Computing
Parallel Computers and the Algorithm: Square Peg vs. Round Hole

3 comments:

Dawn said...

What good is dumping algorithmic approach when we will hang on to the tired od real world metphors of Paper-in-file-cabinet doucents, or tape-control-style transport buttons for Music and video apps.

Really we straight-jacket these intelligences that can do anything into echoes of how we used to do these things without them.

FORTH.
!!

Tobia said...

Hi

Your rant is not without its merits, but do you realize that what you envision already exists?

The "fine-grained parallelism" you describe is exactly the way microchips are programmed (or "designed".)

In particular, an FPGA is a kind of microchip that contains an array (or matrix, or lattice) of programmable logic gates. Each logic gate can be set up to perform an arbitrary boolean function of its inputs. Moreover, the lattice itself can be programmed to connect the gates in an arbitrary way. All the gates fire simultaneously, at the chip's clock speed. At each clock tick, the output from each gate becomes the input for other gates, according to the layout programmed into the lattice. An FPGA can be reprogrammed at will. In fact, it reads its configuration from some kind of memory at each startup.

If I understand your post, what you call a "processor architecture for non-algorithmic, synchronous parallelism" is the very FPGA, which has been in widespread use since the 1990s. Here is a very basic video on FPGAs.

Unfortunately, you will find that FPGA are usually programmed with text-based programming languages (Verilog and VHDL are the big ones) that still require using a keyboard, although they are quite different than traditional programming languages, given the different nature of the hardware.

The reason for using a text-based interface is simply that reading, writing*, scrolling, copy and pasting, searching, and generally working on a long text document is much easier that on a large visual schematic of connections. The difference gets more pronounced as the program becomes bigger and bigger. Just imagine trying to decipher this medium-sized (small-sized really) program written in Labview, a visual programming language**, as opposed to the equivalent program written in a traditional language.

As far as replacing CPUs with FPGAs (or similar fine-grained parallel architectures) in everyday computing, I don't see it happening any time soon, because most tasks performed on (by) general-purpose computers are sequential in nature. That's my unsubstantiated, personal opinion. I know it differs from yours, so be it.

-Tobia

* Hitting a sequence of keys on a keyboard is much faster than moving a mouse or finger over a screen. If that's not the case for you, may I humbly suggest learning to touch-type, preferably on a Dvorak keyboard, best of all the Programmer Dvorak keyboard. It's one of the things that has helped me the most.

** Labview is a very nice visual programming language, but it's also a very expensive piece of commercial software. If you are interested in this kind of thing, may I suggest looking at PureData. It's free software and its programs (called "patches", because they patch units together) look like this. Each unit has a number of inputs (top marks) and outputs (bottom marks) and can be either a builtin or a user-defined function. User functions are themselves patches, defined by patching together other units. I don't think anybody has ever thought of using PureData to program (or emulate) an FPGA, but it would be a nice project.

Tobia said...

PS: OpenCL is a framework for writing software to run on parallel hardware.

It was created by Apple to offset some of the computation requirements of their applications onto the GPU, a powerful piece of parallel computing hardware that is present on most contemporary computers and sits idle most of the time.

Nowadays OpenCL is supported by most hardware manufacturers and operating systems. You can use it to write code (only once) and have it run on a traditional CPU, on a GPU, and on specialized parallel architectures.

There are vendors selling FPGA expansion boards and related software to compile OpenCL programs into a "gate configuration" for their massively-parallel FPGA chips.

So I would say the market is already moving in this direction.