Thursday, January 29, 2009

Parallel Computing: The Fourth Crisis, Part II

Part I, II

Abstract

Previously, I wrote that in order to solve the memory bandwidth problem, the computer industry must design a new kind of computer, one that does away with the central processor. Alternatively, the industry must wait for some sort of breakthrough in quantum tunneling or optical memory. In this post, I argue that a decentralized design will result in extremely fast parallel processing. I further argue that the deterministic timing afforded by a synchronous processing model can be used to implement a signaling mechanism that significantly reduces the requirement for a huge number of physical signal pathways.

Applicability

It should be noted that the multicore memory bandwidth problem is not universal. The use of on-chip caches together with an effective load balancing system will provide adequate scalability for many applications. If data and event dependencies are localized, inter-core communication is minimal and performance will not degrade as the number of cores increases. It is possible to minimize latency by automatically preventing data from drifting to distant caches far from the cores that service them. In this light, I should mention that Tilera’s Tile64 multicore processor with its iMesh™ technology is potentially well suited for this type of parallel computing (see links below). Unfortunately, there are many applications with numerous dependencies that will not scale regardless of the size of cache memory. Let me add here that a single breakthrough in quantum tunneling or optical memory that eliminates memory bus contention would solve the problem, in which case the ideas that I propose in this article would no longer be necessary.

The Brain

My favorite example of a non-centralized, high event-dependency parallel computing system is the brain. In a conventional computer, a central processor loads a stream of op-codes (instructions) from memory, deciphers their meaning and executes them one at a time. The brain, by contrast, uses as many tiny processors (neurons) as there are instructions in its program. Every elementary processor is directly attached to its data, so to speak. The most obvious advantage of this approach is that there is no instruction stream and therefore no need for either a central instruction decoder or an execution unit. Another advantage is the inherent parallelism, an extremely powerful feature that the brain uses to process vast quantities of signals simultaneously. Memory bandwidth is not an issue for the brain. I envision a new programmable, massively parallel computer technology where huge numbers of elementary processors are directly embedded into memory. The performance increase would be tremendous compared to a conventional computer.

Event Dependencies and Signal Pathways

Signaling is used for event timing purposes. It implies the ability to create a connection between a signal emitter and a receiver. We normally do not think of it as such but a computer program is a communication system. Every instruction in an algorithm sends a signal to the next instruction in the sequence meaning essentially “I’m done, now it’s your turn”. In a true parallel system (e.g., the brain), a predecessor instruction can send a signal to an indefinite number of successors: bus contention is not an issue because signal pathways are not shared. Signaling between program elements in a Von Neumann computer, by contrast, is accomplished via a memory addressing mechanism. This is a relatively slow process because it requires shared address decoder, memory controller, address and data buses that permit only one memory location to be accessed at a time. This is fine in most cases if the memory bandwidth is high enough. However, it becomes a major headache in future multicore environments when thousands or even tens of thousands of cores must have concurrent access to shared memory. That’s when bus contention rears its ugly head and drives a nail in the coffin of scalability, so to speak.

Self-Activation via Deterministic Timing

The brain solves the contention problem by using dedicated axonic fibers to directly connect processing elements together. It would be nice if we could use the same approach in our computers but the ensuing proliferation of pathways would be prohibitive, given the current state of our field programmable technology. I’ve been thinking that one way to drastically reduce the number of signals and pathways is to take advantage of the deterministic timing that is inherent in a synchronous programming model such as COSA. The idea is that temporally related elements (operators) would be programmed to know precisely when to activate themselves. All that is needed is a reset pulse and a heartbeat. In other words, all temporally related processors would be connected to only two lines, like pearls on a necklace. Each element would self-activate at their programmed time by counting heartbeat pulses. Some would activate simultaneously while others would do so sequentially according to the program. Multiple “necklaces” could be laid side by side on a single die to form an entire application. Programming the elements (or cells) could be done via conventional memory addressing. Of course, since some elements (such as an AND cell or an addition effector) will require multiple inputs, field-programmable interconnects would allow cross connections between elements residing in different necklaces. As I wrote in a previous post, this is still a partially baked idea. I hope to have time to work on it some more in order to iron out the kinks and fill in the details.

Data Dependencies

It would be great if mimicking the brain’s decentralized architecture would solve all of our problems but, unfortunately, there are things that computers must do that the brain does not have to do. Directly attaching an elementary processor to its data would speed things up marvelously but it would only work up to a point because variables are not always local to the processors that use them. Our programs are expected to do things like moving or copying data from one place in memory to another. An example is matrix manipulation where a variable or constant may be copied to every cell in an array or one array may be used to transform another. In my next post, I will describe a possible way to extend this model to maximize data bandwidth.

See Also:
How to Solve the Parallel Programming Crisis
Transforming the TILE64 into a Kick-Ass Parallel Machine

Sunday, January 18, 2009

Parallel Computing: The Fourth Crisis, Part I

Part I, II

The Memory Bandwidth Problem

I apologize for the long hiatus. It took me a while to recover from the death of a close relative and my recent move from Texas to California. As I wrote in my last post, I have been thinking about the memory bandwidth problem. This is worse than the parallel programming crisis because it appears that the solution will require some sort of breakthrough in quantum tunneling or optical computing. This could happen now or ten years from now. The computer industry cannot wait. Somehow, it must forge a solution as soon as possible, otherwise even big chip companies like Intel, IBM, and AMD will see their profit stream dwindle down to a trickle. These are companies that rely on constant improvements in performance to maintain their competitive edge.

The Four Crises

The computer world is battling four major crises, in my opinion. The first two, software reliability and programmer productivity, have been with us since the seventies and they show no sign of abating. Parallel programming, the third crisis, emerged only a few years ago with the commercialization of multicore processors. I have forcefully argued in the past that these three crises are really one and the same. What I mean is that it is a single problem that calls for a single solution. The fourth major crisis is the memory bandwidth problem. This is the worst one of them all, in my opinion, because, whether or not the other three are solved, slow memory threatens to repeal Moore’s law and bring progress in the field to a screeching halt. Nobody wants that to happen, at least not in the foreseeable future.

A New Kind of Computer

I think the world needs a new kind of computer. I have been thinking about a radically new way of achieving blazingly fast parallel computing without the use of a central processor. The idea has been percolating in my mind for quite some time. It is still partially baked but I think that it is worth pursuing. Essentially, I believe that the age of the central processor must come to an end. The communication bottleneck that results from segregating the processor from memory is simply intolerable. My idea is based primarily on certain characteristics of the COSA software model. I will describe what I have in mind in greater details in part II of this article.

See Also:
How to Solve the Parallel Programming Crisis