Part I, II
Previously, I wrote that in order to solve the memory bandwidth problem, the computer industry must design a new kind of computer, one that does away with the central processor. Alternatively, the industry must wait for some sort of breakthrough in quantum tunneling or optical memory. In this post, I argue that a decentralized design will result in extremely fast parallel processing. I further argue that the deterministic timing afforded by a synchronous processing model can be used to implement a signaling mechanism that significantly reduces the requirement for a huge number of physical signal pathways.
It should be noted that the multicore memory bandwidth problem is not universal. The use of on-chip caches together with an effective load balancing system will provide adequate scalability for many applications. If data and event dependencies are localized, inter-core communication is minimal and performance will not degrade as the number of cores increases. It is possible to minimize latency by automatically preventing data from drifting to distant caches far from the cores that service them. In this light, I should mention that Tilera’s Tile64 multicore processor with its iMesh™ technology is potentially well suited for this type of parallel computing (see links below). Unfortunately, there are many applications with numerous dependencies that will not scale regardless of the size of cache memory. Let me add here that a single breakthrough in quantum tunneling or optical memory that eliminates memory bus contention would solve the problem, in which case the ideas that I propose in this article would no longer be necessary.
My favorite example of a non-centralized, high event-dependency parallel computing system is the brain. In a conventional computer, a central processor loads a stream of op-codes (instructions) from memory, deciphers their meaning and executes them one at a time. The brain, by contrast, uses as many tiny processors (neurons) as there are instructions in its program. Every elementary processor is directly attached to its data, so to speak. The most obvious advantage of this approach is that there is no instruction stream and therefore no need for either a central instruction decoder or an execution unit. Another advantage is the inherent parallelism, an extremely powerful feature that the brain uses to process vast quantities of signals simultaneously. Memory bandwidth is not an issue for the brain. I envision a new programmable, massively parallel computer technology where huge numbers of elementary processors are directly embedded into memory. The performance increase would be tremendous compared to a conventional computer.
Event Dependencies and Signal Pathways
Signaling is used for event timing purposes. It implies the ability to create a connection between a signal emitter and a receiver. We normally do not think of it as such but a computer program is a communication system. Every instruction in an algorithm sends a signal to the next instruction in the sequence meaning essentially “I’m done, now it’s your turn”. In a true parallel system (e.g., the brain), a predecessor instruction can send a signal to an indefinite number of successors: bus contention is not an issue because signal pathways are not shared. Signaling between program elements in a Von Neumann computer, by contrast, is accomplished via a memory addressing mechanism. This is a relatively slow process because it requires shared address decoder, memory controller, address and data buses that permit only one memory location to be accessed at a time. This is fine in most cases if the memory bandwidth is high enough. However, it becomes a major headache in future multicore environments when thousands or even tens of thousands of cores must have concurrent access to shared memory. That’s when bus contention rears its ugly head and drives a nail in the coffin of scalability, so to speak.
Self-Activation via Deterministic Timing
The brain solves the contention problem by using dedicated axonic fibers to directly connect processing elements together. It would be nice if we could use the same approach in our computers but the ensuing proliferation of pathways would be prohibitive, given the current state of our field programmable technology. I’ve been thinking that one way to drastically reduce the number of signals and pathways is to take advantage of the deterministic timing that is inherent in a synchronous programming model such as COSA. The idea is that temporally related elements (operators) would be programmed to know precisely when to activate themselves. All that is needed is a reset pulse and a heartbeat. In other words, all temporally related processors would be connected to only two lines, like pearls on a necklace. Each element would self-activate at their programmed time by counting heartbeat pulses. Some would activate simultaneously while others would do so sequentially according to the program. Multiple “necklaces” could be laid side by side on a single die to form an entire application. Programming the elements (or cells) could be done via conventional memory addressing. Of course, since some elements (such as an AND cell or an addition effector) will require multiple inputs, field-programmable interconnects would allow cross connections between elements residing in different necklaces. As I wrote in a previous post, this is still a partially baked idea. I hope to have time to work on it some more in order to iron out the kinks and fill in the details.
It would be great if mimicking the brain’s decentralized architecture would solve all of our problems but, unfortunately, there are things that computers must do that the brain does not have to do. Directly attaching an elementary processor to its data would speed things up marvelously but it would only work up to a point because variables are not always local to the processors that use them. Our programs are expected to do things like moving or copying data from one place in memory to another. An example is matrix manipulation where a variable or constant may be copied to every cell in an array or one array may be used to transform another. In my next post, I will describe a possible way to extend this model to maximize data bandwidth.
How to Solve the Parallel Programming Crisis
Transforming the TILE64 into a Kick-Ass Parallel Machine