Thursday, January 29, 2009

Parallel Computing: The Fourth Crisis, Part II

Part I, II

Abstract

Previously, I wrote that in order to solve the memory bandwidth problem, the computer industry must design a new kind of computer, one that does away with the central processor. Alternatively, the industry must wait for some sort of breakthrough in quantum tunneling or optical memory. In this post, I argue that a decentralized design will result in extremely fast parallel processing. I further argue that the deterministic timing afforded by a synchronous processing model can be used to implement a signaling mechanism that significantly reduces the requirement for a huge number of physical signal pathways.

Applicability

It should be noted that the multicore memory bandwidth problem is not universal. The use of on-chip caches together with an effective load balancing system will provide adequate scalability for many applications. If data and event dependencies are localized, inter-core communication is minimal and performance will not degrade as the number of cores increases. It is possible to minimize latency by automatically preventing data from drifting to distant caches far from the cores that service them. In this light, I should mention that Tilera’s Tile64 multicore processor with its iMesh™ technology is potentially well suited for this type of parallel computing (see links below). Unfortunately, there are many applications with numerous dependencies that will not scale regardless of the size of cache memory. Let me add here that a single breakthrough in quantum tunneling or optical memory that eliminates memory bus contention would solve the problem, in which case the ideas that I propose in this article would no longer be necessary.

The Brain

My favorite example of a non-centralized, high event-dependency parallel computing system is the brain. In a conventional computer, a central processor loads a stream of op-codes (instructions) from memory, deciphers their meaning and executes them one at a time. The brain, by contrast, uses as many tiny processors (neurons) as there are instructions in its program. Every elementary processor is directly attached to its data, so to speak. The most obvious advantage of this approach is that there is no instruction stream and therefore no need for either a central instruction decoder or an execution unit. Another advantage is the inherent parallelism, an extremely powerful feature that the brain uses to process vast quantities of signals simultaneously. Memory bandwidth is not an issue for the brain. I envision a new programmable, massively parallel computer technology where huge numbers of elementary processors are directly embedded into memory. The performance increase would be tremendous compared to a conventional computer.

Event Dependencies and Signal Pathways

Signaling is used for event timing purposes. It implies the ability to create a connection between a signal emitter and a receiver. We normally do not think of it as such but a computer program is a communication system. Every instruction in an algorithm sends a signal to the next instruction in the sequence meaning essentially “I’m done, now it’s your turn”. In a true parallel system (e.g., the brain), a predecessor instruction can send a signal to an indefinite number of successors: bus contention is not an issue because signal pathways are not shared. Signaling between program elements in a Von Neumann computer, by contrast, is accomplished via a memory addressing mechanism. This is a relatively slow process because it requires shared address decoder, memory controller, address and data buses that permit only one memory location to be accessed at a time. This is fine in most cases if the memory bandwidth is high enough. However, it becomes a major headache in future multicore environments when thousands or even tens of thousands of cores must have concurrent access to shared memory. That’s when bus contention rears its ugly head and drives a nail in the coffin of scalability, so to speak.

Self-Activation via Deterministic Timing

The brain solves the contention problem by using dedicated axonic fibers to directly connect processing elements together. It would be nice if we could use the same approach in our computers but the ensuing proliferation of pathways would be prohibitive, given the current state of our field programmable technology. I’ve been thinking that one way to drastically reduce the number of signals and pathways is to take advantage of the deterministic timing that is inherent in a synchronous programming model such as COSA. The idea is that temporally related elements (operators) would be programmed to know precisely when to activate themselves. All that is needed is a reset pulse and a heartbeat. In other words, all temporally related processors would be connected to only two lines, like pearls on a necklace. Each element would self-activate at their programmed time by counting heartbeat pulses. Some would activate simultaneously while others would do so sequentially according to the program. Multiple “necklaces” could be laid side by side on a single die to form an entire application. Programming the elements (or cells) could be done via conventional memory addressing. Of course, since some elements (such as an AND cell or an addition effector) will require multiple inputs, field-programmable interconnects would allow cross connections between elements residing in different necklaces. As I wrote in a previous post, this is still a partially baked idea. I hope to have time to work on it some more in order to iron out the kinks and fill in the details.

Data Dependencies

It would be great if mimicking the brain’s decentralized architecture would solve all of our problems but, unfortunately, there are things that computers must do that the brain does not have to do. Directly attaching an elementary processor to its data would speed things up marvelously but it would only work up to a point because variables are not always local to the processors that use them. Our programs are expected to do things like moving or copying data from one place in memory to another. An example is matrix manipulation where a variable or constant may be copied to every cell in an array or one array may be used to transform another. In my next post, I will describe a possible way to extend this model to maximize data bandwidth.

See Also:
How to Solve the Parallel Programming Crisis
Transforming the TILE64 into a Kick-Ass Parallel Machine

5 comments:

chz said...

>Every instruction in an algorithm sends a signal to the next instruction in the sequence meaning essentially “I’m done, now it’s your turn”.

No. This is why instruction level parallelism works. The arcs on the dependency graph for instructions is probably what you want here.

>This is a relatively slow process because it requires an address decoder and a controller that permits only one memory location to be accessed at a time.

Do you mean that for a given memory location, only a single thread may access it at time, or do you mean given a 'n' threads and 'n' distinct memory addresses, only one thread may access one memory address at a time? For the latter you can just use multiport memory, for the former, a sane cache hierarchy on a CMP should take care of this for the most part. If every single hardware thread is writing on a single memory location every cycle, the algorithm is broken anyway. If not, you have false sharing and should reexamine your cache design.

Even more on that point, when you say access, do you mean writes as well as reads? If it's only reads, you should be OK with even the simplest coherence protocol. If it's reads and writes, you should be OK so long as it stays on chip.

James said...

I'm wondering how cosa would cope with running time critical applications alongside other programs which would try to steal its memory.

Although seldom implemented by programs, on Windows it is possible to lock ram pages in physical memory so in theory you could write a game so that its ram is immune to something like seti@home. Normally though you would just close seti@home before launching the game.

If I've understood correctly though, cosa programs or objects are not loaded and closed but some of their cells may exist in ram while the rest lie dormant on the hard disk.

Have you considered whether a cosa os should allow some objects to be locked in ram, or whether a 'ram-priority' should be ascribed to each cell? Could this be implemented from within a cosa environment, or would it have to be a feature of the cosa machine?

In general most of your articles are about the cpu-ram bottleneck - do you think ram-storage speed is a minor problem?

Louis Savain said...

No. This is why instruction level parallelism works. The arcs on the dependency graph for instructions is probably what you want here.

Well, I was thinking algorithmically which presupposes a sequence of steps. Note also that dependency graphs are not yet fully automated and can miss important parallelism that occur at various levels within a sequential program. My thesis (see Project COSA) is that, if programming were based on parallel entities in the first place, there would be no need for tools that attempt to parallelize sequential programs. I feel that programming should be parallel by default. Sequences should be explicity specified. Right now, the opposite is true.

Do you mean that for a given memory location, only a single thread may access it at time,

Yes. As far as I know, this is true for most memory systems.

or do you mean given a 'n' threads and 'n' distinct memory addresses, only one thread may access one memory address at a time?

Yes. I am talking about main memory, not caches.

For the latter you can just use multiport memory,

Well, multiport memory is a limited solution that is viable only in very expensive systems. It is out of the question when considering future systems with hundreds or thousands of cores. We would not have a memory bandwidth crisis is the solution only amounted to adding more ports to memory. Why? Because the difficulty and price of adding ports grow exponentially.

for the former, a sane cache hierarchy on a CMP should take care of this for the most part.

See below.

If every single hardware thread is writing on a single memory location every cycle, the algorithm is broken anyway.

Absolutely.

If not, you have false sharing and should reexamine your cache design.

As I wrote in the article, caching is adequate for many applications but not all. For example, it is inadequate for programs like brain simulations or very large neural networks. The interconnections (dependencies) are too numerous and too distant and would bring a cache-based system to its knees. In my opinion, super parallel programs like artificial brains will be the norm in the not too distant future. Unless those programs can scale properly, we will have a major crisis.

Even more on that point, when you say access, do you mean writes as well as reads? If it's only reads, you should be OK with even the simplest coherence protocol. If it's reads and writes, you should be OK so long as it stays on chip.

Well, I don't think this will solve the bandwidth problem, otherwise nobody would be talking about it as if it were a serious problem. Let's say you have program requirement to fill every location in a huge array with a default value. In an ideal system, you should be able to write to every location in the array within a single cycle (no looping). The same goes with copying or transforming one array with another. These types of operations are not easy to scale automatically using cached multicore processors while maintaining a balanced load. The compiler would have to know in advance how many cores will be available at the time of execution.

In conlusion, let me reiterate that the use of caches and multiport memory is not the solution to the memory bandwidth problem because linear scaling becomes prohibitive in big, high dependency applications.

Louis Savain said...

James,

COSA programs remain in memory for as long as they are needed. So yes, they should be locked in RAM once loaded. However, it is up to the program to load and unload its modules, not the operating system. I believe that OS-managed virtual memory is unreliable because they not only play havoc with timing, they are not user-friendly.

A dormant COSA module is one that does not respond to incoming signals. This is done for performance purposes. It must remain in memory so that it can be instantly revived whenever necessary.

I believe that RAM storage is not an issue for future applications. If there is no sufficient RAM, then the application should not be loaded.

I wrote an article in October of last year on the subject of proritization. See link below:

COSA: A New Kind of Programming, Part VI

James said...

I was under the impression that a cosa object would include a relatively small amount of data which would be loaded with its logic. And that programs would be composed from these objects rather than from objects which act on blocks of data.

I thought the two main purposes of cosa after speed were first to remove the need for eplicit type definitions necessary in the current model, and secondly to add implicit persistence to objects in ram and optionally implicit persistence on disk. This frees programmers from the job* of datatype matching which in cosa is implicit or dare I say no longer necessary.

(*and their jobs)

For example a game would be composed of a few million cosa objects spanning 500GB of hard disk. The core objects which interact with the user and hardware would be locked in ram, but the objects which compose the game world would have to be paged by a fundamental feature of the cosa machine.

If I've misunderstood and you think the distinction between program and data should remain, then please could you explain how a parallel program would interface with a block of typed data? I can't see how this could be achieved other than by emulating a traditional program.