Tuesday, August 26, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part IV

Part I, II, III, IV

Abstract

Previously in this series, I suggested that Tilera should adopt the COSA software model and I argued that it should dump the sequential processor core it is currently using for its TILE64™ multicore processor in favor of a pure MIMD vector core. I argued that the number of functional units for every instruction should reflect the usage statistics for that instruction. This would increase performance while lowering energy consumption. I went over instruction cache optimization and the importance of the COSA heartbeat, a global synchronization mechanism that enforces deterministic processing, a must for software reliability and security. In today’s post (the last in this series), I will talk about automatic load balancing and data cache optimization. As it turns out, Tilera’s iMesh™ technology is ideally suited for both tasks. Please read the previous posts in the series before continuing.

Automatic Load Balancing

I have always maintained that an application developer should never have to worry about load balancing and scalability. Changing one’s processor to a more powerful model with more cores should automatically increase the performance of existing parallel applications without modification. An automatic self-balancing mechanism must be part of the processor’s hardware because there is a need to respond quickly to changing application loads. There is also a need for high precision in order to optimize bandwidth and minimize energy consumption.

Multiple Caches

Load balancing would not be so bad if all the cores could share a single instruction cache. Then it would be a matter of every core fetching and processing as many instructions as possible until the current instruction buffer is empty and then go on to the next buffer. The picture that I have in mind, as a metaphor, is that of several farm animals drinking water from a single trough at the same time. Unfortunately, we can’t do it this way in the TILE64 because we would have 64 cores accessing the same cache, which would quickly run into a performance-killing bottleneck. It is better for every core to have its own cache. Besides, with the kind of pure vector core architecture that I am calling for, the core bandwidth would be at least an order of magnitude greater than say, that of an x86, a MIPS or an ARM core. In fact, it would be faster overall than existing GPU cores because it can handle both general purpose and graphics programs with equal ease and performance.

iMesh to the Rescue

The problem with multiple caches is that you want every input buffer to have more or less an equal number of instructions (see the previous post for more on instruction buffers). If the buffers are completely independent, good luck on keeping them balanced. Fortunately for the TILE64, it comes with a fast on-chip network that connects all the cores together on an 8x8 grid. If we use the previous metaphor in which an instruction cache is seen as a water trough for farm animals, then we can use pipes linking the bottom of the troughs to represent the iMesh network.


Just as gravity and fluid dynamics would keep the water surface at the same height in every trough, our load-balancing mechanism should try to keep the number of instructions in the caches about equal. I say ‘about’ because, unlike water with its huge numbers of super fine-grain molecules, processor instructions are comparatively coarse-grained.

Fast and Tricky

There are probably several solutions to the load-balancing problem. One that comes to mind calls for a central load balancer that is independent from the instruction fetch mechanism. The balancer would use the iMesh to keep track of how many instructions are in every cache and move them from one cache to another if necessary. The method for doing this might be a little tricky because we don’t want any instruction to stray too far from its core of origin, whose cache is where its data operands most likely reside.

Another solution is to place a mini-balancer in every core. It would have access only to the caches of its nearest neighbors and would use that information to determine whether or not to exchange instructions. This too could be tricky because there is the danger of entering into never-ending back and forth oscillations, which are undesirable, to say the least. Whichever method one decides to use, it must be fast because it must perform its work during instruction prefetch and must be done by the time the cores are ready to process the instructions.

Conclusion

The computer industry is delaying the inevitable: solving the parallel programming problem will require a radical paradigm shift in computing. The big players are all eying each other’s moves and are not willing to be the first to make the decisive move away from the inadequate computing models of the last century. It seems that the first steps that will trigger the revolution will have to come from an outsider, that is to say, a startup company. In my considered opinion, Tilera is rather well positioned, strategically, to make a winning move and take the computer world by storm. The writing is on the wall.

In a future article, I would like to describe some of the architectural options available to the designer of a pure MIMD vector processor.

Next: Heralding the Impending Death of the CPU

Related Articles:
How to Solve the Parallel Programming Crisis
Tilera’s TILE64: The Good, the Bad and the Possible, Part I, II, III

1 comment:

neotoy said...

One more comment after finishing the post. At the bottom you note:

"In a future article, I would like to describe some of the architectural options available to the designer of a pure MIMD vector processor."

I'd really like to see that article because I do a lot of 3D work in Second Life, and I'd love to experiment with designing some prototypes using the ideas expressed in the COSA project.