Wednesday, August 20, 2008

Transforming the TILE64 into a Kick-Ass Parallel Machine, Part II

Part I, II, III, IV

Abstract

In the previous post I proposed that Tilera Corporation should abandon the sequential architecture for the cores used in its TILE64™ processor and adopt a pure MIMD vector architecture in which every instruction is a vector. The reason is that, in the COSA software model, instructions are not presented to the processor in a sequence but as an array of independent parallel elements. When you stop to think about it, what are sequential cores doing in a parallel processor in the first place? It does not make sense. In this post, I will explain how to optimize a pure MIMD vector processor for the best possible performance and energy saving.

Superscalar Architecture with a Twist

One CPU architecture that comes somewhat closer to what I mean by a pure MIMD vector processor is the superscalar architecture pioneered by the legendary Seymour Cray exept that, in the COSA model, there is no need to check for dependencies between instructions because they are guaranteed to have none. In this context, note that Intel and Hewlett Packard have experimented with explicitly parallel instruction computing, a technique that uses a parallelizing compiler to pre-package instructions for parallel execution. This was the basis for the Itanium processor. I think that was a pretty cool idea because it is not only faster, it also simplifies the processor architecture. But why use an imperfect parallelizing compiler to uncover parallelism in sequential programs when COSA programs are inherently parallel to start with? Just a thought.

At any rate, the way I understand it (my readers will correct me if I’m wrong), parallel execution in a superscalar processor is achieved by placing multiple ALUs and/or FPUs on the same die. I feel that this is an unnecessary waste of both energy and silicon real estate because only one functional unit (adder, multiplier, etc.) within any ALU or FPU can be active at any given time. This is true even if instruction pipelining is used.

Vector Optimization

What I am proposing is this; instead of using multiple ALUs and FPUs, it would be best to have a single ALU or FPU but use multiple independent functional units for every type of operation. In order words, you can have two 32-bit integer adders and two 32-bit integer multipliers within the same ALU. However, just doubling or tripling the functional units within an ALU would not serve any purpose that adding more ALUs could not serve. The reason for using multiple functional units is that some instructions are used more often than others. It would be best if the proportion of parallel functional units for a given instruction reflected the usage statistics for that instruction. For example, we might want to have more adders and multipliers than left or right shifters. The advantage of this approach is that the percentage of functional units that are active simultaneously will be higher on average. The effect is to increase overall performance while using much less energy and silicon than would be neccessary otherwise. Based on this approach, I envision a processor core that can easily handle 16 or more independent instructions concurrently. In fact, the entire instruction set could, in theory, be processed concurrently since every instruction is an indenpendent vector. (I could patent some of this stuff, I know.)

In Part III, I will write about instruction caching and the COSA heartbeat.

7 comments:

Josh McDonald said...

I like the idea of "more adders / multipliers if they're used more often" a lot. Not only that, but firmware could monitor utilisation statistics, and report to the user. If your workstation is doing more than the expected amount of "operation X" the system could suggest you add another chip with more of that operation than others. A GPU could be replaced with simply another core that has more of whatever operations are used for your games / rendering software / Vista Aero :)

Louis Savain said...

A GPU could be replaced with simply another core that has more of whatever operations are used for your games / rendering software / Vista Aero :)

This is a great idea. Custom tailored processors! Indeed, why buy capabilities that are not needed? It would also be nice if the functional vector units could be turned on and off individually. This would make it possible for the processor to tailor overall performance to the needs of the applications that are running. Every application should, in principle, be able to tell the OS of its precise perfomance requirements in terms of functional load.

Fly Mulu said...

That can actually be more elegantly solved using an FPGA, by reprogramming the array in real time to release unnecessary functional units and use the freed gates to create more of the "other" units that are requred in the system at any given time. Granted, FPGAs are slower then "hardwired" dies, but unlike CPUs they are truly universal and reconfigurable.

By the way of FPGAs: did you ever look at JHDL? (http://www.jhdl.org/)

They do some of the graphic modelling, simulation, runtime reprogramming, etc. stuff that we are talking about here. They transform the graphical representation into Java or netlists, depending on whether you want to run a simulation or run it directly on an FPGA board.
Might be an interesting toy to play with.

Wes Felter said...

What you are proposing is similar to the U Texas TRIPS core, although TRIPS uses dataflow dependencies rather than a global barrier to synchronize instructions.

A 16-issue core is going to be much larger than the current Tilera core, so you won't be able to fit 64 of them on a chip. Keep this in mind in your performance estimates.

Louis Savain said...

wes felter wrote: A 16-issue core is going to be much larger than the current Tilera core, so you won't be able to fit 64 of them on a chip. Keep this in mind in your performance estimates.

Thanks for the comment. Well, as I explained in the article, you can have the performance of a 16-vector core without having to implement 16 units for every instruction. The reason is that some instructions are not used as often as others.

Frankly I don't see why that fitting 64 of my vector cores on a die would be a problem. Tilera is planning to have hundreds of their core on a single die. Besides, Intel has already proven that it can be done with Larrabee.

Louis Savain said...

Fly Mulu, I like your FPGA idea. Keep in touch.

Victor said...

well I dont know much but you can construct any function at the level of AND, OR and NOTs, as I imagine a really really big array of such components wich are very simple in internal structure MIGHT be connected or interconected "on the fly" by switching some fets in between them, you might control thousands of conections by probably some fewer amounts of lines, lets say you have a 10 by 10 switches (100) that might be governed by maybe 20 lines, some to do with log2 algebra.

Dont know if fpgas work this way in the low level, a matrix of logical gates commanded its logical function by a few switchings.

If you do an array of an array each array might be rewired "on the fly" maybe in a pair of clock cycles to perform another logical function, u might even "create on the fly" a 32 or 64 bits processor if u have enough logical gates, sounds really crazy. According to time lapses or some nice thought algorithm you might be controlling what function does wich portion of a really big array, after all and this is fundamental every thinkable processor are pure logical gates at the lowest logical level.