Part I, II, III, IV
In the previous post I proposed that Tilera Corporation should abandon the sequential architecture for the cores used in its TILE64™ processor and adopt a pure MIMD vector architecture in which every instruction is a vector. The reason is that, in the COSA software model, instructions are not presented to the processor in a sequence but as an array of independent parallel elements. When you stop to think about it, what are sequential cores doing in a parallel processor in the first place? It does not make sense. In this post, I will explain how to optimize a pure MIMD vector processor for the best possible performance and energy saving.
Superscalar Architecture with a Twist
One CPU architecture that comes somewhat closer to what I mean by a pure MIMD vector processor is the superscalar architecture pioneered by the legendary Seymour Cray exept that, in the COSA model, there is no need to check for dependencies between instructions because they are guaranteed to have none. In this context, note that Intel and Hewlett Packard have experimented with explicitly parallel instruction computing, a technique that uses a parallelizing compiler to pre-package instructions for parallel execution. This was the basis for the Itanium processor. I think that was a pretty cool idea because it is not only faster, it also simplifies the processor architecture. But why use an imperfect parallelizing compiler to uncover parallelism in sequential programs when COSA programs are inherently parallel to start with? Just a thought.
At any rate, the way I understand it (my readers will correct me if I’m wrong), parallel execution in a superscalar processor is achieved by placing multiple ALUs and/or FPUs on the same die. I feel that this is an unnecessary waste of both energy and silicon real estate because only one functional unit (adder, multiplier, etc.) within any ALU or FPU can be active at any given time. This is true even if instruction pipelining is used.
What I am proposing is this; instead of using multiple ALUs and FPUs, it would be best to have a single ALU or FPU but use multiple independent functional units for every type of operation. In order words, you can have two 32-bit integer adders and two 32-bit integer multipliers within the same ALU. However, just doubling or tripling the functional units within an ALU would not serve any purpose that adding more ALUs could not serve. The reason for using multiple functional units is that some instructions are used more often than others. It would be best if the proportion of parallel functional units for a given instruction reflected the usage statistics for that instruction. For example, we might want to have more adders and multipliers than left or right shifters. The advantage of this approach is that the percentage of functional units that are active simultaneously will be higher on average. The effect is to increase overall performance while using much less energy and silicon than would be neccessary otherwise. Based on this approach, I envision a processor core that can easily handle 16 or more independent instructions concurrently. In fact, the entire instruction set could, in theory, be processed concurrently since every instruction is an indenpendent vector. (I could patent some of this stuff, I know.)
In Part III, I will write about instruction caching and the COSA heartbeat.