Friday, September 5, 2008

The Radical Future of Computing, Part II

Part I, II


This post is a continuation of my response to reader Marc’s interesting comments at the end of my recent article, Heralding the Impending Death of the CPU.

The Market Wants Speed and Low Energy Consumption

The microprocessor market is also highly fragmented between cheap low-end processor makers like Microchip and Atmel, and desktop makers. The desktop players have their own mindset that has made them successful in the past. The obviously-easily parallelizable tasks (sound, graphics...) are so common that custom parallel processors were designed for them. You might be able to get Microchip to squeeze in 20 16f84 microcontrollers on one piece of silicon and could easily use a bunch of cheap PICs to emulate a bunch of 20 vector processors with current technology at a chip cost of maybe $100. But then, the optimum bus design would vary on the application.

What application would be most compelling to investors? I don't know... But I think an FPGA or multi-PIC proof of concept would help your idea become implemented at low cost, and a "suggestion software on how to parallelize applications" for sequentially-thinking programmers, combined with a parallel processor emulator for conventional chip architectures would help programmers see parallel programmingas an approachable solution instead of a venture capitalist buzzword.

Well, I am not so sure that this would attract the people with the money. I sense that, when it comes to processors, people are more impressed with proven performance than anything else. And, nowadays, people also want low energy usage to go with the speed. Sure, it would be cool if I could demonstrate a powerful parallel programming tool, but it would be an expensive thing to develop and it would not prove the superiority of the target processor. What I would like to deliver, as an introduction, is a low wattage, general-purpose, single-core processor that is several times more powerful (measured in MIPS) than say, an Intel or AMD processor with four or more cores. I think I can do it using vector processing. This, too, is not something that can be built cheaply, in my estimation. It must be designed from scratch.

SIMD Vector Processor: Who Ordered That?

At this point in the game, there should be no doubt in anyone’s mind that vector processing is the way to go. As GPUs have already amply demonstrated, vector processing delivers both high performance and fine-grain deterministic parallelism. Nothing else can come close. That multicore vendors would want to use anything other than a vector core is an indication of the general malaise and wrongheadedness that have gripped the computer industry. As everyone knows, multithreading and vector processing are incompatible approaches to parallelism. For some unfathomable reason that will keep future psycho-historians busy, the computer intelligentsia cannot see past multithreading as a solution to general purpose parallel computing. That's too bad because, unless they change their perspective, they will fall by the wayside.

When I found out that Intel was slapping x86 cores laced together with SIMD vector units in their upcoming Larrabee GPU, I could not help cringing. What a waste of good silicon! The truth is that the only reason that current vector processors (GPUs) are not suitable for general-purpose parallel applications is that they use an SIMD (single instruction, multiple data) configuration. This is absurd to the extreme, in my opinion. Why SIMD? Who ordered that? Is it not obvious that what is needed is an MIMD (multiple instruction, multiple data) vector core? And it is not just because fine-grain MIMD would be ideal for general-purpose parallel applications, it would do wonders for graphics processing as well. Why? Because (correct me if I’m wrong) it happens that many times during processing, a bunch of SIMD vector units will sit idle because the program calls for only a few units (one instruction at a time) to be used on a single batch of data. The result is that the processor is underutilized. Wouldn't it be orders of magnitude better if other batches of data could be processed simultaneously using different instructions? Of course it would, if only because the parallel performance of a processor is directly dependent on the number of instructions that it can execute at any given time.

MIMD Vector Processing Is the Way to Go

Most of my readers know that I absolutely abhor the multithreading approach to parallelism. I feel the same way about CPUs. A day will come soon when the CPU will be seen as the abomination that it always was (see Heralding the Impending Death of the CPU for more on this topic). However, SIMD vector processors are not the way to go either even if they have shown much higher performance than CPUs in limited domains. It is not just that they lack universality (an unforgivable sin, in my view) but the underutilization problem that is the bane of the SIMD model will only get worse when future vector processors are delivered with thousands or even millions of parallel vector units. The solution, obviously, is to design and build pure MIMD vector processors. As I explained in a previous article on Tilera’s TILE64, the best way to design an MIMD vector processor is to ensure that the proportion of vector units for every instruction reflects the overall usage statistics for that instruction. This would guarantee that a greater percentage of the units are used most of the time, which would, in turn, result in much lower power consumption and greater utilization of the die’s real estate for a given performance level. Of course, a pure MIMD vector core is useless unless you also have the correct parallel software model to use it with, which is what COSA is all about.

Have you looked at any Opencores designs?
No, I haven’t. The open source issue is very interesting but it opens a whole can of worms that I'd better leave for a future article.

1 comment:

manubot said...

Hi Louis,

I have recently begun to read your blog and I have found your ideas quite interesting. I agree in many things and I hope soon you can put them at the service of the computing industry (so we all can have a better computing world ;) ).

I have been working for a time in parallel computing (FPGAs first and then GPUs) and now I am doing some research on architectures which will have a great future potential. In this way I have to thank you for the TILERA analyse which I've found very illustrative.

I would like to know your opinion about a processor I have discovered recently. It is called a Software Configurable Processor and there is a company, Stretch Inc, that manufactures it. It seems quite a nice idea as the parallelisme is done at instruction level. In addition, processors can be put into array to do cooperation. Have you had the chance to give a look at that architecture? How do you find it regarding your ideas? Does the configurable unit (ISEF they call it) have any similarity with your COSA Core?

Thanks again for all the information you share with all us!!