Monday, March 31, 2008

Why Parallel Programming Is So Hard

The Parallel Brain

The human brain is a super parallel signal-processing machine and, as such, it is perfectly suited to the concurrent processing of huge numbers of parallel streams of sensory and proprioceptive signals. So why is it that we find parallel programming so hard? I will argue that it is not because the human brain finds it hard to think in parallel, but because what passes for parallel programming is not parallel programming in the first place. Switch to a true parallel programming environment and the problem will disappear.

Fake Parallelism

What is the difference between a sequential program and a parallel program? A sequential program is an algorithm or a list of instructions arranged in a specific order such that predecessors and successors are implicit. Is there such a thing as a parallel algorithm? In my opinion, the term ‘parallel algorithm’ is an oxymoron because an algorithm, at least as originally defined, is a sequence of steps. There is nothing parallel about algorithms whether or not they are running concurrently on a single processor or on multiple processors. A multithreaded application consists of multiple algorithms (threads) running concurrently. Other than the ability to share memory, this form of parallelism is really no different than multiple communicating programs running concurrently on a distributed network. I call it fake parallelism.

True Parallelism

In a truly parallel system, all events are synchronized to a global clock so that they can be unambiguously identified as being either concurrent or sequential. Synchronization is an absolute must in a deterministic parallel system, otherwise events quickly get out step and inferring temporal correlations becomes near impossible. Note that ‘synchronous processing’ is not synonymous with ‘synchronous messaging’. A truly parallel system must use asynchronous messaging; otherwise the timing of events becomes chaotic and unpredictable. The human brain is a temporal signal processing network that needs consistent temporal markers to establish correlations. While single thread programs provide adequate temporal (sequential) cues, concurrent threads are non-deterministic and thus concurrent temporal cues are hard to establish, which leads to confusion. See also Parallel Programming: Why the Future Is Synchronous for more on this subject.

It is beneficial to view a computer program as a communication system in which elementary processes send and receive signals to one another. In this light, immediately after execution, an operation (predecessor) in an algorithm sends a signal to the next operation (successor) in the sequence meaning essentially, 'I'm done; now it's your turn'. Whereas in an algorithmic program, every element or operation is assumed to have only one predecessor and one successor, by contrast, in a parallel program, there is no limit to the number of predecessors or successors an element can have. This is the reason that sequential order must be explicitly specified in a parallel program. Conversely, concurrency is implicit, i.e., no special construct is needed to specify that two or more elements are to be executed simultaneously.

Composition vs. Decomposition

The common wisdom in the industry is that the best way to write a parallel program is to break an existing sequential program down into multiple threads that can be assigned to separate cores in a multicore processor. Decomposition, it seems, is what the experts are recommending as the correct method of parallelization. However, this begs a couple of questions. If composition is the proper method of constructing sequential programs, why should parallel programs be any different? In other words, if we use sequential elements or components to build a sequential program, why should we not use parallel elements or components to build parallel programs? If the compositional approach to software construction is known to work in sequential programs, it follows that the same approach should be used in parallel software construction. It turns out that signal-based parallel software lends itself well to the use of plug-compatible components that can snap together automatically. Composition is natural and easy. Decomposition is unnatural and hard.

Conclusion

In conclusion, the reason that parallel programming is hard is that it is not what it is claimed to be. As soon as parallel applications become implicitly parallel, synchronous and compositional in nature, parallel programming will be at least an order of magnitude easier than sequential programming. Debugging is a breeze in a deterministic environment, cutting development time considerably.

[This article is part of my downloadable e-book on the parallel programming crisis.]

See Also:

How to Solve the Parallel Programming Crisis
Why I Hate All Computer Programming Languages
Nightmare on Core Street
Parallel Programming: Why the Future Is Synchronous
Parallel Programming: Why the Future Is Non-Algorithmic
Parallel Programming, Math and the Curse of the Algorithm

Thursday, March 20, 2008

Nightmare on Core Street: No More Excuses

The ‘Not Invented Here’ Crowd

This blog has seen a sharp increase in visitor activity since I began writing my Nightmare on Core Street series. That a high proportion of the visits came from organisations like Intel, AMD, Microsoft, Texas Instruments, Freescale Semiconductor, Mentor Graphics and even UC Berkeley did not escape my notice. Do I expect anything to come out of it? Not at all. I perfectly understand the reluctance of anybody in this business to acknowledge outside contributions. After all, if you are being paid handsomely to work on a solution to a pressing problem, why should you tell your employers that someone else might have already found the solution? The answer is obvious: you shouldn’t, because if you did, you would be putting yourself out of a job. The logical (if not cowardly) thing to do is to keep quiet about it in the hope that your employers will not notice.

All About the Money

I have understood the solution to the parallel programming problem for over twenty years. During all those years, I have tried to explain it to anybody in the business who would listen but nobody would. And why should they? Everybody was making lots of money and there was no economic reason to rock the boat, so to speak. Besides, I am just a self-taught programmer, a nobody. But, as they say in the American south, I knew all along that, sooner or later, the chickens would come home to roost. Yes, the chickens will soon come home to roost because they sense something bad approaching, something nasty and mean and dangerous. It is all about the money, you see. If the computer industry does not figure out a way to solve its parallel programming crisis real soon, some very rich and not so very rich folks are going to lose a lot of money, real fast. They can’t let that happen.

The Impotent Wizards Of Multicore

What do rich folks do when they have a nasty problem that threatens to take away their source of income? Well, they do what they’ve always done in the past; they get their hands on all the known experts and wizards in the field, throw money at them and tell them to come up with a solution. However, this time around, the wizards are powerless. In fact, they are the cause of the problem. They are the ones who created the problem in the first place. As a case in point, let us consider Dan Reed, the recently hired Director of Microsoft Research’s scalable and multicore computing. Here is what Reed had to say about the parallel programming problem (source: embedded.com):

Twenty-plus years ago, the research space in parallel computing was looking toward the end of Moore's Law, and so there were bases that were built there to exploit parallelization […] The challenge has been that long-term research had been required to support this. There is no silver bullet there. Some of it is going to be incremental advances; some is going to be new languages.
Any time an expert says things like "there is no silver bullet" or "incremental advances" or "long-term research", you can be absolutely certain that it is a veiled admission of failure, cluelessness and impotence. Besides, "long-term research" implies long-term funding, which is what Reed, after all is said and done, is really concerned about. I have made this point before. Twenty years is a long time in this business. The wizards have been trying for over twenty years to make multithreaded programming easy and they have failed. And they have failed miserably. One would think that, by now, it should have occurred to at least one of them that, maybe, just maybe, threads are not the way to go. Don’t hold your breath. In this business, if something has been good to one’s career in the past, the common wisdom is to hold on to it as much as possible whether or not it works. According to cio.com, “Reed believes that multithreading over time will become "part of the skill set of every professional software developer".” How can anybody still talk about multithreading after spending so many years researching parallel computing? It is enough to make a grown man cry.

No More Excuses

When it comes to understanding the true nature of the parallel programming problem, Daniel Reed of Microsoft Research is as clueless as they come but he is not alone. The computer science community and the industry are crawling with experts like Reed who make a comfortable living out of not finding solutions to pressing problems. They figure that, as long as the problems persist, and as long as they can convince their employers of the need for continued long-term research, their services will continue to be in demand. They are in what I have called in the past, the perpetual research project business. The bad news is that, recently, the game has changed abruptly. Now there is a lot of fear and panic in the air. Very big money is at stake and some very important people are beginning to demand actual and dramatic results. It is no longer business as usual. Another twenty years of fruitless long-term research is out of the question. What is a multicore wizard to do?

In my opinion, the experts (they know who they are) no longer have any excuses. Many of them have visited my blog and the Project COSA site in the last few days and many times over the years. They know what COSA is all about. At this stage of the game, they have only one course of action, in my opinion; either they acknowledge that the COSA software model is the answer to the parallel programming and multicore problem or they will all be fired for incompetence. It is 'shape up or ship out' time. Twenty years of failure is enough and the free ride will come to an end sooner or later. And from the looks of it, it going to be sooner rather than later. As they say, money talks and bullshit walks. In the meantime, I’ll be waiting patiently for the knock on my door. I can wait.

As always, I tell it like I see it.

Tuesday, March 11, 2008

Nightmare on Core Street, Part V

The COSA Software Model

Part I, II, III, IV, V

Recap

In the previous post, I wrote that the reason that the computer industry’s multicore strategy will not work is that it is based on multithreading, a technique that was never intended to be the basis of a parallel software model, only as a mechanism for executing multiple sequential (not parallel) algorithms concurrently. I am proposing an alternative model that is inherently non-algorithmic, deterministic and parallel. It is called the COSA software model and it incorporates the qualities of both MIMD and SIMD parallelism without their shortcomings. The initial reason behind COSA was to solve one of the most pressing problems in computer science today, software unreliability. As it turns out, COSA addresses the parallel programming problem as well.

The COSA Model

Any serious attempt to formulate a parallel software model would do well to emulate parallel systems in nature. One such system is a biological neural network. Imagine an interconnected spiking (pulsed) neural network. Each elementary cell (neuron) in the network is a parallel element or processor that waits for a discrete signal (a pulse or spike) from another cell or a change in the environment (event), performs an action (executes an operation) on its environment and sends a signal to one or more cells. There is no limit to the number of cells that can be executed simultaneously. What I have just described is a behaving system, i.e., a reactive network of cells that use signals to communicate with each other. This is essentially what a COSA program is. In COSA, the cells are the operators; and these can be either effectors (addition, subtraction, multiplication, division, etc…) or sensors (comparison or logic/temporal operators). The environment consists of the data variables and/or constants. Below is an example of a COSA low-level module that consists of five elementary cells.


Alternatively, a COSA program can be viewed as a logic circuit with lines linking various gates (operators or cells) together. Indeed, a COSA program can potentially be turned into an actual electronics circuit. This aspect of COSA has applications in future exciting computing technologies like the one being investigated by the Phoenix Project at Carnegie Mellon University. The main difference between a COSA program and a logic circuit is that, in COSA, there is no signal racing. All gates are synchronized to a global virtual clock and signal travel times are equal to zero, i.e., they occur within one cycle. A global clock means that every operation is assumed to have equal duration, one virtual cycle. The advantage of this convention is that COSA programs are 100% deterministic, meaning that the execution order (concurrent or sequential) of the operations in a COSA program is guaranteed to remain the same. Temporal order determinism is essential for automated verification purposes, which, in turn, lead to rock-solid reliability and security.

The COSA Process Emulation

Ideally, every COSA cell should be its own processor, like a neuron in the brain or a logic gate. However, such a super-parallel system must await future advances. In the meantime we are forced to use one or more very fast processors to do the work of multiple parallel cells. In this light, a COSA processor (see below) should be seen as a cell emulator. The technique is simple and well known. It is used in neural networks, cellular automata and simulations. It is based on an endless loop and two cell buffers. Each pass through the loop represents one cycle. While the processor is executing the cells in one buffer, the downstream cells to be processed during the next cycle are appended to the other buffer. As soon as all the cells in the first buffer are processed, the buffers are swapped and the cycle begins anew. Two buffers are used in order to prevent the signal racing conditions that would otherwise occur.

The COSA Processor

As seen above, we already know how to emulate deterministic parallel processes in software and we can do it without the use of threads. It is not rocket science. However, using a software loop to emulate parallelism at the instruction level would be prohibitively slow. For performance purposes, the two buffers should be integrated into the COSA processor and the cell emulation performed by the processor. In a multicore processor, each core should have its own pair of buffers. I previously began to write a series of blog articles on how to build a self-balancing, fine-grain multicore processor based on the COSA model. I did not finish the series due to business considerations.

Comparison Between COSA and Other Parallel Software Models

I plan to go over each item in the table above in detail in a future article. Let me just mention here that ease of programming is one of the better attributes of COSA. The reason is that programming in COSA is graphical and consists almost entirely of connecting objects together. Most of the time, all that is necessary is to drag the object into the application. High-level objects are plug-compatible and know how to connect themselves automatically.



The figure above is an example of a COSA high-level module under construction. Please take a look at the Project COSA web page for further information.

Related articles:

How to Solve the Parallel Programming Crisis
Parallel Computing: The End of the Turing Madness
Parallel Programming: Why the Future Is Non_Algorithmic
Parallel Programming: Why the Future Is Synchronous
Parallel Computing: Why the Future Is Reactive
Why Parallel Programming Is So Hard
Parallel Programming, Math and the Curse of the Algorithm
The COSA Saga

PS. Everyone should read the comments at the end of Parallel Computing: The End of the Turing Madness. Apparently, Peter Wegner and Dina Goldin of Brown University have been ringing the non-algorithmic/reactive bell for quite some time. Without much success, I might add, otherwise there would be no parallel programming crisis to speak of.

Saturday, March 8, 2008

Nightmare on Core Street, Part IV

Gambling on Threads

Part I, II, III, IV, V

Recap

In Part I, II and III of this five-part article, I wrote that the computer industry is in a panic because there is no easy way to program multicore processors. I also went over the advantages and disadvantages of MIMD, SIMD and heterogeneous multicore architectures. In my opinion, what is needed is a new programming model/processor architecture that combines the strengths of both SIMD and MIMD while eliminating their weaknesses. I am proposing a universal parallel computing model that uses fine-grain, deterministic parallelism in an MIMD configuration to handle anything you can throw at it. I will describe what I have in mind in Part V. In this post, I want to go over the reasons that I think the computer industry’s current multicore strategy will turn out to be a colossal failure.

High Stakes Gamble on the Future of Parallel Computing

The Chief Software Evangelist for Intel, James Reinders, believes that the key to successful parallel programming centers around scaling, debugging and future proofing (source: Dr. Dobb’s Journal). To that list I would add automatic load balancing and ease of programming. Of course, being an Intel evangelist and the author of the new book “Intel Threading Building Blocks”, Reinders is necessarily biased. When he says, “think parallel”, what he means is, “think threads”. Reinders strongly advises programmers to use threaded libraries or, better yet, Intel’s own threading building blocks. He is pushing the use of threads because Intel’s multicore processors are pretty much useless for non-threaded applications. The same goes for AMD and other multicore vendors. These guys are so desperate they’re giving all their code away, free.

Let’s say you decided to listen to the industry pundits and you painstakingly rewrote your entire legacy code to use multiple threads and got it stable enough to be useful. That would make Intel and AMD very happy and your code would indeed run faster on their multicore systems. But what if they are wrong about the future of parallel programming? What if (horror of horrors) the future is not multithreaded? Would all your time and effort have been in vain? The answer is yes, of course. All right, I am not trying to be an alarmist for the hell of it. I am trying to drive home an important point, which is this: Intel and AMD have already made the fateful decision to go the multithreading route and they are now irreversibly committed to it. If their gamble does not win out, it will undoubtedly mean tens of billions of dollars in losses for them and their customers. That would be a disaster of enormous proportions and they know it. This is what Reinders really means by ‘future proofing’. He is more concerned about future-proofing Intel's multicore CPUs than anything else. So if you listen to evangelists like Reinders (or AMD's Chuck Moore), you do so at your own risk because the sad reality (see below) is that the industry does not have a clue as to the real future of parallel computing. There’s something sadly pathetic about the blind leading the blind and both falling into the same ditch.

Persistent Problem

The parallel programming problem is not a new one. It has been around for decades. In a GCN article titled “The Multicore Challenge”, Cray J. Henry, the director of the U.S. Defense Department’s High Performance Computing Modernization Program, wrote:

The challenge of writing parallel software has been the key issue for the computational science and supercomputing community for the last 20 years. There is no easy answer; creating parallel software applications is difficult and time consuming.

This is rather telling. For two decades, some of the smartest people in the computer research community have been using threads to program super high-performance parallel computers. Even after spending untold zillions of dollars and man-hours on the parallel programming problem, they still have no answer. It is still “difficult and time consuming.” What is wrong with this picture? The answer should be obvious to anybody who is not blinded by reality: the academic research community has no idea what the solution might be. They are not any closer to a solution than they were when they started. They have spent twenty long years (an eternity in this business) trying to fit a square peg into a round hole! And, incredibly, they are still at it.

So what makes either Intel or AMD so confident that threads are the way to go? What is the theoretical basis of their multicore strategy? Well, the truth is that they are not confident at all. They have no leg to stand on, really. If they had solved the problem, they would not continue to pour hundreds of millions of dollars into research labs around the globe to find a solution. They are not pushing threads because they think it is the right way to do things. They are pushing threads because they have no alternative. And they have no alternative because they are following in the footsteps of academia, the same people who, in my opinion, got everybody into this mess in the first place. It is one more example of the blind leading the blind.

The Hidden Nature of Computing

The way I see it, if computer scientists had started out with the correct computing model (yes, there is such a thing, even with a single-core processor), there would be no crisis at all and you would not be reading this article. Adding more cores to a processor would have been a relatively painless evolution of computer technology, a mere engineering problem. Unfortunately, the entire computer industry still has the same conception of what a computer should be that Charles Babbage and Lady Ada Lovelace had one hundred and fifty years ago! Is it any wonder that we are in a crisis? To solve the parallel programming problem, the industry first needs to understand the true nature of computing and then reinvent the computer accordingly, both software and hardware. There are no two ways about it. And the longer they wait to wake up and realize their folly, the worse the problem is going to get.

The true nature of computing has nothing to do with universal Turing machines or the Turing computability model or any of the other stuff that academics have intoxicated themselves with. I have already written plenty about this subject elsewhere and it does not make sense to repeat myself here. Suffice it to say that, as soon as one comes to grips with the true nature of computing, it becomes immediately clear that the multithreaded approach to parallel computing is a complete joke, an abomination even. If you are wise, you would take heed not to listen to the likes of Mr. Reinders or to anybody who goes around selling what I call the threaded snake oil. As the native americans used to say, they speak with a forked tongue. :-) Threads are not part of the future of computing. Using threads for so-called future-proofing is a disaster in the making, wishful thinking notwithstanding. Reality can be cruel that way.

There is a way to build self-balancing, MIMD, multicore computers to implement fine-grain, reactive, deterministic parallelism that will not only solve the parallel programming problem but the reliability problem as well. I’ll go over this in my next post. Stay tuned.

Thursday, March 6, 2008

Nightmare on Core Street, Part III

SIMD

Part I, II, III, IV, V

Recap

In Part II of this five-part article I went over the pros and cons of MIMD multicore CPU architectures that are designed to run coarse-grain, multithreaded applications. Current MIMD multicore architectures are an evolutionary step from single core architectures in that they make it easy for existing threaded applications to make the transition to multicore processing without much modification. The bad thing is that multithreaded applications are unreliable and too hard to program and maintain. In addition, coarse-grain parallelism is not well suited to many important types of computations such as graphics and scientific/engineering simulations. Here I describe the advantages and disadvantages of SIMD (single instruction, multiple data) parallelism, also known as data level or vector parallelism.

SIMD

Most multicore processors can be configured to run in SIMD mode. In this mode, all the cores are forced to execute the same instruction on multiple data simultaneously. SIMD is normally used in high performance computers running scientific applications and simulations. This is great when there is a need to perform a given operation on a large data set and in situations when programs have low data dependencies, i.e., when the outcome of an operation rarely affect the execution of a succeeding operation.

Many graphics processors use SIMD because graphics processing is data intensive. If you have a computer with decent graphics capabilities, chances are that it has a special co-processor that uses SIMD to handle the graphics display. Companies like NVIDIA and ATI (now part of AMD) make and sell SIMD graphics processors. In the last few years, many people in the business have come to realize that these dedicated graphics processors can do more than just handle graphics. They can be equally well suited to non-graphical scientific and/or simulation applications that can benefit from a similar data-flow approach to parallel processing.

The Good

One of the advantages of SIMD processors is that, unlike general-purpose MIMD multicore processors, they handle fine-grain parallel processing, which can result in very high performance under certain conditions. Another advantage is that SIMD processing is temporally deterministic, that is to say, operations are guaranteed to always execute in the same temporal order. Temporal order determinism is icing on the parallel cake, so to speak. It is a very desirable property to have in a computer because it is one of the essential ingredients of stable and reliable software.

The Bad

The bad thing about SIMD is that it is lousy in situations that call for a mixture of operations to be performed in parallel. Under these conditions, performance degrades significantly. Applications that have high data dependencies will also perform poorly. I am talking about situations where a computation is performed based on the outcome of a previous computation. An SIMD processor will choke if you have too many of these. Unfortunately, many applications are like that.

Hybrid Processors

The latest trend in multicore CPU design is to mix MIMD and SIMD processing cores on the same die. AMD has been working hard on its Fusion processor, which they plan to release in 2009. Not to be outdone, Intel is quietly working on its own GPU/CPU multicore offering. Indeed, Intel started the trend or mixing graphics and general purpose cores with its failed MMX Pentium processor a while back. Sony, Toshiba and IBM already have a multicore processor that mixes SIMD and MIMD processing cores on one chip. It is called the Cell processor and it is the processor being shipped with Sony’s PlayStation 3 video game console.

The idea behind these so-called heterogeneous processors is that their designers believe that SIMD and MIMD complement each other’s capabilities, which is true. In addition, having both types of cores on the same chip increases performance because communication between cores is faster since it does not have to use a slow external bus. The problem with hybrid processors, however, is that programming them is extremely painful. In the past, I have compared it to pulling teeth with a crowbar. This is something that the industry is acutely aware of and hundreds of millions of dollars are currently being spent on finding a solution that will alleviate the pain.

Fundamental Flaw

In my opinion, all of the current approaches to multicore parallel processing will fail in the end and they will fail miserably. They will fail because they are fundamentally flawed. And they are flawed because, 150 years after Babbage designed the first general-purpose computer, neither academia nor the computer industry has come to understand the true purpose of a CPU. In Part IV of this series, I will explain why I think the computer industry is making a colossal error that will come back to haunt them. Stay tuned.

Monday, March 3, 2008

Nightmare on Core Street, Part II

MIMD

Part I, II, III, IV, V

Recap

In Part I of this five-part article, I wrote that the computer industry is in a sort of panic, their planned transition from sequential computing to parallel processing having hit a brick wall. The existing multicore architectures are not only hard to program, most legacy applications cannot take advantage of the parallelism. Some experts, especially MIT Professor Anant Agarwal, CTO and co-founder of Tilera Corporation, are adamant that the industry needs to come up with a new software model because current thread-based operating systems that use cache coherency snooping will not scale up to the many-core processors of the future (source: EETimes). I agree with Professor Agarwal. In this installment, I will describe the difference between fine and coarse grain parallelism, the origin of threads and the pros and cons of thread-based MIMD multicore processors. MIMD simply means that the parallel cores can execute different instructions on multiple data simultaneously.

Fine Grain Vs. Coarse Grain

General-purpose multicore processors (the kind built in the newest laptops, desktops and servers) use an MIMD architecture. These processors implement coarse-grain parallelism. That is to say, applications are divided into multiple concurrent modules or threads of various sizes. Each core can be assigned one or more threads so as to share the load; this results in faster processing. By contrast, in fine-grain parallelism, applications are broken down to their smallest constituents, i.e., the individual instructions. Ideally, these instructions can be assigned to separate cores for parallel execution. Fine grain parallelism is much more desirable than coarse-grain parallelism because it makes it possible to parallelize well-known functions like those used in array sorting or tree searching.

Threads

Threads are a legacy of the early years of electronics digital computing. They stem from an idea called multitasking that was originally invented to eliminate a problem with sequential batch processing. In the old days, multiple programs (jobs) were stacked in a batch and a job controller was used to feed them into the computer to be executed one after the other. This was time consuming and very expensive. Multitasking made it possible for multiple jobs to run simultaneously on the same sequential processor. The rationale is that the processor is so fast that it can switch execution from one concurrent task to another many times a second. Each task would have its own memory space and would behave as if it had the processor entirely to itself. It did not take long for someone to figure out that a single application could be divided into multiple concurrent internal mini-tasks running in the same memory space. These are called threads.

The Good

Even though multitasking and multithreading were never intended to be a parallel programming model, this is nevertheless the model that most major multicore CPU vendors have embraced. Early on, everybody understood that threads and/or tasks could be divided among multiple parallel cores running concurrently and that it would result in faster processing. In addition, threads provided a direct evolutionary path from single-core to multicore computing without upsetting the cart too much, so to speak. Many existing applications that already used threads extensively could make the transition with little effort and programmers could continue to use the same old compilers and languages. Even non-threaded applications could take advantage of the new CPUs if several of them can be kept running concurrently on the same multicore computer.

The Bad

The need for programming continuity and for compatibility with the existing code base is the reason that companies like AMD and Intel are trying their best to encourage programmers to use as many threads as possible in their code. There are many problems with threads, however, too many to discuss in a blog article. So I’ll just mention a few here. The biggest problem is programming difficulty. Even after decades of research and hundreds of millions of dollars spent on making multithreaded programming easier, threaded applications are still a pain in the ass to write. Threads are inherently non-deterministic and, as a result, they tend to be unreliable and hard to understand, debug and maintain. In addition, the coarse-grain parallelism used in threaded programs is not well suited to data-parallel applications such as graphics processing, simulations and a whole slew of scientific computations. These types of programs run much faster in a fine-grain parallel environment. Another problem is that a huge number of legacy applications were not written with threads in mind and cannot take advantage of the processing power of multicore CPUs. Converting them into threaded applications is a monumental undertaking that will prove to be prohibitively costly and time consuming. Rewriting them from scratch will be equally costly.

Obviously, thread-based MIMD parallelism is not the answer the industry is looking for, wishful thinking notwithstanding. In Part III, I will examine the pros and cons of SIMD and heterogeneous multicore processors.

See also:

Why Parallel Programming Is So Hard

Saturday, March 1, 2008

Nightmare on Core Street, Part I

The Parallel Programming Crisis

Part I, II, III, IV, V

Panic in Multicore Land

There is widespread disagreement among experts on how best to design and program multicore processors. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others (e.g., Anant Agarwal, CTO of Tilera Corporation) disagree on the grounds that heterogeneous processors would be too hard to program. Some see multithreading as the best way for most applications to take advantage of parallel hardware. Others (Agarwal) consider threads to be evil. The only emerging consensus seems to be that multicore computing is facing a major crisis. Here’s a short excerpt from an interview conducted by Dr. Dobb’s Journal with Ryan Schneider, CTO and co-founder of Acceleware:

DDJ: It seems like a system running, say a NVIDIA GPU and a many-core CPU could get pretty complicated to program for. If so, what's a developer to do?

RS: Hide. Look for a different job. Day trade on the stock market... Personally, I find that the fetal position helps. :) In all seriousness though, this is a nasty problem. Your question really describes a heterogeneous system, and most ISVs etc. are probably having enough trouble squeezing decent performance/multiples out of a quad-core CPU without adding another, different beast into the mix.
Schneider is not very optimistic about the future of parallel programming. He goes on to say, “Ultimately there is no easy solution to developing parallel systems.” He’s not alone in his pessimism. In a recent EETIMES article titled “Multicore puts screws to parallel-programming models”, AMD’s Moore is reported to have said that “the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.”

Incompatible Beasts and Hideous Monsters

The main problem with multicore processors is that they are hard to program. In addition, there is a huge legacy of software applications that cannot automatically take advantage of parallel processing. That being said, what is remarkable, in my view, is that there is currently no single architecture and/or programming model that can be used universally across all types of applications. Multicore processors come in essentially two incompatible flavors, MIMD (multiple instructions, multiple data) and SIMD (single instruction, multiple data). Neither flavor is optimal for every situation. Logic dictates that universality should be the primary objective of multicore research. Yet, amazingly, industry leaders like Intel and AMD are now actively pushing the field toward a hybrid (i.e., heterogeneous) type of parallel processor, a truly hideous monster that mixes both MIMD and SIMD cores on a single die. It is obvious, at least from my perspective, why the industry is in a crisis: they don’t seem to have a clue as to the real nature of the problem.

In Part II of this five-part article, I will go over the pros and cons of the MIMD parallel programming model as it is currently used in multicore CPUs. In the meantime, please read How to Solve the Parallel Programming Crisis to get an idea of where I am going with this.

[This article is part of my downloadable e-book on the parallel programming crisis.]

See also:

Why Parallel Programming Is So Hard
How to Solve the Parallel Programming Crisis