Monday, March 31, 2008

Why Parallel Programming Is So Hard

The Parallel Brain

The human brain is a super parallel signal-processing machine and, as such, it is perfectly suited to the concurrent processing of huge numbers of parallel streams of sensory and proprioceptive signals. So why is it that we find parallel programming so hard? I will argue that it is not because the human brain finds it hard to think in parallel, but because what passes for parallel programming is not parallel programming in the first place. Switch to a true parallel programming environment and the problem will disappear.

Fake Parallelism

What is the difference between a sequential program and a parallel program? A sequential program is an algorithm or a list of instructions arranged in a specific order such that predecessors and successors are implicit. Is there such a thing as a parallel algorithm? In my opinion, the term ‘parallel algorithm’ is an oxymoron because an algorithm, at least as originally defined, is a sequence of steps. There is nothing parallel about algorithms whether or not they are running concurrently on a single processor or on multiple processors. A multithreaded application consists of multiple algorithms (threads) running concurrently. Other than the ability to share memory, this form of parallelism is really no different than multiple communicating programs running concurrently on a distributed network. I call it fake parallelism.

True Parallelism

In a truly parallel system, all events are synchronized to a global clock so that they can be unambiguously identified as being either concurrent or sequential. Synchronization is an absolute must in a deterministic parallel system, otherwise events quickly get out step and inferring temporal correlations becomes near impossible. Note that ‘synchronous processing’ is not synonymous with ‘synchronous messaging’. A truly parallel system must use asynchronous messaging; otherwise the timing of events becomes chaotic and unpredictable. The human brain is a temporal signal processing network that needs consistent temporal markers to establish correlations. While single thread programs provide adequate temporal (sequential) cues, concurrent threads are non-deterministic and thus concurrent temporal cues are hard to establish, which leads to confusion. See also Parallel Programming: Why the Future Is Synchronous for more on this subject.

It is beneficial to view a computer program as a communication system in which elementary processes send and receive signals to one another. In this light, immediately after execution, an operation (predecessor) in an algorithm sends a signal to the next operation (successor) in the sequence meaning essentially, 'I'm done; now it's your turn'. Whereas in an algorithmic program, every element or operation is assumed to have only one predecessor and one successor, by contrast, in a parallel program, there is no limit to the number of predecessors or successors an element can have. This is the reason that sequential order must be explicitly specified in a parallel program. Conversely, concurrency is implicit, i.e., no special construct is needed to specify that two or more elements are to be executed simultaneously.

Composition vs. Decomposition

The common wisdom in the industry is that the best way to write a parallel program is to break an existing sequential program down into multiple threads that can be assigned to separate cores in a multicore processor. Decomposition, it seems, is what the experts are recommending as the correct method of parallelization. However, this begs a couple of questions. If composition is the proper method of constructing sequential programs, why should parallel programs be any different? In other words, if we use sequential elements or components to build a sequential program, why should we not use parallel elements or components to build parallel programs? If the compositional approach to software construction is known to work in sequential programs, it follows that the same approach should be used in parallel software construction. It turns out that signal-based parallel software lends itself well to the use of plug-compatible components that can snap together automatically. Composition is natural and easy. Decomposition is unnatural and hard.

Conclusion

In conclusion, the reason that parallel programming is hard is that it is not what it is claimed to be. As soon as parallel applications become implicitly parallel, synchronous and compositional in nature, parallel programming will be at least an order of magnitude easier than sequential programming. Debugging is a breeze in a deterministic environment, cutting development time considerably.

[This article is part of my downloadable e-book on the parallel programming crisis.]

See Also:

How to Solve the Parallel Programming Crisis
Why I Hate All Computer Programming Languages
Nightmare on Core Street
Parallel Programming: Why the Future Is Synchronous
Parallel Programming: Why the Future Is Non-Algorithmic
Parallel Programming, Math and the Curse of the Algorithm

6 comments:

Mr. Funk said...

Hey, I stumbled across your blog a few days ago, and I just wanted to let you know I find some of your ideas on parallel programming intriguing. It's something I'm very interested in and I'll be reading your articles in the future.

axeoth said...

hello/bonjour louis,

please consider the Labview software (www.ni.com) and its G graphic programming language and discuss if it is a trueapproach for parallel programming

best regards,

felix

Chris said...

Do you have examples of what you're refering to?

I'm confused by your distinction. You leave out the obvious definition of parallel algorithms, which is the idea that a traditional problem can be solved faster using a different approach than the serial one by taking advantage of multiple processors.

Parallel Algorithms are not always simply decomposed serial algorithms, and in fact are most often original algorithms that would have horrible performance when run serially. There's alot of analysis on this, and most of it was done years ago.

Finally, implicit parallelism would be wonderful, but how can it be achieved and can it achieve what can be accomplished explicitly?

Louis Savain said...

Chris wrote: You leave out the obvious definition of parallel algorithms, which is the idea that a traditional problem can be solved faster using a different approach than the serial one by taking advantage of multiple processors.

Well, it may be a definition of parallel algorithms but it is not a definition of parallel processing as I understand it. Unless you can precisely identify whether or not any two operations (events) are concurrent or sequential, you don't have true parallelism. The defintion of parallel algorithms does not do us any good as far as making parallel programming easier. That's because it is not saying anything new that networked application programmers did not already know even though these programmers would be surprised to learn that they were doing parallel programming. Does a sequential program running on a processor node in a distributed network suddenly turn into a parallel program simply because another sequential program happens to be running on a different node? I don't think so. My point is that there is no difference.

Parallel Algorithms are not always simply decomposed serial algorithms, and in fact are most often original algorithms that would have horrible performance when run serially.

Sorry, I don't get it. How is an original sequential algorithm going to exhibit greater parallel performance unless it is first decomposed into threads?

There's alot of analysis on this, and most of it was done years ago.

Obviously all this analysis did not do us any good because here we are in the middle of deepening crisis that threatens to choke the life out of the microprocessor industry.

Finally, implicit parallelism would be wonderful, but how can it be achieved and can it achieve what can be accomplished explicitly?

Well, they both purport to achieve parallelism. Implicit parallelism is already routinely achieved in such applications as cellular automata, simulations (including circuit simulators/editors, VHDL and others), neural networks, etc... A node in a cellular automaton or a neural network is implicitly parallel. That is to say, there is no need to explicitly specify whether or not it is to run concurrently with other nodes. The sequentiality (serial order of execution), on the other hand, must be explicitly specified by providing a communication link between two or more nodes.

Implicit parallelism is different in that it does not put any artificial constraint on how elements in a program must communicate. That is, unlike algorithmic systems in which communication is 1-dimensional (1 predecessor to 1 successor), in a true parallel system, communication is multi-dimensional and an element can have an indefinite number of predecessors and/or successors.

orthorim said...

I agree that threaded programming is horrible but I don't quite see how COSA can help. I have actually implemented a system that could map the COSA model - it was a system where graphical building blocks could be used to assemble an application, this was in the field of scientific software so it was all parallel - that was the point. Data would then flow through the network in parallel. Come to think of it - it was basically COSA. And it was neat.

I do not think that this solves the problems that writing parallel programs entails though.

Yes, it was easy to operate, yes, users assembling the blocks could basically not make any mistakes related to parallelism. But it would not at any rate guarantee a good solution. The network as it were might not be used optimally because some nodes would have to wait for others to complete - that's not a problem related to central clock but rather to the topology.

COSA seems to be a framework that prevents abuse of threads, and thread problems by making the dependencies graphical and explicit. But if the nodes are really as simple as in your description, any normal program would create a giant map - not easy to understand anymore. So I don't believe that the inherent difficulty of parallelism is addressed by this model. It may be better than threads though ;)

Yemi Bedu said...

Hello,
orthorim, I do not see how the way you described the benefits do not outweigh any of the problems you mentioned. I too thought about the starvation of some of the cells per cycle. It is a case of whether they are static units or reprogrammable and to consider the starvation of current FPU units in the programs running today. You can lower starvation by using a hash of the effector to allow multiple shared use of a cell (if some cells exist by dynamic generation). I think Louis mentioned this as an instruction pointer in the IB (instruction buffer).

Ideally the big thought I have is about what the starvation of the system is when running a sequential program. I unfortunately think that the current CPU will not die off but be reduced to optimal cells that better handle some set of sequential operations. I for some reason see the Cell Processor as a bridge gap to the type of system that will give a best of both worlds approach. The SPE units will change to not be slave elements, but the main executing blocks and the CPU would live as a buffer switcher/instruction mapper/local-memory loader.

The one thing that some of the conversations seem to mix in a poor way is the view of low level execution and high level development/composition. I don't feel that I would be crippled in programming with text for a parallel system. Also the complexity of a program would be visible over the system but not with any more detail than Lisp dialects do over compilation to native code. That probably is where the large map view of interconnecting cells would need to be seen as abstractions. You can more easily hide some details because the deterministic nature would not need the same interrelated view process execution (less worry of the race and some lock conditions) as they do now with working with threads. Good day.

Yemi Bedu