Tuesday, July 29, 2008

How to Solve the Parallel Programming Crisis

Abstract

Solving the parallel computing problem will require a universal computing model that is easy to program and is equally at home in all types of computing environments. In addition, the applications must be rock-solid. Such a model must implement fine-grained parallelism within a deterministic processing environment. This, in essence, is what I am proposing.

No Threads

The solution to the parallel programming problem is to do away with threads altogether. Threads are evil. There is a way to design and program a parallel computer that is 100% threadless. It is based on a method that has been around for decades. Programmers have been using it to simulate parallelism in such apps as neural networks, cellular automata, simulations, video games and even VHDL. Essentially, it requires two buffers and an endless loop. While the parallel objects in one buffer are being processed, the other buffer is filled with the objects to be processed in the next cycle. At the end of the cycle, the buffers are swapped and the cycle begins anew. Two buffers are used in order to prevent racing conditions. This method guarantees rock-solid deterministic behavior and is thus free of all the problems associated with multithreading. Determinism is essential to mission and safety-critical environments where unreliable software is not an option.

Speed, Transparency and Universality

The two-buffer/loop mechanism described above works great in software but only for coarse-grain objects such as neurons in a network or cells in a cellular automaton. For fine-grain parallelism, it must be applied at the instruction level. That is to say, the processor instructions themselves become the parallel objects. However, doing so in software would be much too slow. What is needed is to make the mechanism an inherent part of the processor itself by incorporating the two buffers on the chip and use internal circuitry for buffer swapping. Of course, this simple two-buffer system can be optimized for performance by adding one or more buffers for use with an instruction prefetch mechanism if so desired. Additionally, since the instructions in the buffer are independent, there is no need to process them sequentially with a traditional CPU. Ideally, the processor core should be a pure MIMD (multiple instructions, multiple data) vector core, which is not to be confused with a GPU core, which uses an SIMD (single instruction, multiple data) configuration.

The processor can be either single core or multicore. In a multicore processor, the cores would divide the instruction load in the buffers among themselves in a way that is completely transparent to the programmer. Adding more cores would simply increase processing power without having to modify the programs. Furthermore, since the model uses fine-grain parallelism and an MIMD configuration, the processor is universal, meaning that it can handle all types of applications. There is no need to have a separate processor for graphics and another for general purpose computing. A single homogeneous processor can do it all. This approach to parallelism will do wonders for productivity and make both GPUs and traditional CPUs obsolete.

Easy to Program

The main philosophy underlying this parallel processing model is that software should behave logically more like hardware. A program is thus a collection of elementary objects that use signals to communicate. This approach is ideal for graphical programming and the use of plug-compatible components. Just drag them and drop them, and they connect themselves automatically. This will open up programming to a huge number of people that were heretofore excluded.
Conclusion

Admittedly, the solution I am proposing will require a reinvention of the computer and of software construction methodology, as we know them. But there is no stopping it. The sooner we get our heads out of the threaded sand and do the right thing, the better off we will be.

Related Articles:

Why Parallel Programming Is So Hard
Parallel Computing: Why the Future Is Non-Algorithmic
Parallel Programming, Math and the Curse of the Algorithm
Half a Century of Crappy Computing
The COSA Saga
Transforming the TILE64 into a Kick-Ass Parallel Machine
Heralding the Impending Death of the CPU
Why Software Is Bad and What We Can Do to Fix It
Parallel Computing: Both CPU and GPU Are Doomed

36 comments:

Ian said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

Louis Savain said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

No. I am not sure what you mean by a single box, but I can assure you that this model can easily support tens of thousands of cores on a die. That is, of course, assuming the hardware technology will have advanced to that level. The synchronization is an automatic characteristic of the loop/buffer mechanism. It does not require any exotic circuitry.

Paul Mohr said...

Very interesting, I always like to look at new ideas for parallelism. I don't immediately grasp the entire concept, but I will read more of your blog and links before I judge the potential. It seems this could be done in an FPGA to determine it's viability. I do think that multiple technologies can be integrated in a single system to specialize the hardware for specific tasks. Content addressable memory is used in routers and if it were more commonly available in the PC it would make programming easier and much faster for some common data base tasks.

Ken Horn said...

certainly seems like an FPGA implementation in verilog / vhdl is the way to go. if you run in on an xtremedata card, it can drop in to a multi-socket AMD box giving good access to main memory. why not start a wiki/mail list to collaborate on the design? imho, graphical programming gets very complicated as soon as you want to change a large design -- grep (aka refactoring) is your friend -- so i'd expect to need a textual representation to be interchangable with the graphical language. starbridge has a data flow tool like this. of course it starts getting complicated (visually) if data dependencies require loops.

Louis Savain said...

Paul and Ken,

Thanks for the comments. You're both right about the feasibility of using FPGA to build a working prototype of a COSA processor. However, I don't think that it is needed to determine its viability. We know this model works because it's been in use for decades. I do agree that it would go a long way to convince doubters. Having said that, note that the people with the purse strings are interested in three things, ease of programming, performance and automatic scalability. These are not things that can be achieved in one's spare time. It requires talent, time and money. By money, I mean millions.

Ken, you are correct that programs created with the graphical dev tool must compile into an intermediate text file using some sort of p-code. Someone (can't remember who, probably Marcus Sundman) once suggested XML, which I think is a good idea.

I get a lot of objections from software engineers regarding graphical programming. Of course, they invariably base their criticism on tools that they've seen. We must remember that most of the research in programming tools have been limited to textual languages. Even so, this has not resulted in moving programming to the masses. Indeed, it seems like computer languages have gotten even more arcane and obscure over the years. Needlessly, I might add.

People often complain about the proliferation of signal lines in graphical languages. However, this should only be seen as the result of a mediocre implementation and not as a charateristic of graphical languages in general. Information hiding can do wonders for program comprehension, regardless of the type of tool. In the COSA programming model, multi-connectors can be used to bundle many connections into one and this goes a long way to eliminate clutter.

By the way, the model that I am proposing is not data flow but signal flow, not unlike a logic circuit.

Ken Horn said...

I'm not convinced it would take millions to show most of the benefits in prototype form -- do you have a software implementation of COSA? I think an open source project could advance the design pretty far.

I don't think a graphical tool is needed to show the underlying model works either -- a simple text language is both easier to develop and to debate since most people who could help will be coming from a text based language background. I think solving one problem at a time is feasible -- for example Verilog for circuit design is a very simple language -- like C both in terms of abstraction level and syntax.

I'm used to building systems in Verilog that I'd call data flow -- but I can't think of the difference you mean by signal flow. Is there a page that describes COSA in more detail? Is it similar to CSP or SEDA?

As for too many signal lines, these can simply be rolled up / filtered similar to interfaces in SystemVerilog. I've long argued that Verilog is almost the right abstraction layer for many tasks but the problem is the tools are stuck in the dark ages.

Paul Mohr said...

I agree with ken and I could simulate the hardware in software and it could give you a measure of it's speed and functionality. If it is a good idea, I have a ton of FPGA's and it isn't that difficult to program a pick and place machine and design a board. There are open source board design programs I use. It could even be hand stuffed. I don't completely understand your concept yet but atomic parallelism seems like it might work.

Louis Savain said...

Ken,

Data flow is something else. In data flow, parallel objects are fed data, as in a multimedia stream. In a signal flow environment, by contrast, such as a logic circuit, only signals move around. A signal is not a piece of data but a temporal event marker that indicates that something (an event) just ocurred. In the COSA model, objects are inherently parallel. Signaling is the explicit mechanism of sequential order. Unlike a logic circuit, signaling is virtual in software. The mere placement of an object into a processing buffer is taken to mean that it just received a signal.

Having said that, I don't think that a textual language will convince anybody to change to the COSA model. A truly revolutionary programming interface is needed. This is the reason that I am so excited about Jeff Han's and multitouch screen technology. They allow the kind of quick and easy, trial-and-error manipulation and rearrangement of parallel visual objects that, I believe, will move application development into the 21st century. It is time we say goodbye to the antiquated and woefully inadequate text-based technologies of the last century.

Louis Savain said...

Paul,

The concept is simple but designing a processor to do the job is not easy. The trick is to keep the cores busy and balanced when there is a heavy load and to turn as many of them off as possible when the load is light in order to save energy. There is also the problem of the data caches used by the cores. Sometimes, a piece of data is needed by one core but resides in another core's cache. Tilera came up with a nice solution with its proprietary I-Mesh technology that links neighboring cores with a high-speed network on chip. Still, load-balancing must be such that inter-core communication (for read/write) is kept to a minimum for performance reasons. I have some good ideas on how to do that, but I'll hold on to that card for the time being.

Ken Horn said...

Is there a page that actually describes COSA? You talk about not being data parallel, but signal -- what do the signals represet? If not, Von Neumann instruction stream and not data, how is the processor programmed? FPGA based systems are signal processors -- this signal flow, is it flags? How is data passed if not as signals? Does the program get loaded and remain in place like a FPGA/CPU design? As ever, the key to performance is either to never move data other than stimuli (as in FPGAs - the data is often held in local mem on chip) or completely hide memory/comms latencies via some context switching mechanism (but hence requires more mem).

Much of this heads in the direction of transputers of course..

I'm still not sold on a graphical language but I think that's an orthogonal concern to the chip architecture.

karl said...

I don't know if this helps, but the way I picture it is like this:

Traditional sequential programs are a big column of instructions, one instruction on every line, and processing works its way down. It can jump up and down and can start and stop anywhere but broadly speaking it goes from top to bottom, one instruction at a time.

Threaded programming attempts to pull chunks of this column out to the sides, but this is pretty tricky with all the dependencies amongst all these instructions.

Instead think of the processing as going from left to right through every instruction at once. Some instructions are waiting for input, so they're shunted to the right into a second column. Once the first column is totally processed, the processor moves to the second column and starts again. If you need multiple copies of a particular instruction, the processor would just shunt a copy of that instruction into the second column, which would be attached to the alternative input.

You could still run this on traditional multicore chips, with multiple threads to process chunks of instructions from the current column. Its just that each instruction is independent of all the others in that column, so the threads don't have to synchronise at all.

Surely it wouldn't take too long to program a rudimentary version something like this?

Louis Savain said...

Ken,

For more information on this model, you can read the Project COSA pages including the Operating System and Software Composition pages.

There are, of course, multiple instruction streams, although, in the model, they are not streams in the conventional sense. They are just elementary objects connected to each other, the same way logic gates or neurons in a neural network are connected to each other with signal lines. Basically, a COSA program is a logic circuit. It consists of elementary cells that wait for one or more signal to do something, after which they may or may not send a signal to one or more other cells. There are also sensors (comparison operators) that detect changes in data and send a signal when a change is detected. COSA is a reactive software model.

Of course, the signals are virtual in software. There is no EMF transition that travels from one cell to another. The way it works is that every cell has a list of cells that it is connected to, if any. During execution, the processor places the items in the list into the second buffer where they will be processed during the next cycle.

One of the most important things about this model is the timing of the signals. Singal timing is 100% deterministic. The arrival times of the signals are essential because some cells must know whether or not its input signals arrived concurrently or consecutively in a given order.

You can also read Parallel Computing: Why the Future Is Non-Algorithmic for a quick introduction.

Ken Horn said...

So is it fine grained CSP with a graphical language? Occam?

Jonathan Leonard said...

This sounds kind of like von Neuman's automaton--the universal constructor. [von Neuman was responsible for more than just computer architecture].

Louis Savain said...

Ken,

COSA is indeed fine-grained but, as far as I understand it, CSP uses message channels to copy messages between parallel objects. COSA does not. In COSA, a message is just shared memory. Also, if I remember correctly, the transputer (for which CSP was created) was based on multithreading and could be easily simulated on a single core machine. This is absolutely not the case with COSA since multithreading is anathema to the COSA model since multithreading is non-deterministic. COSA does not use threads at all and there is no context switching.

Also, in my view, CSP is so arcane that only a nerd could love it. What I am trying to do, among other things, is move parallel application development into the mainstream.

PS. Several people have compared COSA to CSP in the past. I don't get it. To me, COSA looks more like VHDL with the exception that, in COSA, the timing of signals is explicit and available to the program. Has anybody compared CSP to VHDL?

Louis Savain said...

Karl,

I like your analogy. However, I disagree that COSA could be run efficiently on a conventional processor. In COSA, there are no such things as indenpendent threads. Everything must be synchronized to the same global virtual clock. Deterministic timing is absolutely essential to reliability and security in my opinion.

Louis Savain said...

Jonathan,

I am not familiar with von Neumann's universal constructor. However, I agree that cellular automata use the same sort of deterministic parallelism that is used in COSA.

Ken Horn said...

I think the VHDL (or, to me, the much simpler Verilog) analogy is pretty close -- you get massive fine grained parallelism and deterministic behaviour, that's why the FPGA accelerator usage has been popular. As far as I understand it, CSP doesn't necessarily require message channels except in a logical sense, if the data is immutable -- threads are also an implementation detail if you go for blocking reads.

I find CSP / VHDL(or rather chip design) / SEDA / COSA (as far as I understand it) / Unix pipes, all essentially the same mechanism, of partitioned data and calculation flow. The levels of explicitness in the programming models are similar even if the abstraction from silicon is quite different. A channel is a queue is a FIFO.

The other common parallel scheme seems to be (again, from memory) the Occam / parfor / etc used by main systems which can sometimes be derived by a standard compiler based on data dependencies.

As ever tools to help partition sequential and parallel sections and declare/avoid data dependencies are key for these approaches.

It's still not clear to me the lifecycle of COSA objects or how a CPU would be configured other than the existing options of von Neuman type core or FPGA/DSP data flow. The are many mesh/FPOA type approaches for inter node (on chip) communications, but without context switching (in terms of unit of work, not necessarily thread based) or a data path program design you end up with lots of idle silicon, I think (at least has the advantage of low power).

I think to advance the ideas in the discussion a concrete example program, in some text syntax, would be the most productive route.

dvyukov said...

I can't understand one moment in COSA model. How are you going to manage data in COSA?

Handler must read data value, which was actual at the beginning of execution cycle, not current value.

I see 2 solutions here.
First. You can store 2 versions of value for every data component - 'current' and 'next'. And change meaning of those values after every execution cycle. This approach has serious drawbacks.

Second. You can place all new values into some 'redo' log. And apply this 'redo' log at the end of execution cycle. This approach also has serious drawbacks.

Both approaches are serious hits at run-time properties (memory consumption and/or execution time).

Can you describe how are going to manage data components, so that every handler will read value which was actual at the beginning of execution cycle? Or just give some links to the articles where this is described.

Thanks.

Louis Savain said...

dvyukov,

It seems to me that you are talking about a problem that is common in multithreading environments. That is, a data variable may be changed by one thread while it is being read by another thread. The reason that this is a problem is that multithreading is not deterministic. That is to say, it is impossible to determine exactly at what time a piece of data will be read or modified.

The COSA model, by constrast, is 100% deterministic. Data reads and writes occur in a predictable manner and, if there is any conflict, it can be easily discovered by the development tool and the developer will be alerted accordingly.

dvyukov said...

Yes, this problem is common to all multi-threading environments to some degree.
But in non-synchronous/non-deterministic environment this problem can be solved in very straightforward manner. Just lock the mutex and update the data in-place. Reader will read the most current value. Simple.
In COSA the problem cannot be solved this way. Because reader must NOT read the most current value, it must read the value that was actual in the beginning of current execution cycle.
So you have to make some additional bookkeeping and make some additional work to ensure this property. Cost of determinism, so to say.
The most obvious way to solve the problem in COSA, which I see, is to keep redo-log with all writes, and apply it at the end of execution cycle. But this solution has serious drawbacks wrt memory consumption, execution time, needed synchronization and locality of work.

Please, describe how you going to solve this problem. This moment prevents me from obtaining comprehensive notion about COSA.

Dmitriy V'jukov

Louis Savain said...

dvyukov,

I don't think you understood my last comment. In a fine-grain deterministic environment, this is not a problem. The development tool can easily examine the signal flow of a COSA program to determine whether or not there is potential conflict. If so, it will alert the programmer to the conflicting condition so he or she can modify the program to eliminate it.

Modifying the program might consist of simply re-routing a signal and maybe using a wait cell in order to change the timing of an action. It's not rocket science.

dvyukov said...

Ah! Ok, now I get you.
So basically you can update all data in-place, and at the same time readers can't read "wrong" values *by-design*.

Hmmm... I need to think on this...
I'm not yet sure in 2 moments:
(1) Whether all algorithms can be expressed this way. Such scheme has a limitation that on some execution cycles handler can't read a data. And if we can't statically determine on which cycles it's allowed to read a data, and on which it's not?
(2) Whether all algorithms can be expressed *efficiently* this way.

Why I was initially thinking about redo-log? Because this is how synchronous hardware works. And scheme based on redo-log doesn't have limitation that sometimes it's disallowed to read some data. Note that since readers always read values which was actual at the beginning of execution cycle, design is still deterministic! But of course "concurrent" (i.e. inside one execution cycle) writes must be prohibited.

Dmitriy V'jukov

Louis Savain said...

Dmitriy,

I have no idea what you mean by "wrong" values. Your "redo logs" is what some people are referring to as transactional memory, which, in my opinion, is the wrong way to do things. It wreaks havoc with deterministic processing and is not any more efficient. Everything depends on proper timing. If your system is deterministic, you can read data either before or after it has been changed. The choice is yours. What is important is that there is no read/write conflict that could compromise the data.

dvyukov said...

No, I am not referring to transactional memory.
COSA is deterministic. Good.
But determinism achieved at high price from user point of view. You are prohibiting reading values on execution cycles when the values are changed. Right?
Redo-logs it the way to relax this requirement. I.e. handler will be able to read values on the same cycle when the values are changed. AND system STILL stay deterministic.

Dmitriy V'jukov

Louis Savain said...

COSA is deterministic. Good.
But determinism achieved at high price from user point of view.


What high price? It's just a matter of reading a variable after it has been written to. If there is no write, there is no wait. Besides, this is a very rare occurrence in COSA. COSA is a reactive system, which means that, in general, changing a variable causes a reaction. In other words, variables are tested (using comparison operations) immediately and automatically after they are changed and, if necessary, a signal is sent to every object that is dependent on the change. Both the change and the test occur within the same cycle because they are considered to be a single action.

You are prohibiting reading values on execution cycles when the values are changed. Right?
Redo-logs it the way to relax this requirement.


It is a perfectly sensible requirement. Why would I want to relax it? I don't see how "redo-logs" are going to increase performance. If anything, I think they will decrease performance.

I.e. handler will be able to read values on the same cycle when the values are changed. AND system STILL stay deterministic.

I doubt it very much. Redo-logs just add uneeded complexity, in my opinion. My solution only adds a small wait while the variable is being changed. Most of the time, there is no wait at all. Redo-logs waste time and space in my opinion.

dvyukov said...

Well, Ok. Now I understand. Thank you for explanation.

Dmitriy V'jukov

Walter said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

> No. I am not sure what you mean by a single box, but I can assure you that this model can easily support tens of thousands of cores on a die.

I think what Ian means is that you can't reasonably use this on more than one computer. So an asynchronous method of parallel programming would still be required for applications that work on networks of computers. Erlang, for example, will run on multiple cores in a single machine. But it can also be extended to multiple, networked computers just as easily as a single machine.

Louis Savain said...

Walter wrote:

I think what Ian means is that you can't reasonably use this on more than one computer. So an asynchronous method of parallel programming would still be required for applications that work on networks of computers. Erlang, for example, will run on multiple cores in a single machine. But it can also be extended to multiple, networked computers just as easily as a single machine.

Well, COSA uses synchronous processes but it also uses asynchronous messaging. This means that it can just as easily send asynchronous messages to other objects on a distributed network. Erlang's approach to communication is to copy the entire message to a message channel. I think this is absurd because the performance hit is enormous.

Jason said...

Please start a project somewhere to implement the GUI and the hardware simulator.

ilya said...

So, multicore is here, yet the vast majority of applications have not been multicore-enabled. Why? Well, first of all, parallel programming is hard — in fact viewed by many as an unsolved problem. And second, here’s the rough calculus done today by software development teams: On a dual-core system (the typical configuration today), one can typically expect a two-fold improvement at best. This performance improvement is just not compelling enough to make the (potentially considerable) development investment. Once the 8- and 16-core systems become mainstream, however, a single-threaded application may be leaving an order of magnitude in performance and throughput on the table. That’s when multicore enablement moves from “nice to have” to a “must have.”

My sense is that for broad adoption, the change has to be evolutionary, rather than something that's brand new and radical.

Four tools/approaches have a shot at being a mainstream multicore programming approach:
- OpenMP
- Intel's Threading Building Blocks
- Microsoft Concurrency Runtime
- Cilk++

Jonathan Leonard said...

Don't forget:
Parallel FX
Erlang
F# Workflows
Parallel Haskell

igravious said...

metrics or it's not science, only froth

Katarina Malthus said...

Or maybe we could just stop being lazy and learn to properly implement parallelism? Also, Sun's latest line of processors in the E-series, do a very good job of making parallelism on the processor seamless through their automatic branching determination.

Also, parallel studio, aka, parallelism for the incompetent. However, it works.

tetron said...

Practically speaking, since you're always going to have a lot of instructions which are blocked waiting for the results of other instructions (computation is sequential after all!) how is this idea different from a superscalar processor with a deep pipeline? Modern CPUs already achieve a lot of fine-grained instruction level parallelism, it's just not exposed to the programmer.

Louis Savain said...

tetron wrote:

how is this idea different from a superscalar processor with a deep pipeline?
Well, superscalar processors do a great job if the parallelism is local. The reason is that the processor can find independent instructions only within a small segment of the instruction stream. The superscalar approach is blind to what I call 'deep independence'. This is a serious flaw because without deep independence, most of the parallelism cannot be exploited.

The second major problem with superscalar processors (with all processors, for that matter) is that they have no mechanism for deterministic reactive processing. In my opinion, this is unacceptable. Deterministic timing (based on a virtual clock, of course) is essential to reliable software and reactive processing is the key to exposing parallelism in a program.