Tuesday, March 11, 2008

Nightmare on Core Street, Part V

The COSA Software Model

Part I, II, III, IV, V


In the previous post, I wrote that the reason that the computer industry’s multicore strategy will not work is that it is based on multithreading, a technique that was never intended to be the basis of a parallel software model, only as a mechanism for executing multiple sequential (not parallel) algorithms concurrently. I am proposing an alternative model that is inherently non-algorithmic, deterministic and parallel. It is called the COSA software model and it incorporates the qualities of both MIMD and SIMD parallelism without their shortcomings. The initial reason behind COSA was to solve one of the most pressing problems in computer science today, software unreliability. As it turns out, COSA addresses the parallel programming problem as well.

The COSA Model

Any serious attempt to formulate a parallel software model would do well to emulate parallel systems in nature. One such system is a biological neural network. Imagine an interconnected spiking (pulsed) neural network. Each elementary cell (neuron) in the network is a parallel element or processor that waits for a discrete signal (a pulse or spike) from another cell or a change in the environment (event), performs an action (executes an operation) on its environment and sends a signal to one or more cells. There is no limit to the number of cells that can be executed simultaneously. What I have just described is a behaving system, i.e., a reactive network of cells that use signals to communicate with each other. This is essentially what a COSA program is. In COSA, the cells are the operators; and these can be either effectors (addition, subtraction, multiplication, division, etc…) or sensors (comparison or logic/temporal operators). The environment consists of the data variables and/or constants. Below is an example of a COSA low-level module that consists of five elementary cells.

Alternatively, a COSA program can be viewed as a logic circuit with lines linking various gates (operators or cells) together. Indeed, a COSA program can potentially be turned into an actual electronics circuit. This aspect of COSA has applications in future exciting computing technologies like the one being investigated by the Phoenix Project at Carnegie Mellon University. The main difference between a COSA program and a logic circuit is that, in COSA, there is no signal racing. All gates are synchronized to a global virtual clock and signal travel times are equal to zero, i.e., they occur within one cycle. A global clock means that every operation is assumed to have equal duration, one virtual cycle. The advantage of this convention is that COSA programs are 100% deterministic, meaning that the execution order (concurrent or sequential) of the operations in a COSA program is guaranteed to remain the same. Temporal order determinism is essential for automated verification purposes, which, in turn, lead to rock-solid reliability and security.

The COSA Process Emulation

Ideally, every COSA cell should be its own processor, like a neuron in the brain or a logic gate. However, such a super-parallel system must await future advances. In the meantime we are forced to use one or more very fast processors to do the work of multiple parallel cells. In this light, a COSA processor (see below) should be seen as a cell emulator. The technique is simple and well known. It is used in neural networks, cellular automata and simulations. It is based on an endless loop and two cell buffers. Each pass through the loop represents one cycle. While the processor is executing the cells in one buffer, the downstream cells to be processed during the next cycle are appended to the other buffer. As soon as all the cells in the first buffer are processed, the buffers are swapped and the cycle begins anew. Two buffers are used in order to prevent the signal racing conditions that would otherwise occur.

The COSA Processor

As seen above, we already know how to emulate deterministic parallel processes in software and we can do it without the use of threads. It is not rocket science. However, using a software loop to emulate parallelism at the instruction level would be prohibitively slow. For performance purposes, the two buffers should be integrated into the COSA processor and the cell emulation performed by the processor. In a multicore processor, each core should have its own pair of buffers. I previously began to write a series of blog articles on how to build a self-balancing, fine-grain multicore processor based on the COSA model. I did not finish the series due to business considerations.

Comparison Between COSA and Other Parallel Software Models

I plan to go over each item in the table above in detail in a future article. Let me just mention here that ease of programming is one of the better attributes of COSA. The reason is that programming in COSA is graphical and consists almost entirely of connecting objects together. Most of the time, all that is necessary is to drag the object into the application. High-level objects are plug-compatible and know how to connect themselves automatically.

The figure above is an example of a COSA high-level module under construction. Please take a look at the Project COSA web page for further information.

Related articles:

How to Solve the Parallel Programming Crisis
Parallel Computing: The End of the Turing Madness
Parallel Programming: Why the Future Is Non_Algorithmic
Parallel Programming: Why the Future Is Synchronous
Parallel Computing: Why the Future Is Reactive
Why Parallel Programming Is So Hard
Parallel Programming, Math and the Curse of the Algorithm
The COSA Saga

PS. Everyone should read the comments at the end of Parallel Computing: The End of the Turing Madness. Apparently, Peter Wegner and Dina Goldin of Brown University have been ringing the non-algorithmic/reactive bell for quite some time. Without much success, I might add, otherwise there would be no parallel programming crisis to speak of.


Ken said...

Interesting article. I saw another multi-core solution which works on a more macro level over at www.infoquanta.com

Both very promising solutions I feel.

Tim said...

This was a fun read. Its rather funny, I remember reading about Intel's former CEO bashing pharmaceuticals for not advancing anywhere near the rate as microprocessors, he was blaming (broadly) a 'buggy' peer review process getting in the way.

I completely agree that Academia has shot computing in the foot. Ironically, its the same 'buggy' peer review process that _isn't_ getting in the way where it should.

Louis Savain said...

Tim, thanks for that comment about Intel and the inadequacy of peer review. I tend to think of peer review as an incestuous mechanism that feeds on itself. Where is Thomas Kuhn when we need him?

xscott said...

Can someone explain the diagrams for me? The "While Loop" has me confused. From the other blog entry, I got that outputs are red, and inputs are black or white. What do "1/=" and "100/=" do? What does the dashed line represent?

Is there an introductory/explantory article on COSA? What does COSA stand for?

Louis Savain said...

xcott wrote: Is there an introductory/explantory article on COSA? What does COSA stand for?

COSA stands for Complementary Objects for Software Applications. Take a look at the following pages for more info:

COSA Operating System

Software Composition in COSA

AmbricMPPA said...

Checkout Ambric and the MPPA solution.


Tim said...


Could you post a link to specifications so that I don't have to put 20 miles on my mouse to wade through your site to find them?

Or, perhaps surmise them here?

It is truly not my endeavor to personify a jackass, however I sneak time to read neat stuff, so point me to neat stuff ... else you only waste my sneaking time :)


UltranoKei said...

Frankly, COSA looks much more complex to design software in. To me it seems it borders with designing hardware, and I feel it will turn into a nightmare to create the bulk of code. The state-changes will take MBs of RAM, and will need to update it on every cycle - ok, possible in 1000+ core systems that can be built even today on one die. But the routing... it'll be a tech feat to make the optimal silicon for it. Unless you want a "cpu cycle" to take 1-1000ms, that is. With COSA you're shifting the problem from "one instruction per cycle per core, half use LDR/STR" to "all of an app's instructions try to work each cycle, and spread data nonuniformly among themselves, apart from the LDR/STR". At 1k instructions, you already start seeing how inefficient silicon can be for such routing tasks.

I come from microelectronics background, and even to me those networks will be much much slower to design and debug than just typing several asm, C or script lines. Now place an average programmer, that just wants/has to make a medium-sized project with such a model - and you may see the industry dead.

Intel and AMD are more threatened by the "sufficient enough processing power", that we have reached for the majority of PC users. The only three segments, where performance-demands continue to grow, are: http servers, high-end gaming, video-content creation (2D/3D). [Intel/AMD already excel at two, and are threatening nVidia for the third one. ] In these segments, ease of programming is but a mere afterthought; memory bandwidth and latency are the real problem. And missing instructions and OS API to tackle exceedingly common thread-sync-ops continue to trip programmers.

But in the end, what do we expect - using general-purpose hardware (x86) just spoiled everybody with how it could handle everything and would increase clock-rates twofold so often. Now that that joy-ride has ended, the obvious way is to go back to... using task-specific hardware. Stay spoiled, and miss-out on the performance.
Regular PC users have what they need with current x86; gamers can slowly and safely transition to "inefficient" current-trend multicore x86, while using the specialized GPU (that can easily change method of operation drastically) ; http servers, raytracers, and scientific code will be created for the suitable ISA/system.

martin_ch said...

Well, COSA sounds very nice, but we've already gotten there.

What you've just described is an FPGA.

And yes, you too can go out, buy a VHDL or Verilog compiler, and by programming your FPGA, get a huge performance improvement, with (relative) determinism.

The main issue with implementing COSA will be your clock. Representing things in such a manner with many cores is effectively exactly the same thing as a software simulation of a hardware design: either the hardware cells are of the same complexity as a COSA cell, in which case you have an FGPA, or the cells are divided, a hundred or so for each core, in which case, you then have to synchronize when each of the cores finishes it's chunk of work for each clock cycle -> efficiency loss and/or sychronization issues.

Additionally, if you have many cores, then you have the same coherency issues: you need to do some fancy buffering / caching of the state repesenting each of the connections in your graph.

It's a nice model, and I like it, but it's no inherently better than any of the current solutions.

Pranav said...

An extremely interesting read (even for a student like me from academia whom you so kindly bashed).

Well, I like your model. And it is indeed very similar to an FPGA for a simple reason that an FPGA consists of multiple cells and you has a reconfigurable interconnect scheme.

But if you forgive my generalizations, your idea is similar to communicating sequential processes (as a programming language) and the long lost transputer system (as the hardware).

Would you care to comment ?

Louis Savain said...

An extremely interesting read (even for a student like me from academia whom you so kindly bashed).

I am allergic to most academics, especially the scientific kind. Thomas Kuhn and Paul Feyerabend were 100% right about them. :-D

Well, I like your model. And it is indeed very similar to an FPGA for a simple reason that an FPGA consists of multiple cells and you has a reconfigurable interconnect scheme.

Of course COSA is similar to FPGA. It is also similar to VHDL, Verilog and even to spiking neural networks. This is precisely what COSA is supposed to look like since the COSA thesis is that software should behave like hardware. But COSA is way more than any of these things. It is a software model rather than a hardware model. My thesis is that the microprocessor should be redesigned to emulate the parallelism of hardware in software. The COSA model is based on a simple mechanism that programmers have been using to emulate parallelism in such apps as neural networks, cellular automata, simulations and video games. The mechanism consists essentially of a collection of elementary objects to be processed, two buffers and a loop. What I am really proposing, more than anything else, is that this mechanism should be made part of the processor, whether single-core or multicore. In addition, the parallelism should be brought down to the instruction level so as to implment fine-grained deterministic parallelism in an MIMD environment.

Note that there are things that can be done in software that cannot be done in hardware. Try using FPGA or Verilog to implement a general parallel QuickSort module and you'll see what I mean. It can easily be done in COSA. Furthermore, if you read the COSA system page, you will find several essential innovations that do not exist in those hardware models.

But if you forgive my generalizations, your idea is similar to communicating sequential processes (as a programming language) and the long lost transputer system (as the hardware).

Have you taken a good look at Hoare's CSP? It is a hideous language that only a nerd would love. It is one of the most arcane computer languages I have seen in my life. At any rate, COSA is 100% deterministic, a must for reliability, and as far as I know, CSP is not. Besides, if I remember correctly, CSP was designed for the INMOS transputer, a multithreaded parallel computer. The thread concept is anathema to COSA.

In conclusion, let me add that the solution to every major problem in the computer processor industry is here now and free for the taking. It remains to be seen who among the hardware vendors is going to be the first to see the light. The problem is that the industry is putting its trust in academia, i.e., the Turing worshippers who got everybody into this mess in the first place.