Wednesday, September 10, 2008

Parallel Computing: Command Hierarchy


I had originally decided not to publish this article because it reveals a powerful aspect of the COSA software model that I could conceivably use to my advantage should I eventually form a company around COSA, but who cares? I have already given away tons of my ideas; so one more will not matter much. Long ago, I realized that the best ideas are the ones that can be explained via metaphors. The reason is that a metaphor uses a familiar and well-understood concept to elucidate one that is not so familiar but is structurally similar. Below, I describe how the metaphor of a command hierarchy can serve as a powerful organizational principle for composing parallel applications.

Command Hierarchy

In the real world, parallel entities normally go about their business without the need for a hierarchical command structure. They just interact via sensors and effectors according to their set behaviors. However, any time there is a need to perform a complex task or to reach a long-term goal that requires the cooperation of many, it pays to divide the subtasks among various individuals under the direction of a supervisor. If the task is very complex, the supervisors may in turn be placed under the supervision of an even higher authority, and so on. The use of a command hierarchy is a very powerful way to efficiently organize any complex, goal-oriented system consisting of multiple actors or individuals. It is for this reason that it is used by all large organizations such as corporations, churches, armies, etc. Since a COSA application is a collection of behaving entities, it makes sense to use the same method to organize and control their behavior.

The COSA Supervisor

Some of my readers may be familiar with object-oriented programming languages like C++, Smalltalk or Java. It is important not to confuse a class hierarchy in OOP with a command hierarchy. Unlike high-level C++ objects that have their own methods, a high-level command component in COSA does nothing other than supervise the manner in which its subordinate components perform their actions. This is not unlike the way a conductor conducts a musical orchestra. Remember that a COSA component can be either active or inactive. When activated, a component’s sensors immediately detect input changes, if any. If necessary, the component accomplishes its task and, when the task is done, it sends an output signal before going back to sleep. The job of a COSA supervisor component is to determine the right time to activate (or deactivate) various components under its control in order to accomplish a task. This does not mean that the tasks are necessarily consecutive, however. Any number of components may be running concurrently.

COSA QuickSort Revisited

One of the nice things about using a supervisor is that there is no longer a need for low-level components to send messages to each other. They just communicate with the supervisor at the start and end of their tasks. These low-level components will use what I call a data sensor, a simple connector that allows one component to see or sense data changes in another. If time permits, I will redesign the COSA QuickSort component to include the use of a COSA supervisor and a command hierarchy.

See Also:
COSA: A New Kind of Programming


Marc said...

What this gentleman does is strongly relevant to what you're doing:

You wrote in 2006 that you could take your project into different directions:
1/ real time control-type applications, with is what Esterel
Technologies ( has done with SCADE Suite, which comes with a GUI; I would guess (but do not know for sure)that control applications might have somewhat restricted data types
2/ Take advantage of the fact that messages may be of arbitrary types in project COSA and implement objects that are "high level" and friendly to C++/Java programmers (associative arrays? string arrays? image handling filters? windowing environment handling?). Some APL-ish microcode inside the cell, DSP or FPGA tailored to the specific type of expected calculations would ensure the APPEARANCE of output changing at the same time as all inputs become active.

One of the big problems I see in transferring the COSA approach to dedicated chips here is deciding how many cells are going to be feeding a given receiving cell. If the number is small, the pattern would be doable physically. But if you expect most cells to receive signals from 100 other cells, the cost of etching the connections may be prohibitive and the connections may even be impossible! It might be possible to etch 10,000 cells, but with only 100 or so of them having enough inbound message space for heavy-duty graphical work. If you ever feel like you have to go to create a cell template that can write to and from shared memory through a bus, many of the disadvantages of the Von Neumann architecture will come back to haunt you...

Fly Mulu said...

I would go a little further then Marc: after giving it some thought, I don't think COSA scales on anything that can be economically produced today (ie, closely resembling a Von Neumann architecture, MIMD or not), since having a BUS and a multi-CPU or multi-core assembly is basically the definition of the Von Neuman Bottleneck, unless you are willing to provide a bus that is wide enough to supply the data to all cores simultaneously. Which is going to be difficult to build, even with BGA packaging.

The current consensus is that even the algorithmic, non-parallel CPUs saturate the bus with 8 to maximum 32 units, with caching and all. Adding more CPUs will just make the problem worse: caching isn't going to make the fact go away that the CPU is faster then the (much longer) bus. The larger your command or data vector is, the longer it takes to travel to the CPU, and the longer the BUS is blocked - OR, the more pins you need.

The only solution set with current engineering practices (IE without the use of quantum teleportation) that I can think of at the moment are variants of System-on-a-chip, as in,
1. put the entire memory of the system onto the same die as the cores. That limits the amount of memory AND cores together to the die size, ie, it doesn't scale.

2. (variant of 1) integrate RAM+_one_ core on one chip, and connect the chips with each other, creating a more-or-less scalable NUMA system. That would scale for programs that contain relatively large parts that process their input separately and have a limited amount of intercommunication (for example, an AI.)

This second approach has been tried, to a degree, by INMOS (see
Transputer), but turned out to be too expensive to compete with general purpose CPU in their days.

3: As in 2, use separate chips that solve their special tasks independently of each other, communicating mainly small amounts of data that mean a lot (ie, commands that are short but execute a lot of the functionality on the chip without exiting it and using the bus).
This is the approach that the industry currently uses (again, after having ignored it for about 10 years): you have a general-purpose CPU, a sound chip, a wifi chip, a graphics chip, etc, all with their high-level commands.

By looking at that, I would say, the main performance problem of parallel computing is not, actually the CPU design, but fact that there is such a thing as a separation in CPU, Memory and Bus.