Thursday, May 22, 2008

Encouraging Mediocrity at the Multicore Association

Thread Monkey Society

The Multicore Association is an industry-sponsored group that aims to provide standard approaches to multicore programming. Although I am an avid proponent of standardization, it pains me to see an industry association actively promoting parallel computing standards and practices that are designed to favor only one group, multicore processor and programming tool makers. In other words, the Multicore Association does not have the interest of customers in mind but that of their members, i.e., the vendors. On the association’s site, we read the following:
In no way, of course, does the effort to establish standard APIs intend to limit innovation in multicore architectures. APIs that reflect the intrinsic concurrency of an application are in no sense a restriction on the creativity and differentiation of any given embodiment.
This is pure BS, of course, because as soon as a given set of parallel programming standards are accepted and established, the industry becomes pretty much locked into one type of multicore architecture or another. As an example, take a look at their Multicore Programming Practices Group. Their goal is to see how the C and C++ programming languages can best be used to create code that are multicore ready. How can anybody maintain that the use of last century’s programming languages does not limit innovation in multicore architectures? Who are they kidding? That is precisely what it does. It encourages vendors to continue to make and sell multicore processors that use the thread-based model of concurrency. How else are you going to use C or C++ to implement concurrency in a multicore processor without threads or something similar? There is no escaping the fact that the Multicore Association is really a society created for the benefit of thread monkeys. Why? Because the current crop of multicore chips being put out by the likes of Intel, AMD, IBM and the others are worthless without threads. These folks are desperate to find a way to future-proof their multicore technology and they figure that the Multicore Association can help. Now, if you object to being called a thread monkey, that is too bad. I really don’t want to hear about it.

What the Market Wants

You know, this is getting really tiresome. How many times must it be repeated to the industry that the only thing worse than multithreading is single threading? Is the Multicore Association what the computer industry really needs? I don’t think so. It may be what Intel or AMD or Freescale needs but this is not what the customers need. And by customers, I mean the multicore processor market, the people who buy and program multicore computers. The market wants super fast, fine-grain, self-balancing, parallel computers that are easy to program. People want to create parallel programs that scale automatically when more cores are added. They want a programming environment that is better than last century’s technology. They don't even want to think about cores other than as a technology that they can buy to increase performance. Does Intel, or AMD, or Freescale, or IBM or any of the other multicore vendors sell anything that even comes close to delivering what the market wants? I don’t think so. The only board member listed on the Multicore Association's site that can claim to be truly innovative is Plurality Ltd of Israel. Even so, Plurality’s programming model sucks (see my article on Plurality’s Hypercore Architecture) because its task-oriented model is just multithreading in disguise.

We Ain’t Buying this Crap

What is needed is an association that has the interests of multicore customers in mind. Multicore customers must make themselves heard and the only way to do this is with their pocket books. IT directors and IT sponsors should refuse to buy the current crop of multicore processors for the simple reason that they suck. Am I calling for a boycott? You bet I am. The market should refuse to buy into the mediocrity that is the multithreading programming model. And the only way the market is going to get what it wants and change the course of computing in its favor is when those beautiful multicore chips begin to pile up at the fab, all dressed up with nowhere to go. The vendors may have their evangelists, their trade organizations and their snake oil salesmen. The market has something better, which is the power to say, “We ain’t buying this crap!” That would be a message heard loud and clear.

My Message to Marcus

My message to Marcus Levy is the following. I am not one to foment trouble just for the hell of it. It is just my nature to tell it like I see it. My main interest in multicore technology is that of a consumer and developer. My position is that it is not in the interest of the multicore industry to be the purveyors of mediocrity. In the end, this kind of attitude will come back to haunt you and the members of your association. But it does not have to be that way. The whole thing can be a win-win situation if the leaders of the multicore industry are willing to listen to wisdom and realize their folly. Their approach to parallel programming is crap. I know it, you know it, and they know it. They know it because, no matter how much time and money they spend on trying to make it all work, it is still a royal pain in the ass. They know it because their researchers have visited my blog countless times since I wrote my “Nightmare on Core Street” series. Now I perfectly understand the not-invented-here syndrome but that is no excuse.

As the head of an influential and rapidly growing organization, you have two options, in my opinion. You can choose to take the cowardly route and play along with the mediocrity bunch or you can step up to the plate like a hero and let the industry know in no uncertain terms that it is full of shit. Zero or hero, take your pick. But then again, it may not matter in the grand scheme of things. If the current players refuse to change their ways and do the right thing, some unknown entity may just sneak behind them and steal the pot of gold.

15 comments:

0 said...

Can we get a working implementation of a COSA virtual machine any time soon? Trust me , it would really help in what you're trying to accomplish .

tbcpp said...

ditto

Louis Savain said...

If you guys can fork up about 5 million dollars, I might be able to deliver a COSA-compatible multicore computer with a working OS and set of graphical tools to write code with.

COSA does not need to show any working code such as a virtual machine. That will prove nothing that is not already well understood and it will not convince those who do not want to be convinced. COSA is not rocket science. If you want to know how it works basically, take a look at a spiking neural network or a cellular automaton or even VHDL. It all comes down to a set of elementary objects to be processed, two buffers and a loop. It's proven technology.

The industry just needs to realize that it is the answer to the multicore programming and architecture crisis. It solves not only the parallel programming problem but, since it is deterministic, it solves the software reliability problem as well. That is my crusade and the point of my post is that the Multicore Association is going about it the wrong way.

0 said...

No offense , but you might be stuck in a vicious circle here . You ask for money to build your cpu , but in turn , I doubt anyone is willing to throw away 5 million dollars without some proof-of-concept , a guarantee .
COSA itself is not proven technology , because it's newer been done before , you say so yourself . My suggestion
is to build a COSA virtual machine for the hoard of ordinary pc users to download , and once COSA becomes proven technology , the fonds for a cpu will come .

Louis Savain said...

I don't think you get it. Synchronous reactive (signal-based) programming is an existing technology that is already being used in aviation and other mission critical applications. I just take it several steps further down to the instruction level. Unlike what's out there, COSA is a purely reactive and synchronous computing model, the way it should have been done from the start. I also claim that it is the answer to almost everything that is wrong with computing including the parallel programming and multicore architecture crisis.

There are a lot of OS researchers and other thinkers out there who are aware of the COSA approach to parallelism and know it is the right approach. However, it is not in their interest to promote it because it's not their idea.

Having a virtual machine is useless, in my opinion. Even having a whole set of development tools would be useless. It would only attract computer geeks and hobbyists. Nobody important would be interested because it would be too slow on current processors. Right now, people are only impressed with performance, not realizing that performance is only a part of problem. Reliability and productivity are even more important than speed, in my opinion. The only thing that would get people with money to stand up and take notice is an actual working COSA-compatible processor. Even a single core COSA processor would be impressive. But, as you said, it's a chicken and egg problem.

Another problem is that I don't have a 'Phd' next to my name. Venture capitalists are extremely impressed with academic credentials because they don't have the know-how to evaluate an idea. Having said that, I am not about to quit. Eventually, the money will come because the industry is in a world of hurt right now, and the research labs are not coming up with the right solutions. The current trend is domain-specific tools and heterogeneous processors. That's going to be yet another painful failure they'll have to deal with. I can wait. Besides, the money might even come from somewhere else. COSA is not the only thing I am working on.

tbcpp said...

People could have said the same thing of Linus, no PhD, no funds, and no backing. And yet, what OS runs on almost everything from cell phones to supercomputers? These hobbyist developers are the same people that hold day jobs as real developers and engineers. Give them something to work with and tinker with, and they may just take it to work.

Brian said...

"Nobody important would be interested because it would be too slow on current processors"

Is this necessarily so or simply an assumption? Is there some apriori reason that a COSA system would be unbearably slow on modern microprocessors?

There is no reason that a COSA system needs to be interpreted. Look at a COSA cell as a CPU instruction. Cells are supposed to be elementary right? For example, an ADD_AND_ASSIGN effector cell's memory representation can be the exact CPU instructions required to perform the operation. The addresses of the data operands can be hardwired into the effector's code since they do not change (i.e. cell X uses data cells A and B. This is a fixed relationship right?).

With cells implemented as discrete "chunks" of executable code you eliminate all the overhead of any kind of "interpreter". COSA simply becomes a way of organizing and coordinating the execution of machine instructions.

The main issue in regards to performance with this setup is the cache. The execution order of the active cells in a given virtual cycle needs to be ordered in such a way that cache hits are maximized. This is not as bad as it may first seem.

All data access is localized in COSA. Cells only access data that "lives" close by (i.e. in the same low-level component). This means that you have all the cells (the code) and the data on which they operate in very close proximity in memory.

If you execute cells by low-level component order, you can maintain maximum possible cache validity given the memory distribution of the COSA structure.

But anyways, even if a COSA system would not reach its best performance on current CPU architectures you have to start somewhere. Even in designing a custom processor, you will need to have a functionally accurate (or cycle accurate) simulator to use for validation of the proposed hardware. Even without that you will most definitely have at least an RTL simulator.

Either way any prospective "COSA CPU" will undoubtedly be simulated in software on current CPUs before it ever sees itself cut into a piece of silicon.

Louis Savain said...

tbcpp,

You're comparing apples and oranges, in my opinion. Linux was never meant to be a disruptive technology. It is merely a Unix work alike. The main difference is that it is free. Linux did not give us a new type of computer that is incompatible with existing tools and applications. COSA, however, is out to radically change the way we build and program our computers. This is not the sort of thing that most hobbyists do in their spare time.

Louis Savain said...

Brian,

I am preparing a response to your comment.

Louis Savain said...

Brian,

First off, let me say that if I had the time and the money, I would be working on implementing a COSA virtual machine and a set of COSA dev tools right now. Unfortunately, I find myself occupied with on other things. The way I see it, anybody is free to work on their own COSA implementation and it has come to my attention that a few people are indeed doing just that but would rather remain anonymous. I wish them the best of luck.

You wrote, There is no reason that a COSA system needs to be interpreted.

Well, if you mean that COSA does not have to be interpreted on current processors, I have to disagree. I don't see how it can be otherwise. I admit that I haven't thought of all the possible ways a COSA program can be organized for fast processing. However, consider that current processors are designed to use an instruction pointer that is incremented every time an instruction is processed. It's a simple 1-dimensional implicit process that is interrupted only when the processor encounters a jump instruction that explicitly modifies the instruction pointer.

A COSA processor, by contrast, does not execute instructions directly from the instruction cache but from an on-chip buffer that is filled on the fly. Remember that COSA uses two buffers in order to eliminate signal racing conditions and to ensure deterministic timing. There are two ways to manage the buffer. The entire instructions can be copied directly to the buffer or the buffer can contain index pointers to the locations of the instructions in the instruction cache. I favor the second option. Of course, the processor must use an instruction pointer to process the buffer instructions during a given cycle. The processor must include some circuitry to manage the buffers and be able to do a look-ahead in memory to prefetch instructions as much as possible. This means that a very fast processor should have big buffers and should divide the buffers into subareas for prefetching. These capabilities are not available on current processors. Therefore, a lot of it would have to be done in software, which would be slow. That's not a good thing when you want to impress venture capitalists who are, for the most part, speed fanatics.

All data access is localized in COSA. Cells only access data that "lives" close by (i.e. in the same low-level component).

Well, there are times when a cell may access data in another cell. This occurs during message passing. Since COSA uses a shared memory message mechanism, the receiving component must have access to the message data structure which resides in the sender's data area. Having said that, there is a way to automatically minimize cross-cache memory access as much as possible. It cannot be avoided but the nice thing is that the performance hit is not exponential. This means
that overall performance will increase proportionally with the number of cores. I choose not to reveal the mechanism here, just in case some venture capitalist might be interested in investing in a COSA multicore project. :-)

If you execute cells by low-level component order, you can maintain maximum possible cache validity given the memory distribution of the COSA structure.

Yes.

But anyways, even if a COSA system would not reach its best performance on current CPU architectures you have to start somewhere. Even in designing a
custom processor, you will need to have a functionally accurate (or cycle accurate) simulator to use for validation of the proposed hardware.


I agree. However, my position is that, if the claims that I make for COSA are valid (there is no doubt in my mind at this point), then the least the industry can do is pay for its development. I believe that the world at large will benefit tremendously from easy to program, reliable and super fast computing. I've done my part. COSA is not rocket science. The least that the industry can do is provide the resources to design a COSA multicore processor and a full COSA OS and development tools. I would do it but, as I said earlier, I can't afford to spend time on it.

PS. I'm writing a short history of COSA for an upcoming article. Stay tuned.

Brian said...

Hey Louis,

First off, let me say that if I had the time and the money, I would be working on implementing a COSA virtual machine and a set of COSA dev tools right now.

And I too. The COSA idea itself and the more general something-completely-new feeling is much more inspiring then my day job of grinding out business app code. :(

Unfortunately, I find myself occupied with on other things.

Some of your more, how shall I say it, esoteric projects? ;) How is the Animal coming along? I imagine a working COSA implementation would a useful platform for your AI research.

But anyway, back to the discussion at hand....

However, consider that current processors are designed to use an instruction pointer that is incremented every time an instruction is processed. It's a simple 1-dimensional implicit process that is interrupted only when the processor encounters a jump instruction that explicitly modifies the instruction pointer. (emphasis added)

Right, so what processing a cell list amounts to is the following:

- Process A
- Jump -> B
- Process B
- Jump to C, etc

The jumps only exist to take you from one cell to the next in the list of cells that need to processed for the current cycle. If it was possible to remove the jumps, processing a cell list would look like a simple, uninterrupted stream of instructions to the CPU. While copying all the instructions end-to-end into a buffer is definitely impractical you can get very close to the same effect by using an "executable cell list". I will try to explain. :)

What I'm envisioning is as follows:

- A cell is a distinct entity in the memory space of the system.
- The hardware instructions required to process the cell are stored in the cell memory structure. In other words every cell has its own instructions. There is no code sharing between cells of the same type.
- Cells are instantiated from a template that contains placeholders for the data operands. When a cell is created and assigned data operands, the cell's copy of the hardware instructions is modified, and the operand placeholders are replaced with the actual addresses as immediate (i.e. constant) operands.
- Every cell's "program" ends with two empty instructions (Actually not empty, but RTN opcodes). These two "slots" in the cell's instruction set are used to manage the current cycle, and next cycle processing lists.
- Lastly, a cell includes a linked list of "connection handlers" which are basically instructions that are executed to notify connected cells of a signal event.

Ok that is a lot to take in. The cell would like this is memory (ASCII art time ;) ):

** EDIT: Blogspot won't take my amazing formatting :( ***

A cells-to-be-processed list can now be maintained by modifying the instruction at A or B to be an unconditional jump to the top of the next cell to process. Two lists can be maintained (current and next) by alternating which location is modified and which location is jumped to during current processing. (i.e use A on even numbered cycles, and B on odd).

With this setup we have the processing list stored implicitly in the cells themselves. The whole process starts by initializing a set of cells to start. The cell processor routine then pushes an address to return to onto the stack and jumps to the head of the first cell to process. When the final cell in the list is reached, it will encounter a RTN opcode instead of a JMP and return back to the cell processing routine, at which point the whole thing repeats.

Did any of that make any sense? The basic ideas of self-modifying code and executable data structures comes from Massalin's dissertation:
Synthesis: An Efficient Implementation of Fundamental Operating System Services (1992)
.

The processor must include some circuitry to manage the buffers and be able to do a look-ahead in memory to prefetch instructions as much as possible. This means that a very fast processor should have big buffers and should divide the buffers into subareas for prefetching. These capabilities are not available on current processors.

The cache size on a modern CPU (eg an Intel Core 2) is rather large, and the CPU also possesses a rather sophisticated predictive branching evaluator. By using only unconditional jumps to move between cells, the CPU's branch prediction work is trivialized. We always take the branch! The CPU can easily prefetch the instructions on the other side of the branch.

However a COSA optimized processor could do away with lots of junk the current CPUs have accumulated to deal with all kinds of legacy usage patterns, etc. This could make the chips smaller and cheaper, or the extra transistors could be used to increase on-die cache size.

This comment is already way to long so I'll wrap it up. The above ideas seem doable to me and I am about to start building a experimental COSA processor based on them. We'll see what happens. It may very well turn out to be a sh**ty way of going about doing things. :)

As always, enjoy reading the blog. Keep up the good work.

tbcpp said...

Good point! COSA also made me think of Synthesis several times.

I don't think that COSA code would necessarily be that much slower than C code, if Brian's methods were used. Remember, we're in a day when C#, Java, and Python rule the world. This is an age where programmers are more than willing to throw more hardware at a problem if it means that code will be easier to write or be more scalable..

Brian said...

tbcpp said, COSA also made me think of Synthesis several times.

Yay! I'm not the only crazy one! I only stumbled across that paper last week. Must say it's one of the most interesting things I've ever read in regards to computer science. From what I can tell it has suffered the same fate as COSA: too different so no one wants to use it.

tbcpp said, I don't think that COSA code would necessarily be that much slower than C code...

I think that if we got a functional COSA system up and running there would initially be a performance hit, but the real benefits would come from the extreme ease at which the performance could be scaled to large numbers of concurrently executing cores.

This assumes that the load balancing is in working order. (Louis has a top-secret load distribution scheme somewhere up his sleeve :) ) Obviously to be scalable to an arbitrarily large number of cores you have to have a fully decentralized method of distributing the load. From my own thoughts I was thinking that the low-level COSA components (cells and data) would make a good unit of distribution. That way you don't have two cores trying to access the same memory locations during the same cycle (except of course for message effectors, however the load balancing can ensure that cores needing access to the message are provided a local copy).

Any cross core signaling could be done in hardware via some sort of message passing. Signaling does not need to be uber-fast since the only result of a signal can be placing a cell on the to-be-processed list. As long as it gets done before the end of the current cycle all is good.

A COSA-aware processor can only make things better. You could have a stripped-down number cruncher CPU. The execution path of a COSA cell list is much more straightforward than bunches of threads, context switches, and subroutines all flying about willy-nilly. For example CPU instructions that directly updated the to-be-processed lists, as well as instructions to handle inter-core signaling would greatly improve performance in my opinion.

As for custom CPU cores, check out: Tensilica :)

Louis Savain said...

Brian,

Keep in touch. In the event that I stumble upon some venture capital (one never knows), I would not mind having you on the team.

Eddie Edwards said...

Brian's right. COSA doesn't need to be interpreted. It can be compiled. The methods for compiling it are already well-understood because it turns out COSA is *exactly* equivalent to pure functional programming.

I have an elegant proof of this which unfortunately this margin is too small to contain.

(Disclaimer: I may not understand your idea of COSA perfectly, but it seems so similar to ideas of mine that I think what I'm saying is correct.)