Thursday, March 20, 2008

Nightmare on Core Street: No More Excuses

The ‘Not Invented Here’ Crowd

This blog has seen a sharp increase in visitor activity since I began writing my Nightmare on Core Street series. That a high proportion of the visits came from organisations like Intel, AMD, Microsoft, Texas Instruments, Freescale Semiconductor, Mentor Graphics and even UC Berkeley did not escape my notice. Do I expect anything to come out of it? Not at all. I perfectly understand the reluctance of anybody in this business to acknowledge outside contributions. After all, if you are being paid handsomely to work on a solution to a pressing problem, why should you tell your employers that someone else might have already found the solution? The answer is obvious: you shouldn’t, because if you did, you would be putting yourself out of a job. The logical (if not cowardly) thing to do is to keep quiet about it in the hope that your employers will not notice.

All About the Money

I have understood the solution to the parallel programming problem for over twenty years. During all those years, I have tried to explain it to anybody in the business who would listen but nobody would. And why should they? Everybody was making lots of money and there was no economic reason to rock the boat, so to speak. Besides, I am just a self-taught programmer, a nobody. But, as they say in the American south, I knew all along that, sooner or later, the chickens would come home to roost. Yes, the chickens will soon come home to roost because they sense something bad approaching, something nasty and mean and dangerous. It is all about the money, you see. If the computer industry does not figure out a way to solve its parallel programming crisis real soon, some very rich and not so very rich folks are going to lose a lot of money, real fast. They can’t let that happen.

The Impotent Wizards Of Multicore

What do rich folks do when they have a nasty problem that threatens to take away their source of income? Well, they do what they’ve always done in the past; they get their hands on all the known experts and wizards in the field, throw money at them and tell them to come up with a solution. However, this time around, the wizards are powerless. In fact, they are the cause of the problem. They are the ones who created the problem in the first place. As a case in point, let us consider Dan Reed, the recently hired Director of Microsoft Research’s scalable and multicore computing. Here is what Reed had to say about the parallel programming problem (source: embedded.com):

Twenty-plus years ago, the research space in parallel computing was looking toward the end of Moore's Law, and so there were bases that were built there to exploit parallelization […] The challenge has been that long-term research had been required to support this. There is no silver bullet there. Some of it is going to be incremental advances; some is going to be new languages.
Any time an expert says things like "there is no silver bullet" or "incremental advances" or "long-term research", you can be absolutely certain that it is a veiled admission of failure, cluelessness and impotence. Besides, "long-term research" implies long-term funding, which is what Reed, after all is said and done, is really concerned about. I have made this point before. Twenty years is a long time in this business. The wizards have been trying for over twenty years to make multithreaded programming easy and they have failed. And they have failed miserably. One would think that, by now, it should have occurred to at least one of them that, maybe, just maybe, threads are not the way to go. Don’t hold your breath. In this business, if something has been good to one’s career in the past, the common wisdom is to hold on to it as much as possible whether or not it works. According to cio.com, “Reed believes that multithreading over time will become "part of the skill set of every professional software developer".” How can anybody still talk about multithreading after spending so many years researching parallel computing? It is enough to make a grown man cry.

No More Excuses

When it comes to understanding the true nature of the parallel programming problem, Daniel Reed of Microsoft Research is as clueless as they come but he is not alone. The computer science community and the industry are crawling with experts like Reed who make a comfortable living out of not finding solutions to pressing problems. They figure that, as long as the problems persist, and as long as they can convince their employers of the need for continued long-term research, their services will continue to be in demand. They are in what I have called in the past, the perpetual research project business. The bad news is that, recently, the game has changed abruptly. Now there is a lot of fear and panic in the air. Very big money is at stake and some very important people are beginning to demand actual and dramatic results. It is no longer business as usual. Another twenty years of fruitless long-term research is out of the question. What is a multicore wizard to do?

In my opinion, the experts (they know who they are) no longer have any excuses. Many of them have visited my blog and the Project COSA site in the last few days and many times over the years. They know what COSA is all about. At this stage of the game, they have only one course of action, in my opinion; either they acknowledge that the COSA software model is the answer to the parallel programming and multicore problem or they will all be fired for incompetence. It is 'shape up or ship out' time. Twenty years of failure is enough and the free ride will come to an end sooner or later. And from the looks of it, it going to be sooner rather than later. As they say, money talks and bullshit walks. In the meantime, I’ll be waiting patiently for the knock on my door. I can wait.

As always, I tell it like I see it.

58 comments:

msundman said...

The "it's all about money"-rant is a red herring. You also seem to not know which stage of the game we're in if you think people should start using COSA. The fact is that COSA has not been shown in practice to be better at anything than the traditional model(s). The only ones that might be interested in COSA "at this stage of the game" are researchers. Guess what? Many researchers are. Well, not about COSA specifically, but about the synchronous approach of programming reactive systems. I have a shitload (200+) of research papers, and even books, about the topic. Pretty much all of them provide solid research and actual facts, rather than speculations, unsupported hypotheses and flaming like in your COSA pages. Furthermore, research in the field (in the 90s, iirc) have indicated that the step model used at the lowest level of COSA doesn't work very well in practice, mostly because it destroys the clean mathematics that can otherwise be used when verifying and compiling programs. Also, most of the research is focused on embedded systems, because nobody has yet come up with a solid, practical system that would scale really well.

Until someone has managed to create a synchronous reactive programming system (language/ide/whatever) that can be tested and is found to scale well there is absolutely no reason for anyone but a narrow niche of researchers to be interested in the topic. The research is not (yet) at the stage where it would even hint at being "the answer to the parallel programming and multicore problem". If you want to help the research then you're welcome to do so. E.g., try to figure out a mathematical model for representing and verifying properties of COSA/whateverlanguage programs. Then try to figure out if/how this can be made to scale well. If you insist on describing only practical ways of making/using COSA, then it's just hot air until you've created a prototype that actually supports your claims.

Blork the Ork said...

Well said, msundman. This reads far too much like a rant.

Louis Savain said...

First off, Ork, this is indeed a rant and it was meant to be. If you don't like my stuff, why do you read it? It's not for meant for you, man. Judging by your comment elsewhere on my blog, you are obviously an atheist/Darwinist that took offense at my Biblical work. You don't like me and I don't like you. Do you put food on my table? And even if you did, I still would not care about your opinion. So buzz off. How about that?

Second, Sundman is ignoring the fact that reactive systems and languages have been around and in use for years and they have already proven themselves over and over. VHDL, for example, is a pure reactive language and formal verification methods can be used to prove the correctness of a VHDL logic circuit. There are also several reactive languages in the market that also lend themselves quite well to formal verification. Esterel is a case in point and is right now being used very effectively by Airbus and other European aerospace companies for their safety-critical systems.

As far as creating a prototype is concerned, that takes time and money. But don't worry, this is being taken care of as we speak. BTW, all the mathematical formal methods you are referring to is just hocus pocus, in my opinion. The academic community has been using formal methods for decades and what has it done for them other than get the industry into the sorry mess that it is?

msundman said...

Louis, I'm sure you know that I know how long synchronous reactive systems have been around (since the 80s, and actually beginning in the late 70s). They have proven themselves in embedded, mission-critical systems, but not in large server/desktop applications written by "ordinary" programmers. Formal verification in VHDL is not as straight-forward as you make it seem. Since I have to get up in a few hours I'm going to be lazy and just quote someone else here:
"Formally, Esterel's semantics are based on constructive causality[...]. The understanding of this problem and its clean mathematical treatment is one of Esterel's most significant contributions to the semantic foundations of reactive synchronous systems. Incomplete understanding of this problem is exactly what prevented the designers of the Verilog, VHDL, and Statecharts languages from having truly synchronous semantics. Instead they have semantics based on awkward delta-cycle microsteps or delayed relation notions with an unclear semantic relation to the natural models of synchronous circuits and Mealy machines. This produces a gap between simulation behavior and hardware implementations that Esterel avoids." (Source: "The Synchronous Languages 12 Years Later" by Benveniste, Caspi, Edwards, Halbwachs, le Guernic and de Simone).

The formal mathematical methods are used to prove correctness in many situations, and even to gain better insights into specific language features (e.g., optimizing compilers, adding and/or removing semantics to enhance the language, ...). Regarding the mess of the industry, you are contradicting yourself, in when you just said the synchronous reactive languages in the industry have proven themselves over and over. These proven languages are the direct result of rigorous formal reasoning in the academic community.

Louis Savain said...

Marcus,

Programmers and mathematicians have a bad habit of complicating things way beyond necessity. They also enjoy using a vocabulary that makes the simplest processes sound complicated. I believe they do it in order to feel much more important than they actually are. I find that practice very annoying, to say the least. I consider Esterel and other reactive synchronous languages to be at least an order of magnitude more complex than they have to be. Besides, you know my aversion to programming languages. I think that the linguistic approach to programming is primitive and prone to error.

There is nothing hard or mysterious about synchronizing the operations of a software system. It is far from being rocket science. In fact, it is trivial to the extreme. And if you can build a reactive synchronous system (a simple cellular automaton like the game of Life is an example), then you are already over 90% of the way to having a correct system. Deterministic behavior is what you get in a synchronous system. Again, it's not rocket science.

Proving the correctness of such a system is also trivial because it is only a matter of trying an exhaustive combination of discrete signals and detecting conflicts. There is nothing to it and I refuse to let a bunch of pompous academics pull a wool over my eyes. It is all about discrete timing.

The reason that I said that mathematicians got us into our current mess is because our current sequential (algorithmic) computing model was originally designed by mathematicians for mathematicians. This is a fact. We must view the computer as a behaving system where timing (operation execution order) is an essential ingredient of behavior. We must forever stop seeing the computer as a computing (calculating) machine for obtaining results and in which temporality is of secondary importance.

msundman said...

> Proving the correctness of such a system is
> also trivial because it is only a matter of
> trying an exhaustive combination of discrete
> signals and detecting conflicts.

That's not quite correct. In a tick-based control-flow system that allows input-variables you're never going to be able to perform an exhaustive combination of control signals. E.g., imagine a loop node, which takes the iteration-count from some input variable. Before you know what the input is you can't know how many signals the loop-node will produce. Now, you have two options:
1) Create a formal mathematical model for reasoning about relations between signals and data (variable values).
2) Litter your code with signal-joining serialization-code, such as FIFO-queues, resulting in highly indeterministic behavior.

I have tried doing the former. A loop node is trivial to formalize, but unfortunately most other types are not. E.g., a join node (one that fires only after all N inputs have been fired) is simple conceptually, but if you don't know beforehand which signal input will be fired last (which is the case when the input is from an external source) I've found it impossible to formalize it in a way that is really useful for reasoning about signal-delays.

Louis Savain said...

Marcus,

You are complicating things beyond necessity because, in my opinion, that is how you were trained to think. Frankly, I don't really care what academics say about the issue. The way I see it, there is only one way to prove the correctness of a system. The system must be set up so that its expectations are clearly defined so as to be testable. In a discrete-event deterministic system such as COSA, there are only two types of expectations: events are either concurrent or sequential. That is all! The triviality of the whole thing may be upsetting to an academic but I don't care.

For example, if you have a thermostat that is expected to sound an alarm when the temperature increases above 100 C, there is no need to test every possible temperature. One only needs to test a temperature transition form <= 100 to > 100. Now if the expectation is that the temperature should rise above 100 at a certain time, then one only needs to test that condition.

The beauty of deterministic systems that are based on discrete events/signals is that it is trivial to incorporate a mechanism that automatically induces or discovers all the temporal expectations of the system while it is running, even the ones that the programmer/developer will invariably miss.

As an aside, someone (apparently an academic who shall remain nameless) just wrote to me regarding his interest in formalizing COSA or, at the very least, prove its Turing compatibility. To give you an idea of what I think of the academic community's take on things, I reproduce my response to him below:

Thanks for writing and for your interest in COSA. Unfortunately, I have a low opinion of academia, in general. The COSA Project is very dear to me but if its eventual success must depend on academic approval, I would rather see it fail. COSA is based on an extremely simple concept and I refuse to allow the academic community to complicate it beyond recognition. In addition, I find the Turing computability model to be irrelevant to COSA or to the true nature of computing. Sorry.

msundman said...

To know whether two output signals of two components are concurrent or not you have to know how many ticks the components consume (and the relative times (measured in ticks) of the various inputs).

If this is so trivial as you claim then you should have no problem what so ever telling me if the 'done' signal of your quicksort (assuming the array to sort has N elements) is concurrent with a component that consumes M ticks and whose 'start' signal is concurrent with the 'start' signal of the quicksort component. So, is it?

msundman said...

Apparently I left out some words. It should read:
[...] if the 'done' signal of your quicksort (assuming the array to sort has N elements) is concurrent with the 'done' signal of a component that consumes M ticks and whose 'start' signal is concurrent with the 'start' signal of the quicksort component.

Louis Savain said...

Marcus,

I am not sure I understand your question. What is wrong with using ticks (system cycles) and tick counters to determine whether signals are concurrent or sequential? Why is that not trivial?

msundman said...

I don't find it trivial. Maybe it is trivial for you. If you think it's trivial, then please answer whether the two components' 'done' signals are concurrent or not.

msundman said...

Just a clarification (well, more like stating the obvious):
The answer is, of course, not a simple "yes" or "no", but a function, isconcurrent(M:integer,N:integer,X):boolean.

The problem is to construct this function. I.e., for which Ms and Ns is it 'true' and for which is it 'false'. Do we need the X to do this, and if so, what should it be? (There is an infinite (or at least unreasonably large) number of possible Ms and Ns, so we can't try all of them.)

Louis Savain said...

Marcus,

Either I have no idea what you're talking about or you are confused. You know that temporality is a big thing with me. I have been using sequence and concurrency detectors for over 15 years in my pulsed neural network. It is an essential part of my AI research. To determine concurrency or sequentiality, one only needs to compare two time variables. A detector can stamp a local variable with the global time count when it receives a signal. One only needs to compare the two time variables for either equality (concurrent) or not (sequential). Where is the big mystery?

Louis Savain said...

Correction:

One only needs to compare the two time variables from two different detectors for either equality (concurrent) or non-equality (sequential). Where is the big mystery?

msundman said...

You're talking about detecting sequentiality/concurrency for some particular input at runtime. I'm talking about calculating it for any possible input at compile-time.
Does this make my comments above more clear?

msundman said...

...and please don't get stuck at the "compile-time" above; you can substitute it for "when optimizing" or "when verifying some behavior property" or whatever.

Louis Savain said...

Marcus,

None of what you wrote makes any sense to me. Are you saying that, in order to test whether or not a comparison operation is correct, you need to test every single value for the two integer variables? If so, this is nonsense. Boolean algebra can be used to easily prove the correctness of arithmetic and logical operations on numbers. There is no need to test every number in the set.

But then again, maybe I still don't get your point. Please rephrase your argument.

msundman said...

Maybe if I reduce the problem to the bare minimum:
Let's say you have two components, A and B. A has an input, Ia, and an output, Oa. B has an input, Ib, and an output, Ob. Now, assuming Ia and Ib propagate signals at the same time, is it possible to figure out if Oa and Ob will fire at the same time, without actually executing the program?

Louis Savain said...

Marcus,

Are you kidding me? Certainly we can predict when nodes will fire and under what conditions they will fire. This is precisely what is nice about a deterministic system and another reason why there should be a constitutional amendment against the use of threads in software. I'm only half joking.

In a deterministic system such as COSA, we can trace the path of signals and predict exactly when they will arrive at a given destination. A simple example is that of node A sending a signal to 4 other nodes. We know that whenever node A fires, exactly 1 cycle later, the four target nodes will receive their signals simultaneously. We can have more complex circuits than this but the principle is the same. It's just counting cycles.

But why write code to do this at compile time when it can easily be done at run time? As I said many times in the past, a determinitic system lends itself well to a mechanism that can find all temporal constraints automatically during testing. Of course, it finds them inductively. So it is up to the programmer to verify the validity of a violated constraint and either correct the error or invalidate the constraint.

Raystonn said...

I see what msundman is saying. Perhaps A or B has a while loop. That may mean the number of cycles it takes to get an output could depend on an input value. How do we keep a circuit syncronized so we can use Oa and Ob together in an operation after A and B have completed? Do we need to specifically design A and B to always finish in the same number of cycles if we are to use their outputs together in further operations?

I do agree that MIMD is not the way to go.

msundman said...

A simple "yes" or "no" would have been enough, so I'll just go ahead and assume you replied just "yes".

If A uses a variable, x, then it might not always fire its Oa the same number of ticks after it got the Ia. E.g., if A is a simple delay node then if x=1 then Oa fires 2 ticks after Ia, and if x=2 then Oa fires 3 ticks after Ia, etc. In this case Oa's firing tick is the function f(Ia,x)=Ia+x+1. With me so far?

If we can produce such input-output relationship functions for all possible nodes, then we can do all sorts of nice property verifications. E.g., we can make sure that two particular "synapses" (to borrow from your COSA terminology) will always fire at the same time. Or we can make sure that a particular component is always finished within N ticks.

Louis Savain said...

Marcus,

The way I see it, if A fires n cycles after receiving a signal Ia and a variable x = x1, it is guaranteed to always fire after the same number of cycles n, whenever it receives a variable such that xn = x1.

In other words, using the quicksort example, the time (number of cycles) taken to sort a given array is guaranteed to always be the same if the same array is used over and over. This aspect of COSA can be used to effectively secure a system because a temporal constraint violation is a sure indication that the data was altered.

Temporal determinism is a well-understood charateristic of synchronous reactive systems. There is nothing complicated about it. It can be used to create rock-solid parallel software systems that never fail.

msundman said...

Now we're getting somewhere.
As you said, the quicksort component will take the same number of ticks each time we feed it similar input (i.e., an array of the same size and with its elements in the same order). However, if the elements are in a different order, or if the array is larger or smaller, then what? Obviously we can't test-run every possible permutation of every possible number of elements.

Louis Savain said...

Obviously we can't test-run every possible permutation of every possible number of elements.

Why would you want to do that? This brings back the thermostat program example that I often use. There is no point in testing every temperature value in order to validate the program. All that must be tested are the conditions (state transitions) that a system or component is specified to react to. A quicksort component consists of several sub-components. If you test all the possible conditions that these sub-components are designed to respond to, then you can be assured that they will behave correctly when combined with the other components. Why? Because that is the nature of deterministic systems: they consistently exhibit a stable behavior regardless of the environment they are used in.

msundman said...

I never indicated that I would want to test every possible value.
Instead I asked a question: if the elements are in a different order, or if the array is larger or smaller, then what?
I could further clarify: The number of ticks A spends between receiving Ia and firing Oa often depends on the values of variables A refers to. Thus it is impossible to know this number of ticks unless you also know how these variables' values affect this number. Now we have 2.5 options:

1) We don't even try to calculate the number of ticks different components will take to do their things, and if we, e.g., need two "synapses" to always fire at the same time we use joiner-queues or whatever to accomplish that.

2) We make sure that the semantics of each cell/component is such that it's possible to calculate the number of ticks it will use. (Then we can (hopefully) solve the equations needed to verify whatever needs to be verified.)

2b) We make sure that a cell/component always takes the same number of ticks no matter what values the variables it refers to have. (This of course makes it impossible, by definition, to create components that use different amounts of ticks depending on variables. This implies: no quicksort component. At least not without introducing sub-ticks or somesuch.)

(I called the 3rd option "2b" because it's actually a subset of option 2.)

So, Louis, now:
A) Do you agree that these are the available options?
B) Which option do you prefer?

Samuel said...

I would suggest you use a rank-based sequence detector for this. Just set the two slaves to have rank = -1. When the master signal comes in, this detector will signal only if both slave signals have come in. If you require any particular order to the arrival of slave signals then you can alter the rank values appropriately.

msundman said...

Samuel,
Nice of you to join the discussion, but: Huh? What is the problem that your "rank-based sequence detector" (whatever that is) is meant to solve?

Samuel said...

Please read http://www.rebelscience.org/Cosas/System.htm for information on them. In fact, it would be best if you would read all the pertinent information on COSA prior to engaging Louis in debate on its merits.

Samuel said...

Based on your moniker "msundman", I just took note that you may in fact be the one who is creating COSAed. If so, I apologize for the tone of my last post. I'm used to seeing people attack the merits of COSA and may have jumped the gun.

msundman said...

Yes, I made COSAed (and am making something not completely dissimilar), but you don't have to apologize to anyone (if we'd get offended we'd get nowhere). Just answer the question; which of the discussed problem(s) do you think a "rank-based sequence detector" would solve?

Samuel said...

This was your question:

"Let's say you have two components, A and B. A has an input, Ia, and an output, Oa. B has an input, Ib, and an output, Ob. Now, assuming Ia and Ib propagate signals at the same time, is it possible to figure out if Oa and Ob will fire at the same time, without actually executing the program?"

The rank-based sequence detector would allow you to use Oa and Ob in a subsequent operation without requiring they both show up at the same time. When the last of the two signals comes in, the detector signals the start of the subsequent operation.

There is no need to ensure the output of A and B show up during in the same cycle. In fact, if the processing time for A and/or B varies, then it may be impossible to guarantee that output of both will be available at the same time.

msundman said...

OK. So, I guess then you would answer my questions on "March 30, 2008 8:09 AM" with:
A) Yes.
B) Option 1. (With a "rank-based sequence detector" being the "whatever" part of the "joiner-queues or whatever".)

However, this leads to what I described on "March 26, 2008 8:59 AM" as option number 2, namely:
"2) Litter your code with signal-joining serialization-code, such as FIFO-queues, resulting in highly indeterministic behavior."
(Well, actually the behavior isn't highly indeterministic as such, but because it's filled with these unknowable delays the various timings will be highly indeterministic.)

Also, since COSA isn't data-flow you can't just use a "rank-based sequence detector", because you have to make sure nobody else starts using A or B before both of them are finished and their result variables read. Only then can A and B be allowed to accept the next inputs and use the same variables for the next calculation. So, instead of a "rank-based sequence detector" I suggest to use a joiner-queue, which simply queues the output variables of A and B and then delivers A's results and B's results paired up as if Oa and Ob fired at the same time (albeit with an indefinite delay). I don't like this approach, though, since it's so sensitive to race-conditions and the "code" gets littered with these serializer-queues.

There are probably over a hundred programming languages that are deterministic in the sense that for the same inputs they produce the same outputs. (And that's not counting every single-threaded program ever written.) This kind of determinism is as such almost useless when we want to verify some properties. Of course we don't want the state-space to explode because of indeterministic thread scheduling, but that's not enough. We also don't want it to explode because of indefinite inputs (and with "inputs" I'm not only counting the timings of when data is received, but also the data itself, if the language allows different behavior based on different data).

Samuel said...

> you have to make sure nobody else starts using A or B before both of them are finished and their result variables read.

That may not be a requirement. It may simply be sufficient to wait for A to complete to reuse A, or wait for B to complete to reuse B. We may not always require the outputs of both A and B.

> I suggest to use a joiner-queue, which simply queues the output variables of A and B and then delivers A's results and B's results paired up as if Oa and Ob fired at the same time

This is certainly an option if you always want the results of A and B paired and never want to use A or B separately. That would be a design restriction.

> I don't like this approach, though, since it's so sensitive to race-conditions and the "code" gets littered with these serializer-queues.

Where is the race condition here? What exactly are you calling litter? If you need the results of both A and B to continue, then you will need to wait for whichever takes longest to complete. However, this is a form of coarse-grained parallelism, not very different from the multiple-thread approach to algorithmic computing. My suggestion would be to try to find an alternative to coarse-grained parallelism.

Louis Savain said...

Samuel,

Thanks for joining this discussion. I must admit that, until your joined in, I had no idea what Marcus had in mind. I guess I was a little too busy to make the effort. Let me point out that a COSA processor is an open eneded design. New sensors and effectors will be added in the future to extend the processor.

It is not everything that I've thought about that actually makes it to the COSA pages. For example, if you look at the bottom of the Quicksort page your will notice that I introduced the ALL detector even though it's not described in the System page.

The ALL detector fires as soon as all of its inputs have fired regardless of the order of signal arrival (concurrent or sequential). This can easily be implemented by using an internal count variable.

I want to emphasize that COSA is a work in progress and even if a COSA processor were to be launched tomorow, future designers will be free to add new functionality. For example, I envision that future COSA processors will be specially designed for such applications as spiking neural networks simulation and cellular automata. In the world FPGA, it will be relatively easy for customers to extend the functionality of the core design.

msundman said...

> > you have to make sure nobody
> > else starts using A or B before
> > both of them are finished and
> > their result variables read.
>
> That may not be a requirement. It
> may simply be sufficient to wait
> for A to complete to reuse A, or
> wait for B to complete to reuse B.
> We may not always require the
> outputs of both A and B.

Since COSA isn't data-flow the variables will get reused, so you need to get results off of them before the results are overwritten. Assuming the next component, C, needs the results of both A and B it either needs to copy the values as they become available, or have some queue-thingy inbetween.

Also, since COSA uses a push-architecture for the signals (i.e., the signals are sent whether the receivers are ready or not, as opposed to having the receivers ask for the signals whenever they are ready to receive them) you have to make sure that components "upstream" don't push signals too fast for components "downstream" to handle.

> > I don't like this approach,
> > though, since it's so sensitive
> > to race-conditions and the
> > "code" gets littered with these
> > serializer-queues.
>
> Where is the race condition here?

If there is no way of knowing at which (relative) tick numbers a variable's value is accessed then you have a race-condition between every component that's accessing it. If you use queues as a workaround then you have a race-condition between the component(s) filling the queue and the one(s) emptying it. And you have a race-condition between the length of the queues (i.e., the reaction delays) and external stimuli.

> What exactly are you calling
> litter?

These variable-synchronization constructs that are needed because COSA isn't data-flow and because COSA doesn't have a formalism making it possible for verifiers/compilers to ensure non-concurrent access.

> If you need the results of both
> A and B to continue, then you
> will need to wait for whichever
> takes longest to complete.

I assume that "you" is the component C, which uses the values produced by A and B. In that case C also has to connect to whatever is feeding A and B stuff, so that they also wait, and so on, right up to the sensor to the external world (or to a queue, which, once more, is an indefinite delay).

> However, this is a form of coarse-
> grained parallelism, not very
> different from the multiple-
> thread approach to algorithmic
> computing. My suggestion would be
> to try to find an alternative to
> coarse-grained parallelism.

Most non-trivial components include parts where one component, C, needs the results of other components, A and B. If each of these take only a few ticks and there are hundreds of thousands of them, all operating in parallel, I wouldn't exactly call if "coarse-grained parallelism". Or maybe I misunderstood you.

Having to copy values into queues all over the place will consume an awful lot of resources. In COSA ticks/time is implicit. Every operation uses at least one tick. In other reactive synchronous languages the ticks/time is an explicit value or operation. This means that several "consecutive" things happen "at once", during the same tick. (E.g., "if signal S1 is present then do A and then fire signal S2" and "if signal S3 is present then do B and then fire signal S1" will make sure that in each tick in which S3 is present also S1 and S2 are present and A and B are done. (If this is new for you you might have to read this example a few times before you grok it.)) This enables much simpler formalisms, much bigger headroom for optimizers/compilers, much smaller state-spaces and much more realistic chances of verifying various properties. It also has some drawbacks, the obvious one being a possibility of programs being inconsistent. (E.g., "if signal S1 isn't present then fire signal S2" and "if signal S2 is present then fire signal S1" is a causal paradox because S1 can't both be present and not be present in a tick.) There are many solutions to this problem, though. An obvious solution is to forbid causality loops within a tick.

Samuel said...

> Since COSA isn't data-flow the
> variables will get reused, so
> you need to get results off of
> them before the results are
> overwritten. Assuming the next
> component, C, needs the results
> of both A and B it either needs
> to copy the values as they
> become available, or have some
> queue-thingy inbetween.

There is no need for a queue. A and B can signal when they are available. Something that needs A's functionality can either wait for A to become available or use something else that has identical functionality.

> If there is no way of knowing at
> which (relative) tick numbers a
> variable's value is accessed
> then you have a race-condition
> between every component that's
> accessing it.

How would you not know? You can signal when a value is being accessed if you really want to know this. I'm still not seeing any race conditions here.

> These variable-synchronization
> constructs that are needed
> because COSA isn't data-flow and
> because COSA doesn't have a
> formalism making it possible for
> verifiers/compilers to ensure
> non-concurrent access.

Presumably component D in the D->A->C chain would wait until you were signalled before trying to use A again. Think of A as a stage in a pipeline. It is busy until you are told otherwise. You are told this with signals.

> This means
> that several "consecutive"
> things happen "at once", during
> the same tick. (E.g., "if signal
> S1 is present then do A and then
> fire signal S2" and "if signal S3
> is present then do B and then
> fire signal S1" will make sure
> that in each tick in which S3 is
> present also S1 and S2 are
> present and A and B are done.

This is a violation of one of the principles of COSA: "One of the most important aspects of using a master clock is that every elementary operation lasts exactly one execution cycle." It removes one of the benefits of COSA's design: the ability to pipeline. Think of each sequential component as a stage in a pipeline. Doing so allows you to forego many of the queues you discussed earlier. If you create a stage that can take an unknown number of cycles to complete, that stage must signal to the previous stage that it is available once it completes its operations. This also allows that previous stage to choose another available execution unit rather than just wait for the busy unit.

msundman said...

> > If there is no way of knowing at
> > which (relative) tick numbers a
> > variable's value is accessed
> > then you have a race-condition
> > between every component that's
> > accessing it.
>
> How would you not know?
> You can signal when a value is
> being accessed if you really want
> to know this.

Why would I want to know when a variable is being accessed? I don't. I want to know when a variable will be accessed. Knowing this prevents the race-condition.

(Maybe you misunderstood my use of "is" in "is accessed". It wasn't meant as "currently" ("is now"), but "constantly" ("is each time"). E.g., "it is always accessed 15 ticks after the start".)

> I'm still not seeing any race
> conditions here.

OK, back to simpletons:
A and B start at the same time. A writes to variable X Na ticks after starting. B writes to variable X Nb ticks after starting. Both A and B must not write to X at the same tick. If Na and Nb are unknowable (until after the fact) then we have a race-condition. (Therefore we need to manually serialize the access to the variable somehow.)

> This is a violation of one of the
> principles of COSA: "One of the
> most important aspects of using a
> master clock is that every
> elementary operation lasts exactly
> one execution cycle." It removes
> one of the benefits of COSA's
> design: the ability to pipeline.

That's pure hogwash. It removes nothing except inflexibility and indeterminism. You can even add all those sub-ticks to the runtime if you wanted to, but there's no point of doing that.

In fact, COSA suffers from the same "causality"-loop problems, except they are manifested as livelocks (that might happen always, or once every 5 million runs, or maybe even never) instead of as compiler/verifier errors.

> If you create a stage that can
> take an unknown number of cycles
> to complete, that stage must
> signal to the previous stage that
> it is available once it completes
> its operations.

Yes, and the "code" gets littered with these since almost all components take indefinite numbers of ticks.

I find it a bit annoying that Louis time and again state with absolute confidence how things will be in COSA, seemingly without actually trying it out.
I've actually tried to code some COSA(ish). I've found out that I need all kinds of "maintenance code" to work around inflexibilities that do nothing to help with anything. In fact, my latest COSAed doesn't use a global clock anymore, but it's possible to define different tick-rates to different components, and it's possible to hard-code the number of external ticks a component will seem to use, thus hiding the fact that it doesn't use a constant number of ticks internally.

Samuel said...

> That's pure hogwash. It removes
> nothing except inflexibility and
> indeterminism. You can even add
> all those sub-ticks to the
> runtime if you wanted to, but
> there's no point of doing that.

There most definately is a point. It sounds like you are arguing against the "synchronous" part of synchronous, reactive systems.

With the synchronous model of computation, which discretises time into logical ticks, it is possible to derive exact, tight bounds on its worst case reaction time. This tells how much time it takes the system to react to the environment. A COSA processor should be equipped with a tick manager that can provide a constant logical tick rate and detect internal timing over-runs.

This would serve to detect hardware failures and provide another safeguard so that real-time deadlines are met.

msundman said...

> > That's pure hogwash. It removes
> > nothing except inflexibility and
> > indeterminism. You can even add
> > all those sub-ticks to the
> > runtime if you wanted to, but
> > there's no point of doing that.
>
> There most definately is a point.

...which is..?

> It sounds like you are arguing
> against the "synchronous" part of
> synchronous, reactive systems.

Where did you get that from??? Are you seriously suggesting normal synchronous languages are not synchronous because ticks (or steps) are explicit instead of implicit? C'mon!

> With the synchronous model of
> computation, which discretises
> time into logical ticks, it is
> possible to derive exact, tight
> bounds on its worst case reaction
> time. This tells how much time it
> takes the system to react to the
> environment.

With most normal synchronous languages, yes, but with COSA, no. In COSA, unlike in other synchronous languages, you can't know how many ticks a certain component will take. Thus it's impossible to know how long it will take.

E.g., if I have a loop-component I have no idea how many ticks it will take, until at runtime when the loop-count-variable is set. Since there is an infinite number of possible loop-count-variable values there is also an infinite number of possible tick counts the loop-component could take.

Samuel said...

> With most normal synchronous
> languages, yes, but with COSA,
> no. In COSA, unlike in other
> synchronous languages, you can't
> know how many ticks a certain
> component will take. Thus it's
> impossible to know how long it
> will take.
>
> E.g., if I have a loop-component
> I have no idea how many ticks it
> will take, until at runtime when
> the loop-count-variable is set.
> Since there is an infinite
> number of possible loop-count-
> variable values there is also an
> infinite number of possible tick
> counts the loop-component could
> take.

You seem to be confusing "worst case reaction time" with "worst case execution time". The worst case reaction time is the maximum amount of real time it takes to completely process a single tick. The worst case execution time is the maximum amount of real time it takes to complete execution of all required ticks.

COSA makes worst case reaction time easy. It's always 1. COSA can easily handle worst case execution time as well, by simply adding an attribute to each effector cell indicating the allowed "repeat count" values.

For example, please take a look at the "while loop" sample at http://www.rebelscience.org/Art/loop.gif. The 100 values used in the cells are counters. These counters could be set based on run-time input. What we need is an attribute to specify the values allowed as a "repeat count" for effector cells. A COSA processor should catch any violation and put up a red flag.

Using the allowed-repeat-count attributes, you should be able to come up with a worst case execution time.

Samuel said...

> COSA makes worst case reaction time easy. It's always 1.

Sorry, this is just plain wrong. Worst case reaction time will be the longest amount of real time it may take to completely process 1 tick.

Samuel said...

> What we need is an attribute to specify the values allowed as a "repeat count" for effector cells.

It occurred to me that the repeat count shown in the effector cells actually *is* an upper bound for the number of iteration to perform. Effector cells have both start and stop inputs. You would normally use the stop input to halt the iterations. The value listed in the effector cell is the worst case if the stop input is never signalled. You can use that worst case to calculate a worst case execution time.

Samuel said...

Just to clarify, those repeat count values you see in the effector cells should be provided at design-time, not run-time.

msundman said...

> Worst case reaction time will be
> the longest amount of real time
> it may take to completely process
> 1 tick.

Usually the reaction time is defined as the time between the input and the output. (What combination of input beginning/ending and output beginning/ending is considered depends on the context.)

Even if you'd consider simply doing something as "reacting" it still wouldn't be enough, since there is no reaction until the components are free to accept the input or while data is in a queue waiting to be processed. Since these delays are indefinite also the reaction-time is indefinite.

Of course you can place runtime constraints on programs (e.g., "if this component is not free after waiting for 20 ticks then reset it"), but then you couldn't verify that programs actually produce (usable) output. Besides, that can be done in pretty much any language.

Only utterly trivial programs in COSA have defined runtime-characteristics (such as worst-case reaction time), and even then the programmer has to do the reasoning manually since there are no formalisms for such things for COSA. This is what I've been saying all along, and I've described why it is that way.

> those repeat count values you see
> in the effector cells should be
> provided at design-time, not
> run-time.

That's often impossible. E.g., let's say you process an array that you get from some input (e.g., a quicksort component). Since you can't know the size of the array you can't know the repeat counts that depend on the size of the array.

Samuel said...
This comment has been removed by the author.
Samuel said...

For something such as a quick sort, you generally have some idea of an upper limit on data sizes your application is expected to process. It shouldn't be infinite. If it is, then you are already non-deterministic due to your data, and your choice of programming languages is irrelevant.

Assuming a non-infinite data set (the only type of data set that can allow you to be deterministic), you can specify an upper limit on the size of your input. This limit would be communicated to your data provider.

msundman said...

> you generally have some idea of
> an upper limit on data sizes your
> application is expected to process

So, a loop-component that uses a 64-bit unsigned int could take anything from 1 to 18446744073709551616 ticks. In practice this means you have absolutely no clue if it'll end within a millisecond or after a hundred thousand years. That level of uncertainty is not what I would call having usable worst-case limits.

(In COSAed integers are dynamically resized to accomodate for much, much, much larger numbers than what fits in a measly 64 bits, so a simple loop component might take billions upon billions of millennia to complete.)

As I've said, COSA will remain largely undeterministic, unless someone manages to create a formalism that allows for verifying properties, such as the number of ticks there will be between two signals.

> a non-infinite data set (the only
> type of data set that can allow
> you to be deterministic)

Any language that only process data flows can (theoretically) process an infinite amount of data, yet have many fully deterministic properties.

Samuel said...

> So, a loop-component that uses a
> 64-bit unsigned int could take
> anything from 1 to
> 18446744073709551616 ticks. In
> practice this means you have
> absolutely no clue if it'll end
> within a millisecond or after a
> hundred thousand years. That
> level of uncertainty is not what
> I would call having usable worst-
> case limits.

If you set your circuit to accept input of that size, then yes, without knowing the input size ahead of time we can only give a very large worst case execution time.

We need to remove the requirement of knowing the size of the input in order to make this measurement useful. That is why the worst case reaction time is needed. Basically, you take the worst case reaction time (time required to process 1 tick) and multiply that by the number of ticks that must pass before the next input can be accepted. Then you have a measurement of worst case execution time for a single input. That would be a useful measurement.

Samuel said...

Basically, you would want to take the worst-case single-tick processing time, multiply that by the worst-case input-to-next-input tick-count (basically the number of ticks that occurs between pipeline stages), then add the worst-case execution time for the last input the system receives (to get the final input through all pipeline stages.)

msundman said...

> If you set your circuit to accept
> input of that size, then yes,
> without knowing the input size
> ahead of time we can only give a
> very large worst case execution
> time.

Do you think it's too much for a loop component to use an "input size" of a single 64-bit integer??? It can't get much lower, you know.

> We need to remove the requirement
> of knowing the size of the input in
> order to make this measurement
> useful.

That'll severely limit the kind of components that can be used. E.g., it means that we can't use a loop-component (at least not one that loops a variable number of times, which is the most useful kind). Actually, to avoid confusion about how a loop-component is used, let's switch the discussion to a delay-component. A delay-component is like a loop-component, except that it fires its output only at the last "loop". (E.g., a delay-component that's bound to a variable that happens to have the value 5 will wait 5 ticks after receiving the input signal before firing the output signal.)

The other alternative is to create a formalism for how each cell acts and how variables' values affect this.

My first try at creating a formalism was based on the hypothesis that most tick counts vary very little, creating fairly small ranges of uncertainty wherever a variable's value was ignored. It seems this hypothesis was incorrect. The ranges were huge (such as the possible values of a delay-component that uses a 32-bit integer), and even when they weren't they very quickly added up to completely unrealistic uncertainties.

My second try was to try to include the variables' values in the formalism, making it much more difficult to use, but eliminating uncertainty completely. I never succeeded in creating any such formalism that would be of any use.

> you would want to take the worst-
> case single-tick processing time,
> multiply that by the worst-case
> input-to-next-input tick-count
> (basically the number of ticks
> that occurs between pipeline
> stages), then add the worst-case
> execution time for the last input
> the system receives (to get the
> final input through all pipeline
> stages.)

But, as has already been made copiously clear, it's impossible to know whether it'll take one tick or a billion ticks before the next input can be accepted.

At this early stage of the play we aren't even interested in getting something "through all pipeline stages". C'mon, we can't even get a signal deterministically one step forward yet.

Samuel said...

> Do you think it's too much for a
> loop component to use an "input
> size" of a single 64-bit
> integer??? It can't get much
> lower, you know.

You seem to want to use the data type itself as the only restriction on what values can be used in your delay circuit. You should be specifying an actual value as the maximum delay. For example, you may wish to set a maximum delay of 10 for that particular component. If the data that comes in results in a setting above 10, then the processor would immediately raise a red flag, indicating out-of-spec data.

There is no reason every delay and every loop should always allow MAXINT as a value. Usually there is a reasonable limit that can be expected during design time.

msundman said...

> There is no reason every delay and
> every loop should always allow
> MAXINT as a value.

That's certainly true, but even conservative limits quickly escalate beyond reason. E.g., a quicksort component should certainly be able to handle arrays with over a million elements. That alone makes the following signals' times completely unpredictable. If you have a bunch of them you quickly realize that you can't know when any signal will fire anywhere, so you have to put in synchronizers/serializers, such as joiner-queues. And so we're back at the beginning of this thread (no pun intended).

> Usually there is a reasonable
> limit that can be expected during
> design time.

I must disagree, at least until I've seen more COSA programs. And even if that was the case, those "reasonable" limits quickly add up, so that any non-trivial component has loads and loads of them, making its behavior highly unpredictable. This makes COSA (as such) highly unscalable. All this also makes it impossible to make verifiers that ensure exclusive access to variables etc.

It's easy to claim whatever when we speak of hypothetical scenarios, but I have seen nothing that would indicate that COSA scales beyond trivial components. (And Louis has certainly not proved COSA to work in practice, and in fact not even indicated that it might. Neither have I, and I've actually tried.)

But let's use some more concrete example, such as the quicksort component. What's the reasonable upper limit of the various values in the quicksort component? What's the worst-case time (measured in ticks) of it?

Samuel said...

The Quick-sort algorithm is an algorthm that processes all of the data at once. The output signal is not given until the entire data set is processed. That makes it impossible to calculate a worst case without prior knowledge of the input.

Tell me, how would Esterel allow you to compute a worst case execution time or worst case reaction time of the same algorithm?

Louis Savain said...

Marcus,

I guess I was wrong. I still have no idea what you're talking about or what your point is. What's with the joiner queues? COSA supports message queues. Is that what you're talking about? If so, what is wrong with message queues? And why would I want to use queues with parallel QSort since I can spawn a new QSort module at any time?

Samuel said...

Louis, we're trying to establish that Worst Case Reaction Time can or cannot be computed for a Quicksort such that we know the rate at which we can issue a new Quicksort. Spawning a new quicksort circuit during runtime will increase the amount of time it takes to process a single tick, as there will be more active cells to process. If we continue spawning off new quicksort circuits (because we are signalling a new quicksort to begin before the last one finished), eventually it will take an indeterminant amount of time to complete a single tick.

Louis Savain said...

Samuel,

All duration calculations must be computed in terms of virtual clock ticks, not real time. Virtual ticks are deterministic but real time is not. Note also that a parallel quicksort partitions the array and spawns new left and right quicksorts to sort the new partitions, and so on. There are no recursions. The bigger the array, the greater the number of quicksorts will be launched and the bigger the load will be on the processor, slowing things down. This is true of any system, not just COSA. Of course, if you had an infinite numbers of cores and perfect load balancing, it would no longer matter because real time ticks would be proportional to the virtual ticks.

It is impossible to know exactly how long a sort will take unless one knows the size and the exact order of the array in advance. It makes no difference what language is used and whether or not it is synchronous. This is not knowable in normal situations.

The worst case scenario for quicksort, as far as I know, is when the array is already sorted and the sort begins with the left item. Of course, the greater the size of the array, the longer it will take.

It is not easy to calculate the number of ticks that a worst case scenario will take regardless of the system or language used. The easiest method is to run the sort with a tick counter and write down the number of ticks when it is done.

Having said that, I still don't get what the controversy is all about.

Samuel said...

> Virtual ticks are deterministic but real time is not.

> ...I still don't get what the controversy is all about.

This is a big problem. In fact, I'd say it's a showstopper.

Say we have a company that needs data sets sorted. Every data set has the same number of data elements (let's say 1 million), but the data elements are completely different each time. The company sells these sorted data sets. The company needs to ensure it can produce X sorted data sets per day to maintain revenue.

In order to guarantee the minimum X sorted data sets per day, the company must know the worst case amount of real time it will take to sort any of the data sets.

Is it, or is it not, possible to know the worst case amount of real time it will take a COSA application to sort a single data set, given that we already know the size of the data to be sorted?

Louis Savain said...

Is it, or is it not, possible to know the worst case amount of real time it will take a COSA application to sort a single data set, given that we already know the size of the data to be sorted?

Of course, it is possible. It is also possible in an ancient language like C as well, or any other computer language. That is, as long as there are no other program running on the computer that affect the processor load and, as a result, change the real time count in a way that is unpredictable. It's possible if you have time to waste.

A COSA program will be savable in a sort of intermediate p-code that can be analyzed mathematically. But then again, why go through the trouble of performing a very complex calculation to find a result that can be easily obtained by simply running the program? Is there a sort of pleasure in doing this or is that a machismo thing?

I don't think you guys truly appreciate the complexity of calculating the exact real time duration of a worst case quicksort on a multicore computer. The timing interactions are mind-blogging. Are you willing to calculate the time it takes a core to fectch every variable from data memory? Are you prepared to calculate where (i.e., in which cache) every variable is so that you can know how long it will take to fetch? Are you prepared to calculate how fast it takes to write and read an item to and from main memory when multiple cores are accessing memory simultaneously? I don't think so. So yes, of course, it is possible to calculate the exact worst time for a quicksort but at what cost? And why?

Look, this is not really a COSA issue. What I wrote above is true in Esterel or any other computer language running on any number of cores. All that needs to be done, in my opinion, is to run the program on various array sizes, plot the values on a graph and extrapolate from there. And, of course, you must make sure you always run the program under pretty much the same conditions afterwards. What could be simpler?