Tuesday, July 29, 2008

How to Solve the Parallel Programming Crisis

Abstract

Solving the parallel computing problem will require a universal computing model that is easy to program and is equally at home in all types of computing environments. In addition, the applications must be rock-solid. Such a model must implement fine-grained parallelism within a deterministic processing environment. This, in essence, is what I am proposing.

No Threads

The solution to the parallel programming problem is to do away with threads altogether. Threads are evil. There is a way to design and program a parallel computer that is 100% threadless. It is based on a method that has been around for decades. Programmers have been using it to simulate parallelism in such apps as neural networks, cellular automata, simulations, video games and even VHDL. Essentially, it requires two buffers and an endless loop. While the parallel objects in one buffer are being processed, the other buffer is filled with the objects to be processed in the next cycle. At the end of the cycle, the buffers are swapped and the cycle begins anew. Two buffers are used in order to prevent racing conditions. This method guarantees rock-solid deterministic behavior and is thus free of all the problems associated with multithreading. Determinism is essential to mission and safety-critical environments where unreliable software is not an option.

Speed, Transparency and Universality

The two-buffer/loop mechanism described above works great in software but only for coarse-grain objects such as neurons in a network or cells in a cellular automaton. For fine-grain parallelism, it must be applied at the instruction level. That is to say, the processor instructions themselves become the parallel objects. However, doing so in software would be much too slow. What is needed is to make the mechanism an inherent part of the processor itself by incorporating the two buffers on the chip and using internal circuitry for buffer swapping. Of course, this simple two-buffer system can be optimized for performance by adding one or more buffers for use with an instruction prefetch mechanism if so desired. Additionally, since the instructions in the buffer are independent, there is no need to process them sequentially with a traditional CPU. Ideally, the processor core should be a pure MIMD (multiple instructions, multiple data) vector core, which is not to be confused with a GPU core, which uses an SIMD (single instruction, multiple data) configuration.
The processor can be either single core or multicore. In a multicore processor, the cores would divide the instruction load in the buffers among themselves in a way that is completely transparent to the programmer. Adding more cores would simply increase processing power without having to modify the programs. Furthermore, since the model uses fine-grain parallelism and an MIMD configuration, the processor is universal, meaning that it can handle all types of applications. There is no need to have a separate processor for graphics and another for general purpose computing. A single homogeneous processor can do it all. This approach to parallelism will do wonders for productivity and make both GPUs and traditional CPUs obsolete.

Easy to Program

The main philosophy underlying this parallel processing model is that software should behave logically more like hardware. A program is thus a collection of elementary objects that use signals to communicate. This approach is ideal for graphical programming and the use of plug-compatible components. Just drag them and drop them, and they connect themselves automatically. This will open up programming to a huge number of people that were heretofore excluded.
Conclusion

Admittedly, the solution I am proposing will require a reinvention of the computer and of software construction methodology, as we know them. But there is no stopping it. The sooner we get our heads out of the threaded sand and do the right thing, the better off we will be.

See Also:

Why Parallel Programming Is So Hard
Parallel Computing: Why the Future Is Non-Algorithmic
Computer Scientists Created the Parallel Programming Crisis
UC Berkeley's Edward Lee: A Breath of Fresh Air
Why I Hate All Computer Programming Languages
Half a Century of Crappy Computing
How to Construct 100% Bug-Free Software
Why Software Is Bad and What We Can Do to Fix It
Parallel Computing: Both CPU and GPU Are Doomed

48 comments:

Ian said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

Louis Savain said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

No. I am not sure what you mean by a single box, but I can assure you that this model can easily support tens of thousands of cores on a die. That is, of course, assuming the hardware technology will have advanced to that level. The synchronization is an automatic characteristic of the loop/buffer mechanism. It does not require any exotic circuitry.

Paul Mohr said...

Very interesting, I always like to look at new ideas for parallelism. I don't immediately grasp the entire concept, but I will read more of your blog and links before I judge the potential. It seems this could be done in an FPGA to determine it's viability. I do think that multiple technologies can be integrated in a single system to specialize the hardware for specific tasks. Content addressable memory is used in routers and if it were more commonly available in the PC it would make programming easier and much faster for some common data base tasks.

Ken Horn said...

certainly seems like an FPGA implementation in verilog / vhdl is the way to go. if you run in on an xtremedata card, it can drop in to a multi-socket AMD box giving good access to main memory. why not start a wiki/mail list to collaborate on the design? imho, graphical programming gets very complicated as soon as you want to change a large design -- grep (aka refactoring) is your friend -- so i'd expect to need a textual representation to be interchangable with the graphical language. starbridge has a data flow tool like this. of course it starts getting complicated (visually) if data dependencies require loops.

Louis Savain said...

Paul and Ken,

Thanks for the comments. You're both right about the feasibility of using FPGA to build a working prototype of a COSA processor. However, I don't think that it is needed to determine its viability. We know this model works because it's been in use for decades. I do agree that it would go a long way to convince doubters. Having said that, note that the people with the purse strings are interested in three things, ease of programming, performance and automatic scalability. These are not things that can be achieved in one's spare time. It requires talent, time and money. By money, I mean millions.

Ken, you are correct that programs created with the graphical dev tool must compile into an intermediate text file using some sort of p-code. Someone (can't remember who, probably Marcus Sundman) once suggested XML, which I think is a good idea.

I get a lot of objections from software engineers regarding graphical programming. Of course, they invariably base their criticism on tools that they've seen. We must remember that most of the research in programming tools have been limited to textual languages. Even so, this has not resulted in moving programming to the masses. Indeed, it seems like computer languages have gotten even more arcane and obscure over the years. Needlessly, I might add.

People often complain about the proliferation of signal lines in graphical languages. However, this should only be seen as the result of a mediocre implementation and not as a charateristic of graphical languages in general. Information hiding can do wonders for program comprehension, regardless of the type of tool. In the COSA programming model, multi-connectors can be used to bundle many connections into one and this goes a long way to eliminate clutter.

By the way, the model that I am proposing is not data flow but signal flow, not unlike a logic circuit.

Ken Horn said...

I'm not convinced it would take millions to show most of the benefits in prototype form -- do you have a software implementation of COSA? I think an open source project could advance the design pretty far.

I don't think a graphical tool is needed to show the underlying model works either -- a simple text language is both easier to develop and to debate since most people who could help will be coming from a text based language background. I think solving one problem at a time is feasible -- for example Verilog for circuit design is a very simple language -- like C both in terms of abstraction level and syntax.

I'm used to building systems in Verilog that I'd call data flow -- but I can't think of the difference you mean by signal flow. Is there a page that describes COSA in more detail? Is it similar to CSP or SEDA?

As for too many signal lines, these can simply be rolled up / filtered similar to interfaces in SystemVerilog. I've long argued that Verilog is almost the right abstraction layer for many tasks but the problem is the tools are stuck in the dark ages.

Paul Mohr said...

I agree with ken and I could simulate the hardware in software and it could give you a measure of it's speed and functionality. If it is a good idea, I have a ton of FPGA's and it isn't that difficult to program a pick and place machine and design a board. There are open source board design programs I use. It could even be hand stuffed. I don't completely understand your concept yet but atomic parallelism seems like it might work.

Louis Savain said...

Ken,

Data flow is something else. In data flow, parallel objects are fed data, as in a multimedia stream. In a signal flow environment, by contrast, such as a logic circuit, only signals move around. A signal is not a piece of data but a temporal event marker that indicates that something (an event) just ocurred. In the COSA model, objects are inherently parallel. Signaling is the explicit mechanism of sequential order. Unlike a logic circuit, signaling is virtual in software. The mere placement of an object into a processing buffer is taken to mean that it just received a signal.

Having said that, I don't think that a textual language will convince anybody to change to the COSA model. A truly revolutionary programming interface is needed. This is the reason that I am so excited about Jeff Han's and multitouch screen technology. They allow the kind of quick and easy, trial-and-error manipulation and rearrangement of parallel visual objects that, I believe, will move application development into the 21st century. It is time we say goodbye to the antiquated and woefully inadequate text-based technologies of the last century.

Louis Savain said...

Paul,

The concept is simple but designing a processor to do the job is not easy. The trick is to keep the cores busy and balanced when there is a heavy load and to turn as many of them off as possible when the load is light in order to save energy. There is also the problem of the data caches used by the cores. Sometimes, a piece of data is needed by one core but resides in another core's cache. Tilera came up with a nice solution with its proprietary I-Mesh technology that links neighboring cores with a high-speed network on chip. Still, load-balancing must be such that inter-core communication (for read/write) is kept to a minimum for performance reasons. I have some good ideas on how to do that, but I'll hold on to that card for the time being.

Ken Horn said...

Is there a page that actually describes COSA? You talk about not being data parallel, but signal -- what do the signals represet? If not, Von Neumann instruction stream and not data, how is the processor programmed? FPGA based systems are signal processors -- this signal flow, is it flags? How is data passed if not as signals? Does the program get loaded and remain in place like a FPGA/CPU design? As ever, the key to performance is either to never move data other than stimuli (as in FPGAs - the data is often held in local mem on chip) or completely hide memory/comms latencies via some context switching mechanism (but hence requires more mem).

Much of this heads in the direction of transputers of course..

I'm still not sold on a graphical language but I think that's an orthogonal concern to the chip architecture.

karl said...

I don't know if this helps, but the way I picture it is like this:

Traditional sequential programs are a big column of instructions, one instruction on every line, and processing works its way down. It can jump up and down and can start and stop anywhere but broadly speaking it goes from top to bottom, one instruction at a time.

Threaded programming attempts to pull chunks of this column out to the sides, but this is pretty tricky with all the dependencies amongst all these instructions.

Instead think of the processing as going from left to right through every instruction at once. Some instructions are waiting for input, so they're shunted to the right into a second column. Once the first column is totally processed, the processor moves to the second column and starts again. If you need multiple copies of a particular instruction, the processor would just shunt a copy of that instruction into the second column, which would be attached to the alternative input.

You could still run this on traditional multicore chips, with multiple threads to process chunks of instructions from the current column. Its just that each instruction is independent of all the others in that column, so the threads don't have to synchronise at all.

Surely it wouldn't take too long to program a rudimentary version something like this?

Louis Savain said...

Ken,

For more information on this model, you can read the Project COSA pages including the Operating System and Software Composition pages.

There are, of course, multiple instruction streams, although, in the model, they are not streams in the conventional sense. They are just elementary objects connected to each other, the same way logic gates or neurons in a neural network are connected to each other with signal lines. Basically, a COSA program is a logic circuit. It consists of elementary cells that wait for one or more signal to do something, after which they may or may not send a signal to one or more other cells. There are also sensors (comparison operators) that detect changes in data and send a signal when a change is detected. COSA is a reactive software model.

Of course, the signals are virtual in software. There is no EMF transition that travels from one cell to another. The way it works is that every cell has a list of cells that it is connected to, if any. During execution, the processor places the items in the list into the second buffer where they will be processed during the next cycle.

One of the most important things about this model is the timing of the signals. Singal timing is 100% deterministic. The arrival times of the signals are essential because some cells must know whether or not its input signals arrived concurrently or consecutively in a given order.

You can also read Parallel Computing: Why the Future Is Non-Algorithmic for a quick introduction.

Ken Horn said...

So is it fine grained CSP with a graphical language? Occam?

Jonathan Leonard said...
This comment has been removed by the author.
Louis Savain said...

Ken,

COSA is indeed fine-grained but, as far as I understand it, CSP uses message channels to copy messages between parallel objects. COSA does not. In COSA, a message is just shared memory. Also, if I remember correctly, the transputer (for which CSP was created) was based on multithreading and could be easily simulated on a single core machine. This is absolutely not the case with COSA since multithreading is anathema to the COSA model since multithreading is non-deterministic. COSA does not use threads at all and there is no context switching.

Also, in my view, CSP is so arcane that only a nerd could love it. What I am trying to do, among other things, is move parallel application development into the mainstream.

PS. Several people have compared COSA to CSP in the past. I don't get it. To me, COSA looks more like VHDL with the exception that, in COSA, the timing of signals is explicit and available to the program. Has anybody compared CSP to VHDL?

Louis Savain said...

Karl,

I like your analogy. However, I disagree that COSA could be run efficiently on a conventional processor. In COSA, there are no such things as indenpendent threads. Everything must be synchronized to the same global virtual clock. Deterministic timing is absolutely essential to reliability and security in my opinion.

Louis Savain said...

Jonathan,

I am not familiar with von Neumann's universal constructor. However, I agree that cellular automata use the same sort of deterministic parallelism that is used in COSA.

Ken Horn said...

I think the VHDL (or, to me, the much simpler Verilog) analogy is pretty close -- you get massive fine grained parallelism and deterministic behaviour, that's why the FPGA accelerator usage has been popular. As far as I understand it, CSP doesn't necessarily require message channels except in a logical sense, if the data is immutable -- threads are also an implementation detail if you go for blocking reads.

I find CSP / VHDL(or rather chip design) / SEDA / COSA (as far as I understand it) / Unix pipes, all essentially the same mechanism, of partitioned data and calculation flow. The levels of explicitness in the programming models are similar even if the abstraction from silicon is quite different. A channel is a queue is a FIFO.

The other common parallel scheme seems to be (again, from memory) the Occam / parfor / etc used by main systems which can sometimes be derived by a standard compiler based on data dependencies.

As ever tools to help partition sequential and parallel sections and declare/avoid data dependencies are key for these approaches.

It's still not clear to me the lifecycle of COSA objects or how a CPU would be configured other than the existing options of von Neuman type core or FPGA/DSP data flow. The are many mesh/FPOA type approaches for inter node (on chip) communications, but without context switching (in terms of unit of work, not necessarily thread based) or a data path program design you end up with lots of idle silicon, I think (at least has the advantage of low power).

I think to advance the ideas in the discussion a concrete example program, in some text syntax, would be the most productive route.

dvyukov said...

I can't understand one moment in COSA model. How are you going to manage data in COSA?

Handler must read data value, which was actual at the beginning of execution cycle, not current value.

I see 2 solutions here.
First. You can store 2 versions of value for every data component - 'current' and 'next'. And change meaning of those values after every execution cycle. This approach has serious drawbacks.

Second. You can place all new values into some 'redo' log. And apply this 'redo' log at the end of execution cycle. This approach also has serious drawbacks.

Both approaches are serious hits at run-time properties (memory consumption and/or execution time).

Can you describe how are going to manage data components, so that every handler will read value which was actual at the beginning of execution cycle? Or just give some links to the articles where this is described.

Thanks.

Louis Savain said...

dvyukov,

It seems to me that you are talking about a problem that is common in multithreading environments. That is, a data variable may be changed by one thread while it is being read by another thread. The reason that this is a problem is that multithreading is not deterministic. That is to say, it is impossible to determine exactly at what time a piece of data will be read or modified.

The COSA model, by constrast, is 100% deterministic. Data reads and writes occur in a predictable manner and, if there is any conflict, it can be easily discovered by the development tool and the developer will be alerted accordingly.

dvyukov said...

Yes, this problem is common to all multi-threading environments to some degree.
But in non-synchronous/non-deterministic environment this problem can be solved in very straightforward manner. Just lock the mutex and update the data in-place. Reader will read the most current value. Simple.
In COSA the problem cannot be solved this way. Because reader must NOT read the most current value, it must read the value that was actual in the beginning of current execution cycle.
So you have to make some additional bookkeeping and make some additional work to ensure this property. Cost of determinism, so to say.
The most obvious way to solve the problem in COSA, which I see, is to keep redo-log with all writes, and apply it at the end of execution cycle. But this solution has serious drawbacks wrt memory consumption, execution time, needed synchronization and locality of work.

Please, describe how you going to solve this problem. This moment prevents me from obtaining comprehensive notion about COSA.

Dmitriy V'jukov

Louis Savain said...

dvyukov,

I don't think you understood my last comment. In a fine-grain deterministic environment, this is not a problem. The development tool can easily examine the signal flow of a COSA program to determine whether or not there is potential conflict. If so, it will alert the programmer to the conflicting condition so he or she can modify the program to eliminate it.

Modifying the program might consist of simply re-routing a signal and maybe using a wait cell in order to change the timing of an action. It's not rocket science.

dvyukov said...

Ah! Ok, now I get you.
So basically you can update all data in-place, and at the same time readers can't read "wrong" values *by-design*.

Hmmm... I need to think on this...
I'm not yet sure in 2 moments:
(1) Whether all algorithms can be expressed this way. Such scheme has a limitation that on some execution cycles handler can't read a data. And if we can't statically determine on which cycles it's allowed to read a data, and on which it's not?
(2) Whether all algorithms can be expressed *efficiently* this way.

Why I was initially thinking about redo-log? Because this is how synchronous hardware works. And scheme based on redo-log doesn't have limitation that sometimes it's disallowed to read some data. Note that since readers always read values which was actual at the beginning of execution cycle, design is still deterministic! But of course "concurrent" (i.e. inside one execution cycle) writes must be prohibited.

Dmitriy V'jukov

Louis Savain said...

Dmitriy,

I have no idea what you mean by "wrong" values. Your "redo logs" is what some people are referring to as transactional memory, which, in my opinion, is the wrong way to do things. It wreaks havoc with deterministic processing and is not any more efficient. Everything depends on proper timing. If your system is deterministic, you can read data either before or after it has been changed. The choice is yours. What is important is that there is no read/write conflict that could compromise the data.

dvyukov said...

No, I am not referring to transactional memory.
COSA is deterministic. Good.
But determinism achieved at high price from user point of view. You are prohibiting reading values on execution cycles when the values are changed. Right?
Redo-logs it the way to relax this requirement. I.e. handler will be able to read values on the same cycle when the values are changed. AND system STILL stay deterministic.

Dmitriy V'jukov

Louis Savain said...

COSA is deterministic. Good.
But determinism achieved at high price from user point of view.


What high price? It's just a matter of reading a variable after it has been written to. If there is no write, there is no wait. Besides, this is a very rare occurrence in COSA. COSA is a reactive system, which means that, in general, changing a variable causes a reaction. In other words, variables are tested (using comparison operations) immediately and automatically after they are changed and, if necessary, a signal is sent to every object that is dependent on the change. Both the change and the test occur within the same cycle because they are considered to be a single action.

You are prohibiting reading values on execution cycles when the values are changed. Right?
Redo-logs it the way to relax this requirement.


It is a perfectly sensible requirement. Why would I want to relax it? I don't see how "redo-logs" are going to increase performance. If anything, I think they will decrease performance.

I.e. handler will be able to read values on the same cycle when the values are changed. AND system STILL stay deterministic.

I doubt it very much. Redo-logs just add uneeded complexity, in my opinion. My solution only adds a small wait while the variable is being changed. Most of the time, there is no wait at all. Redo-logs waste time and space in my opinion.

dvyukov said...

Well, Ok. Now I understand. Thank you for explanation.

Dmitriy V'jukov

Walter said...

You do realise that the sheer quantity of synchronisation required for this model would prohibit its use on anything larger than a single box, right?

> No. I am not sure what you mean by a single box, but I can assure you that this model can easily support tens of thousands of cores on a die.

I think what Ian means is that you can't reasonably use this on more than one computer. So an asynchronous method of parallel programming would still be required for applications that work on networks of computers. Erlang, for example, will run on multiple cores in a single machine. But it can also be extended to multiple, networked computers just as easily as a single machine.

Louis Savain said...

Walter wrote:

I think what Ian means is that you can't reasonably use this on more than one computer. So an asynchronous method of parallel programming would still be required for applications that work on networks of computers. Erlang, for example, will run on multiple cores in a single machine. But it can also be extended to multiple, networked computers just as easily as a single machine.

Well, COSA uses synchronous processes but it also uses asynchronous messaging. This means that it can just as easily send asynchronous messages to other objects on a distributed network. Erlang's approach to communication is to copy the entire message to a message channel. I think this is absurd because the performance hit is enormous.

Jason said...

Please start a project somewhere to implement the GUI and the hardware simulator.

ilya said...

So, multicore is here, yet the vast majority of applications have not been multicore-enabled. Why? Well, first of all, parallel programming is hard — in fact viewed by many as an unsolved problem. And second, here’s the rough calculus done today by software development teams: On a dual-core system (the typical configuration today), one can typically expect a two-fold improvement at best. This performance improvement is just not compelling enough to make the (potentially considerable) development investment. Once the 8- and 16-core systems become mainstream, however, a single-threaded application may be leaving an order of magnitude in performance and throughput on the table. That’s when multicore enablement moves from “nice to have” to a “must have.”

My sense is that for broad adoption, the change has to be evolutionary, rather than something that's brand new and radical.

Four tools/approaches have a shot at being a mainstream multicore programming approach:
- OpenMP
- Intel's Threading Building Blocks
- Microsoft Concurrency Runtime
- Cilk++

Jonathan Leonard said...

Don't forget:
Parallel FX
Erlang
F# Workflows
Parallel Haskell

igravious said...

metrics or it's not science, only froth

Katarina Malthus said...

Or maybe we could just stop being lazy and learn to properly implement parallelism? Also, Sun's latest line of processors in the E-series, do a very good job of making parallelism on the processor seamless through their automatic branching determination.

Also, parallel studio, aka, parallelism for the incompetent. However, it works.

tetron said...

Practically speaking, since you're always going to have a lot of instructions which are blocked waiting for the results of other instructions (computation is sequential after all!) how is this idea different from a superscalar processor with a deep pipeline? Modern CPUs already achieve a lot of fine-grained instruction level parallelism, it's just not exposed to the programmer.

Louis Savain said...

tetron wrote:

how is this idea different from a superscalar processor with a deep pipeline?
Well, superscalar processors do a great job if the parallelism is local. The reason is that the processor can find independent instructions only within a small segment of the instruction stream. The superscalar approach is blind to what I call 'deep independence'. This is a serious flaw because without deep independence, most of the parallelism cannot be exploited.

The second major problem with superscalar processors (with all processors, for that matter) is that they have no mechanism for deterministic reactive processing. In my opinion, this is unacceptable. Deterministic timing (based on a virtual clock, of course) is essential to reliable software and reactive processing is the key to exposing parallelism in a program.

Chris Stevenson said...

I'm just coming across this, and it's really, really hard to grasp. From the blog comments, it seems that I'm not alone.
You may want to focus on a short video presentation to get to concept across. Something that the casual watcher of Youtube can understand.

Marijan said...

----------------
Part 2
----------------
Only complication that I see is when previous operation is not completed at time next instruction is about to be executed, when such instruction should be (actually its address only) moved into next time/cycle (that is similar to that two buffer concept if I understood it properly). Now there would be problem with cyclic processing as in classical program loops, as proper place for this moved instruction is changed, so perhaps it should remain in its cycle buffer and one executed in next logicaly following cycle has to be removed after execution. There could be some statistic of this, so if it happen mor times than normal execution schedule expect, then instruction should be advanced to next cycle permanently.
Now in case of processing a loopful of instructions there would be some instructions necesary to return to execution of previous time/cycles. Since in buffers would be just addreses of instructions which may also say on which computer and which core it is, (and same goes for variables that instruction uses, but naturaly as in OO data should be clustered with instructions handling them) it is irelevant where actuall instruction is, so any and all available processors could be used, right?


Regards, Marijan Pollak, IT SA/SE 1st. Class, Instructor and Team Leader (retired)

Marijan said...

----------------
Part 3
----------------
Copy of cycle buffers could be kept on all CPUs containing only instructions that would be executed in this logicall cycle, but then there should be mechanism for synchronization of logicall cycles execution so all CPUs would really work in paralel. Since signals could be implemented as flag bits like it was done on old Mainframes, I see no reason to make new kind of CPU just now, as compiler can arrange it all and program Loader should do what is necesary to distribute program on available hardvare. Perhaps such system I am trying to describe was implemented in some of specialized languages such as OCCAM that I am not familiar with, then I apologize. But this seems to me as most simple system, and we all know it should be as simple as possible, right?
Now, my opinion about graphicaly represented programing is that it is good at meta level of programing when You construct program from Objects, and then Objects encapsulate their own Methods which esentialy react on data arrival or request for data export from another Object. From practicall view, to make this on level of instructions, one would have to have column of instruction symbols from which to choose (and those would be hard to represent graphically anyway, so why dont stick to instruction mnemonic that allready exist?) and then each should be supplied with data variable names which would have to be in another column or alphabetic matrix to choose from, and each variable should be defined first as well, so I dont see just now how would this make programming faster...
Again, I hope you would excuse old coleague if what I write would not really make sense to you, as I am retired and perhaps should let younger generations to carry the torch......

Regards, Marijan Pollak, IT SA/SE 1st. Class, Instructor and Team Leader (retired)

Marijan said...

It seems that somehow my first part of comment is lost, so here it is again:

Hi, there, everybody..
It seems I am limited to 4096 characters, so I must cut most out of fine text I wrote :-((
I would sugest that script should show how many characters is left to write, as this is waste of time otherwise...
Since I cannot know when I reach limit, I would divide this into parts.....
----------------
Part 1
----------------
I chanced to this blog just now, and have problems to follow this because it seems to be discussion of CPU rather than programming language. However, even if concept of instruction objects is indenticall to my vision of future for programing languages, I dont get how would putting instruction object into buffer solve problem if timing their execution. Concept is good but may be some improvements can be suggested.
All time should be virtuall within system as it is only way to be able to execute all that is necesary to execute at same time. Timing/dependency of instructions should be determined in advance and buffers formed for each time/cycle segment, and then executed one at the time. Signals then could be simple, just telling data is being processed or that instruction that is changing it has finished its work and put result into variable.

Marijan said...

Hi, that is Marijan again...
I have read about COSA in the meantime, and it seems that there is one great uncomprehending of what procedural languages are. Problem is with bad OS of Windows type that are faulty to start with, and made by people who dont really understand OO paradigm, so system of Message procesing is unreliable and has little to do with intended sequence of events. If that would function properly and messages passed directly to object that has to receive it then system would work like COSA want it to behave. Same is with Legacy systems where no standard in programing was set so people that has to change existing application dont know where to look to introduce changes or additions.
I once read officiall description of what MS call >>Iterative programnming<< where they introduce changes and look where program breaks then patch those breaks and so ad nauseam....
That is NOT programing, that is HACKING. In my time there was no >>bugs<< in program as we would lose client if results of data processing were not 100% correct.
Originally, programs were this what COSA want them to be: strictly deterministic sequence of instructions, one following another and untill one was finished, another cannot be started. Of course, this is no paralel procesing, just sequentiall.

Marijan said...

Now, suppose this is all nicely functioning and whatever could be done in paralel is executed, but we are talking about single program, what about multiple programs running in paralel? That was called >>Multitasking<<....
I learned and used 26+ different programing languages, many in few generations and on severall platforms, but I have not seen single one that has no loops, testing constructs and other things Louis want to bane from program. Even if I agree that there are too many types of loops unnecesary, something that would fulfill their role in program has to exist. In some cases (as table lookup) it can be substituted by sending messages to all objects or their instances, but that means esentially that each Object has to process message, which would slow down executions of big data groups. In Table lookup, unless it is turned into instruction and so hidden from programmer's wiew, one would not be able to devise faster ways to search Table, as for instance in Binary Search algoryithm. Louis also want that software behave like hardware, and in the beginning there was only hardware programmed by wire connections. Machine code also behave exactly like hardware, and all higher programing languages are compiled to it, so software basicaly cannot do anything that hardware cannot also. Last but not least, those projects that were started and used up great amount of money failed because they were not planned properly, and planing is based on requirement anylyse, which Louis think we can do without. That is not so.

Marijan said...

I am writin in parts because of limitations, sorry!

Now, dear Louis, how would one who is supposed to make program or application know what one has to make? It is all right if program is for personall use or one write a game, then that person surely know what to make as requirements are known. As soon as one start working on program or Application for someone else, first one has to anylyse problem or get requirements from user to know what is purpose of program and what data it is supposed to process and how, as well as which form of results are expected. I have made many different kind of programs, from direct hardvare controling programs to Neurall Networks and Evolutionary programing or Genetic algorithms.
But most of my work was for other people and companies, and how would I as programmer know what client want? Biggest problem there is that Client assume that programmer know what he is talking about, or take some things for granted since it is his everyday knowledge. I was lucky to study Economy, so I mostly knew what my Clients were talking about, but I am also Systems Analyst and Systems Engineer, and if someone dont study problem and specify requirements, nobody can program anything.....
Louis also assume things that are not correct, like that program is running all the time while being created. By same paradigm he is proposing, since there is no sensor activated, program is NOT running. Editor/Interpreter IS running, but not program under development.

Marijan said...

To continue, also it is not necesary true that program should be interpreted all the time. As each Instruction Object is incorporated (all parameters are supplied, graphically or numerically or by connecting interface to operands so Object konw from where input is comming and where output is going, such object could be compiled >>on the fly<< and be ready for execution next instant, together with rest of program. Still, user have to start program, and there is allmost no programs that have all data hardcoded as constants that would never change, is it?
So, even if Louis is quite clever and obviously knows a lot, there is need for requirement Analyse as well as for testing program when it is finalized. Simple fact that programmer decide it is finished is not guarantee that he did not forget some detail, so it is not in program, or he has misunderstood Client in some regard, therefore set of data for testing should be prepared and run on program to see if it all produce exactly results that are expected. Moreover, intentional mistakes in data should be introduced or part of data dropped out, to test how program hangle user errors......
Therefore dear Louis, some things are indispensable and cannot be just dropped out of process, however automated it may be.

Mario De Roma said...

Parellelization it's an hot potato.

I like this article, but maybe I do this because it preaches the potato away from me (as an application programmer) and passes it on to the CPU architecture guys.

Only they kinda made it clear they want nothing todo with it, they're all into spitting out even more cores into a single chip: they have no time for that.

I surely hope compilers eventually will take the lead and bake that evil potato away for good without making me leave procedural/object-oriented paradigms.

Otherwise we will really have to bend to some functional (malevolent) god.

Tom said...

I realize this post was written about a year and a half ago, and you may have wised up over that time, but that notwithstanding, you, sir, are a buffoon. Sure, VHDL is nice, but it doesn't work quite so well for application programs. Application programs are, in general, massively sequential. If A, B, and C have to happen before D does, or else you're wrong, then they have to happen in that order, one after the other. It doesn't matter if you have the hardware to do 50 other things at the same time, if things have to happen sequentially, they have to happen sequentially. To a large extend, modern processors implement out-of-order execution to mask away as much as they can.

Nondeterminism is not bad. Programs don't care exactly when memory is referenced (with proper synchronization [read: sequentiality]).

Nondeterminism isn't the same as unreliability. Nondeterminism coupled with poor programmers causes unreliability.

M. Simon said...

You might want to look into GreenArrays GA144 (bottom of the page).

It seems to be an example of the type of computing you describe. Work is triggered by data. No data - no work.

Unknown said...

The general approach being considered here is pretty much the FSM approach used in hardware. A similar approach is communicating sequential processes (CSP). I've had a go at incorporating both methodologies into C++ -

http://parallel.cc