Saturday, October 25, 2008

Why Software Is Bad, Part IV

Part I, II, III, IV

The Silver Bullet

The Billion Dollar Question

The billion (trillion?) dollar question is: What is it about the brain and integrated circuits that makes them so much more reliable in spite of their essential complexity? But even more important, can we emulate it in our software? If the answer is yes, then we have found the silver bullet.

Why Software Is Bad

Algorithmic software is unreliable because of the following reasons:

  • Brittleness

    An algorithm is not unlike a chain. Break a link and the entire chain is broken. As a result, algorithmic programs tend to suffer from catastrophic failures even in situations where the actual defect is minor and globally insignificant.


  • Temporal Inconsistency

    With algorithmic software it is virtually impossible to guarantee the timing of various processes because the execution times of subroutines vary unpredictably. They vary mainly because of a construct called ‘conditional branching’, a necessary decision mechanism used in instruction sequences. But that is not all. While a subroutine is being executed, the calling program goes into a coma. The use of threads and message passing between threads does somewhat alleviate the problem but the multithreading solution is way too coarse and unwieldy to make a difference in highly complex applications. And besides, a thread is just another algorithm. The inherent temporal uncertainty (from the point of view of the programmer) of algorithmic systems leads to program decisions happening at the wrong time, under the wrong conditions.


  • Unresolved Dependencies

    The biggest contributing factor to unreliability in software has to do with unresolved dependencies. In an algorithmic system, the enforcement of relationships among data items (part of what Brooks defines as the essence of software) is solely the responsibility of the programmer. That is to say, every time a property is changed by a statement or a subroutine, it is up to the programmer to remember to update every other part of the program that is potentially affected by the change. The problem is that relationships can be so numerous and complex that programmers often fail to resolve them all.
Why Hardware is Good

Brains and integrated circuits are, by contrast, parallel signal-based systems. Their reliability is due primarily to three reasons:
  • Strict Enforcement of Signal Timing through Synchronization

    Neurons fire at the right time, under the right temporal conditions. Timing is consistent because of the brain's synchronous architecture (**). A similar argument can be made with regard to integrated circuits.


  • Distributed Concurrent Architecture

    Since every element runs independently and synchronously, the localized malfunctions of a few (or even many) elements will not cause the catastrophic failure of the entire system.


  • Automatic Resolution of Event Dependencies

    A signal-based synchronous system makes it possible to automatically resolve event dependencies. That is to say, every change in a system's variable is immediately and automatically communicated to every object that depends on it.

Programs as Communication Systems

Although we are not accustomed to think of it as such, a computer program is, in reality, a communication system. During execution, every statement or instruction in an algorithmic procedure essentially sends a signal to the next statement, saying: 'I'm done, now it's your turn.' A statement should be seen as an elementary object having a single input and a single output. It waits for an input signal, does something, and then sends an output signal to the next object. Multiple objects are linked together to form a one-dimensional (single path) sequential chain. The problem is that, in an algorithm, communication is limited to only two objects at a time, a sender and a receiver. Consequently, even though there may be forks (conditional branches) along the way, a signal may only take one path at a time.

My thesis is that this mechanism is too restrictive and leads to unreliable software. Why? Because there are occasions when a particular event or action must be communicated to several objects simultaneously. This is known as an event dependency. Algorithmic development environments make it hard to attach orthogonal signaling branches to a sequential thread and therein lies the problem. The burden is on the programmer to remember to add code to handle delayed reaction cases: something that occurred previously in the procedure needs to be addressed at the earliest opportunity by another part of the program. Every so often we either forget to add the necessary code (usually, a call to a subroutine) or we fail to spot the dependency.

Event Dependencies and the Blind Code Problem

The state of a system at any given time is defined by the collection of properties (variables) that comprise the system's data, including the data contained in input/output registers. The relationships or dependencies between properties determine the system's behavior. A dependency simply means that a change in one property (also known as an event) must be followed by a change in one or more related properties. In order to ensure flawless and consistent behavior, it is imperative that all dependencies are resolved during development and are processed in a timely manner during execution. It takes intimate knowledge of an algorithmic program to identify and remember all the dependencies. Due to the large turnover in the software industry, programmers often inherit strange legacy code which aggravates the problem. Still, even good familiarity is not a guarantee that all dependencies will be spotted and correctly handled. Oftentimes, a program is so big and complex that its original authors completely lose sight of old dependencies. Blind code leads to wrong assumptions which often result in unexpected and catastrophic failures. The problem is so pervasive and so hard to fix that most managers in charge of maintaining complex mission-critical software systems will try to find alternative ways around a bug that do not involve modifying the existing code.

Related Articles:

How to Solve the Parallel Programming Crisis
Parallel Computing: The End of the Turing Madness
Parallel Computing: Why the Future Is Synchronous
Parallel Computing: Why the Future Is Reactive
Why Parallel Programming Is So Hard
Parallel Programming, Math and the Curse of the Algorithm
The COSA Saga

8 comments:

Mathew Robertson said...

Some questions that I'm hoping you may answer...

- any links to evidence which says the brain is synchronous?

- some (lots) integrated circuits have a global clocks - however, the fastest systems are clock-less and are controlled primarily from event driven sources (ie: edge triggered)

- a single component failure in most electronic systems, will pretty much always generate a bad output or worse (eg: smoke!)

- "signal-based synchronous systems" (assuming I understand everything else written here... :) can be created without requiring central-clock-like control. Are you suggesting that now your thoughts are evolving, that a central clock is no longer needed?

chz said...

A few things:

Brittleness - You say that "algorithmic programs tend to suffer from catastrophic failures even in situations where the actual defect is minor and globally insignificant". How does this fit in with Rinard's results for "Failure Oblivious Computing"? Why is this a general computing issue and not just an operating system semantics issue?

Temporal Inconsistency - Why is it important to know how long a function will take to complete? Why is knowing the run time of a function to guarantee no decisions happen "at the wrong time, under the wrong conditions" better than just using existing synchronization primitives? Furthermore, why is a thread an algorithm? Is it because a thread computes a given algorithm?

Signal Timing - For the brain, what is "the right time"? What are the "right temporal conditions"?

Communication Systems - What is an orthogonal signaling branch? Why is a condition variable with broadcast support insufficient for communicating an event to several objects?

Louis Savain said...

Matthew and chz,

Thanks for commenting. I'll respond to your questions soon. I'm a little busy. Bear with me.

Louis Savain said...

Mathew Robertson, sorry for the late reply. You wrote:

- any links to evidence which says the brain is synchronous?

There is plenty of evidence that the brain uses electrical synapses in conjunction with oscillations (usually in the 30 Hz range) to synchronize the operation of huge cell assemblies throughout the brain. Do a Google search for "cortex synchronization".

- some (lots) integrated circuits have a global clocks - however, the fastest systems are clock-less and are controlled primarily from event driven sources (ie: edge triggered)

I agree. However, synchronous operation does not necessarily mean that circuits have to use a global clock to maintain synchronization. What is important is that operations and signal travel times are the same everywhere(i.e., all AND gates have equal processing duration). The idea is to eliminate signal racing which would be highly detrimental to reliability.

- a single component failure in most electronic systems, will pretty much always generate a bad output or worse (eg: smoke!)

Well, I am not so sure about that. The fact is that microprocessors usually have many non-working or faulty circuits and yet function properly within their advertised specs. The reasons for this mostly has to do with bad initial design or subsequent manufacturing defects and/or limitations. In order to save money, the processor vendor can usually come up with a work-around or decide to eliminate a planned feature altogether.

- "signal-based synchronous systems" (assuming I understand everything else written here... :) can be created without requiring central-clock-like control. Are you suggesting that now your thoughts are evolving, that a central clock is no longer needed?

No. The clock in the COSA programming model is virtual. This means that a COSA-compatible processor core can and should be clockless and by this, I mean that there is no need for the core to use a real-time clock. Instructions should be processed as fast as possible.

Louis Savain said...

chz wrote:

Brittleness - You say that "algorithmic programs tend to suffer from catastrophic failures even in situations where the actual defect is minor and globally insignificant". How does this fit in with Rinard's results for "Failure Oblivious Computing"?

Sorry, I don't know much about "failure-oblivious computing". All I know is that, if you have a program that consists of a bunch of objects running in parallel, a localized failure will only affect areas that are directly dependent on the failed object. The others should continue to run unaffected. This is what is known as robustness. We see this in nature. For example, an animal may suffer a localized wound but the rest of its body will continue to function adequately enough for it to survive. Unless the failure occurs in a super-critical object that affects the entire system, the system as a whole should continue to function even if it is somewhat handicapped. Robustness is important to safety or mission-critical systems used in such application areas as aviation, transportation, financial systems and energy generation.

Why is this a general computing issue and not just an operating system semantics issue?

It is a general issue because it affects all areas of computing, not just operating systems.

Temporal Inconsistency - Why is it important to know how long a function will take to complete?

It's only important in algorithmic programs that must operate in real time environments. The program must respond in a timely fashion to external events whenever they occur. Of course, preemptive multitasking and interrupts will take care of it but they are not part of the Turing Machine model and their timing is non-deterministic.

Why is knowing the run time of a function to guarantee no decisions happen "at the wrong time, under the wrong conditions" better than just using existing synchronization primitives?

Unless you have a model that simulates parallelism (as in a video game) via an infinite loop and fast incremental updates of concurrent objects or a preemptive interrupt mechanism that can handle critical events as they happen, you will miss real time events.

Furthermore, why is a thread an algorithm? Is it because a thread computes a given algorithm?

No, every program that runs on current computers are algorithmic. A thread is a program.

Signal Timing - For the brain, what is "the right time"? What are the "right temporal conditions"?

Precise timing is highly critical to brain function. As an example, you would not be able to recognize anything if the timing of signals generated by the auditory and visual sensors of your brain were random. The same is true for motor behavior. Unless signals arriving at your muscles are precisely timed, you would not be able to accomplish much.

Communication Systems - What is an orthogonal signaling branch?

It is just a signaling mechanism that sends a signal to multiple targets at the same time.

Why is a condition variable with broadcast support insufficient for communicating an event to several objects?

It is not insufficient unless it is temporally non-deterministic and is not an integral part of the software model. The way I see it, multiple signals that originate from a single sender should all arrive simultaneously at their targets. The deterministic timing of signals is of paramount importance because any discrepancy in the signal timing of a running program is evidence of a malfunction. This will do wonders for reliability.

James said...

Are you aware of Chuck Moore's new chip? None of your posts match seaforth or intallasys so here's the link:

http://www.intellasys.net/index.php?option=com_content&task=view&id=35

How does this processor compare with the tile64? Would it be worthwhile to write a cosa emulator for either chip?

Louis Savain said...

James,

Thanks for commenting. In my opinion, since the cores in a SEAforth processor can run VentureForth directly, I would guess that they are much faster than the MIPS cores used in the Tile64. I don't know enough about inter-core communication mechanism used in SEAForth to compare it with the Tilera iMesh technology.

Being a long time Forth programmer, I am unavoidably biased toward it, if only because I know it's as fast as assembly on a dedicated processor. I would therefore recommend it over the Tile64 for a COSA emulation project or any other project that requires speed.

PS. Since Forth is a stack-based language, there might be a way to use the stacks as COSA input buffers, assuming VentureForth allows the swapping of stacks.

James said...

Here's a good overview of the architecture:

http://tinyurl.com/6rsdz7
A couple of paragraphs seemed promising:

"Every time a stack is pushed or popped, the next or previous item is selected. An odd but useful side effect of this is that a pop does not actually destroy data, it just goes to the bottom of the stack. Devious programmers take advantage of this feature to use the stacks as small data caches."

"It is at this point that we see the need for a model to help us partition programs across cores. We can view each core as a node executing a process block. Each node is connected to its neighbors by signal carriers (sometimes called "wires"). Thinking of the chip as a set of process blocks and signal carriers is the key to successfully factoring problems to fit the SEAforth chip. Multicore chip programming introduces communication as a design element that is as important as the programming of the cores."