Friday, February 26, 2010

Unreliable Software, Part III: Brooks's Blunder

Part I, II, III

Abstract

In Part II, I wrote that Brooks's excuse for crappy software is actually a curse because it is much more detrimental to society than is commonly assumed. In this post, I defend the thesis that there is a way to lift the curse and solve the software reliability problem once and for all. Timing is the key.

Brooks's Blunder

In his famous No Silver Bullet paper, computer scientist Frederic Brooks wrote the following:
The complexity of software is an essential property, not an accidental one.
Brooks's thesis is that the accidental complexity of software programming (e.g., syntax, spelling, infinite loops, null pointers, etc.) can be effectively managed but its essential complexity is so vast that errors cannot be prevented. Brooks stakes his entire thesis on the following assertion:
From the complexity comes the difficulty of enumerating, much less understanding, all the possible states of the program, and from that comes the unreliability.
In other words, Brooks argues that, in order for a program to be reliable, one must be able to enumerate and understand all the possible states of the program. This may seem obvious at first glance but is it really true? Well, I hate to disappoint those of you who have believed in this fairy tale for the last twenty years or so but Brooks's thesis can be easily falsified as follows:

Let's take a simple temperature control program that is designed to read a register T that represents room temperature. The program turns on the heater if T decreases below t1 and turns it off when T increases above t2. Is it necessary to enumerate all the possible states of this program in order to know that it works correctly? Of course not. The program will work correctly if the two conditions (i.e., T < t1 and T > t2) that control its behavior are tested and shown to elicit the right behavior. There is no need to test all the possible states of T in order to know that the program is working properly as designed.

Note: Many years ago, speaking about Brooks's 'No Silver Bullet' paper, I wrote that "no other paper in the annals of software engineering has had a more detrimental effect on humanity's efforts to find a solution to the software reliability crisis." I still stand by that statement.

Thinking of Everything: It's All in the Timing

It is common knowledge that the human mind is prone to errors and often overlooks important aspects of a complex design. My claim is that there is a way to solve this problem. In my thesis on the correct approach to programming, timing is an essential and fundamental aspect of software, not an afterthought. This means that a program should be synchronous and reactive and should consist entirely of sensors (comparison operators or event detectors) and effectors (normal data operators). It also means that the relative timing of every elementary operation can be easily discovered via the use of simultaneity and sequence detectors. During testing, a special tool called the temporal discovery tool (TDT) can automatically discover the normal relative temporal order of all the sensors and effectors used in the program. This knowledge can be used to automatically generate alert detectors for every possible temporal anomaly. What would constitute an anomaly? Suppose the TDT discovers that signals A, B, C, and D always occur sequentially in that order. It will then create anomaly detectors that fire if there is a break somewhere in the sequence.
In practice, the TDT will check for both simultaneity and sequentiality in order to cover every possible eventuality. An application must not be deployed unless every discovered anomaly (alert signal) is adequately handled. This way, it is no longer necessary for the programmer to think of every possible anomalous condition in the program since that is the job of the TDT.

An Example

A simple program like the temperature control program I described above would suffice in most situations but what if some anomaly occurred such as an external hardware malfunction? Given enough temperature sensors, the TDT can discover many things about the way temperature changes in the room.  For example, the TDT can easily discover that the temperature cannot increase above X twice in a row without first decreasing below X in between.

In other words, it discovers that an increase above X normally alternates with a decrease below X. So it creates an alert sensor that sends a signal if the rule is broken. This sort of thing can occur if the temperature sensor undergoes a malfunction for whatever reason. The discovery process is fully automatic and thorough and guarantees that every possible contingency is covered.

Dear Toyota Software Managers

Please take note. If Toyota had a tool like the TDT, it would not be in the predicament that it finds itself in. The TDT would have automatically discovered that pressing the brake pedal at the same time as the gas pedal is an anomaly. Of course, Toyota's engineers knew that but, with the TDT, they would also know that everything is guaranteed to be in working order and, as a result, they would not be shy about adding as much complexity to the control program as necessary, or as desired.

Conclusion

The TDT tool described above can only work in a deterministic software environment. What this means is that the system must make it possible to determine the temporal relationship (simultaneous or sequential) of any two events in a program. The beauty of the COSA software model is that it enables temporally deterministic programs. Non-deterministic and/or abnormal behavior can be fully accounted for during development. COSA does not construct your code for you. It simply makes sure that whatever you create is rock-solid and complete with no stone left unturned. If only the not so wise programming managers at Toyota, or wherever safety-critical code is developed, would take notice.

Next: How to Construct 100% Bug-Free Software

See Also:

Parallel Computing: Why the Future Is Synchronous
Parallel Computing: Why the Future Is Reactive
How to Solve the Parallel Programming Crisis
Parallel Computing: The End of the Turing Madness
Why Software Is Bad and What We can Do to Fix It
Project COSA
The COSA Operating System

5 comments:

Josh McDonald said...

Doesn't this imply a solution to the halting problem?

Louis Savain said...

Josh McDonald wrote: Doesn't this imply a solution to the halting problem?

The answer is no since a COSA program is non-algorithmic. That is to say, it does not receive a set of input data, performs a computation and returns a result. A COSA program is simply a behaving system, an inter-connected set of sensors and effectors. External sensors can generate inputs at any time, not just at the beginning, as is done in algorithms.

The halting problem is irrelevant to the COSA software model. That is not to say that a COSA program cannot perform an algorithmic task that is subject to the halting problem. But who really cares anyway? The halting problem is really a useless academic curiosity that has absolutely nothing to do with 99.9999999999999999% of computer applications.

Remember, a COSA machine does not compute. It just behaves. Computation is just one of its many behaviors. And a behavior can last indefinitely.

Sigfred said...

Louis,

I am one of those crackpots; I agree with you. Not in the formal, uptight sense, but in a more vague, cosmic, "I feel what you are feeling" kind of sense.

A logic gate was not designed to DO things. An AND gate does not LET electrons flow through its output wire if both of its input wires are live; rather, its output wire is live BECAUSE both of its input wires are live. It was designed to BE things.

Hoping to articulate more of my thoughts.

- Sigfred

Nate Cull said...

Like Sigfred, I'm another of those crackpots - I feel something similar but can't quite articulate it. I believe very strongly that functional reactive programming with finegrained parallelism is essential for the future of computing and that the von Neumann model is really, really hamstringing us.

I don't agree quite that this will solve ALL software reliability issues, but I think we're needlessly introducing a lot of complexity in current languages which could be removed by going to a pure parallel language.

I also think we can do parallel execution of a serialised language, and that graphic representations aren't necessary (in fact are harmful) at a low level. This is because we need serious parallel software to be (like Lisp) self-describing if we want to build software in it; we need to be able to write macros, compilers, interpreters, VMs etc. Most parallel languages (Pure Data, LabView) so far have been graphical because they're trying to be 'simple' for non-programmers instead of being workmanlike tools for the serious programmer. That's not very useful.

Neither is a parallel system quite 'non-algorithmic' though I think I do get what you mean; it doesn't take one input then run and spit out an answer. Instead, it operates over an infinite stream of data, and multiplexes between event streams; this I think implies something like coroutines or continuations ('yield') as the basic control primitive. This approach is sort of well-known (in the sense of being mathematically describable), but not well-implemented in the sense of having even a single good programming language yet based on it. Scheme was based on continuations but I think still doesn't really 'get' this style at all.

But certainly looking at these 'things not talked about' does feel like pushing on the boundary of madness and can become quite an obsession.

At the moment I'm thinking something along the lines of a Joylike language (but not a stack language; a prefix one) would be the way forward. This is pretty much what the first version of Smalltalk was; but I think Smalltalk hardened into 'object-orientation' (with a function/object split and inheritance hierarchies) far too soon and what we need to rediscover is pure message-passing. Make the messages (events) central and process them in parallel (FRP), distribute across the Net, and I think we've got something.

John Reilly said...

I think you are a bit confused by some terminology. You say

Is it necessary to enumerate all the possible states of this program in order to know that it works correctly? Of course not. The program will work correctly if the two conditions (i.e., T < t1 and T > t2) that control its behavior are tested and shown to elicit the right behavior. There is no need to test all the possible states of T in order to know that the program is working properly as designed.

Testing for all values of T is not testing all states.

If you are testing states, you would have 2 defined states. ON and OFF. You have conditions in which you will transition from ON to OFF and OFF to ON. One would test these state transitions using appropriate values of T (as you describe).

The problem is that not all state is easily enumerated and state transitions may be dictated by complex sets of criteria.

It sounds to me like you have never been involved in any complex software development.